On 9/19/17, reports of elevated wait times in Globiflow-based processes were reported to Podio Ops at 9:55am EDT, and monitoring tools recorded API errors. As task load increased unexpectedly on the affected instance, subsequent processes became queued. After a manual throttle of the affected tasks was unsuccessful, the API server was restarted at 10:59am, leading to an almost immediate reduction in queue time and overall performance, effectively resolving this incident.
Further investigation revealed that the impacted instance had been scheduled for maintenance earlier today, however a discrepancy in the backing instance configuration resulted in the instance remaining deployed, and contributed to the inability of the system to gracefully failover as intended.
Our Operations team has made a number of recommendations for long-term remediation, particularly as it applies to the backing services configuration, and the failover mechanics for instances that are no longer deployed.