Podio Incident August 20 - 21, 2024
Summary of Impact
On August 20, 2024, Podio customers experienced performance issues in searching items, calculations on fields and webhook events causing delays and disruptions in workflow execution.
On August 21, 2024, Podio customers experienced issues in real time updates for item activities, notifications and chat services.
Root Cause
At August 20, 2024, 13:37 EDT, Podio Automated Alert System noticed a sudden spike occurred in one of our queues responsible for item search, webhook execution, and other tasks. The messages in this queue were not getting processed, which added to the existing load. The issue was traced back to a failure in our event broker service from our third-party vendor, causing the queue system processing events to become inoperable. This failure led to delays in processing messages and impacted various services, including item search, calculations, and webhook executions. ShareFile Engineering engaged our third-party vendor and worked closely with them to troubleshoot and resolve the issue as this is a managed service. After resolution there was a significant backup of the queue which took 2 hours to completely drain.
At August 21, 2024, 00:53 EDT, Podio Automated Alert System triggered a pattern that matched the previous incident causing the same issue to re-occur to our end users of delays in processing search, calculations, and webhook executions. To mitigate, ShareFile Engineering created a replacement cluster to manage events while the team from our third-party vendor continued to resolve the original problem so events would be properly handled.
At August 21, 2024, 13:41 EDT, ShareFile Engineering initiated a new incident to troubleshoot real time updates after receiving feedback in support and community channels on slow updates in the UI. The performance degradation was still affecting real-time updates for item activity streams, chat, and notifications. These systems were still communicating with the older, now non-functional, queue, causing further delays and requiring page refreshes for updates.
Mitigation
After identifying the root cause, ShareFile Engineering took the following steps:
All issues were confirmed resolved and fixed all of the above issues as of August 21, 2024 5:30 PM EDT.
Next Steps
ShareFile Engineering continue to collaborate with our third-party vendor to understand why the queuing service stopped processing messages in the first place. In case it happens again, ShareFile Engineering now have a way to quickly identify the root cause of the issue which will help us in faster remediation. ShareFile Engineering are continuing to work internally to ensure root cause and pursue actions on the following issues: