Provided below is the full root-cause analysis of the events and circumstances that lead to the unavailability of the Podio service beginning at 12:30pm ET, January 23, 2018 to approximately 3:15pm ET on January 24th, 2018.
What Occurred?
At approximately 10:00am ET, on January 23rd, 2018 Podio operations noticed a flat line in our item creation metrics and began receiving reports of 500 errors as items were attempting to be created. Upon investigation, a large inflow of database exceptions in the Item
table logs were observed. Each exception referenced an inability to create a new record in the database due to duplicate key constraint with value 2147483647
. The team recognized each number to be 2 ^ 31 - 1
which is the maximum value of a signed int
. At this point, Podio is effectively in read-only mode.
The Operations team determined the maximum value had been reached for the database cardinality limits for item revisions. The best mechanism to resolve this issue would be to adjust the mechanics that the platform used to write data to the specific tables that were having issues. The solution was to increase the size of the datatype for the column from Int
to a BigInt
, which would require maintenance to multiple database tables based on their relationships and mechanics. The Development team went on to prioritize the list of tables and features that were affected while the Operations team conducted testing on the changes to each column of one of the tables to get an estimate of the time it would take to run this statement with a hard cut-off time at 12:30pm ET. During this period, all backend jobs have been stopped to not process and fail in order to keep data safe while Podio.com was still serving requests. Customers were able to view items but not make changes to them.
Our team made a risk assessment and decision based on the initial analysis and testing. Based on this, we know there were four impacted tables, and had to migrate at least the two impacted tables to the new mechanics. In order to most effectively do this without risk of any data loss, it was necessary to shut down the entirety of the Podio service. There were two primary reasons -
The team recognized the item feature to be critical. If Podio had been kept online even in a read-only mode, requests would be hitting the database, and the database would have taken a harder hit with the migration running in parallel. Because of the size of the tables in questions, development and operations decided that it would be best to eliminate user traffic to the database to reduce the risk of resource competition and hopefully to help the table updates complete more quickly.
Additionally, there were relational concerns of customers using the site and adding data to parts of the database while some tables could not receive writes, increasing the risk of relational errors in the data after the issue had been resolved.
Upon making this decision, the team implemented the procedure in a test environment before making the final call to bring the service offline for the emergency migration.
While the process and migration completed, our Operations team ran through the appropriate plan to bring the application back up upon completion. Further, we took the opportunity while these updates were running to apply maintenance and security patches that we had in our pipeline. We also made the decision to run the migration for the last two tables after getting the application online, with the intention to manually remove the feature, the item activity log, related to the problematic tables in order to enable a safe migration and ensure data integrity throughout the process.
The Podio database is enormous with table sizes containing more than 500GB of data per table, this means changes were progressing at a slow but steady rate, as we were waiting for the database changes to be completed.
At approximately 14:44pm ET, the following day, January 24th, 2018, the second migration completed. At this time the operations team ran through several procedures to ensure the Podio application was successfully able to read and write with the new mechanics. At approximately 15:15pm ET the application was once again widely available. The job workers were started and the traffic started flowing as soon as we went online.
During initial testing around 15:17pm ET it was found that the file upload feature was not working as expected and was failing to upload files. Another deploy was done at 15:30pm ET in order to restart all possible services which could have been left in a stale state, but it did not solve the problem. The team started investigating more into the code and logs from across the different services and it was observed that services handling file requests were not running in majority of the servers. At around 16:15pm ET a force restart of service was applied across all the servers handling file services, and it resolved the issue.
As expected, the impact of the new mechanism being used on the migrated tables, left some operations in a somewhat degraded state. This included the activity log of item adjustments. Additional maintenance was scheduled immediately to migrate the appropriate tables to the new mechanics to alleviate these degraded features as well. We expect the item revision activity log functionality to be back in service after the last two tables have completed migration with the application in online state during the period of Jan 24 to Jan 28.
At this point, Podio is now accessible to our customers with approximately 99% of all features enabled and available for all customers. All the tasks in our queues have been processed with no loss of data.
Overall, we do want to apologize to our customers for the huge impact this event had for you, your team and your business. We know you depend on us for a reliable and available service at all times and it is our highest priority to live up to your expectations and improve our availability in the future. We will conduct additional in-depth analysis in the following week to ensure we learn from this event.
In order to improve, we are evaluating our risks and policies and will implement new processes the following days and weeks that will prevent a similar incident in the future. We are still pursuing the completion of the immediate actions required to restore service fully and to complete the analysis needed to be the reliable platform you need for your team and business long-term.