1.23.2018 - Emergency Podio Maintenance - Application Will Be Unavailable

Scheduled Maintenance Report for Podio Status Page

Postmortem

Provided below is the full root-cause analysis of the events and circumstances that lead to the unavailability of the Podio service beginning at 12:30pm ET, January 23, 2018 to approximately 3:15pm ET on January 24th, 2018.

What Occurred?

At approximately 10:00am ET, on January 23rd, 2018 Podio operations noticed a flat line in our item creation metrics and began receiving reports of 500 errors as items were attempting to be created. Upon investigation, a large inflow of database exceptions in the Item table logs were observed. Each exception referenced an inability to create a new record in the database due to duplicate key constraint with value 2147483647. The team recognized each number to be 2 ^ 31 - 1 which is the maximum value of a signed int. At this point, Podio is effectively in read-only mode.

The Operations team determined the maximum value had been reached for the database cardinality limits for item revisions. The best mechanism to resolve this issue would be to adjust the mechanics that the platform used to write data to the specific tables that were having issues. The solution was to increase the size of the datatype for the column from Int to a BigInt, which would require maintenance to multiple database tables based on their relationships and mechanics. The Development team went on to prioritize the list of tables and features that were affected while the Operations team conducted testing on the changes to each column of one of the tables to get an estimate of the time it would take to run this statement with a hard cut-off time at 12:30pm ET. During this period, all backend jobs have been stopped to not process and fail in order to keep data safe while Podio.com was still serving requests. Customers were able to view items but not make changes to them.

Our team made a risk assessment and decision based on the initial analysis and testing. Based on this, we know there were four impacted tables, and had to migrate at least the two impacted tables to the new mechanics. In order to most effectively do this without risk of any data loss, it was necessary to shut down the entirety of the Podio service. There were two primary reasons -

The team recognized the item feature to be critical. If Podio had been kept online even in a read-only mode, requests would be hitting the database, and the database would have taken a harder hit with the migration running in parallel. Because of the size of the tables in questions, development and operations decided that it would be best to eliminate user traffic to the database to reduce the risk of resource competition and hopefully to help the table updates complete more quickly.
Additionally, there were relational concerns of customers using the site and adding data to parts of the database while some tables could not receive writes, increasing the risk of relational errors in the data after the issue had been resolved.

Upon making this decision, the team implemented the procedure in a test environment before making the final call to bring the service offline for the emergency migration.

While the process and migration completed, our Operations team ran through the appropriate plan to bring the application back up upon completion. Further, we took the opportunity while these updates were running to apply maintenance and security patches that we had in our pipeline. We also made the decision to run the migration for the last two tables after getting the application online, with the intention to manually remove the feature, the item activity log, related to the problematic tables in order to enable a safe migration and ensure data integrity throughout the process.

The Podio database is enormous with table sizes containing more than 500GB of data per table, this means changes were progressing at a slow but steady rate, as we were waiting for the database changes to be completed.

At approximately 14:44pm ET, the following day, January 24th, 2018, the second migration completed. At this time the operations team ran through several procedures to ensure the Podio application was successfully able to read and write with the new mechanics. At approximately 15:15pm ET the application was once again widely available. The job workers were started and the traffic started flowing as soon as we went online.

During initial testing around 15:17pm ET it was found that the file upload feature was not working as expected and was failing to upload files. Another deploy was done at 15:30pm ET in order to restart all possible services which could have been left in a stale state, but it did not solve the problem. The team started investigating more into the code and logs from across the different services and it was observed that services handling file requests were not running in majority of the servers. At around 16:15pm ET a force restart of service was applied across all the servers handling file services, and it resolved the issue.

As expected, the impact of the new mechanism being used on the migrated tables, left some operations in a somewhat degraded state. This included the activity log of item adjustments. Additional maintenance was scheduled immediately to migrate the appropriate tables to the new mechanics to alleviate these degraded features as well. We expect the item revision activity log functionality to be back in service after the last two tables have completed migration with the application in online state during the period of Jan 24 to Jan 28.

At this point, Podio is now accessible to our customers with approximately 99% of all features enabled and available for all customers. All the tasks in our queues have been processed with no loss of data.

Overall, we do want to apologize to our customers for the huge impact this event had for you, your team and your business. We know you depend on us for a reliable and available service at all times and it is our highest priority to live up to your expectations and improve our availability in the future. We will conduct additional in-depth analysis in the following week to ensure we learn from this event.

In order to improve, we are evaluating our risks and policies and will implement new processes the following days and weeks that will prevent a similar incident in the future. We are still pursuing the completion of the immediate actions required to restore service fully and to complete the analysis needed to be the reliable platform you need for your team and business long-term.

Posted Jan 26, 2018 - 22:17 EST

Completed

At this point - we are confident that the adjustments made within the maintenance have been successful in bringing the Podio application back online and making it available for customers.
We do still have an open task of performing maintenance to address an open issue where actions are not being logged in our activity log. We will be creating a separate maintenance item to track work on this.
Again, our apologies for the inconvenience caused by this maintenance. We will provide full RCA on this issue as soon as it is available.

Posted Jan 24, 2018 - 16:53 EST

Verifying

At this time, the Podio application is once again available for use. Please be gentle while we get things back up and running. We are continuing to monitor the service and the adjustments made to ensure the appropriate systems are functioning properly.

As stated previously, there will be some features such as items writing to the item activity log, that may not work properly. We will continue to keep users apprised as we work to ensure ALL Podio functionality is back online and working properly.

Posted Jan 24, 2018 - 15:22 EST

Update

Our table migrations have fully migrated and we are making the appropriate moves to bring the service back online. We are anticipating the service to be available shortly. As soon as we have a full update around the service becoming available we will post it here.

Posted Jan 24, 2018 - 14:50 EST

Update

As of 1:45pm ET, both tables that we have been migrating are at 100% and the migration is finalizing. Once the migration has fully finalized, our Ops team will begin working to make the site available. We should be able to provide a firm ETA on availability once the migration has finalized. We will update this site as soon as the information is available.

Posted Jan 24, 2018 - 13:50 EST

Update

As of right now, we are at 100% complete for one migration and close to 90% completion for the other. This information is assisting us in gauging a time of completion. At this point we are anticipating the maintenance to complete between 1:00pm-2:00pm ET.
Upon completion of the maintenance, we anticipate the service becoming available once more with some features being temporarily unavailable. One of those anticipated features will be item revision records being logged in the item activity log.

We will continue to provide updates as they are available, anticipating the next one around 1:30pm ET.

Posted Jan 24, 2018 - 11:01 EST

Update

Update: In order to overcome a capacity problem in our main database, we are currently running a migration to expand the limit for 2 primary tables. The migration is currently slowing down compared to plan. As of right now, we are at 100% complete for one migration and slowly reaching 70% completion for another. This information is assisting us in gauging a time of completion. We do anticipate this will take a few extra hours longer than anticipated due the migration slowdown.

The current estimate for completion will be around 12:00pm Eastern Time / 5 PM GMT / 18.00 CET.

Posted Jan 24, 2018 - 06:39 EST

Update

Update: In order to overcome a capacity problem in our main database, we are currently running a migration to expand the limit for 2 primary tables. The migration is progressing slowly, but according to plan. As of this time we are at 100% complete for one migration and at 60% completion for another, which is assisting us in estimating an approximate time to resolution.

Posted Jan 24, 2018 - 04:56 EST

Update

To offer a bit more background about the work being completed at this point: In order to overcome a capacity problem in our main database, we are currently running a migration to expand the limit for 2 primary tables. The migration is progressing slowly, but according to plan. Our current estimate is that we are at 90% completion for one migration and 50% completion for another, which is allowing us to estimate approximate time to full resolution. It is important to note, that there is absolutely no compromise to data integrity throughout this process.

Additionally, our team completely understands the critical importance to our customers of Podio being reliable and available at all times. We are taking every measure possible to ensure we do not see a similar situation in the future. When the incident is resolved, we will provide a full and detailed overview of the main problem, implemented solution and the long term measures we are making to ensure this issue does not reoccur. The team is working as diligently as possible to bring Podio back online. We appreciate your patience during this process.

Posted Jan 24, 2018 - 03:34 EST

Update

Our operations team continues to work on the necessary procedures, and at this time we are continuing to estimate that the maintenance should end this morning (9 am Eastern Time / 2:00 PM GMT / 3:00 PM CET), pending the outcome of the activities currently underway.
We will provide another update at that time, or sooner should further information become available.

Posted Jan 24, 2018 - 03:13 EST

Update

Our operations team continues to work on the necessary procedures, and at this time we are continuing to estimate that the maintenance should end tomorrow morning (Eastern Time), pending the outcome of the activities currently underway.
We will provide another update at that time, or sooner should further information become available.

Posted Jan 23, 2018 - 21:15 EST

Update

Our operations team continues to perform the necessary maintenance to resolve the issues with the Podio application and bring it back online.
At this point in time, we are anticipating the maintenance to continue being performed overnight and up through tomorrow Jan 24 (until 9 am Eastern Time - 2:00 PM Wed, Greenwich Mean Time (GMT)). Should that timeline change, we will immediately update this status page to reflect the change.
It is important to note, that there is absolutely no compromise to data integrity throughout this process. The operations team is working to resolve capacity issues that are impacting functionality. We will provide further updates as soon as they are available and we apologize for the inconvenience caused.

Posted Jan 23, 2018 - 16:39 EST

Update

Maintenance is ongoing and we continue to make the appropriate changes in hopes of bringing the Podio application back online as soon as possible. At this time we do not have an ETA for full resolution and when the app should be available again. We will be sure to provide more information via this page, as soon as it is available.

Posted Jan 23, 2018 - 14:30 EST

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Jan 23, 2018 - 12:30 EST

Scheduled

In an effort to resolve and immediately improve underlying issues that are leading to poor performance and behavior within the Podio application, our Development team will be performing maintenance on the application beginning at 12:30pm ET.
While this maintenance is underway, the Podio application will be unavailable. We apologize for whatever inconveniences this may cause.

At this time, we are unsure of the exact amount time needed to make the appropriate adjustments, however we will update our customers whenever possible to provide as much information as often as we can. The date listed here is simply a placeholder until we have an ETA.

Posted Jan 23, 2018 - 11:35 EST