At approximately 5:00 AM UTC, the NetFoundry Console began exhibiting issues in populating data and allowing provisioning or monitoring activities, and alerts were generated from internal systems.
During the outage, only the provisioning and viewing of networks was impacted. The deployed networks continued to function in the last state they were configured with.
Our Development and DevOps teams engaged to determine and correct the issue. It was discovered that the Core database holding configuration data of customer networks was overrun with queries, and unable to service the load in a timely manner, leading to it being effectively locked. The database was restarted in an effort the clear the issue, which worked briefly, but the issue returned quickly, indicating the source of the queries was continuous.
As more members of our technical teams joined the efforts, it was discovered that the source of the problem was our Console software as used by customers. A change pushed the day before had inadvertently created a massive number of parallel operations querying the core database for network events. This query was updated periodically in the console when open, so was continuing to create the issue. The issue was identified specifically by the developer who created the query, and a hotfix patch was created and deployed to return an empty set value, disrupting the problem. The patch deployment was begun at 14:22 UTC, and was completed at 14:58 UTC. A user noted to be issuing the queries was also blocked during this process as the software in their browser had to be updated to the latest release or it would continue to act as before.
Our Support team began reaching out to customers that had contacted us about the issue with instruction on how to refresh the code running in their browsers, and operations returned to normal.
A more complete fix for the issue has been created and pushed into the software stream in our lower environments to be tested and replace the hotfix patch when appropriate. The query was for the purpose of a UI element showing if processes were running or not, and is not critical to operations of the Console software.
We are reviewing our testing procedures, as the primary contributing factor to this issue was a misunderstanding of the scale of certain networks' entries in the tables. While testing was performed, it was not tested against the kind of environment that caused the issue. We are also reviewing the troubleshooting steps to make more of our team aware of certain methods to be able to resolve the issues quicker.
NetFoundry apologizes for the inconvenience caused by the issues, and as always, please contact us if you have any questions or concerns.