Something Went Wrong Facebook 2019
By
Ega Wahyudi
—
Monday, August 12, 2019
—
What's Wrong With Facebook
Something Went Wrong Facebook
The essential problem that created this blackout to be so serious was an unfortunate handling of a mistake condition. An automated system for confirming configuration worths wound up causing much more damages than it fixed.
The intent of the automated system is to look for setup worths that are void in the cache and change them with updated values from the consistent shop. This functions well for a transient trouble with the cache, yet it does not function when the persistent shop is void.
Today we made a change to the persistent duplicate of a setup value that was taken invalid. This suggested that each and every single customer saw the void worth and tried to fix it. Due to the fact that the solution includes making an inquiry to a cluster of data sources, that collection was quickly bewildered by thousands of countless inquiries a second.
To make issues worse, whenever a client obtained a mistake trying to query one of the data sources it translated it as an invalid value, and removed the corresponding cache key. This suggested that even after the original problem had actually been dealt with, the stream of queries continued. As long as the databases stopped working to service some of the requests, they were triggering a lot more demands to themselves. We had actually entered a feedback loophole that didn't permit the databases to recover.
The way to stop the feedback cycle was rather agonizing - we needed to stop all web traffic to this data source collection, which suggested turning off the site. As soon as the databases had recouped and the source had actually been repaired, we slowly enabled even more individuals back onto the website.
This got the website back up as well as running today, and also for now we have actually turned off the system that attempts to fix setup values. We're discovering brand-new styles for this configuration system complying with design patterns of various other systems at Facebook that deal even more with dignity with feedback loops and also transient spikes.
We apologize again for the site failure, as well as we want you to know that we take the performance as well as integrity of Facebook really seriously.