So, that was a bummer.
Yesterday, we experienced a very long downtime. All told, we were down for about 11 hours, which is unacceptably long. It sucked for everyone (including our team – we all check in everyday, too). We know how frustrating this was for all of you because many of you told us how much you’ve come to rely on foursquare when you’re out and about. For the 32 of us working here, that’s quite humbling. We’re really sorry.
This blog post is a bit technical. It has the details of what happened, and what we’re doing to make sure it doesn’t happen again in the future.
The vast bulk of the data we store is from user check-in histories. The way our databases are structured is that that data is spread evenly across multiple database “shards”, each of which can only store so many check-ins. Starting around 11:00am EST yesterday, we noticed that one of these shards was performing poorly because a disproportionate share of check-ins were being written to it. For the next hour and a half, until about 12:30pm, we tried various measures to ensure a proper load balance. None of these things worked. As a next step, we introduced a new shard, intending to move some of the data from the overloaded shard to this new one.
We wanted to move this data in the background while the site remained up. For reasons that are not entirely clear to us right now, though, the addition of this shard caused the entire site to go down. In addition, moving the data over to the new shard did not free up as much space as anticipated (partially because of data fragmentation, partially because our database is partitioned by user ID). We spent the next five hours trying different approaches to migrating data to this new shard and then restarting the site, but each time we encountered the same problem of overloading the initial shard, keeping the site down.
At 6:30pm EST, we determined the most effective course of action was to re-index the shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours. At 11:30, the site was brought back up. Because of our safeguards and extensive backups, no data was lost.
What we’ll be doing differently – technically speaking
So we now have more shards and no danger of overloading in the short-to-medium term. There are three general technical things we’re investigating to prevent this type of error from happening in the future:
- The makers of MongoDB – the system that powers our databases – are working very closely with us to better deal with the problems we encountered.
- We’re making changes to our operational procedures to prevent overloading, and to ensure that future occurrences have safeguards so foursquare stays up.
- Currently, we’re also looking at things like artful degradation to help in these situations. There may be times when we’re overloaded in the future, and it would be better if certain functionalities were turned off rather than the whole site going down, obviously.
- Downtime and ‘we’re back up’ messages will be tweeted by @4sqsupport (our support account) and retweeted by @foursquare
- During these outages, regular updates (at least hourly) will be tweeted from @4sqsupport
- We’ve created a new status blog at status.foursquare.com, which will have the latest updates.
- A more useful error page; instead of having a static graphic saying we’re upgrading our servers (which was not completely accurate), we’ll have a more descriptive status update. Of course we hope not to see the pouty princess in the future…
Hopefully this makes what happened clear and will help lead to a more reliable foursquare in the future. We feel tremendous responsibility to our community and yesterday’s outage was both disappointing and embarrassing for us. We’re sorry.