July 12, 2012

Scaling lessons learned at Dropbox, part 1

I was in charge of scaling Dropbox for a while, from roughly 4,000 to 40,000,000 users. For most of that time we had one to three people working on the backend. Here are some suggestions on scaling, particularly in a resource-constrained, fast-growing environment that can’t always afford to do things “the right way” (i.e., any real-world engineering project ;-). If people find this useful, I’ll try to come up with more tips and write a part 2.

Run with extra load

One technique we repeatedly used was creating artificial extra load in the live site. For example, we would do a lot more memcached reads than necessary. Then when memcached broke, we could quickly switch off the duplicate queries and have time to come up with a solution. Why not just plan ahead? Because most of the time, it was a very abrupt failure that we couldn’t detect with monitoring. But with this canary in the coal mine, we could buy ourselves time to plan as well as take note of the apparent limits.

Note that it’s not perfect to just do extra reads because it’s more likely that high write load will cause problems, but writes are hard to simulate (risk of endangering data, not getting a realistic amount of lock contention). In my experience extra reads alone are actually good enough to buy time.

As an idea, you could have a server that just sits there and plays back 10% of your read-only logs at your frontend and can be switched off as necessary. Better than switching off features!

App-specific metrics

Another thing that became increasingly useful was to have thousands of custom stats aggregated over thousands of servers graphed. Most out-of-the-box monitoring solutions aren’t meant to handle this sort of load, and we wanted a one-line way of adding stats so that we didn’t have to think about whether it costed anything, or fuss around with configuration files just to add a stat (decreasing the friction of testing and monitoring is a big priority).

We chose to implement a solution in a combination of memcached, cron, and ganglia. Every time something happened that we wanted graphed, we would store it in a thread-local memory buffer. Every second we’d post that stat to a time-keyed bucket (timestamp mod something) in memcached, and on a central machine, every minute, the buckets in memcached would get cleared out, aggregated, and posted to ganglia. Very scalable, and this made it possible for us to track thousands of stats in near real time. Even stats as fine-grained as “average time to read a memcached key” that happened dozens of times per request performed fine.

Of course, when you have thousands of stats it becomes tough to just “look at the graphs” to find anomolies. Here is one summary graph that we found the most useful:

The top line represents average response time on the site (we had one of these graphs for web traffic, and one for client traffic). Each segment represents a partition of work. So you can see there was a spike in response time at around 1:00, and that was caused by something in the MySQL commit phase. Our actual graph had a bunch more segments, so you can imagine how much screen real estate this saves when you’re trying to figure shit out. “CPU” is cheating, it’s actually just the average response time minus everything else we factored out.

If there is a way to do this, it would also be cool to have markings on the graph that point to events you can annotate such as code push, or AWS outage.

Poor man’s analytics with bash

If you haven’t used the shell much, it can be eye opening how much faster certain tasks are (and helps demystify why languages like Perl are structured the way they are). Check it out:

Let’s say you’re trying to debug something in your webserver, and you want to know if maybe there’s been a spike of activity recently and all you have are the logs. Having graphing for that webserver would be great, but if it’s on 1 or 5 minute intervals like most systems it might not be fine-grained enough (or maybe you only want to look at a certain kind of request, or whatever).

Apr 8 2012 14:33:59 POST ...
Apr 8 2012 14:34:00 GET ...
Apr 8 2012 14:34:00 GET ...
Apr 8 2012 14:34:01 POST ...

You could use your shell like this:

cut -d’ ’ -f1-4 log.txt | xargs -L1 -I_ date +%s -d_ | uniq -c | (echo “plot ’-’ using 2:1 with lines”; cat) | gnuplot

Boom! Very quickly you have a nice graph of what’s going on, and you can tailor it easily (select only one URL, certain time frames, change to a histogram, etc.). Almost all the command line tools take line separated, space delimited 2D arrays as input, with implicit numeric conversions as necessary, so you don’t usually need to do backflips to pipe programs together. They also don’t throw exceptions on bad data, which I think is a good thing when you’re just trying to do something quickly and don’t care about a few dropped data points. If you’re unfamiliar with command line tools, here is a shortlist I’d recommend becoming acquainted with:

sed, awk, grep, cut, head, tail, sort, uniq, tr, date, xargs

For sed and awk, the richer commands, I kept a couple cheatsheets bookmarked in case I forgot how to do something. This one is really good for sed, and this one for awk. The other commands are super quick to learn.

And then the output would be to a summary file, or graphed with gnuplot, or if you want to make a traffic diagram, dot.

Log spam is really helpful

Log spam isn’t all that bad. We used to have so many random print statements scattered around the code that would end up in our webserver logs, but I can’t count the number of times it turned out to be unintentionally useful. It’s almost a way of randomly tracing your code. So for example, in debugging a particularly nasty race condition, I noticed that a particular “FUUUCCKKKKKasdjkfnff” wasn’t getting printed where it should have, and that made it clear where the problem was happening. It’s best to keep a clean log file, and a dirty one with all the unstructured spam, but make sure you duplicate the clean stuff into the dirty one so that you don’t have to load both and match up timestamps.

It’s tempting to remove the super-verbose logging every so often, but I almost always regretted it when I did.

If something can fail, make sure it does

If you have something that you know can fail at any point, and you think the failover will be graceful, you should actually test this every so often. Randomly take that server off the network and make sure the failover works, because a couple things can happen:

Since the last failover, increased load means the failover process now causes a cascade.
In between the last failover and now, there have been a bazillion code pushes, database schema changes, internal DNS renames, etc. so any of those scripts that haven’t been run since then might depend on old assumptions.

These things are better to figure out in peacetime, so it’s best to make this stuff happen intentionally (restart your processes on a cron, clear out memcached once in a while). Maybe it sounds stupid to run fire drills on the live site, but testing environments are not sufficient and this is really good insurance.

Run infrequent stuff more often in general

The above points also go for stuff that just doesn’t run all that often in your codebase. If you can afford to push code through the infrequent code paths more often it’ll save some headaches. Like if you have a cron that runs every month, maybe run it as a dry-run every day or week to make sure that at least the assumptions are consistent, so you don’t have to debug it after a month’s worth of commits. Even just checking the module imports would help. Anything more than crossing your fingers and running it every month. Same goes for scripts that are only run manually (e.g., a script to fix a user’s state, or run diagnostics) – it’d be good to put them in a cron and catch errors, even if they’re not doing useful work (again, testing environments are unlikely to be sufficient for this stuff).

Try to keep things homogeneous

We once long ago had two shards for user data, and once it started getting full I added a third shard to put new users in. Damn, that was a headache! So then we had two shards growing at almost exactly the same pace, and a new one that was growing much faster, meaning we’d have to reshard at different times. It’s much better (but obviously trickier) to just split each shard into two and keep them all looking the same.

Homogeneity is good for hardware too, as capacity planning becomes a simpler problem. It’s best to have a small number of machine types. As an example, a breakdown could be high-CPU, high-memory, high-disk, and “beast” (maxed out on everything — analytics and heavy DB) type machines. It might cost a little extra to force everything to be in one of your categories, but I think it’s worth it for the simplicity (you can also switch machines around more easily when you need to).

Keeping a downtime log

Every time the site goes down or degrades (even short blips), write down the start and end times for the outage and then label it with any applicable causes (bad code review, insufficient monitoring, log overflow). Then when you look at the list you can objectively answer the question “what can I do to minimize the most minutes of downtime right now?” by figuring out how to cover the most minutes. Solutions might span multiple problems and each problem might be solvable many ways, so it helps to write down as much as you can. For instance, proper monitoring might alert you to an impending disk full problem, or you can limit the amount of stuff being written to disk.

UTC

Keep everything in UTC internally! Server times, stuff in the database, etc. This will save lots of headaches, and not just daylight savings time. Some software just doesn’t even handle non-UTC time properly, so don’t do it! We kept a clock on the wall set to UTC. When you want to display times to a user, make the timezone conversion happen at the last second.

Technologies we used

For those that are curious about what we chose and why, the software we used was:

Python for virtually everything; not more than a couple thousand lines of C
MySQL
Paster/Pylons/Cheetah (web framework – minimal use beyond templating and handling form input)
S3/EC2 for storing and serving file blocks
memcached in front of the database and for handling inter-server coordination
ganglia for graphing, with drraw for custom graphs like the stack graph mentioned above
nginx for the frontend server
haproxy for load balancing to app servers, after nginx (better configurability than nginx’s balancing modules)
nagios for internal health checks
Pingdom for external service monitoring and paging
GeoIP for mapping IPs to locations

Pretty standard. The reason why we chose each of these things was the same — reliability. Even memcached, which is the conceptually simplest of these technologies and used by so many other companies, had some REALLY nasty memory corruption bugs we had to deal with, so I shudder to think about using stuff that’s newer and more complicated. MySQL vs. Postgres was because Postgres didn’t have built-in replication at the time, MySQL has a huge network of support and we were pretty sure if we had a problem, Google, Yahoo, or Facebook would have to deal with it and patch it before we did. :)

We used SQLAlchemy for our ORM, but I really hate ORM’s and this was just a giant nuisance to deal with. We eventually ended up converting everything to SQLAlchemy’s lowest-level language for constructing SQL (one step away from raw SQL). ORM apologists like to say they are OK if you use them right, know when to use them, understand them, etc., but actually sometimes SQLAlchemy was just plain wrong and delivered incorrect results. No need for that. To use one knocks at least a few nines off of MySQL’s usually rock-solid performance.

My only suggestion for choosing technology would be to pick lightweight things that are known to work and see a lot of use outside your company, or else be prepared to become the “primary contributor” to the project.

Simulate/analyze things before trying them

Unlike product, which is harder to reason about, backend engineering is fairly objective (optimize page load time, uptime, etc.) and so we can use this to our advantage. If you think something will produce a result, you might consider simulating the effects in an easier way before committing to implementing it. Like if you think faster procs are going to help, maybe just run a machine with only one worker and see if it makes that much of a difference (or are you really just I/O bound?). Or if you’re toying around with moving your database server to a place with more latency, add in a few ms of latency in your low-level database glue and see what happens. Or if you’re going to cache something out of the database, simulate the effects and tally up the hits so you can justify spending time worrying about the invalidation policy, etc.

The security-convenience tradeoff

Security is really important for Dropbox because it’s people’s personal files. But all services are different, and many security decisions will inconvenience someone, whether it’s a programmer or a user.

For instance, almost every website has a thing where if you enter in a wrong username OR wrong password it’ll tell you that you got one wrong, but not tell you which one. This is good for security because you can’t use the information to figure out usernames, but it is a GIANT pain in the ass for people like me who can’t remember which username they registered under. So if you don’t actually care about exposing usernames (maybe on something like a forum or Pinterest where they’re public anyway), consider revealing the information to make it more convenient for users.

Having internal firewalls between servers that don’t need to talk to each other — again a good idea. But if your service doesn’t actually need this, don’t necessarily do it, or do it where it matters (separate your third-party forum software from your core website for sure — those things are buggy and more likely to get hacked than your website).

Maybe this is controversial… but security is something that people like to do lip-service to and armchair philosophize about, but in reality I think a lot of services (even banks!) have serious security problems and seem to be able to weather a small PR storm. So figure it out if it really is important to you (are you worth hacking? do you actually care if you’re hacked? is it worth the engineering or product cost?) before you go and lock down everything.

5:46pm | URL: https://tmblr.co/ZD9SpuPDuP0F

(Notes: 1,177)