Notes - How Dropbox Scaled 2007 to 2012
#scalability #startups #videos
These are some of my notes taken from the Standford lecture on Youtube: How We’ve Scaled Dropbox
Main Challenges
There were two primary challenges that motivated the scaling and architecture evolution at Dropbox between 2007 to 2012:
- Application Write Volume
- Most web applications have a fairly high R/W ratio. Most users are simply reading and periodically posting new content
- Dropbox was close to a ratio of 1 given the amount of data upload and reading
- This made things difficult to cache given that the cache is the files you store on Dropbox
- ACID Requirements
- ACID: Atomicity, Durability, Isolation, Consistency
- It’s not really negotiable if you upload a file and only 1/2 of it is watchable. Therefore consistency and durability were very important factors
Application Evolution
Beginning 2007 (0 users)
- Started with a single server
- The server ran a MySQL server, stored data on local disk, and ran the main application instance
Figure: The initial monolithic server
What broke?
- The server ran out of disk space
- Servers became overloaded
Late 2007 (0 users)
- Instead of storing everything on the server’s local disk write data to Amazon S3 and move MySQL into a different server
Figure: The evolved architecture using AWS’ S3 and a separate database server
What broke
- Downloading and uploading started to bring down the regular functionality of the site
- Needed to split the work to different servers
- Clients polling really taxing servers
Late 2008 (50k users)
- Break the server into three servers:
notificationService
: Sending updates about file uploads and new content to clientsmetaService
: An application level server that the users would interface withblockServer
: An API like server. Dropbox used blocks of data as their data model and API
Figure: A three backend service architecture fragmenting the various responsibilities to a notification server, meta server and block API server
What broke?
- The
blockserver
was doing DB queries directly causing slow down - Only one database causing performance bottle necks
Late 2008 (100k users)
- Added more
metaservers
andblockservers
- The
blockservers
started proxying calls to themetaserver
through a load balancer - Also added
memcache
to everything that was easily cacheable- This saved the team from having to do the intensive work for database partitioning or other database scaling strategies
Figure: The same architecture from early 2008 but simply with additional instances of meta and block server
Early 2012 (50m Users)
- Added and stacked up the servers
- Scale out existing architecture with additional instances
Figure: Multiple instances of all components. Though trivial in picture is difficult in practice
Lessons
It’s hard to know what will break until it does
- It’s difficult to arrive at a solution that you could build from scratch rather than evolving your existing architecture
- e.g. Once you go to a sharded database system, it’s difficult to know whether an application level query will work or not
The first iteration probably won’t work well
- You cant want to necessarily start with the highest performing solution first because that can impede the business use case