In simple words about the guiding principles and technology stack
Instagram increased the number of users from 0 to 14 million in just one year, from October 2010 to December 2011. They did it with just three engineers.
They did this by following 3 key principles and having a robust technology stack.
Keep it simple
Don’t reinvent the wheel
Use proven and reliable technologies whenever possible
Instagram’s early infrastructure ran on AWS using EC2 with Ubuntu Linux. For reference: EC2 is an Amazon service that allows developers to rent virtual computers.
To make things easier, and since I like to think about the user from an engineer’s point of view, let’s look at the life of a user session. (Let’s label this as “Session”)
Session: User opens Instagram
Instagram was originally launched as an iOS app in 2010. Since Swift was released in 2014, we can assume that Instagram was written using Objective-C and a combination of other things like UIKit.
Session: Once the app is opened, a request to get photos from the main feed is sent to the backend where it goes to Instagram’s load balancer.
Instagram used Elastic Load Balancer from Amazon. They had 3 NGINX instances that were swapped depending on their health.
Each request first hits the load balancer and is then forwarded to the actual application server.
Session: The load balancer sends a request to the application server, which contains the logic to handle the request correctly.
Instagram application server used Django and was written in Python, and Gunicorn was a WSGI server.
As a reminder, WSGI (Web Server Gateway Interface) forwards requests from the web server to the web application.
Instagram uses Fabric to run commands in parallel across many instances at the same time. This allows you to deploy code in seconds.
They were hosted on more than 25 Amazon High-CPU Extra-Large machines. Since the server itself is stateless, when they needed to handle more requests they could add more machines.
Shared data storage
Session: The application server sees that the request needs data for the main feed. For this, let’s say you need:
Latest up-to-date photo IDs
real photos matching these photo IDs
user data for these photos
Session: The application server retrieves the latest matching photo IDs from Postgres.
The application server will receive data from PostgreSQLwhich stores most of Instagram’s data, such as user and photo metadata.
Connections between Postgres and Django were pooled using Pgbouncer.
Instagram has split its data due to the large volume of data it receives (more than 25 photos and 90 likes per second). They used code to map several thousand “logical” fragments to several physical fragments.
An interesting challenge that Instagram faced and solved was generating IDs that could be sorted by time. The final identifiers sorted by time looked like this:
41 bits for time in milliseconds (gives us 41 years of identifiers with custom epoch)
13 bits that represent the logical segment identifier
10 bits, which are an auto-incrementing sequence, modulo 1024. This means we can generate 1024 IDs per segment in a millisecond.
(You can read more Here.)
With time-sorted IDs in Postgres, the application server successfully retrieved the latest matching photo IDs.
Image storage: S3 and Cloudfront
Session: The app server fetches the actual photos corresponding to these photo IDs using CDN quick links so that they load quickly for the user.
At Amazon S3 several terabytes of photographs were stored. These photos were quickly shared with users using Amazon CloudFront.
Caching: Redis and Memcached
Session: To get user data from Postgres, the application server (Django) maps photo IDs to user IDs using Redis.
Instagram used Redis to store a mapping of about 300 million photos to the user ID of the user who created them, to know which fragment to request when retrieving photos for the main feed, activity feed, etc. Everything in Redis was stored in memory to reduce latency, and it was divided into several machines.
Thanks to smart hashing, Instagram was able to store 300 million key-value mappings in less than 5 GB.
This key-value mapping of the photo ID to the user ID was necessary to know which Postgres fragment to query.
Session: Thanks to efficient caching using Memcached, retrieving user data from Postgres was fast because the response was freshly cached.
For general caching, Instagram used Memcached. At that time they had 6 Memcached instances. Memcached is relatively easy to layer on top of Django.
Fun fact: two years later, in 2013, Facebook published a landmark article about how they scaled Memcached to handle billions of requests per second.
Session: The user now sees a home feed filled with the latest photos of the people they follow.
Setting up master-replica
Both Postgres and Redis ran in “master-replica” configurations and used Amazon EBS (Elastic Block Store) snapshots to back up systems frequently.
Push notifications and asynchronous tasks
Session: Now let’s say the user closes the app, but then receives a push notification that a friend posted a photo.
This push notification was sent using pyapns, just like the over a billion other push notifications Instagram has already sent. Pyapns is an open-source, all-in-one Apple Push Notification Service (APNS) provider.
Session: This user really liked this photo! So he decided to share it on Twitter.
On the server side, the task is placed in Gearman – a task queue that transfers work to more suitable machines. Instagram had about 200 Python employees using the Gearman issue queue.
Gearman was used for a variety of asynchronous tasks, such as broadcasting actions (such as posting a new photo) to all of a user’s followers (this is called forking).
Session: Oh oh! The Instagram app crashed because the server encountered an error and sent an incorrect response. Three Instagram engineers receive an instant notification.
Instagram used Sentryan open source Django application for real-time Python error monitoring.
Munin used to plot system-wide metrics and alert about anomalies. Instagram had several custom Munin plugins to track app-level metrics such as the number of photos posted per second.
Pingdom was used to monitor external services, and PagerDuty used to process incidents and notifications.