Generating Unique Identifiers in Distributed Systems - Twitter's Snowflake Id

Introduction

We all have used social media be it Instagram, Facebook or Twitter. Everytime we upload pictures, we know it gets saved in some storage having a unique identifier. Ever wondered how that unique id is generated? 

We have seen a similar problem while designing Tiny URL. There we had used Key Generation server to generate and save unique ids.

Let us understand various other approaches that can be used to solve the problem.

1. Auto incrementing numbers

One of the most simplest way to solve this problem is to use an auto incrementing number as a unique id. This can be easily implemented and maintained using the sql database auto increment feature. One benefit of using sql database is that it is a well understood technology. It is easy to predict its behaviour and scalability features.

Problems:

  1. If we use a single database it becomes a single point of failure.
  2. If the database is situated at a single place for geographically distributed systems there is high latency.
  3. A single database can become performance bottleneck.

Solution:

Use several databases to mitigate the above drawbacks.

Challenges with the solution:

How do we divide the auto-incrementing functionality among the databases?

We can consider something like one database uses odd and other even numbers. Thus we are dividing the numbers among the databases. If we have several such databases we need to come up with some other algorithm which can be as simple as one database starts its count from 1, 2nd from 10 Million, 3rd from 20 Million and so on. Flikr uses this approach for generation of its unique ids.

Drawbacks:

This approach although works will make us loose the ability to sort the items based on arrival time. If we want it to be sortable we need to keep additional metadata information about the arrival timestamp.   

2. Using Timestamps

This brings us to the next approach where we can have timestamps as the unique id. In such scenario we can even sort the data using the timestamp value.

Problems:

  1. There can be 2 items arriving at the same timestamp. In this scenario we will have 2 items with the same id.
  2. This scenario becomes more apparent in distributed systems. Example it is a very common scenario that people across the world are uploading pics on Instagram at the same timestamp.

Solution:

Append some additional data to the timestamp to make the ids unique.

Challenges:

What data can be appended to the timestamp to make the id unique?

Solution:

Since most distributed systems employ worker threads to do their work and these workers have unique id we can use it to generate our unique identifier. Example we have have 10 worker threads accepting images then the unique id can be

timestamp + worker number

Here worker number is the unique worker id. Whenever we spawn workers in any system we give a unique id to it.

Challenges:

What happens if 2 images arrive at the same worker at the same time? In such case the unique id generated for both images will be the same.

Solution:

We need some additional data to make our id unique. We can add a sequence number which is an ever increasing number per worker thread. So we can append these 3 variables to generate a unique Id.

timestamp + worker number + sequence number

 

Example:

Considering our timestamp is 8888, the item arrives at worker thread 05 and we start our sequence with 0001 our unique id will be 8888050001.

Another thing to note is that we need to keep the size of the unique id constant. This is because it is easier to process and distinguish between the timestamp, worker id and sequence number. While sorting we can extract the timestamp and sequence number from the id and sort on them.

Twitter's Snowflake ID

The timestamp based approach is famously used by twitter and the unique id generated is called snowflake id. To meet their requirements twitter uses a 64 bits to save its ids. If you are giving an interview in twitter it is a very important question.

Not just twitter there are various applications where we need a unique id generator. Instagram uses an approach similar to snowflake but instead of having a separate snowflake service it has added the logic inside PostgreSQL.

Twitter’s snowflake is one of the most famous solutions to solving the unique id generation problem. If anyone asks you about unique id generation you should instantly think of snowflake!

The Complete Design Interview Course

Let's connect on LinkedIn