We all have used social media be it Instagram, Facebook or Twitter. Everytime we upload pictures, we know it gets saved in some storage having a unique identifier. Ever wondered how that unique id is generated?
We have seen a similar problem while designing Tiny URL. There we had used Key Generation server to generate and save unique ids.
Let us understand various other approaches that can be used to solve the problem.
One of the most simplest way to solve this problem is to use an auto incrementing number as a unique id. This can be easily implemented and maintained using the sql database auto increment feature. One benefit of using sql database is that it is a well understood technology. It is easy to predict its behaviour and scalability features.
Problems:
Solution:
Use several databases to mitigate the above drawbacks.
Challenges with the solution:
How do we divide the auto-incrementing functionality among the databases?
We can consider something like one database uses odd and other even numbers. Thus we are dividing the numbers among the databases. If we have several such databases we need to come up with some other algorithm which can be as simple as one database starts its count from 1, 2nd from 10 Million, 3rd from 20 Million and so on. Flikr uses this approach for generation of its unique ids.
Drawbacks:
This approach although works will make us loose the ability to sort the items based on arrival time. If we want it to be sortable we need to keep additional metadata information about the arrival timestamp.
This brings us to the next approach where we can have timestamps as the unique id. In such scenario we can even sort the data using the timestamp value.
Problems:
Solution:
Append some additional data to the timestamp to make the ids unique.
Challenges:
What data can be appended to the timestamp to make the id unique?
Solution:
Since most distributed systems employ worker threads to do their work and these workers have unique id we can use it to generate our unique identifier. Example we have have 10 worker threads accepting images then the unique id can be
timestamp + worker number
Here worker number is the unique worker id. Whenever we spawn workers in any system we give a unique id to it.
Challenges:
What happens if 2 images arrive at the same worker at the same time? In such case the unique id generated for both images will be the same.
Solution:
We need some additional data to make our id unique. We can add a sequence number which is an ever increasing number per worker thread. So we can append these 3 variables to generate a unique Id.
timestamp + worker number + sequence number
Example:
Considering our timestamp is 8888, the item arrives at worker thread 05 and we start our sequence with 0001 our unique id will be 8888050001.
Another thing to note is that we need to keep the size of the unique id constant. This is because it is easier to process and distinguish between the timestamp, worker id and sequence number. While sorting we can extract the timestamp and sequence number from the id and sort on them.
The timestamp based approach is famously used by twitter and the unique id generated is called snowflake id. To meet their requirements twitter uses a 64 bits to save its ids. If you are giving an interview in twitter it is a very important question.
Not just twitter there are various applications where we need a unique id generator. Instagram uses an approach similar to snowflake but instead of having a separate snowflake service it has added the logic inside PostgreSQL.
Twitter’s snowflake is one of the most famous solutions to solving the unique id generation problem. If anyone asks you about unique id generation you should instantly think of snowflake!