Designing a Text Sharing Service Like Pastebin

1. Introduction

Pastebin is a website that allows users to share plain text through public posts called “pastes”.

    Prerequisites:
    System design introduction : 3 principles of distributed system, 5 step guide for System design.

    System design concepts & components : Horizontal scaling, Databases

    2. Requirement analysis

    Functional requirements

    1. The user should be able to create a paste and generate a unique random url.
    2. The generated links expire after a certain period of time.
    3. The user should be able to specify a title to the pastes.
    4. Users should be able to upload only text with a limit on max size of the text.

    Non-functional requirements

    1. System should be highly reliable. No data should be lost.
    2. System should be highly available.
    3. There should be minimum latency.
    3. API design

    String createPaste(pasteData, expirationDate : optional, title : optional) : returns shortUrl

    PasteData getPaste(shortUrl)

      4. Define Data model

      PasteUrl: ShortUrl<PK>, CreationDate, ExpirationDate, PasteData

      What kind of database to use?

      NoSQL document databases

      Partition key?

      Consistent hashing on short url.

        5. Back-of-the-envelope calculations

        We need to calculate 3 estimations

        1. Let us consider we have 10 write requests per second so how many unique shorturls will be required for the next 5 years.
        2. Suppose each write request is of an average 100KB. How much data would be generated in the next 5 years.
        3. Considering our reads are 10 times more than our writes. How much data would be read per second.

        Answer 1 :

        • Requests per day = 10 * 60 * 60 * 24 = 36000 * 24  40000 * 20 = 80000 = 800K request per day
        • So per year = 800K * 365  800K * 400 = 320000K = 320 Million URLs per year
        • So 320 Million * 5  1500 Million = 1.5 Billion URLs for 5 years

        Answer 2 :

        • 10 * 100KB = 1000KB = 1MB writes/s
        • So each day we write 1MB * 60 * 60 * 24 = 3600MB * 24  4000 * 20 = 80000MB per day = 80GB per day.
        • So each year we write 80 * 365  80 * 400 = 32000 GB = 32 TB
        • We don’t want to use more than 70% of our data storage at any time so we procure (32 * 100)/70  (35 * 100)/70 = 50TB disk space.
        • For 5 years we have 50TB * 5 = 250TB disk space

        Answer 3 :

        • 10 * 1MB = 10MB reads/s

        Number of servers required?

        Say each server handles 10MB reads per second we would need minimum 10 servers for reading and 1 server for writing.

        Caching capacity using 80-20 rule?

        • Considering 20% of data per day is responsible for 80% of traffic so 80 GB * 20% = 80 * 0.2 = 16GB cache is needed.
        • Also because we have calculated that our system has 100MB reads/s and we have applied the 80-20 rule the 80% of 100MB/s = 80MB/s traffic is now read from the cache. Remaining 20% traffic is read directly from the database.
          6. High level design

          Problem statement :

          Create unique URLs for the pastes generated. Also save all the pastes in document db.

          Solution :

          Refer Design Tiny URL

            7. Scaling the design
            8. Additional thoughts

            Additional requirements

            1. Specify paste exposure whether it should be private or public.
            2. Delete the pastes
            9. Next Steps
            Ask questions and share your feedback in the course.

            The Complete Design Interview Course

            Let's connect on LinkedIn

            © Copyright CompleteDesignInterviewCourse.com