Web crawler or spider or spiderbot is an internet bot which crawls the webpages mainly for the purpose of indexing. A distributed web crawler typically employs several machines to perform crawling. One of the most famous distributed web crawler is Google’s web crawler which indexes all the web pages on the internet.
Prerequisites :
System design introduction : 3 principles of distributed system, 5 step guide for System design.
System design concepts & components : Horizontal scaling, Database, Message queues, Caching
Functional requirements
Non-functional requirements
There are no APIs. We are making http requests to webpages and saving them.
URL : url(PK), IsVisited, visitedTimestamp
Which database to use?
NoSQL Key value database like DynamoDB
Where to save the web pages?
Object store like Azure blob storage or Amazon S3
We need to perform 2 estimations
Solution 1 :
Solution 2 :
Diagram shows a simple web crawler.
Which algorithm to use for scanning the pages?
Same URL might be added twice in queue so those pages will be downloaded twice. How do we prevent it?
Single responsibility principle
Our crawler is doing mainly 2 things.
How to make the crawler polite?
Let us revisit our non-functional requirements