Web Crawler

Overview

Designing a Web Crawler (like Googlebot) tests your understanding of distributed crawling, politeness policies, URL deduplication, and handling the scale and unpredictability of the web. Search companies and content aggregators rely on crawlers, and the question reveals how you think about rate limiting, robots.txt, duplicate detection, and frontier management. This design matters in interviews because it combines BFS/DFS graph traversal with distributed systems—URLs as nodes, links as edges—and requires careful handling of politeness (don't overwhelm servers), prioritization (important pages first), and fault tolerance (billions of URLs, many failures).

Requirements

Functional

Fetch web pages given seed URLs
Extract and follow links from pages
Respect robots.txt and crawl-delay
Deduplicate URLs (avoid crawling same page twice)
Store raw HTML and extracted metadata
Support prioritization (e.g., sitemap first)

Non-Functional

Politeness — rate limit per domain
Scalability — billions of pages
Fault tolerance — handle timeouts, 404s, redirects
Freshness — recrawl periodically

Capacity Estimation

Assume 1B pages to crawl, 1000 pages/sec crawl rate. 1B URLs in frontier. Bloom filter for dedup: ~1GB. Storage: 100KB/page = 100TB.

Architecture Diagram

Component Deep Dive

URL Frontier

Priority queue of URLs to crawl. Per-domain queues for politeness. Politeness delay between same-domain requests.

URL Filter / Dedup

Bloom filter or distributed set. Checks if URL already crawled. Normalizes URLs (remove fragments, lowercase).

Fetcher

HTTP client pool. Fetches pages, handles redirects, timeouts. Respects robots.txt.

Parser

Extracts links, metadata, content. Outputs new URLs to frontier, content to storage.

URL Store

Persistent storage for crawled URLs and metadata. Used for dedup and recrawl scheduling.

Content Store

Stores raw HTML or extracted text. Distributed file store (S3, HDFS).

Database Design

URL metadata in Cassandra/DynamoDB: url (PK), status, last_crawled, next_crawl. Content in object store. Bloom filter in Redis for fast dedup check.

API Design

Method	Path	Description
`POST`	/api/crawl	Submit seed URLs. Body: {urls[], priority?}. Returns job_id.
`GET`	/api/crawl/{job_id}	Get crawl job status and stats.
`GET`	/api/pages/{url_hash}	Retrieve stored page content (internal).

Scalability & Trade-offs

BFS vs DFS: BFS discovers broadly; DFS goes deep. For web, BFS with prioritization is common.
Politeness vs speed: Per-domain rate limits reduce load on servers but slow overall crawl.
Dedup: Bloom filter is space-efficient but has false positives; exact set is accurate but expensive.

Related System Designs

Infrastructure