Build a Web Scraper
A production-grade scraping pipeline: headless browser, queues, retries, structured extraction. Use the LLM-friendly tools when the output goes to RAG; the classic ones when you need raw throughput.
A scraper that crawls thousands of pages reliably, stores clean output, and feeds your AI agent.
LLM-grade scrapers
2 reposOne API, returns clean markdown ready for RAG. Self-host or use their cloud — same SDK.
The 2026 default for LLM-grade scraping. Fires Playwright behind the scenes, returns clean markdown ready to feed into RAG. Self-host for fr…
Python alternative when you need strict JSON-schema extraction with an embedded LLM.
Python-native LLM-friendly crawler. Strong at extracting structured data (JSON schemas) from messy HTML using an embedded LLM. Heavier setup…
Production crawlers
3 reposNode-native. Queues, retries, proxy rotation. The default for serious TS scraping.
Node-native crawler from the Apify team. Built-in queues, retries, proxy rotation, headless browser pool — production patterns out of the bo…
Python's veteran. Battle-tested middleware + pipeline architecture.
The Python scraping veteran. Mature ecosystem, plugins for everything (caching, proxies, middlewares), and a years-honed pipeline architectu…
Go option. Single binary, low memory, runs on a ₹500 VPS.
Go's answer to Scrapy. Built-in rate limiting, caching, parallelism, and storage backends. Compiles to a single binary which makes deploymen…
Browser automation
2 reposJS-heavy sites, anti-bot bypass, multi-browser. Also your E2E test runner.
End-to-end browser testing. ~65k stars. Beat Cypress on speed, parallelism, and multi-browser support. Default for E2E testing in 2026.…
Chrome-only, thinner abstraction. Use when Playwright is too much.
Chrome-only browser automation from Google. Slightly more raw than Playwright with fewer batteries included, but lighter weight and battle-t…
- Plain HTML parsing1 repo
- Storage + queue1 repo
- Feed to your AI agent2 repos
Unlock with lifetime membership.
Pay once. Full directory unlocked forever. No renewals, no surprise charges.
See pricingThe 4-step AI workflow
The AI agents are good at code. They're bad at deciding what stack to use. This bundle does the second part. You bring the agent.
- 1Ideate with ChatGPT or Claude.ai (web)Paste your idea: “I'm building build a web scraper. Help me sharpen the product spec — features, edge cases, MVP scope.” Iterate for 10-15 minutes until you have a clear one-page brief.
- 2Pick your coding agentFor this kind of bundle, we recommend Claude Code — Sonnet 4.6/4.7 handles full-stack multi-file reasoning best. See the install guide → Cursor and Codex are also great; pick the one you already pay for.
- 3Feed this bundle to the agentOpen Claude Code / Cursor / Codex in an empty folder, then paste:
I'm building build a web scraper. Use this bundle as the source of truth for the stack: https://stackpicks.dev/build/web-scraper Brief from my product spec: [paste your brief from step 1] Follow the bundle order strictly: 1. LLM-grade scrapers 2. Production crawlers 3. Browser automation 4. Plain HTML parsing ... Stop and confirm with me after each layer.
- 4Wire one layer at a time, commit between eachDon't let the agent install everything before the first
git commit. One layer = one commit. Catches drift early, easy rollback.
Beyond the bundle
- 1Ship the boring version first. The bundle above is the maximalist list. For an MVP, start with 60% of these and add the rest when real users ask.
- 2Deploy early. Push to Railway / Vercel after layer 02 (auth) — not after layer 09. Production breaks differently than localhost.
- 3Read CLAUDE.md / .cursor/rules in this repo for the project conventions your AI agent should follow.
- 4Iterate on the take. If a repo here doesn't fit your specific use case, tell us — contact — and we'll add a better one within 60 minutes.