When to use
- Extracting content from websites
- Handling paywalls and anti-bot measures
- Implementing scraping cascades with fallbacks
- Processing social media (YouTube, Instagram, TikTok)
- Finding and using undocumented APIs
What's included
Scraping cascade
Three-tier fallback: Trafilatura (fast) to Requests (HTTP) to Playwright (JavaScript rendering with stealth).
Poison pill detection
Detect paywalls, CAPTCHAs, rate limits, Cloudflare, and login walls with pattern matching.
Undocumented APIs
Find and use hidden APIs via browser dev tools, with examples for autocomplete endpoints.
Social media tools
yt-dlp for YouTube/TikTok, instaloader for Instagram, with metadata extraction and download patterns.
Scraping cascade architecture
Try multiple extraction strategies with automatic fallback:
Trafilatura
Lightweight extraction for standard articles. Best for news sites and blogs.
Requests + BeautifulSoup
HTTP requests with rotating user agents. Good for static content.
Playwright with stealth
Full JavaScript rendering with anti-bot bypass. For SPAs and protected sites.
Poison pill types
| Type | Detection patterns |
|---|---|
| Paywall | "subscribe to continue", "you've reached your limit" |
| CAPTCHA | "verify you are human", "robot verification" |
| Rate limit | "too many requests", HTTP 429 |
| Cloudflare | "checking your browser", "ddos protection" |
| Login required | "sign in to continue", "create an account" |
Installation
# Recommended: install the dev-toolkit plugin
/plugin marketplace add jamditis/claude-skills-journalism
/plugin install dev-toolkit@claude-skills-journalism
# Or copy just this skill from the plugin tree
git clone https://github.com/jamditis/claude-skills-journalism.git
cp -r claude-skills-journalism/dev-toolkit/skills/web-scraping ~/.claude/skills/
Or browse this skill in the GitHub repository.
Related skills
Extract what you need, ethically
Cascade architecture, poison pill detection, and social media tools in one skill.
View on GitHub