AI Search and Retrieval for Web Crawlers
With mcp-server-webcrawl, your AI client filters and analyzes web content under your direction or autonomously.
Support for multiple crawlers, including WARC, wget, InterroBot, Katana, and SiteOne is baked in.
The server includes a full-text search interface with boolean support, filtering by type, HTTP status, and more.
Main Features
- Claude Desktop ready
- Multi-crawler compatible
- Filter by type, status, and more
- Boolean search support
- Support for Markdown and snippets
- Roll your own website knowledgebase
Getting Started
mcp-server-webcrawl is free and open source, and requires Claude Desktop and Python (>=3.10). It is installed on the command line, via pip install:
pip install mcp-server-webcrawl
Setup videos are available for each supported crawler, showing how to connect your crawl data to your LLM.
If you prefer text-only, step action guides are available in the docs.
MCP Configuration
# Windows: command set to "mcp-server-webcrawl" # macOS: command set to absolute path, i.e. # the value of $ which mcp-server-webcrawl { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "wget", "--datasrc", "/path/to/wget/archives/"] } } } # tested configurations (macOS Terminal/Windows WSL) # from /path/to/wget/archives/ as current working direcory # --adjust-extension for file extensions, e.g. *.html $ wget --mirror https://example.com $ wget --mirror https://example.com --adjust-extension
# Windows: command set to "mcp-server-webcrawl" # macOS: command set to absolute path, i.e. # the value of $ which mcp-server-webcrawl { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "warc", "--datasrc", "/path/to/warc/archives/"] } } } # tested configurations (macOS Terminal/Windows WSL) # from /path/to/warc/archives/ as current working direcory $ wget --warc-file=example --recursive https://example.com $ wget --warc-file=example --recursive --page-requisites https://example.com
# Windows: command set to "mcp-server-webcrawl" # macOS: command set to absolute path, i.e. # the value of $ which mcp-server-webcrawl { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "interrobot", "--datasrc", "[homedir]/Documents/InterroBot/interrobot.v2.db"] } } } # crawls executed in InterroBot (windowed) # Windows: replace [homedir] with /Users/... # macOS: path provided on InterroBot settings page
# Windows: command set to "mcp-server-webcrawl" # macOS: command set to absolute path, i.e. # the value of $ which mcp-server-webcrawl { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "katana", "--datasrc", "/path/to/katana/crawls/"] } } } # tested configurations (macOS Terminal/Powershell/WSL) # -store-response to save crawl contents # -store-response-dir allows for expansion of hosts # consistent with default Katana behavior to # spread assets across host directories $ katana -u https://example.com -store-response -store-response-dir /path/to/katana/crawls/example.com/
# Windows: command set to "mcp-server-webcrawl" # macOS: command set to absolute path, i.e. # the value of $ which mcp-server-webcrawl { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "siteone", "--datasrc", "/path/to/siteone/archives/"] } } } # crawls executed in SiteOne (windowed) # *Generate offline website* must be checked
From Claude's developer settings, find the MCP configuration to include your crawl. Open in a text editor and modify the example to reflect your datasrc path.
You can set up more mcp-server-webcrawl connections under mcpServers if you want.
For additional technical information, including crawler feature support, be sure to check out help.