AI Search and Retrieval for Web Crawlers

With mcp-server-webcrawl, your AI client filters and analyzes web content under your direction or autonomously.

Support for multiple crawlers, including WARC, wget, InterroBot, Katana, and SiteOne is baked in.

The server includes a full-text search interface with boolean support, filtering by type, HTTP status, and more.

Main Features

  • Claude Desktop ready
  • Multi-crawler compatible
  • Filter by type, status, and more
  • Boolean search support
  • Support for Markdown and snippets
  • Roll your own website knowledgebase

Getting Started

mcp-server-webcrawl is free and open source, and requires Claude Desktop and Python (>=3.10). It is installed on the command line, via pip install:

pip install mcp-server-webcrawl

Setup videos are available for each supported crawler, showing how to connect your crawl data to your LLM.

If you prefer text-only, step action guides are available in the docs.

MCP Configuration

# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "wget", "--datasrc",
         "/path/to/wget/archives/"]
    }
  }
}

# tested configurations (macOS Terminal/Windows WSL)
# from /path/to/wget/archives/ as current working direcory
# --adjust-extension for file extensions, e.g. *.html
$ wget --mirror https://example.com
$ wget --mirror https://example.com --adjust-extension
# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "warc", "--datasrc",
         "/path/to/warc/archives/"]
    }
  }
}

# tested configurations (macOS Terminal/Windows WSL)
# from /path/to/warc/archives/ as current working direcory
$ wget --warc-file=example --recursive https://example.com
$ wget --warc-file=example --recursive --page-requisites https://example.com
# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "interrobot", "--datasrc",
         "[homedir]/Documents/InterroBot/interrobot.v2.db"]
    }
  }
}

# crawls executed in InterroBot (windowed)
# Windows: replace [homedir] with /Users/...
# macOS: path provided on InterroBot settings page
# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "katana", "--datasrc",
         "/path/to/katana/crawls/"]
    }
  }
}

# tested configurations (macOS Terminal/Powershell/WSL)
# -store-response to save crawl contents
# -store-response-dir allows for expansion of hosts
#    consistent with default Katana behavior to 
#     spread assets across host directories
$ katana -u https://example.com -store-response -store-response-dir /path/to/katana/crawls/example.com/
# Windows: command set to "mcp-server-webcrawl"
# macOS: command set to absolute path, i.e.
# the value of $ which mcp-server-webcrawl
{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "siteone", "--datasrc",
         "/path/to/siteone/archives/"]
    }
  }
}

# crawls executed in SiteOne (windowed)
# *Generate offline website* must be checked

From Claude's developer settings, find the MCP configuration to include your crawl. Open in a text editor and modify the example to reflect your datasrc path.

You can set up more mcp-server-webcrawl connections under mcpServers if you want.

For additional technical information, including crawler feature support, be sure to check out help.

Abstraction of LLM clients (Claude and OpenAI) communicating with a website archive