web-to-kb/README.md

4.9 KiB

Web to Knowledge Base for Open WebUI

A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.

Features

  • Crawls websites to a specified depth while respecting domain boundaries
  • Converts HTML content to Markdown using MarkItDown
  • Preserves JSON content in its original format
  • Creates or updates knowledge bases in Open WebUI
  • Handles existing files through update or skip options
  • Customizable crawling with exclude patterns
  • Detailed logging of the process

Installation

Prerequisites

  • Python 3.10+
  • Open WebUI instance with API access

Dependencies

Install the required packages:

pip install requests beautifulsoup4 markitdown

Getting the Script

Download the script and make it executable:

curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
chmod +x web_to_kb.py

Usage

Basic usage:

python web_to_kb.py --token "YOUR_API_TOKEN" \
                   --base-url "https://your-openwebui-instance.com" \
                   --website-url "https://website-to-crawl.com" \
                   --kb-name "My Website Knowledge Base"

Command Line Arguments

Argument Short Description Required Default
--token -t Your OpenWebUI API token Yes -
--base-url -u Base URL of your OpenWebUI instance Yes -
--website-url -w URL of the website to crawl Yes -
--kb-name -n Name for the knowledge base Yes -
--kb-purpose -p Purpose description for the knowledge base No None
--depth -d Maximum depth to crawl No 2
--delay Delay between requests in seconds No 1.0
--exclude -e URL patterns to exclude from crawling (can be specified multiple times) No None
--include-json -j Include JSON files and API endpoints No False
--update Update existing files in the knowledge base No False
--skip-existing Skip existing files in the knowledge base No False

Examples

Basic Crawl with Limited Depth

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Example Docs KB" \
                   -d 3

Excluding Certain URL Patterns

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://blog.example.com" \
                   -n "Example Blog KB" \
                   -e "/tags/" \
                   -e "/author/" \
                   -e "/search/"

Including JSON Content

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://api-docs.example.com" \
                   -n "Example API Documentation" \
                   -j

Updating an Existing Knowledge Base

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://knowledge-center.example.com" \
                   -n "Knowledge Center" \
                   --update

Skipping Existing Files

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Documentation KB" \
                   --skip-existing

How It Works

  1. Website Crawling: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.

  2. Content Processing:

    • HTML content is converted to Markdown using MarkItDown
    • JSON content is preserved in its native format (when --include-json is used)
  3. Knowledge Base Management:

    • Checks if a knowledge base with the specified name already exists
    • Creates a new knowledge base if none exists
  4. File Upload:

    • Manages existing files based on the --update or --skip-existing flags
    • Uploads new files to the knowledge base

Notes

  • The script respects domain boundaries and will not crawl external links
  • URLs are used to generate filenames, with special characters replaced
  • Add a delay between requests to be respectful of websites' resources
  • File updates are performed by uploading a new file and removing the old one

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments