# Web to Knowledge Base for Open WebUI A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base. ## Features - Crawls websites to a specified depth while respecting domain boundaries - Converts HTML content to Markdown using MarkItDown - Preserves JSON content in its original format - Creates or updates knowledge bases in Open WebUI - Handles existing files through update or skip options - Customizable crawling with exclude patterns - Detailed logging of the process ## Installation ### Prerequisites - Python 3.10+ - Open WebUI instance with API access ### Dependencies Install the required packages: ```bash pip install requests beautifulsoup4 markitdown ``` ### Getting the Script Download the script and make it executable: ```bash curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py chmod +x web_to_kb.py ``` ## Usage Basic usage: ```bash python web_to_kb.py --token "YOUR_API_TOKEN" \ --base-url "https://your-openwebui-instance.com" \ --website-url "https://website-to-crawl.com" \ --kb-name "My Website Knowledge Base" ``` ### Command Line Arguments | Argument | Short | Description | Required | Default | |----------|-------|-------------|----------|---------| | `--token` | `-t` | Your OpenWebUI API token | Yes | - | | `--base-url` | `-u` | Base URL of your OpenWebUI instance | Yes | - | | `--website-url` | `-w` | URL of the website to crawl | Yes | - | | `--kb-name` | `-n` | Name for the knowledge base | Yes | - | | `--kb-purpose` | `-p` | Purpose description for the knowledge base | No | None | | `--depth` | `-d` | Maximum depth to crawl | No | 2 | | `--delay` | | Delay between requests in seconds | No | 1.0 | | `--exclude` | `-e` | URL patterns to exclude from crawling (can be specified multiple times) | No | None | | `--include-json` | `-j` | Include JSON files and API endpoints | No | False | | `--update` | | Update existing files in the knowledge base | No | False | | `--skip-existing` | | Skip existing files in the knowledge base | No | False | ## Examples ### Basic Crawl with Limited Depth ```bash python web_to_kb.py -t "YOUR_API_TOKEN" \ -u "https://your-openwebui-instance.com" \ -w "https://docs.example.com" \ -n "Example Docs KB" \ -d 3 ``` ### Excluding Certain URL Patterns ```bash python web_to_kb.py -t "YOUR_API_TOKEN" \ -u "https://your-openwebui-instance.com" \ -w "https://blog.example.com" \ -n "Example Blog KB" \ -e "/tags/" \ -e "/author/" \ -e "/search/" ``` ### Including JSON Content ```bash python web_to_kb.py -t "YOUR_API_TOKEN" \ -u "https://your-openwebui-instance.com" \ -w "https://api-docs.example.com" \ -n "Example API Documentation" \ -j ``` ### Updating an Existing Knowledge Base ```bash python web_to_kb.py -t "YOUR_API_TOKEN" \ -u "https://your-openwebui-instance.com" \ -w "https://knowledge-center.example.com" \ -n "Knowledge Center" \ --update ``` ### Skipping Existing Files ```bash python web_to_kb.py -t "YOUR_API_TOKEN" \ -u "https://your-openwebui-instance.com" \ -w "https://docs.example.com" \ -n "Documentation KB" \ --skip-existing ``` ## How It Works 1. **Website Crawling**: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain. 2. **Content Processing**: - HTML content is converted to Markdown using MarkItDown - JSON content is preserved in its native format (when `--include-json` is used) 3. **Knowledge Base Management**: - Checks if a knowledge base with the specified name already exists - Creates a new knowledge base if none exists 4. **File Upload**: - Manages existing files based on the `--update` or `--skip-existing` flags - Uploads new files to the knowledge base ## Notes - The script respects domain boundaries and will not crawl external links - URLs are used to generate filenames, with special characters replaced - Add a delay between requests to be respectful of websites' resources - File updates are performed by uploading a new file and removing the old one ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Acknowledgments - [MarkItDown](https://github.com/microsoft/markitdown) for HTML to Markdown conversion [1] - [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing - [Requests](https://requests.readthedocs.io/) for HTTP requests - [Open WebUI](https://github.com/open-webui/open-webui) for the knowledge base API