4.9 KiB
4.9 KiB
Web to Knowledge Base for Open WebUI
A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.
Features
- Crawls websites to a specified depth while respecting domain boundaries
- Converts HTML content to Markdown using MarkItDown
- Preserves JSON content in its original format
- Creates or updates knowledge bases in Open WebUI
- Handles existing files through update or skip options
- Customizable crawling with exclude patterns
- Detailed logging of the process
Installation
Prerequisites
- Python 3.10+
- Open WebUI instance with API access
Dependencies
Install the required packages:
pip install requests beautifulsoup4 markitdown
Getting the Script
Download the script and make it executable:
curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
chmod +x web_to_kb.py
Usage
Basic usage:
python web_to_kb.py --token "YOUR_API_TOKEN" \
--base-url "https://your-openwebui-instance.com" \
--website-url "https://website-to-crawl.com" \
--kb-name "My Website Knowledge Base"
Command Line Arguments
Argument | Short | Description | Required | Default |
---|---|---|---|---|
--token |
-t |
Your OpenWebUI API token | Yes | - |
--base-url |
-u |
Base URL of your OpenWebUI instance | Yes | - |
--website-url |
-w |
URL of the website to crawl | Yes | - |
--kb-name |
-n |
Name for the knowledge base | Yes | - |
--kb-purpose |
-p |
Purpose description for the knowledge base | No | None |
--depth |
-d |
Maximum depth to crawl | No | 2 |
--delay |
Delay between requests in seconds | No | 1.0 | |
--exclude |
-e |
URL patterns to exclude from crawling (can be specified multiple times) | No | None |
--include-json |
-j |
Include JSON files and API endpoints | No | False |
--update |
Update existing files in the knowledge base | No | False | |
--skip-existing |
Skip existing files in the knowledge base | No | False |
Examples
Basic Crawl with Limited Depth
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://docs.example.com" \
-n "Example Docs KB" \
-d 3
Excluding Certain URL Patterns
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://blog.example.com" \
-n "Example Blog KB" \
-e "/tags/" \
-e "/author/" \
-e "/search/"
Including JSON Content
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://api-docs.example.com" \
-n "Example API Documentation" \
-j
Updating an Existing Knowledge Base
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://knowledge-center.example.com" \
-n "Knowledge Center" \
--update
Skipping Existing Files
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://docs.example.com" \
-n "Documentation KB" \
--skip-existing
How It Works
-
Website Crawling: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.
-
Content Processing:
- HTML content is converted to Markdown using MarkItDown
- JSON content is preserved in its native format (when
--include-json
is used)
-
Knowledge Base Management:
- Checks if a knowledge base with the specified name already exists
- Creates a new knowledge base if none exists
-
File Upload:
- Manages existing files based on the
--update
or--skip-existing
flags - Uploads new files to the knowledge base
- Manages existing files based on the
Notes
- The script respects domain boundaries and will not crawl external links
- URLs are used to generate filenames, with special characters replaced
- Add a delay between requests to be respectful of websites' resources
- File updates are performed by uploading a new file and removing the old one
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- MarkItDown for HTML to Markdown conversion [1]
- BeautifulSoup for HTML parsing
- Requests for HTTP requests
- Open WebUI for the knowledge base API