Web to Knowledge Base for Open WebUI
A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.
Features
- Crawls websites to a specified depth while respecting domain boundaries
 - Converts HTML content to Markdown using MarkItDown
 - Preserves JSON content in its original format
 - Creates or updates knowledge bases in Open WebUI
 - Handles existing files through update or skip options
 - Customizable crawling with exclude patterns
 - Detailed logging of the process
 
Installation
Prerequisites
- Python 3.10+
 - Open WebUI instance with API access
 
Dependencies
Install the required packages:
pip install requests beautifulsoup4 markitdown[all]
Getting the Script
Download the script and make it executable:
curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
chmod +x web_to_kb.py
Usage
Basic usage:
python web_to_kb.py --token "YOUR_API_TOKEN" \
                   --base-url "https://your-openwebui-instance.com" \
                   --website-url "https://website-to-crawl.com" \
                   --kb-name "My Website Knowledge Base"
Command Line Arguments
| Argument | Short | Description | Required | Default | 
|---|---|---|---|---|
--token | 
-t | 
Your OpenWebUI API token | Yes | - | 
--base-url | 
-u | 
Base URL of your OpenWebUI instance | Yes | - | 
--website-url | 
-w | 
URL of the website to crawl | Yes | - | 
--kb-name | 
-n | 
Name for the knowledge base | Yes | - | 
--kb-purpose | 
-p | 
Purpose description for the knowledge base | No | None | 
--depth | 
-d | 
Maximum depth to crawl | No | 2 | 
--delay | 
Delay between requests in seconds | No | 1.0 | |
--exclude | 
-e | 
URL patterns to exclude from crawling (can be specified multiple times) | No | None | 
--include-json | 
-j | 
Include JSON files and API endpoints | No | False | 
--update | 
Update existing files in the knowledge base | No | False | |
--skip-existing | 
Skip existing files in the knowledge base | No | False | 
Examples
Basic Crawl with Limited Depth
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Example Docs KB" \
                   -d 3
Excluding Certain URL Patterns
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://blog.example.com" \
                   -n "Example Blog KB" \
                   -e "/tags/" \
                   -e "/author/" \
                   -e "/search/"
Including JSON Content
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://api-docs.example.com" \
                   -n "Example API Documentation" \
                   -j
Updating an Existing Knowledge Base
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://knowledge-center.example.com" \
                   -n "Knowledge Center" \
                   --update
Skipping Existing Files
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Documentation KB" \
                   --skip-existing
How It Works
- 
Website Crawling: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.
 - 
Content Processing:
- HTML content is converted to Markdown using MarkItDown
 - JSON content is preserved in its native format (when 
--include-jsonis used) 
 - 
Knowledge Base Management:
- Checks if a knowledge base with the specified name already exists
 - Creates a new knowledge base if none exists
 
 - 
File Upload:
- Manages existing files based on the 
--updateor--skip-existingflags - Uploads new files to the knowledge base
 
 - Manages existing files based on the 
 
Notes
- The script respects domain boundaries and will not crawl external links
 - URLs are used to generate filenames, with special characters replaced
 - Add a delay between requests to be respectful of websites' resources
 - File updates are performed by uploading a new file and removing the old one
 
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- MarkItDown for HTML to Markdown conversion [1]
 - BeautifulSoup for HTML parsing
 - Requests for HTTP requests
 - Open WebUI for the knowledge base API
 
Description
				
					Languages
				
				
								
								
									Python
								
								100%