Files

Josh Knapp 3d5acde487 Updating for a new version and adding the updated README

2025-04-16 19:50:33 -07:00

4.9 KiB

Raw Blame History

Web to Knowledge Base for Open WebUI

A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.

Features

Crawls websites to a specified depth while respecting domain boundaries
Converts HTML content to Markdown using MarkItDown
Preserves JSON content in its original format
Creates or updates knowledge bases in Open WebUI
Handles existing files through update or skip options
Customizable crawling with exclude patterns
Detailed logging of the process

Installation

Prerequisites

Python 3.10+
Open WebUI instance with API access

Dependencies

Install the required packages:

pip install requests beautifulsoup4 markitdown

Getting the Script

Download the script and make it executable:

curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
chmod +x web_to_kb.py

Usage

Basic usage:

python web_to_kb.py --token "YOUR_API_TOKEN" \
                   --base-url "https://your-openwebui-instance.com" \
                   --website-url "https://website-to-crawl.com" \
                   --kb-name "My Website Knowledge Base"

Command Line Arguments

Argument	Short	Description	Required	Default
`--token`	`-t`	Your OpenWebUI API token	Yes	-
`--base-url`	`-u`	Base URL of your OpenWebUI instance	Yes	-
`--website-url`	`-w`	URL of the website to crawl	Yes	-
`--kb-name`	`-n`	Name for the knowledge base	Yes	-
`--kb-purpose`	`-p`	Purpose description for the knowledge base	No	None
`--depth`	`-d`	Maximum depth to crawl	No	2
`--delay`		Delay between requests in seconds	No	1.0
`--exclude`	`-e`	URL patterns to exclude from crawling (can be specified multiple times)	No	None
`--include-json`	`-j`	Include JSON files and API endpoints	No	False
`--update`		Update existing files in the knowledge base	No	False
`--skip-existing`		Skip existing files in the knowledge base	No	False

Examples

Basic Crawl with Limited Depth

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Example Docs KB" \
                   -d 3

Excluding Certain URL Patterns

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://blog.example.com" \
                   -n "Example Blog KB" \
                   -e "/tags/" \
                   -e "/author/" \
                   -e "/search/"

Including JSON Content

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://api-docs.example.com" \
                   -n "Example API Documentation" \
                   -j

Updating an Existing Knowledge Base

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://knowledge-center.example.com" \
                   -n "Knowledge Center" \
                   --update

Skipping Existing Files

python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Documentation KB" \
                   --skip-existing

How It Works

Website Crawling: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.
Content Processing:
- HTML content is converted to Markdown using MarkItDown
- JSON content is preserved in its native format (when --include-json is used)
Knowledge Base Management:
- Checks if a knowledge base with the specified name already exists
- Creates a new knowledge base if none exists
File Upload:
- Manages existing files based on the --update or --skip-existing flags
- Uploads new files to the knowledge base

Notes

The script respects domain boundaries and will not crawl external links
URLs are used to generate filenames, with special characters replaced
Add a delay between requests to be respectful of websites' resources
File updates are performed by uploading a new file and removing the old one

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

MarkItDown for HTML to Markdown conversion [1]
BeautifulSoup for HTML parsing
Requests for HTTP requests
Open WebUI for the knowledge base API

4.9 KiB Raw Blame History