web-to-kb/README.md

152 lines
4.9 KiB
Markdown
Raw Permalink Normal View History

# Web to Knowledge Base for Open WebUI
A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.
## Features
- Crawls websites to a specified depth while respecting domain boundaries
- Converts HTML content to Markdown using MarkItDown
- Preserves JSON content in its original format
- Creates or updates knowledge bases in Open WebUI
- Handles existing files through update or skip options
- Customizable crawling with exclude patterns
- Detailed logging of the process
## Installation
### Prerequisites
- Python 3.10+
- Open WebUI instance with API access
### Dependencies
Install the required packages:
```bash
2025-04-17 16:49:39 +00:00
pip install requests beautifulsoup4 markitdown[all]
```
### Getting the Script
Download the script and make it executable:
```bash
curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
chmod +x web_to_kb.py
```
## Usage
Basic usage:
```bash
python web_to_kb.py --token "YOUR_API_TOKEN" \
--base-url "https://your-openwebui-instance.com" \
--website-url "https://website-to-crawl.com" \
--kb-name "My Website Knowledge Base"
```
### Command Line Arguments
| Argument | Short | Description | Required | Default |
|----------|-------|-------------|----------|---------|
| `--token` | `-t` | Your OpenWebUI API token | Yes | - |
| `--base-url` | `-u` | Base URL of your OpenWebUI instance | Yes | - |
| `--website-url` | `-w` | URL of the website to crawl | Yes | - |
| `--kb-name` | `-n` | Name for the knowledge base | Yes | - |
| `--kb-purpose` | `-p` | Purpose description for the knowledge base | No | None |
| `--depth` | `-d` | Maximum depth to crawl | No | 2 |
| `--delay` | | Delay between requests in seconds | No | 1.0 |
| `--exclude` | `-e` | URL patterns to exclude from crawling (can be specified multiple times) | No | None |
| `--include-json` | `-j` | Include JSON files and API endpoints | No | False |
| `--update` | | Update existing files in the knowledge base | No | False |
| `--skip-existing` | | Skip existing files in the knowledge base | No | False |
## Examples
### Basic Crawl with Limited Depth
```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://docs.example.com" \
-n "Example Docs KB" \
-d 3
```
### Excluding Certain URL Patterns
```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://blog.example.com" \
-n "Example Blog KB" \
-e "/tags/" \
-e "/author/" \
-e "/search/"
```
### Including JSON Content
```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://api-docs.example.com" \
-n "Example API Documentation" \
-j
```
### Updating an Existing Knowledge Base
```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://knowledge-center.example.com" \
-n "Knowledge Center" \
--update
```
### Skipping Existing Files
```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
-u "https://your-openwebui-instance.com" \
-w "https://docs.example.com" \
-n "Documentation KB" \
--skip-existing
```
## How It Works
1. **Website Crawling**: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.
2. **Content Processing**:
- HTML content is converted to Markdown using MarkItDown
- JSON content is preserved in its native format (when `--include-json` is used)
3. **Knowledge Base Management**:
- Checks if a knowledge base with the specified name already exists
- Creates a new knowledge base if none exists
4. **File Upload**:
- Manages existing files based on the `--update` or `--skip-existing` flags
- Uploads new files to the knowledge base
## Notes
- The script respects domain boundaries and will not crawl external links
- URLs are used to generate filenames, with special characters replaced
- Add a delay between requests to be respectful of websites' resources
- File updates are performed by uploading a new file and removing the old one
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- [MarkItDown](https://github.com/microsoft/markitdown) for HTML to Markdown conversion [1]
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
- [Requests](https://requests.readthedocs.io/) for HTTP requests
- [Open WebUI](https://github.com/open-webui/open-webui) for the knowledge base API