152 lines
4.9 KiB
Markdown
152 lines
4.9 KiB
Markdown
# Web to Knowledge Base for Open WebUI
|
|
|
|
A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.
|
|
|
|
## Features
|
|
|
|
- Crawls websites to a specified depth while respecting domain boundaries
|
|
- Converts HTML content to Markdown using MarkItDown
|
|
- Preserves JSON content in its original format
|
|
- Creates or updates knowledge bases in Open WebUI
|
|
- Handles existing files through update or skip options
|
|
- Customizable crawling with exclude patterns
|
|
- Detailed logging of the process
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.10+
|
|
- Open WebUI instance with API access
|
|
|
|
### Dependencies
|
|
|
|
Install the required packages:
|
|
|
|
```bash
|
|
pip install requests beautifulsoup4 markitdown[all]
|
|
```
|
|
|
|
### Getting the Script
|
|
|
|
Download the script and make it executable:
|
|
|
|
```bash
|
|
curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
|
|
chmod +x web_to_kb.py
|
|
```
|
|
|
|
## Usage
|
|
|
|
Basic usage:
|
|
|
|
```bash
|
|
python web_to_kb.py --token "YOUR_API_TOKEN" \
|
|
--base-url "https://your-openwebui-instance.com" \
|
|
--website-url "https://website-to-crawl.com" \
|
|
--kb-name "My Website Knowledge Base"
|
|
```
|
|
|
|
### Command Line Arguments
|
|
|
|
| Argument | Short | Description | Required | Default |
|
|
|----------|-------|-------------|----------|---------|
|
|
| `--token` | `-t` | Your OpenWebUI API token | Yes | - |
|
|
| `--base-url` | `-u` | Base URL of your OpenWebUI instance | Yes | - |
|
|
| `--website-url` | `-w` | URL of the website to crawl | Yes | - |
|
|
| `--kb-name` | `-n` | Name for the knowledge base | Yes | - |
|
|
| `--kb-purpose` | `-p` | Purpose description for the knowledge base | No | None |
|
|
| `--depth` | `-d` | Maximum depth to crawl | No | 2 |
|
|
| `--delay` | | Delay between requests in seconds | No | 1.0 |
|
|
| `--exclude` | `-e` | URL patterns to exclude from crawling (can be specified multiple times) | No | None |
|
|
| `--include-json` | `-j` | Include JSON files and API endpoints | No | False |
|
|
| `--update` | | Update existing files in the knowledge base | No | False |
|
|
| `--skip-existing` | | Skip existing files in the knowledge base | No | False |
|
|
|
|
## Examples
|
|
|
|
### Basic Crawl with Limited Depth
|
|
|
|
```bash
|
|
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
|
-u "https://your-openwebui-instance.com" \
|
|
-w "https://docs.example.com" \
|
|
-n "Example Docs KB" \
|
|
-d 3
|
|
```
|
|
|
|
### Excluding Certain URL Patterns
|
|
|
|
```bash
|
|
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
|
-u "https://your-openwebui-instance.com" \
|
|
-w "https://blog.example.com" \
|
|
-n "Example Blog KB" \
|
|
-e "/tags/" \
|
|
-e "/author/" \
|
|
-e "/search/"
|
|
```
|
|
|
|
### Including JSON Content
|
|
|
|
```bash
|
|
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
|
-u "https://your-openwebui-instance.com" \
|
|
-w "https://api-docs.example.com" \
|
|
-n "Example API Documentation" \
|
|
-j
|
|
```
|
|
|
|
### Updating an Existing Knowledge Base
|
|
|
|
```bash
|
|
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
|
-u "https://your-openwebui-instance.com" \
|
|
-w "https://knowledge-center.example.com" \
|
|
-n "Knowledge Center" \
|
|
--update
|
|
```
|
|
|
|
### Skipping Existing Files
|
|
|
|
```bash
|
|
python web_to_kb.py -t "YOUR_API_TOKEN" \
|
|
-u "https://your-openwebui-instance.com" \
|
|
-w "https://docs.example.com" \
|
|
-n "Documentation KB" \
|
|
--skip-existing
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Website Crawling**: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.
|
|
|
|
2. **Content Processing**:
|
|
- HTML content is converted to Markdown using MarkItDown
|
|
- JSON content is preserved in its native format (when `--include-json` is used)
|
|
|
|
3. **Knowledge Base Management**:
|
|
- Checks if a knowledge base with the specified name already exists
|
|
- Creates a new knowledge base if none exists
|
|
|
|
4. **File Upload**:
|
|
- Manages existing files based on the `--update` or `--skip-existing` flags
|
|
- Uploads new files to the knowledge base
|
|
|
|
## Notes
|
|
|
|
- The script respects domain boundaries and will not crawl external links
|
|
- URLs are used to generate filenames, with special characters replaced
|
|
- Add a delay between requests to be respectful of websites' resources
|
|
- File updates are performed by uploading a new file and removing the old one
|
|
|
|
## License
|
|
|
|
This project is licensed under the MIT License - see the LICENSE file for details.
|
|
|
|
## Acknowledgments
|
|
|
|
- [MarkItDown](https://github.com/microsoft/markitdown) for HTML to Markdown conversion [1]
|
|
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
|
|
- [Requests](https://requests.readthedocs.io/) for HTTP requests
|
|
- [Open WebUI](https://github.com/open-webui/open-webui) for the knowledge base API |