web-to-kb/README.md

# Web to Knowledge Base for Open WebUI

A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.

## Features

- Crawls websites to a specified depth while respecting domain boundaries
- Converts HTML content to Markdown using MarkItDown
- Preserves JSON content in its original format
- Creates or updates knowledge bases in Open WebUI
- Handles existing files through update or skip options
- Customizable crawling with exclude patterns
- Detailed logging of the process

## Installation

### Prerequisites

- Python 3.10+
- Open WebUI instance with API access

### Dependencies

Install the required packages:

```bash
pip install requests beautifulsoup4 markitdown[all]
```

### Getting the Script

Download the script and make it executable:

```bash
curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
chmod +x web_to_kb.py
```

## Usage

Basic usage:

```bash
python web_to_kb.py --token "YOUR_API_TOKEN" \
                   --base-url "https://your-openwebui-instance.com" \
                   --website-url "https://website-to-crawl.com" \
                   --kb-name "My Website Knowledge Base"
```

### Command Line Arguments

| Argument | Short | Description | Required | Default |
|----------|-------|-------------|----------|---------|
| `--token` | `-t` | Your OpenWebUI API token | Yes | - |
| `--base-url` | `-u` | Base URL of your OpenWebUI instance | Yes | - |
| `--website-url` | `-w` | URL of the website to crawl | Yes | - |
| `--kb-name` | `-n` | Name for the knowledge base | Yes | - |
| `--kb-purpose` | `-p` | Purpose description for the knowledge base | No | None |
| `--depth` | `-d` | Maximum depth to crawl | No | 2 |
| `--delay` | | Delay between requests in seconds | No | 1.0 |
| `--exclude` | `-e` | URL patterns to exclude from crawling (can be specified multiple times) | No | None |
| `--include-json` | `-j` | Include JSON files and API endpoints | No | False |
| `--update` | | Update existing files in the knowledge base | No | False |
| `--skip-existing` | | Skip existing files in the knowledge base | No | False |

## Examples

### Basic Crawl with Limited Depth

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Example Docs KB" \
                   -d 3
```

### Excluding Certain URL Patterns

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://blog.example.com" \
                   -n "Example Blog KB" \
                   -e "/tags/" \
                   -e "/author/" \
                   -e "/search/"
```

### Including JSON Content

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://api-docs.example.com" \
                   -n "Example API Documentation" \
                   -j
```

### Updating an Existing Knowledge Base

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://knowledge-center.example.com" \
                   -n "Knowledge Center" \
                   --update
```

### Skipping Existing Files

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Documentation KB" \
                   --skip-existing
```

## How It Works

1. **Website Crawling**: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.

2. **Content Processing**:
   - HTML content is converted to Markdown using MarkItDown
   - JSON content is preserved in its native format (when `--include-json` is used)

3. **Knowledge Base Management**:
   - Checks if a knowledge base with the specified name already exists
   - Creates a new knowledge base if none exists

4. **File Upload**:
   - Manages existing files based on the `--update` or `--skip-existing` flags
   - Uploads new files to the knowledge base

## Notes

- The script respects domain boundaries and will not crawl external links
- URLs are used to generate filenames, with special characters replaced
- Add a delay between requests to be respectful of websites' resources
- File updates are performed by uploading a new file and removing the old one

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- [MarkItDown](https://github.com/microsoft/markitdown) for HTML to Markdown conversion [1]
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
- [Requests](https://requests.readthedocs.io/) for HTTP requests
- [Open WebUI](https://github.com/open-webui/open-webui) for the knowledge base API