web-to-kb/README.md

# Web to Knowledge Base for Open WebUI

A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.

## Features

- Crawls websites to a specified depth while respecting domain boundaries
- Converts HTML content to Markdown using MarkItDown
- Preserves JSON content in its original format
- Creates or updates knowledge bases in Open WebUI
- Handles existing files through update or skip options
- Customizable crawling with exclude patterns
- Detailed logging of the process

## Installation

### Prerequisites

- Python 3.10+
- Open WebUI instance with API access

### Dependencies

Install the required packages:

```bash
pip install requests beautifulsoup4 markitdown[all]
```

### Getting the Script

Download the script and make it executable:

```bash
curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py
chmod +x web_to_kb.py
```

## Usage

Basic usage:

```bash
python web_to_kb.py --token "YOUR_API_TOKEN" \
                   --base-url "https://your-openwebui-instance.com" \
                   --website-url "https://website-to-crawl.com" \
                   --kb-name "My Website Knowledge Base"
```

### Command Line Arguments

| Argument | Short | Description | Required | Default |
|----------|-------|-------------|----------|---------|
| `--token` | `-t` | Your OpenWebUI API token | Yes | - |
| `--base-url` | `-u` | Base URL of your OpenWebUI instance | Yes | - |
| `--website-url` | `-w` | URL of the website to crawl | Yes | - |
| `--kb-name` | `-n` | Name for the knowledge base | Yes | - |
| `--kb-purpose` | `-p` | Purpose description for the knowledge base | No | None |
| `--depth` | `-d` | Maximum depth to crawl | No | 2 |
| `--delay` | | Delay between requests in seconds | No | 1.0 |
| `--exclude` | `-e` | URL patterns to exclude from crawling (can be specified multiple times) | No | None |
| `--include-json` | `-j` | Include JSON files and API endpoints | No | False |
| `--update` | | Update existing files in the knowledge base | No | False |
| `--skip-existing` | | Skip existing files in the knowledge base | No | False |

## Examples

### Basic Crawl with Limited Depth

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Example Docs KB" \
                   -d 3
```

### Excluding Certain URL Patterns

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://blog.example.com" \
                   -n "Example Blog KB" \
                   -e "/tags/" \
                   -e "/author/" \
                   -e "/search/"
```

### Including JSON Content

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://api-docs.example.com" \
                   -n "Example API Documentation" \
                   -j
```

### Updating an Existing Knowledge Base

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://knowledge-center.example.com" \
                   -n "Knowledge Center" \
                   --update
```

### Skipping Existing Files

```bash
python web_to_kb.py -t "YOUR_API_TOKEN" \
                   -u "https://your-openwebui-instance.com" \
                   -w "https://docs.example.com" \
                   -n "Documentation KB" \
                   --skip-existing
```

## How It Works

1. **Website Crawling**: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.

2. **Content Processing**: 
   - HTML content is converted to Markdown using MarkItDown
   - JSON content is preserved in its native format (when `--include-json` is used)

3. **Knowledge Base Management**:
   - Checks if a knowledge base with the specified name already exists
   - Creates a new knowledge base if none exists

4. **File Upload**:
   - Manages existing files based on the `--update` or `--skip-existing` flags
   - Uploads new files to the knowledge base

## Notes

- The script respects domain boundaries and will not crawl external links
- URLs are used to generate filenames, with special characters replaced
- Add a delay between requests to be respectful of websites' resources
- File updates are performed by uploading a new file and removing the old one

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- [MarkItDown](https://github.com/microsoft/markitdown) for HTML to Markdown conversion [1]
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
- [Requests](https://requests.readthedocs.io/) for HTTP requests
- [Open WebUI](https://github.com/open-webui/open-webui) for the knowledge base API
Updating for a new version and adding the updated README 2025-04-16 19:50:33 -07:00			`# Web to Knowledge Base for Open WebUI`

			`A Python utility script that crawls websites, converts pages to Markdown or preserves JSON data, and uploads them to an Open WebUI knowledge base.`

			`## Features`

			`- Crawls websites to a specified depth while respecting domain boundaries`
			`- Converts HTML content to Markdown using MarkItDown`
			`- Preserves JSON content in its original format`
			`- Creates or updates knowledge bases in Open WebUI`
			`- Handles existing files through update or skip options`
			`- Customizable crawling with exclude patterns`
			`- Detailed logging of the process`

			`## Installation`

			`### Prerequisites`

			`- Python 3.10+`
			`- Open WebUI instance with API access`

			`### Dependencies`

			`Install the required packages:`

			```bash
Update README.md 2025-04-17 16:49:39 +00:00			`pip install requests beautifulsoup4 markitdown[all]`
Updating for a new version and adding the updated README 2025-04-16 19:50:33 -07:00			```

			`### Getting the Script`

			`Download the script and make it executable:`

			```bash
			`curl -O https://raw.githubusercontent.com/yourusername/open-webui-site-crawler/main/web_to_kb.py`
			`chmod +x web_to_kb.py`
			```

			`## Usage`

			`Basic usage:`

			```bash
			`python web_to_kb.py --token "YOUR_API_TOKEN" \`
			`--base-url "https://your-openwebui-instance.com" \`
			`--website-url "https://website-to-crawl.com" \`
			`--kb-name "My Website Knowledge Base"`
			```

			`### Command Line Arguments`

			`\| Argument \| Short \| Description \| Required \| Default \|`
			`\|----------\|-------\|-------------\|----------\|---------\|`
			\| `--token` \| `-t` \| Your OpenWebUI API token \| Yes \| - \|
			\| `--base-url` \| `-u` \| Base URL of your OpenWebUI instance \| Yes \| - \|
			\| `--website-url` \| `-w` \| URL of the website to crawl \| Yes \| - \|
			\| `--kb-name` \| `-n` \| Name for the knowledge base \| Yes \| - \|
			\| `--kb-purpose` \| `-p` \| Purpose description for the knowledge base \| No \| None \|
			\| `--depth` \| `-d` \| Maximum depth to crawl \| No \| 2 \|
			\| `--delay` \| \| Delay between requests in seconds \| No \| 1.0 \|
			\| `--exclude` \| `-e` \| URL patterns to exclude from crawling (can be specified multiple times) \| No \| None \|
			\| `--include-json` \| `-j` \| Include JSON files and API endpoints \| No \| False \|
			\| `--update` \| \| Update existing files in the knowledge base \| No \| False \|
			\| `--skip-existing` \| \| Skip existing files in the knowledge base \| No \| False \|

			`## Examples`

			`### Basic Crawl with Limited Depth`

			```bash
			`python web_to_kb.py -t "YOUR_API_TOKEN" \`
			`-u "https://your-openwebui-instance.com" \`
			`-w "https://docs.example.com" \`
			`-n "Example Docs KB" \`
			`-d 3`
			```

			`### Excluding Certain URL Patterns`

			```bash
			`python web_to_kb.py -t "YOUR_API_TOKEN" \`
			`-u "https://your-openwebui-instance.com" \`
			`-w "https://blog.example.com" \`
			`-n "Example Blog KB" \`
			`-e "/tags/" \`
			`-e "/author/" \`
			`-e "/search/"`
			```

			`### Including JSON Content`

			```bash
			`python web_to_kb.py -t "YOUR_API_TOKEN" \`
			`-u "https://your-openwebui-instance.com" \`
			`-w "https://api-docs.example.com" \`
			`-n "Example API Documentation" \`
			`-j`
			```

			`### Updating an Existing Knowledge Base`

			```bash
			`python web_to_kb.py -t "YOUR_API_TOKEN" \`
			`-u "https://your-openwebui-instance.com" \`
			`-w "https://knowledge-center.example.com" \`
			`-n "Knowledge Center" \`
			`--update`
			```

			`### Skipping Existing Files`

			```bash
			`python web_to_kb.py -t "YOUR_API_TOKEN" \`
			`-u "https://your-openwebui-instance.com" \`
			`-w "https://docs.example.com" \`
			`-n "Documentation KB" \`
			`--skip-existing`
			```

			`## How It Works`

			`1. Website Crawling: The script starts crawling from the specified website URL, following links up to the specified depth while staying within the same domain.`

			`2. Content Processing:`
			`- HTML content is converted to Markdown using MarkItDown`
			- JSON content is preserved in its native format (when `--include-json` is used)

			`3. Knowledge Base Management:`
			`- Checks if a knowledge base with the specified name already exists`
			`- Creates a new knowledge base if none exists`

			`4. File Upload:`
			- Manages existing files based on the `--update` or `--skip-existing` flags
			`- Uploads new files to the knowledge base`

			`## Notes`

			`- The script respects domain boundaries and will not crawl external links`
			`- URLs are used to generate filenames, with special characters replaced`
			`- Add a delay between requests to be respectful of websites' resources`
			`- File updates are performed by uploading a new file and removing the old one`

			`## License`

			`This project is licensed under the MIT License - see the LICENSE file for details.`

			`## Acknowledgments`

			`- [MarkItDown](https://github.com/microsoft/markitdown) for HTML to Markdown conversion [1]`
			`- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing`
			`- [Requests](https://requests.readthedocs.io/) for HTTP requests`
			`- [Open WebUI](https://github.com/open-webui/open-webui) for the knowledge base API`