add_pdf(), add_website(), and add_text(). Documents are automatically chunked and vectorized for retrieval.
Quick Start
add_pdf()
Add a PDF document to the knowledge base.Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path | str | Required | Path to PDF file |
chunk_size | int | 1024 | Size of text chunks in characters |
chunk_overlap | int | 128 | Overlap between chunks |
data_parser | str | "llmsherpa" | PDF parser to use |
extra_info | str | None | Extra metadata as JSON string |
Examples
add_docx()
Add a Word document to the knowledge base.Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path | str | Required | Path to DOCX file |
chunk_size | int | 1024 | Size of text chunks |
chunk_overlap | int | 128 | Overlap between chunks |
data_parser | str | "docx2txt" | Document parser |
extra_info | str | None | Extra metadata |
Example
add_txt()
Add a plain text file to the knowledge base.Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path | str | Required | Path to TXT file |
chunk_size | int | 1024 | Size of text chunks |
chunk_overlap | int | 128 | Overlap between chunks |
data_parser | str | "simple" | Text parser |
extra_info | str | None | Extra metadata |
Example
add_website()
Add website content to the knowledge base with optional crawling.Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | List[str] | Required | URL or list of URLs |
source | str | "website" | Source identifier |
max_pages | int | 1 | Maximum pages to crawl |
max_depth | int | 0 | Maximum crawl depth (0 = single page) |
chunk_size | int | 1024 | Size of text chunks |
chunk_overlap | int | 128 | Overlap between chunks |
dynamic_content_wait_secs | int | 5 | Wait time for dynamic content |
crawler_type | str | "cheerio" | Crawler type |
Examples
add_text()
Add raw text content directly to the knowledge base.Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
text | str | Required | Text content to add |
source | str | Required | Source identifier |
chunk_size | int | 1024 | Size of text chunks |
chunk_overlap | int | 128 | Overlap between chunks |
Examples
Chunking Configuration
Documents are split into chunks for efficient retrieval. Configure chunking to optimize for your use case:Small Chunks (Precise Retrieval)
- FAQ-style content
- Technical documentation
- When precision is important
Large Chunks (More Context)
- Narrative content
- Legal documents
- When context is important
Bulk Document Loading
Examples
Documentation Website
Support Knowledge Base
Mixed Content
Error Handling
Processing Time
Document processing can take time, especially for:- Large PDFs (many pages)
- Website crawling (many pages)
- Complex documents