LLM Caching¶

OntoCast includes automatic LLM response caching to improve performance, reduce API costs, and enable offline testing capabilities.

Overview¶

The LLM caching system automatically caches responses from language model providers, ensuring that identical queries return cached results instead of making new API calls. This provides several benefits:

Performance: Cached responses return instantly
Cost Reduction: Avoids duplicate API calls
Offline Testing: Tests can run without API access
Transparency: No configuration required - works automatically

Shared Caching Architecture¶

OntoCast uses a shared caching architecture where:

Single Cacher Instance: One Cacher object manages all caching for all tools
Tool-Specific Subdirectories: Each tool gets its own subdirectory within the shared cache
Dependency Injection: Tools receive the shared Cacher instance through their constructors
Organized Storage: Cache files are organized by tool type (llm/, converter/, chunker/)

Benefits¶

Memory Efficiency: Single cache instance instead of multiple
Consistent Configuration: All tools use the same cache directory settings
Centralized Management: Easy to clear, monitor, and manage all caches
Better Organization: Clear separation of cache files by tool type

How It Works¶

Shared Caching¶

OntoCast uses a shared caching system where all tools share a single Cacher instance:

from ontocast.tool.llm import LLMTool
from ontocast.config import LLMConfig
from ontocast.tool.cache import Cacher

# Create shared cache instance
shared_cache = Cacher()

# Create LLM tool with shared cache
llm_config = LLMConfig(
    provider="openai",
    model_name="gpt-4o-mini",
    api_key="your-api-key"
)

llm_tool = LLMTool.create(config=llm_config, cache=shared_cache)

# First call - hits API and caches response
response1 = llm_tool("What is the capital of France?")

# Second call - returns cached response instantly
response2 = llm_tool("What is the capital of France?")

Cache Key Generation¶

Cache keys are generated based on: - LLM provider and model - Prompt text - Temperature and other parameters - API endpoint URL

This ensures that different configurations or parameters result in separate cache entries.

Cache Locations¶

Default Locations¶

The system automatically selects appropriate cache directories:

Tests: .test_cache/llm/ in the current working directory
Windows: %USERPROFILE%\AppData\Local\ontocast\llm\
Unix/Linux: ~/.cache/ontocast/llm/ (or $XDG_CACHE_HOME/ontocast/llm/)

Environment Variables¶

Set the cache directory via environment variables:

# OntoCast cache directory (recommended)
export ONTOCAST_CACHE_DIR=/path/to/custom/cache

# Or use XDG cache home (affects all XDG-compliant applications)
export XDG_CACHE_HOME=/path/to/custom/cache

CLI Parameter¶

Specify cache directory via command line:

ontocast --env-path .env --working-directory ./work --cache-dir /custom/cache/path

Cache Management¶

Cache Structure¶

The cache directory contains organized subdirectories:

cache_dir/
├── openai/
│   ├── gpt-4o-mini/
│   │   ├── prompt_hash_1.json
│   │   └── prompt_hash_2.json
│   └── gpt-4/
│       └── prompt_hash_3.json
└── ollama/
    └── llama2/
        └── prompt_hash_4.json

Cache Files¶

Each cached response is stored as a JSON file containing: - Original prompt and parameters - Response content - Metadata (timestamp, model info) - Cache key hash

Testing with Caching¶

Offline Testing¶

Cached responses enable offline testing:

# First run - with API access
pytest test_llm_functionality.py

# Subsequent runs - offline (uses cached responses)
pytest test_llm_functionality.py

Test Isolation¶

Each test run uses a separate cache directory (.test_cache/llm/) to avoid interference between tests.

Performance Benefits¶

Speed Improvements¶

First Call: Normal API response time
Cached Calls: Near-instant response (< 1ms)
Batch Processing: Significant speedup for repeated operations

Cost Savings¶

Development: Avoid repeated API calls during development
Testing: Run tests without API costs
Production: Reduce API usage for common queries

Best Practices¶

Development¶

Use Default Locations: Let the system choose appropriate cache directories
Version Control: Add cache directories to .gitignore
Cleanup: Periodically clean old cache files

Production¶

Persistent Storage: Use persistent cache directories
Monitoring: Monitor cache hit rates
Maintenance: Implement cache cleanup strategies

Testing¶

Isolated Caches: Each test run gets its own cache
Deterministic: Cached responses ensure consistent test results
Offline Capability: Tests can run without API access

Troubleshooting¶

Common Issues¶

Cache Not Working: Check directory permissions
Stale Responses: Clear cache directory
Disk Space: Monitor cache directory size

Debug Cache¶

from ontocast.tool.llm import LLMTool

# Check cache directory
llm_tool = LLMTool.create(config=llm_config)
print(f"Cache directory: {llm_tool.cache.tool_cache_dir}")

# List cached files
cache_files = list(llm_tool.cache.tool_cache_dir.glob("**/*.json"))
print(f"Cached responses: {len(cache_files)}")

Clear Cache¶

import shutil
from pathlib import Path

# Clear entire cache
cache_dir = Path.home() / ".cache" / "ontocast" / "llm"
if cache_dir.exists():
    shutil.rmtree(cache_dir)
    print("Cache cleared!")

Advanced Usage¶

Custom Cache Implementation¶

For advanced use cases, you can implement custom caching by extending the Cacher class:

from ontocast.tool.llm import LLMTool
from ontocast.tool.cache import Cacher
from pathlib import Path

class CustomLLMTool(LLMTool):
    def __init__(self, config, **kwargs):
        super().__init__(config, **kwargs)
        # Override with custom cache
        self.cache = Cacher(subdirectory="llm", cache_dir=Path("/custom/cache"))

Cache Statistics¶

from ontocast.tool.llm import LLMTool

# Get cache statistics
llm_tool = LLMTool.create(config=llm_config)
stats = llm_tool.cache.get_cache_stats()
print(f"Cache stats: {stats}")

Integration with Other Tools¶

ToolBox Integration¶

Caching works seamlessly with the ToolBox through a shared Cacher instance:

from ontocast.toolbox import ToolBox
from ontocast.config import Config

# ToolBox automatically creates and uses a shared Cacher
config = Config()
tools = ToolBox(config)

# All tools (LLM, Converter, Chunker) share the same cache instance
result = tools.llm("Process this document")
converted = tools.converter(document_file)
chunks = tools.chunker(text)

Server Integration¶

The server automatically uses caching for all LLM operations:

# Start server with automatic caching
ontocast --env-path .env --working-directory /data/working

Security Considerations¶

Sensitive Data¶

Cache files may contain sensitive prompt data
Ensure proper file permissions on cache directories
Consider encryption for sensitive deployments

Access Control¶

Restrict access to cache directories
Use appropriate file system permissions
Consider network security for shared cache directories

Converter and Chunker Caching¶

In addition to LLM response caching, OntoCast also includes caching for document conversion and text chunking operations. This helps avoid redundant processing when the same documents or text are processed multiple times.

Converter Caching¶

The ConverterTool automatically caches document conversion results based on the input file content. This means:

PDF files: If the same PDF is processed multiple times, the conversion to markdown is cached
Other documents: PowerPoint, Word documents, etc. are also cached after conversion
Plain text: Text input is not cached as it doesn't require conversion

Chunker Caching¶

The ChunkerTool caches chunking results based on: - Input text content: The exact text being chunked - Chunking configuration: All chunking parameters (max_size, min_size, model, etc.) - Chunking mode: Whether semantic or naive chunking is used

This ensures that identical text with identical chunking parameters will return cached results.

Cache Organization¶

Caching is organized in subdirectories:

~/.cache/ontocast/
├── llm/           # LLM response cache
├── converter/     # Document conversion cache
└── chunker/       # Text chunking cache

Cache Benefits¶

Faster Processing: Repeated operations return instantly from cache
Cost Reduction: Avoids redundant LLM API calls and processing
Consistency: Identical inputs always produce identical outputs
Offline Capability: Cached operations work without API access

Cache Management¶

You can access cache statistics and management through the tool instances:

from ontocast.tool.converter import ConverterTool
from ontocast.tool.chunk.chunker import ChunkerTool

# Get cache statistics
converter = ConverterTool()
stats = converter.cache.get_cache_stats()
print(f"Converter cache: {stats['total_files']} files, {stats['total_size_bytes']} bytes")

# Clear cache if needed
converter.cache.clear()

# Chunker cache management
chunker = ChunkerTool()
chunker.cache.clear()  # Clear chunker cache

Custom Cache Directories¶

You can specify custom cache directories in several ways:

1. Environment Variables¶

# OntoCast cache directory (recommended)
export ONTOCAST_CACHE_DIR=/custom/cache/path

# Or use XDG cache home (affects all XDG-compliant applications)
export XDG_CACHE_HOME=/custom/cache/path

2. CLI Parameter¶

ontocast --env-path .env --working-directory ./work --cache-dir /custom/cache/path

3. Programmatic Configuration¶

from ontocast.toolbox import ToolBox
from ontocast.config import Config
from pathlib import Path

# Create config and set cache directory
config = Config()
config.tool_config.path_config.cache_dir = Path("/custom/cache/path")

# Create ToolBox with config (cache directory is automatically used)
tools = ToolBox(config)

# All tools will use the same custom cache directory
result = tools.llm("Process this document")
converted = tools.converter(document_file)
chunks = tools.chunker(text)

Cache Key Generation¶

Cache keys are generated based on: - Content hash: SHA256 hash of the input content - Configuration: All relevant parameters that affect the output - Tool-specific parameters: Model names, chunking modes, etc.

This ensures that different configurations produce different cache entries, even for the same input content.

Best Practices¶

Let caching work automatically: No configuration needed for basic usage
Monitor cache size: Check cache statistics periodically
Clear cache when needed: If you change tool configurations significantly
Use custom directories: For testing or specific deployment scenarios
Cache persistence: Caches persist between runs for maximum benefit