14 KiB
predict-otron-9000
AI inference Server with OpenAI-compatible API (Limited Features)
> This project is an educational aide for bootstrapping my understanding of language model inferencing at the lowest levels I can, serving as a "rubber-duck" solution for Kubernetes based performance-oriented inference capabilities on air-gapped networks.
By isolating application behaviors in components at the crate level, development reduces to a short feedback loop for validation and integration, ultimately smoothing the learning curve for scalable AI systems. Stability is currently best-effort. Many models require unique configuration. When stability is achieved, this project will be promoted to the seemueller-io GitHub organization under a different name.
A comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces.
./scripts/run.sh
Project Overview
The predict-otron-9000 is a flexible AI platform that provides:
- Local LLM Inference: Run Gemma and Llama models locally with CPU or GPU acceleration
- Embeddings Generation: Create text embeddings with FastEmbed
- Web Interface: Interact with models through a Leptos WASM chat interface
- TypeScript CLI: Command-line client for testing and automation
- Production Deployment: Docker and Kubernetes deployment options
The system supports both CPU and GPU acceleration (CUDA/Metal), with intelligent fallbacks and platform-specific optimizations.
Features
- OpenAI Compatible: API endpoints match OpenAI's format for easy integration
- Text Embeddings: Generate high-quality text embeddings using FastEmbed
- Text Generation: Chat completions with OpenAI-compatible API using Gemma and Llama models (various sizes including instruction-tuned variants)
- Performance Optimized: Efficient caching and platform-specific optimizations for improved throughput
- Web Chat Interface: Leptos chat interface
- Flexible Deployment: Run as monolithic service or microservices architecture
Architecture Overview
Workspace Structure
The project uses a 9-crate Rust workspace plus TypeScript components:
crates/
├── predict-otron-9000/ # Main orchestration server (Rust 2024)
├── inference-engine/ # Multi-model inference orchestrator (Rust 2021)
├── embeddings-engine/ # FastEmbed embeddings service (Rust 2024)
└── chat-ui/ # WASM web frontend (Rust 2021)
integration/
├── cli/ # CLI client crate (Rust 2024)
│ └── package/
│ └── cli.ts # TypeScript/Bun CLI client
├── gemma-runner/ # Gemma model inference via Candle (Rust 2021)
├── llama-runner/ # Llama model inference via Candle (Rust 2021)
├── helm-chart-tool/ # Kubernetes deployment tooling (Rust 2024)
└── utils/ # Shared utilities (Rust 2021)
Service Architecture
- Main Server (port 8080): Orchestrates inference and embeddings services
- Embeddings Service (port 8080): Standalone FastEmbed service with OpenAI API compatibility
- Web Frontend (port 8788): chat-ui WASM app
- CLI Client: TypeScript/Bun client for testing and automation
Deployment Modes
The architecture supports multiple deployment patterns:
- Development Mode: All services run in a single process for simplified development
- Docker Monolithic: Single containerized service handling all functionality
- Kubernetes Microservices: Separate services for horizontal scalability and fault isolation
Build and Configuration
Dependencies and Environment Prerequisites
Rust Toolchain
- Editions: Mixed - main services use Rust 2024, some components use 2021
- Recommended: Latest stable Rust toolchain:
rustup default stable && rustup update
- Developer tools:
rustup component add rustfmt
(formatting)rustup component add clippy
(linting)
Node.js/Bun Toolchain
- Bun: Required for TypeScript CLI client:
curl -fsSL https://bun.sh/install | bash
- Node.js: Alternative to Bun, supports OpenAI SDK v5.16.0+
ML Framework Dependencies
- Candle: Version 0.9.1 with conditional compilation:
- macOS: Metal support with CPU fallback for stability
- Linux: CUDA support with CPU fallback
- CPU-only: Supported on all platforms
- FastEmbed: Version 4.x for embeddings functionality
Hugging Face Access
- Required for: Gemma model downloads (gated models)
- Authentication:
- CLI:
pip install -U "huggingface_hub[cli]" && huggingface-cli login
- Environment:
export HF_TOKEN="<your_token>"
- CLI:
- Cache management:
export HF_HOME="$PWD/.hf-cache"
(optional, keeps cache local) - Model access: Accept Gemma model licenses on Hugging Face before use
Platform-Specific Notes
- macOS: Metal acceleration available but routed to CPU for Gemma v3 stability
- Linux: CUDA support with BF16 precision on GPU, F32 on CPU
- Conditional compilation: Handled automatically per platform in Cargo.toml
Build Procedures
Full Workspace Build
cargo build --workspace --release
Individual Services
Main Server:
cargo build --bin predict-otron-9000 --release
Inference Engine CLI:
cargo build --bin cli --package inference-engine --release
Embeddings Service:
cargo build --bin embeddings-engine --release
Running Services
Main Server (Port 8080)
./scripts/run_server.sh
- Respects
SERVER_PORT
(default: 8080) andRUST_LOG
(default: info) - Boots with default model:
gemma-3-1b-it
- Requires HF authentication for first-time model download
Web Frontend (Port 8788)
cd crates/chat-ui
./run.sh
- Serves chat-ui WASM frontend on port 8788
- Sets required RUSTFLAGS for WebAssembly getrandom support
- Auto-reloads during development
TypeScript CLI Client
# List available models
cd integration/cli/package && bun run cli.ts --list-models
# Chat completion
cd integration/cli/package && bun run cli.ts "What is the capital of France?"
# With specific model
cd integration/cli/package && bun run cli.ts --model gemma-3-1b-it --prompt "Hello, world!"
# Show help
cd integration/cli/package && bun run cli.ts --help
API Usage
Health Checks and Model Inventory
curl -s http://localhost:8080/v1/models | jq
Chat Completions
Non-streaming:
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 64
}' | jq
Streaming (Server-Sent Events):
curl -N http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Tell a short joke"}],
"stream": true,
"max_tokens": 64
}'
Model Specification:
- Use
"model": "default"
for configured model - Or specify exact model ID:
"model": "gemma-3-1b-it"
- Requests with unknown models will be rejected
Embeddings API
Generate text embeddings compatible with OpenAI's embeddings API.
Endpoint: POST /v1/embeddings
Request Body:
{
"input": "Your text to embed",
"model": "nomic-embed-text-v1.5"
}
Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.1, 0.2, 0.3]
}
],
"model": "nomic-embed-text-v1.5",
"usage": {
"prompt_tokens": 0,
"total_tokens": 0
}
}
Web Frontend
- Navigate to
http://localhost:8788
- Real-time chat interface with the inference server
- Supports streaming responses and conversation history
Testing
Test Categories
- Offline/fast tests: No network or model downloads required
- Online tests: Require HF authentication and model downloads
- Integration tests: Multi-service end-to-end testing
Quick Start: Offline Tests
Prompt formatting tests:
cargo test --workspace build_gemma_prompt
Model metadata tests:
cargo test --workspace which_
These verify core functionality without requiring HF access.
Full Test Suite (Requires HF)
Prerequisites:
- Accept Gemma model licenses on Hugging Face
- Authenticate:
huggingface-cli login
orexport HF_TOKEN=...
- Optional:
export HF_HOME="$PWD/.hf-cache"
Run all tests:
cargo test --workspace
Integration Testing
End-to-end test script:
./scripts/smoke_test.sh
This script:
- Starts the server in background with proper cleanup
- Waits for server readiness via health checks
- Runs CLI tests for model listing and chat completion
- Includes 60-second timeout and process management
Development
Code Style and Tooling
Formatting:
cargo fmt --all
Linting:
cargo clippy --workspace --all-targets -- -D warnings
Logging:
- Server uses
tracing
framework - Control via
RUST_LOG
(e.g.,RUST_LOG=debug ./scripts/run_server.sh
)
Adding Tests
For fast, offline tests:
- Exercise pure logic without tokenizers/models
- Use descriptive names for easy filtering:
cargo test specific_test_name
- Example patterns: prompt construction, metadata selection, tensor math
Process:
- Add test to existing module
- Run filtered:
cargo test --workspace new_test_name
- Verify in full suite:
cargo test --workspace
OpenAI API Compatibility
Features:
- POST
/v1/chat/completions
with streaming and non-streaming - Single configured model enforcement (use
"model": "default"
) - Gemma-style prompt formatting with
<start_of_turn>
/<end_of_turn>
markers - System prompt injection into first user turn
- Repetition detection and early stopping in streaming mode
CORS:
- Fully open by default (
tower-http CorsLayer::Any
) - Adjust for production deployment
Architecture Details
Device Selection:
- Automatic device/dtype selection
- CPU: Universal fallback (F32 precision)
- CUDA: BF16 precision on compatible GPUs
- Metal: Available but routed to CPU for Gemma v3 stability
Model Loading:
- Single-file
model.safetensors
preferred - Falls back to index resolution via
utilities_lib::hub_load_safetensors
- HF cache populated on first access
Multi-Service Design:
- Main server orchestrates inference and embeddings
- Services can run independently for horizontal scaling
- Docker/Kubernetes metadata included for deployment
Deployment
Docker Support
All services include Docker metadata in Cargo.toml
:
Main Server:
- Image:
ghcr.io/geoffsee/predict-otron-9000:latest
- Port: 8080
Inference Service:
- Image:
ghcr.io/geoffsee/inference-service:latest
- Port: 8080
Embeddings Service:
- Image:
ghcr.io/geoffsee/embeddings-service:latest
- Port: 8080
Web Frontend:
- Image:
ghcr.io/geoffsee/chat-ui:latest
- Port: 8788
Docker Compose:
# Start all services
docker-compose up -d
# Check logs
docker-compose logs -f
# Stop services
docker-compose down
Kubernetes Support
All services include Kubernetes manifest metadata:
- Single replica deployments by default
- Service-specific port configurations
- Ready for horizontal pod autoscaling
For Kubernetes deployment details, see the ARCHITECTURE.md document.
Build Artifacts
Ignored by Git:
target/
(Rust build artifacts)node_modules/
(Node.js dependencies)dist/
(Frontend build output).fastembed_cache/
(FastEmbed model cache).hf-cache/
(Hugging Face cache, if configured)
Common Issues and Solutions
Authentication/Licensing
Symptom: 404 or permission errors fetching models
Solution:
- Accept Gemma model licenses on Hugging Face
- Authenticate with
huggingface-cli login
orHF_TOKEN
- Verify token with
huggingface-cli whoami
GPU Issues
Symptom: OOM errors or GPU panics
Solution:
- Test on CPU first: ensure
CUDA_VISIBLE_DEVICES=""
if needed - Check available VRAM vs model requirements
- Consider using smaller model variants
Model Mismatch Errors
Symptom: 400 errors with type=model_mismatch
Solution:
- Use
"model": "default"
in API requests - Or match configured model ID exactly:
"model": "gemma-3-1b-it"
Frontend Build Issues
Symptom: WASM compilation failures
Solution:
- Install required targets:
rustup target add wasm32-unknown-unknown
- Check RUSTFLAGS in chat-ui/run.sh
Network/Timeout Issues
Symptom: First-time model downloads timing out
Solution:
- Ensure stable internet connection
- Consider using local HF cache:
export HF_HOME="$PWD/.hf-cache"
- Download models manually with
huggingface-cli
Minimal End-to-End Verification
Build verification:
cargo build --workspace --release
Fast offline tests:
cargo test --workspace build_gemma_prompt
cargo test --workspace which_
Service startup:
./scripts/run_server.sh &
sleep 10 # Wait for server startup
curl -s http://localhost:8080/v1/models | jq
CLI client test:
cd integration/cli/package && bun run cli.ts "What is 2+2?"
Web frontend:
cd crates/chat-ui && ./run.sh &
# Navigate to http://localhost:8788
Integration test:
./scripts/smoke_test.sh
Cleanup:
pkill -f "predict-otron-9000"
For networked tests and full functionality, ensure Hugging Face authentication is configured as described above.
Further Reading
Documentation
- Architecture - Detailed architectural diagrams and deployment patterns
- Server Configuration Guide - Detailed server configuration options
- Testing Documentation - Comprehensive testing guide
- Performance Benchmarking - Instructions for benchmarking
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Make your changes and add tests
- Ensure all tests pass:
cargo test
- Submit a pull request
Warning: Do NOT use this in production unless you are cool like that.