Files
predict-otron-9001/README.md
geoffsee 38d51722f2 Update configuration loading with Cargo.toml path and clean up .gitignore
---

This commit message concisely communicates the key changes:

1. The code now builds an absolute path to the `Cargo.toml` file, enhancing clarity in configuration loading.
2. The addition of `PathBuf` usage improves type safety.
3. The removal of unnecessary entries from `.gitignore` helps maintain a clean project structure.

These updates reflect improvements in both functionality and project organization.
2025-08-31 14:06:44 -04:00

14 KiB

predict-otron-9000

Powerful local AI inference with OpenAI-compatible APIs


This project is an educational aide for bootstrapping my understanding of language model inferencing at the lowest levels I can, serving as a "rubber-duck" solution for Kubernetes based performance-oriented inference capabilities on air-gapped networks.

By isolating application behaviors in components at the crate level, development reduces to a short feedback loop for validation and integration, ultimately smoothing the learning curve for scalable AI systems. Stability is currently best effort. Many models require unique configuration. When stability is achieved, this project will be promoted to the seemueller-io GitHub organization under a different name.

A comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces.

Project Overview

The predict-otron-9000 is a flexible AI platform that provides:

  • Local LLM Inference: Run Gemma and Llama models locally with CPU or GPU acceleration
  • Embeddings Generation: Create text embeddings with FastEmbed
  • Web Interface: Interact with models through a Leptos WASM chat interface
  • TypeScript CLI: Command-line client for testing and automation
  • Production Deployment: Docker and Kubernetes deployment options

The system supports both CPU and GPU acceleration (CUDA/Metal), with intelligent fallbacks and platform-specific optimizations.

Features

  • OpenAI Compatible: API endpoints match OpenAI's format for easy integration
  • Text Embeddings: Generate high-quality text embeddings using FastEmbed
  • Text Generation: Chat completions with OpenAI-compatible API using Gemma and Llama models (various sizes including instruction-tuned variants)
  • Performance Optimized: Efficient caching and platform-specific optimizations for improved throughput
  • Web Chat Interface: Leptos chat interface
  • Flexible Deployment: Run as monolithic service or microservices architecture

Architecture Overview

Workspace Structure

The project uses a 7-crate Rust workspace plus TypeScript components:

crates/
├── predict-otron-9000/     # Main orchestration server (Rust 2024)
├── inference-engine/       # Multi-model inference orchestrator (Rust 2021)
├── gemma-runner/          # Gemma model inference via Candle (Rust 2021)
├── llama-runner/          # Llama model inference via Candle (Rust 2021)
├── embeddings-engine/     # FastEmbed embeddings service (Rust 2024)
├── leptos-app/            # WASM web frontend (Rust 2021)
├── helm-chart-tool/       # Kubernetes deployment tooling (Rust 2024)
└── scripts/
    └── cli.ts             # TypeScript/Bun CLI client

Service Architecture

  • Main Server (port 8080): Orchestrates inference and embeddings services
  • Embeddings Service (port 8080): Standalone FastEmbed service with OpenAI API compatibility
  • Web Frontend (port 8788): cargo leptos SSR app
  • CLI Client: TypeScript/Bun client for testing and automation

Deployment Modes

The architecture supports multiple deployment patterns:

  1. Development Mode: All services run in a single process for simplified development
  2. Docker Monolithic: Single containerized service handling all functionality
  3. Kubernetes Microservices: Separate services for horizontal scalability and fault isolation

Build and Configuration

Dependencies and Environment Prerequisites

Rust Toolchain

  • Editions: Mixed - main services use Rust 2024, some components use 2021
  • Recommended: Latest stable Rust toolchain: rustup default stable && rustup update
  • Developer tools:
    • rustup component add rustfmt (formatting)
    • rustup component add clippy (linting)

Node.js/Bun Toolchain

  • Bun: Required for TypeScript CLI client: curl -fsSL https://bun.sh/install | bash
  • Node.js: Alternative to Bun, supports OpenAI SDK v5.16.0+

ML Framework Dependencies

  • Candle: Version 0.9.1 with conditional compilation:
    • macOS: Metal support with CPU fallback for stability
    • Linux: CUDA support with CPU fallback
    • CPU-only: Supported on all platforms
  • FastEmbed: Version 4.x for embeddings functionality

Hugging Face Access

  • Required for: Gemma model downloads (gated models)
  • Authentication:
    • CLI: pip install -U "huggingface_hub[cli]" && huggingface-cli login
    • Environment: export HF_TOKEN="<your_token>"
  • Cache management: export HF_HOME="$PWD/.hf-cache" (optional, keeps cache local)
  • Model access: Accept Gemma model licenses on Hugging Face before use

Platform-Specific Notes

  • macOS: Metal acceleration available but routed to CPU for Gemma v3 stability
  • Linux: CUDA support with BF16 precision on GPU, F32 on CPU
  • Conditional compilation: Handled automatically per platform in Cargo.toml

Build Procedures

Full Workspace Build

cargo build --workspace --release

Individual Services

Main Server:

cargo build --bin predict-otron-9000 --release

Inference Engine CLI:

cargo build --bin cli --package inference-engine --release

Embeddings Service:

cargo build --bin embeddings-engine --release

Running Services

Main Server (Port 8080)

./scripts/run_server.sh
  • Respects SERVER_PORT (default: 8080) and RUST_LOG (default: info)
  • Boots with default model: gemma-3-1b-it
  • Requires HF authentication for first-time model download

Web Frontend (Port 8788)

cd crates/leptos-app
./run.sh
  • Serves Leptos WASM frontend on port 8788
  • Sets required RUSTFLAGS for WebAssembly getrandom support
  • Auto-reloads during development

TypeScript CLI Client

# List available models
bun run scripts/cli.ts --list-models

# Chat completion
bun run scripts/cli.ts "What is the capital of France?"

# With specific model
bun run scripts/cli.ts --model gemma-3-1b-it --prompt "Hello, world!"

# Show help
bun run scripts/cli.ts --help

API Usage

Health Checks and Model Inventory

curl -s http://localhost:8080/v1/models | jq

Chat Completions

Non-streaming:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 64
  }' | jq

Streaming (Server-Sent Events):

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default", 
    "messages": [{"role": "user", "content": "Tell a short joke"}],
    "stream": true,
    "max_tokens": 64
  }'

Model Specification:

  • Use "model": "default" for configured model
  • Or specify exact model ID: "model": "gemma-3-1b-it"
  • Requests with unknown models will be rejected

Embeddings API

Generate text embeddings compatible with OpenAI's embeddings API.

Endpoint: POST /v1/embeddings

Request Body:

{
  "input": "Your text to embed",
  "model": "nomic-embed-text-v1.5"
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.1, 0.2, 0.3]
    }
  ],
  "model": "nomic-embed-text-v1.5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

Web Frontend

  • Navigate to http://localhost:8788
  • Real-time chat interface with the inference server
  • Supports streaming responses and conversation history

Testing

Test Categories

  1. Offline/fast tests: No network or model downloads required
  2. Online tests: Require HF authentication and model downloads
  3. Integration tests: Multi-service end-to-end testing

Quick Start: Offline Tests

Prompt formatting tests:

cargo test --workspace build_gemma_prompt

Model metadata tests:

cargo test --workspace which_

These verify core functionality without requiring HF access.

Full Test Suite (Requires HF)

Prerequisites:

  1. Accept Gemma model licenses on Hugging Face
  2. Authenticate: huggingface-cli login or export HF_TOKEN=...
  3. Optional: export HF_HOME="$PWD/.hf-cache"

Run all tests:

cargo test --workspace

Integration Testing

End-to-end test script:

./smoke_test.sh

This script:

  • Starts the server in background with proper cleanup
  • Waits for server readiness via health checks
  • Runs CLI tests for model listing and chat completion
  • Includes 60-second timeout and process management

Development

Code Style and Tooling

Formatting:

cargo fmt --all

Linting:

cargo clippy --workspace --all-targets -- -D warnings

Logging:

  • Server uses tracing framework
  • Control via RUST_LOG (e.g., RUST_LOG=debug ./scripts/run_server.sh)

Adding Tests

For fast, offline tests:

  • Exercise pure logic without tokenizers/models
  • Use descriptive names for easy filtering: cargo test specific_test_name
  • Example patterns: prompt construction, metadata selection, tensor math

Process:

  1. Add test to existing module
  2. Run filtered: cargo test --workspace new_test_name
  3. Verify in full suite: cargo test --workspace

OpenAI API Compatibility

Features:

  • POST /v1/chat/completions with streaming and non-streaming
  • Single configured model enforcement (use "model": "default")
  • Gemma-style prompt formatting with <start_of_turn>/<end_of_turn> markers
  • System prompt injection into first user turn
  • Repetition detection and early stopping in streaming mode

CORS:

  • Fully open by default (tower-http CorsLayer::Any)
  • Adjust for production deployment

Architecture Details

Device Selection:

  • Automatic device/dtype selection
  • CPU: Universal fallback (F32 precision)
  • CUDA: BF16 precision on compatible GPUs
  • Metal: Available but routed to CPU for Gemma v3 stability

Model Loading:

  • Single-file model.safetensors preferred
  • Falls back to index resolution via utilities_lib::hub_load_safetensors
  • HF cache populated on first access

Multi-Service Design:

  • Main server orchestrates inference and embeddings
  • Services can run independently for horizontal scaling
  • Docker/Kubernetes metadata included for deployment

Deployment

Docker Support

All services include Docker metadata in Cargo.toml:

Main Server:

  • Image: ghcr.io/geoffsee/predict-otron-9000:latest
  • Port: 8080

Inference Service:

  • Image: ghcr.io/geoffsee/inference-service:latest
  • Port: 8080

Embeddings Service:

  • Image: ghcr.io/geoffsee/embeddings-service:latest
  • Port: 8080

Web Frontend:

  • Image: ghcr.io/geoffsee/leptos-app:latest
  • Port: 8788

Docker Compose:

# Start all services
docker-compose up -d

# Check logs
docker-compose logs -f

# Stop services
docker-compose down

Kubernetes Support

All services include Kubernetes manifest metadata:

  • Single replica deployments by default
  • Service-specific port configurations
  • Ready for horizontal pod autoscaling

For Kubernetes deployment details, see the ARCHITECTURE.md document.

Build Artifacts

Ignored by Git:

  • target/ (Rust build artifacts)
  • node_modules/ (Node.js dependencies)
  • dist/ (Frontend build output)
  • .fastembed_cache/ (FastEmbed model cache)
  • .hf-cache/ (Hugging Face cache, if configured)

Common Issues and Solutions

Authentication/Licensing

Symptom: 404 or permission errors fetching models
Solution:

  1. Accept Gemma model licenses on Hugging Face
  2. Authenticate with huggingface-cli login or HF_TOKEN
  3. Verify token with huggingface-cli whoami

GPU Issues

Symptom: OOM errors or GPU panics
Solution:

  1. Test on CPU first: ensure CUDA_VISIBLE_DEVICES="" if needed
  2. Check available VRAM vs model requirements
  3. Consider using smaller model variants

Model Mismatch Errors

Symptom: 400 errors with type=model_mismatch
Solution:

  • Use "model": "default" in API requests
  • Or match configured model ID exactly: "model": "gemma-3-1b-it"

Frontend Build Issues

Symptom: WASM compilation failures
Solution:

  1. Install required targets: rustup target add wasm32-unknown-unknown
  2. Check RUSTFLAGS in leptos-app/run.sh

Network/Timeout Issues

Symptom: First-time model downloads timing out
Solution:

  1. Ensure stable internet connection
  2. Consider using local HF cache: export HF_HOME="$PWD/.hf-cache"
  3. Download models manually with huggingface-cli

Minimal End-to-End Verification

Build verification:

cargo build --workspace --release

Fast offline tests:

cargo test --workspace build_gemma_prompt
cargo test --workspace which_

Service startup:

./scripts/run_server.sh &
sleep 10  # Wait for server startup
curl -s http://localhost:8080/v1/models | jq

CLI client test:

bun run scripts/cli.ts "What is 2+2?"

Web frontend:

cd crates/leptos-app && ./run.sh &
# Navigate to http://localhost:8788

Integration test:

./smoke_test.sh

Cleanup:

pkill -f "predict-otron-9000"

For networked tests and full functionality, ensure Hugging Face authentication is configured as described above.

Further Reading

Documentation

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Ensure all tests pass: cargo test
  5. Submit a pull request

Warning: Do NOT use this in production unless you are cool like that.