mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Files

geoffsee 38d51722f2 Update configuration loading with Cargo.toml path and clean up .gitignore

---

This commit message concisely communicates the key changes:

1. The code now builds an absolute path to the `Cargo.toml` file, enhancing clarity in configuration loading.
2. The addition of `PathBuf` usage improves type safety.
3. The removal of unnecessary entries from `.gitignore` helps maintain a clean project structure.

These updates reflect improvements in both functionality and project organization.

2025-08-31 14:06:44 -04:00

14 KiB

Raw Blame History

predict-otron-9000

Powerful local AI inference with OpenAI-compatible APIs

This project is an educational aide for bootstrapping my understanding of language model inferencing at the lowest levels I can, serving as a "rubber-duck" solution for Kubernetes based performance-oriented inference capabilities on air-gapped networks.

By isolating application behaviors in components at the crate level, development reduces to a short feedback loop for validation and integration, ultimately smoothing the learning curve for scalable AI systems. Stability is currently best effort. Many models require unique configuration. When stability is achieved, this project will be promoted to the seemueller-io GitHub organization under a different name.

A comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces.

Project Overview

The predict-otron-9000 is a flexible AI platform that provides:

Local LLM Inference: Run Gemma and Llama models locally with CPU or GPU acceleration
Embeddings Generation: Create text embeddings with FastEmbed
Web Interface: Interact with models through a Leptos WASM chat interface
TypeScript CLI: Command-line client for testing and automation
Production Deployment: Docker and Kubernetes deployment options

The system supports both CPU and GPU acceleration (CUDA/Metal), with intelligent fallbacks and platform-specific optimizations.

Features

OpenAI Compatible: API endpoints match OpenAI's format for easy integration
Text Embeddings: Generate high-quality text embeddings using FastEmbed
Text Generation: Chat completions with OpenAI-compatible API using Gemma and Llama models (various sizes including instruction-tuned variants)
Performance Optimized: Efficient caching and platform-specific optimizations for improved throughput
Web Chat Interface: Leptos chat interface
Flexible Deployment: Run as monolithic service or microservices architecture

Architecture Overview

Workspace Structure

The project uses a 7-crate Rust workspace plus TypeScript components:

crates/
├── predict-otron-9000/     # Main orchestration server (Rust 2024)
├── inference-engine/       # Multi-model inference orchestrator (Rust 2021)
├── gemma-runner/          # Gemma model inference via Candle (Rust 2021)
├── llama-runner/          # Llama model inference via Candle (Rust 2021)
├── embeddings-engine/     # FastEmbed embeddings service (Rust 2024)
├── leptos-app/            # WASM web frontend (Rust 2021)
├── helm-chart-tool/       # Kubernetes deployment tooling (Rust 2024)
└── scripts/
    └── cli.ts             # TypeScript/Bun CLI client

Service Architecture

Main Server (port 8080): Orchestrates inference and embeddings services
Embeddings Service (port 8080): Standalone FastEmbed service with OpenAI API compatibility
Web Frontend (port 8788): cargo leptos SSR app
CLI Client: TypeScript/Bun client for testing and automation

Deployment Modes

The architecture supports multiple deployment patterns:

Development Mode: All services run in a single process for simplified development
Docker Monolithic: Single containerized service handling all functionality
Kubernetes Microservices: Separate services for horizontal scalability and fault isolation

Build and Configuration

Dependencies and Environment Prerequisites

Rust Toolchain

Editions: Mixed - main services use Rust 2024, some components use 2021
Recommended: Latest stable Rust toolchain: rustup default stable && rustup update
Developer tools:
- rustup component add rustfmt (formatting)
- rustup component add clippy (linting)

Node.js/Bun Toolchain

Bun: Required for TypeScript CLI client: curl -fsSL https://bun.sh/install | bash
Node.js: Alternative to Bun, supports OpenAI SDK v5.16.0+

ML Framework Dependencies

Candle: Version 0.9.1 with conditional compilation:
- macOS: Metal support with CPU fallback for stability
- Linux: CUDA support with CPU fallback
- CPU-only: Supported on all platforms
FastEmbed: Version 4.x for embeddings functionality

Hugging Face Access

Required for: Gemma model downloads (gated models)
Authentication:
- CLI: pip install -U "huggingface_hub[cli]" && huggingface-cli login
- Environment: export HF_TOKEN="<your_token>"
Cache management: export HF_HOME="$PWD/.hf-cache" (optional, keeps cache local)
Model access: Accept Gemma model licenses on Hugging Face before use

Platform-Specific Notes

macOS: Metal acceleration available but routed to CPU for Gemma v3 stability
Linux: CUDA support with BF16 precision on GPU, F32 on CPU
Conditional compilation: Handled automatically per platform in Cargo.toml

Build Procedures

Full Workspace Build

cargo build --workspace --release

Individual Services

Main Server:

cargo build --bin predict-otron-9000 --release

Inference Engine CLI:

cargo build --bin cli --package inference-engine --release

Embeddings Service:

cargo build --bin embeddings-engine --release

Running Services

Main Server (Port 8080)

./scripts/run_server.sh

Respects SERVER_PORT (default: 8080) and RUST_LOG (default: info)
Boots with default model: gemma-3-1b-it
Requires HF authentication for first-time model download

Web Frontend (Port 8788)

cd crates/leptos-app
./run.sh

Serves Leptos WASM frontend on port 8788
Sets required RUSTFLAGS for WebAssembly getrandom support
Auto-reloads during development

TypeScript CLI Client

# List available models
bun run scripts/cli.ts --list-models

# Chat completion
bun run scripts/cli.ts "What is the capital of France?"

# With specific model
bun run scripts/cli.ts --model gemma-3-1b-it --prompt "Hello, world!"

# Show help
bun run scripts/cli.ts --help

API Usage

Health Checks and Model Inventory

curl -s http://localhost:8080/v1/models | jq

Chat Completions

Non-streaming:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 64
  }' | jq

Streaming (Server-Sent Events):

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default", 
    "messages": [{"role": "user", "content": "Tell a short joke"}],
    "stream": true,
    "max_tokens": 64
  }'

Model Specification:

Use "model": "default" for configured model
Or specify exact model ID: "model": "gemma-3-1b-it"
Requests with unknown models will be rejected

Embeddings API

Generate text embeddings compatible with OpenAI's embeddings API.

Endpoint: POST /v1/embeddings

Request Body:

{
  "input": "Your text to embed",
  "model": "nomic-embed-text-v1.5"
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.1, 0.2, 0.3]
    }
  ],
  "model": "nomic-embed-text-v1.5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

Web Frontend

Navigate to http://localhost:8788
Real-time chat interface with the inference server
Supports streaming responses and conversation history

Testing

Test Categories

Offline/fast tests: No network or model downloads required
Online tests: Require HF authentication and model downloads
Integration tests: Multi-service end-to-end testing

Quick Start: Offline Tests

Prompt formatting tests:

cargo test --workspace build_gemma_prompt

Model metadata tests:

cargo test --workspace which_

These verify core functionality without requiring HF access.

Full Test Suite (Requires HF)

Prerequisites:

Accept Gemma model licenses on Hugging Face
Authenticate: huggingface-cli login or export HF_TOKEN=...
Optional: export HF_HOME="$PWD/.hf-cache"

Run all tests:

cargo test --workspace

Integration Testing

End-to-end test script:

./smoke_test.sh

This script:

Starts the server in background with proper cleanup
Waits for server readiness via health checks
Runs CLI tests for model listing and chat completion
Includes 60-second timeout and process management

Development

Code Style and Tooling

Formatting:

cargo fmt --all

Linting:

cargo clippy --workspace --all-targets -- -D warnings

Logging:

Server uses tracing framework
Control via RUST_LOG (e.g., RUST_LOG=debug ./scripts/run_server.sh)

Adding Tests

For fast, offline tests:

Exercise pure logic without tokenizers/models
Use descriptive names for easy filtering: cargo test specific_test_name
Example patterns: prompt construction, metadata selection, tensor math

Process:

Add test to existing module
Run filtered: cargo test --workspace new_test_name
Verify in full suite: cargo test --workspace

OpenAI API Compatibility

Features:

POST /v1/chat/completions with streaming and non-streaming
Single configured model enforcement (use "model": "default")
Gemma-style prompt formatting with <start_of_turn>/<end_of_turn> markers
System prompt injection into first user turn
Repetition detection and early stopping in streaming mode

CORS:

Fully open by default (tower-http CorsLayer::Any)
Adjust for production deployment

Architecture Details

Device Selection:

Automatic device/dtype selection
CPU: Universal fallback (F32 precision)
CUDA: BF16 precision on compatible GPUs
Metal: Available but routed to CPU for Gemma v3 stability

Model Loading:

Single-file model.safetensors preferred
Falls back to index resolution via utilities_lib::hub_load_safetensors
HF cache populated on first access

Multi-Service Design:

Main server orchestrates inference and embeddings
Services can run independently for horizontal scaling
Docker/Kubernetes metadata included for deployment

Deployment

Docker Support

All services include Docker metadata in Cargo.toml:

Main Server:

Image: ghcr.io/geoffsee/predict-otron-9000:latest
Port: 8080

Inference Service:

Image: ghcr.io/geoffsee/inference-service:latest
Port: 8080

Embeddings Service:

Image: ghcr.io/geoffsee/embeddings-service:latest
Port: 8080

Web Frontend:

Image: ghcr.io/geoffsee/leptos-app:latest
Port: 8788

Docker Compose:

# Start all services
docker-compose up -d

# Check logs
docker-compose logs -f

# Stop services
docker-compose down

Kubernetes Support

All services include Kubernetes manifest metadata:

Single replica deployments by default
Service-specific port configurations
Ready for horizontal pod autoscaling

For Kubernetes deployment details, see the ARCHITECTURE.md document.

Build Artifacts

Ignored by Git:

target/ (Rust build artifacts)
node_modules/ (Node.js dependencies)
dist/ (Frontend build output)
.fastembed_cache/ (FastEmbed model cache)
.hf-cache/ (Hugging Face cache, if configured)

Common Issues and Solutions

Authentication/Licensing

Symptom: 404 or permission errors fetching models
Solution:

Accept Gemma model licenses on Hugging Face
Authenticate with huggingface-cli login or HF_TOKEN
Verify token with huggingface-cli whoami

GPU Issues

Symptom: OOM errors or GPU panics
Solution:

Test on CPU first: ensure CUDA_VISIBLE_DEVICES="" if needed
Check available VRAM vs model requirements
Consider using smaller model variants

Model Mismatch Errors

Symptom: 400 errors with type=model_mismatch
Solution:

Use "model": "default" in API requests
Or match configured model ID exactly: "model": "gemma-3-1b-it"

Frontend Build Issues

Symptom: WASM compilation failures
Solution:

Install required targets: rustup target add wasm32-unknown-unknown
Check RUSTFLAGS in leptos-app/run.sh

Network/Timeout Issues

Symptom: First-time model downloads timing out
Solution:

Ensure stable internet connection
Consider using local HF cache: export HF_HOME="$PWD/.hf-cache"
Download models manually with huggingface-cli

Minimal End-to-End Verification

Build verification:

cargo build --workspace --release

Fast offline tests:

cargo test --workspace build_gemma_prompt
cargo test --workspace which_

Service startup:

./scripts/run_server.sh &
sleep 10  # Wait for server startup
curl -s http://localhost:8080/v1/models | jq

CLI client test:

bun run scripts/cli.ts "What is 2+2?"

Web frontend:

cd crates/leptos-app && ./run.sh &
# Navigate to http://localhost:8788

Integration test:

./smoke_test.sh

Cleanup:

pkill -f "predict-otron-9000"

For networked tests and full functionality, ensure Hugging Face authentication is configured as described above.

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Ensure all tests pass: cargo test
Submit a pull request

Warning: Do NOT use this in production unless you are cool like that.

14 KiB Raw Blame History