geoffsee/predict-otron-9001

Fork 0

mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Files

geoffsee 1e02b12cda fixes issue with model selection

2025-09-04 13:42:30 -04:00

14 KiB

Raw Blame History

predict-otron-9000

AI inference Server with OpenAI-compatible API (Limited Features)

> This project is an educational aide for bootstrapping my understanding of language model inferencing at the lowest levels I can, serving as a "rubber-duck" solution for Kubernetes based performance-oriented inference capabilities on air-gapped networks.

By isolating application behaviors in components at the crate level, development reduces to a short feedback loop for validation and integration, ultimately smoothing the learning curve for scalable AI systems. Stability is currently best-effort. Many models require unique configuration. When stability is achieved, this project will be promoted to the seemueller-io GitHub organization under a different name.

A comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces.

./scripts/run.sh

Project Overview

The predict-otron-9000 is a flexible AI platform that provides:

Local LLM Inference: Run Gemma and Llama models locally with CPU or GPU acceleration
Embeddings Generation: Create text embeddings with FastEmbed
Web Interface: Interact with models through a Leptos WASM chat interface
TypeScript CLI: Command-line client for testing and automation
Production Deployment: Docker and Kubernetes deployment options

The system supports both CPU and GPU acceleration (CUDA/Metal), with intelligent fallbacks and platform-specific optimizations.

Features

OpenAI Compatible: API endpoints match OpenAI's format for easy integration
Text Embeddings: Generate high-quality text embeddings using FastEmbed
Text Generation: Chat completions with OpenAI-compatible API using Gemma and Llama models (various sizes including instruction-tuned variants)
Performance Optimized: Efficient caching and platform-specific optimizations for improved throughput
Web Chat Interface: Leptos chat interface
Flexible Deployment: Run as monolithic service or microservices architecture

Architecture Overview

Workspace Structure

The project uses a 9-crate Rust workspace plus TypeScript components:

crates/
├── predict-otron-9000/     # Main orchestration server (Rust 2024)
├── inference-engine/       # Multi-model inference orchestrator (Rust 2021)
├── embeddings-engine/     # FastEmbed embeddings service (Rust 2024)
└── chat-ui/               # WASM web frontend (Rust 2021)

integration/
├── cli/                   # CLI client crate (Rust 2024)
│   └── package/
│       └── cli.ts         # TypeScript/Bun CLI client
├── gemma-runner/          # Gemma model inference via Candle (Rust 2021)
├── llama-runner/          # Llama model inference via Candle (Rust 2021)
├── helm-chart-tool/       # Kubernetes deployment tooling (Rust 2024)
└── utils/                 # Shared utilities (Rust 2021)

Service Architecture

Main Server (port 8080): Orchestrates inference and embeddings services
Embeddings Service (port 8080): Standalone FastEmbed service with OpenAI API compatibility
Web Frontend (port 8788): chat-ui WASM app
CLI Client: TypeScript/Bun client for testing and automation

Deployment Modes

The architecture supports multiple deployment patterns:

Development Mode: All services run in a single process for simplified development
Docker Monolithic: Single containerized service handling all functionality
Kubernetes Microservices: Separate services for horizontal scalability and fault isolation

Build and Configuration

Dependencies and Environment Prerequisites

Rust Toolchain

Editions: Mixed - main services use Rust 2024, some components use 2021
Recommended: Latest stable Rust toolchain: rustup default stable && rustup update
Developer tools:
- rustup component add rustfmt (formatting)
- rustup component add clippy (linting)

Node.js/Bun Toolchain

Bun: Required for TypeScript CLI client: curl -fsSL https://bun.sh/install | bash
Node.js: Alternative to Bun, supports OpenAI SDK v5.16.0+

ML Framework Dependencies

Candle: Version 0.9.1 with conditional compilation:
- macOS: Metal support with CPU fallback for stability
- Linux: CUDA support with CPU fallback
- CPU-only: Supported on all platforms
FastEmbed: Version 4.x for embeddings functionality

Hugging Face Access

Required for: Gemma model downloads (gated models)
Authentication:
- CLI: pip install -U "huggingface_hub[cli]" && huggingface-cli login
- Environment: export HF_TOKEN="<your_token>"
Cache management: export HF_HOME="$PWD/.hf-cache" (optional, keeps cache local)
Model access: Accept Gemma model licenses on Hugging Face before use

Platform-Specific Notes

macOS: Metal acceleration available but routed to CPU for Gemma v3 stability
Linux: CUDA support with BF16 precision on GPU, F32 on CPU
Conditional compilation: Handled automatically per platform in Cargo.toml

Build Procedures

Full Workspace Build

cargo build --workspace --release

Individual Services

Main Server:

cargo build --bin predict-otron-9000 --release

Inference Engine CLI:

cargo build --bin cli --package inference-engine --release

Embeddings Service:

cargo build --bin embeddings-engine --release

Running Services

Main Server (Port 8080)

./scripts/run_server.sh

Respects SERVER_PORT (default: 8080) and RUST_LOG (default: info)
Boots with default model: gemma-3-1b-it
Requires HF authentication for first-time model download

Web Frontend (Port 8788)

cd crates/chat-ui
./run.sh

Serves chat-ui WASM frontend on port 8788
Sets required RUSTFLAGS for WebAssembly getrandom support
Auto-reloads during development

TypeScript CLI Client

# List available models
cd integration/cli/package && bun run cli.ts --list-models

# Chat completion
cd integration/cli/package && bun run cli.ts "What is the capital of France?"

# With specific model
cd integration/cli/package && bun run cli.ts --model gemma-3-1b-it --prompt "Hello, world!"

# Show help
cd integration/cli/package && bun run cli.ts --help

API Usage

Health Checks and Model Inventory

curl -s http://localhost:8080/v1/models | jq

Chat Completions

Non-streaming:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 64
  }' | jq

Streaming (Server-Sent Events):

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default", 
    "messages": [{"role": "user", "content": "Tell a short joke"}],
    "stream": true,
    "max_tokens": 64
  }'

Model Specification:

Use "model": "default" for configured model
Or specify exact model ID: "model": "gemma-3-1b-it"
Requests with unknown models will be rejected

Embeddings API

Generate text embeddings compatible with OpenAI's embeddings API.

Endpoint: POST /v1/embeddings

Request Body:

{
  "input": "Your text to embed",
  "model": "nomic-embed-text-v1.5"
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.1, 0.2, 0.3]
    }
  ],
  "model": "nomic-embed-text-v1.5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

Web Frontend

Navigate to http://localhost:8788
Real-time chat interface with the inference server
Supports streaming responses and conversation history

Testing

Test Categories

Offline/fast tests: No network or model downloads required
Online tests: Require HF authentication and model downloads
Integration tests: Multi-service end-to-end testing

Quick Start: Offline Tests

Prompt formatting tests:

cargo test --workspace build_gemma_prompt

Model metadata tests:

cargo test --workspace which_

These verify core functionality without requiring HF access.

Full Test Suite (Requires HF)

Prerequisites:

Accept Gemma model licenses on Hugging Face
Authenticate: huggingface-cli login or export HF_TOKEN=...
Optional: export HF_HOME="$PWD/.hf-cache"

Run all tests:

cargo test --workspace

Integration Testing

End-to-end test script:

./scripts/smoke_test.sh

This script:

Starts the server in background with proper cleanup
Waits for server readiness via health checks
Runs CLI tests for model listing and chat completion
Includes 60-second timeout and process management

Development

Code Style and Tooling

Formatting:

cargo fmt --all

Linting:

cargo clippy --workspace --all-targets -- -D warnings

Logging:

Server uses tracing framework
Control via RUST_LOG (e.g., RUST_LOG=debug ./scripts/run_server.sh)

Adding Tests

For fast, offline tests:

Exercise pure logic without tokenizers/models
Use descriptive names for easy filtering: cargo test specific_test_name
Example patterns: prompt construction, metadata selection, tensor math

Process:

Add test to existing module
Run filtered: cargo test --workspace new_test_name
Verify in full suite: cargo test --workspace

OpenAI API Compatibility

Features:

POST /v1/chat/completions with streaming and non-streaming
Single configured model enforcement (use "model": "default")
Gemma-style prompt formatting with <start_of_turn>/<end_of_turn> markers
System prompt injection into first user turn
Repetition detection and early stopping in streaming mode

CORS:

Fully open by default (tower-http CorsLayer::Any)
Adjust for production deployment

Architecture Details

Device Selection:

Automatic device/dtype selection
CPU: Universal fallback (F32 precision)
CUDA: BF16 precision on compatible GPUs
Metal: Available but routed to CPU for Gemma v3 stability

Model Loading:

Single-file model.safetensors preferred
Falls back to index resolution via utilities_lib::hub_load_safetensors
HF cache populated on first access

Multi-Service Design:

Main server orchestrates inference and embeddings
Services can run independently for horizontal scaling
Docker/Kubernetes metadata included for deployment

Deployment

Docker Support

All services include Docker metadata in Cargo.toml:

Main Server:

Image: ghcr.io/geoffsee/predict-otron-9000:latest
Port: 8080

Inference Service:

Image: ghcr.io/geoffsee/inference-service:latest
Port: 8080

Embeddings Service:

Image: ghcr.io/geoffsee/embeddings-service:latest
Port: 8080

Web Frontend:

Image: ghcr.io/geoffsee/chat-ui:latest
Port: 8788

Docker Compose:

# Start all services
docker-compose up -d

# Check logs
docker-compose logs -f

# Stop services
docker-compose down

Kubernetes Support

All services include Kubernetes manifest metadata:

Single replica deployments by default
Service-specific port configurations
Ready for horizontal pod autoscaling

For Kubernetes deployment details, see the ARCHITECTURE.md document.

Build Artifacts

Ignored by Git:

target/ (Rust build artifacts)
node_modules/ (Node.js dependencies)
dist/ (Frontend build output)
.fastembed_cache/ (FastEmbed model cache)
.hf-cache/ (Hugging Face cache, if configured)

Common Issues and Solutions

Authentication/Licensing

Symptom: 404 or permission errors fetching models
Solution:

Accept Gemma model licenses on Hugging Face
Authenticate with huggingface-cli login or HF_TOKEN
Verify token with huggingface-cli whoami

GPU Issues

Symptom: OOM errors or GPU panics
Solution:

Test on CPU first: ensure CUDA_VISIBLE_DEVICES="" if needed
Check available VRAM vs model requirements
Consider using smaller model variants

Model Mismatch Errors

Symptom: 400 errors with type=model_mismatch
Solution:

Use "model": "default" in API requests
Or match configured model ID exactly: "model": "gemma-3-1b-it"

Frontend Build Issues

Symptom: WASM compilation failures
Solution:

Install required targets: rustup target add wasm32-unknown-unknown
Check RUSTFLAGS in chat-ui/run.sh

Network/Timeout Issues

Symptom: First-time model downloads timing out
Solution:

Ensure stable internet connection
Consider using local HF cache: export HF_HOME="$PWD/.hf-cache"
Download models manually with huggingface-cli

Minimal End-to-End Verification

Build verification:

cargo build --workspace --release

Fast offline tests:

cargo test --workspace build_gemma_prompt
cargo test --workspace which_

Service startup:

./scripts/run_server.sh &
sleep 10  # Wait for server startup
curl -s http://localhost:8080/v1/models | jq

CLI client test:

cd integration/cli/package && bun run cli.ts "What is 2+2?"

Web frontend:

cd crates/chat-ui && ./run.sh &
# Navigate to http://localhost:8788

Integration test:

./scripts/smoke_test.sh

Cleanup:

pkill -f "predict-otron-9000"

For networked tests and full functionality, ensure Hugging Face authentication is configured as described above.

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Ensure all tests pass: cargo test
Submit a pull request

Warning: Do NOT use this in production unless you are cool like that.

14 KiB Raw Blame History