# Predict-Otron-9000 Architecture Documentation This document provides comprehensive architectural diagrams for the Predict-Otron-9000 multi-service AI platform, showing all supported configurations and deployment patterns. ## Table of Contents - [System Overview](#system-overview) - [Workspace Structure](#workspace-structure) - [Deployment Configurations](#deployment-configurations) - [Development Mode](#development-mode) - [Docker Monolithic](#docker-monolithic) - [Kubernetes Microservices](#kubernetes-microservices) - [Service Interactions](#service-interactions) - [Platform-Specific Configurations](#platform-specific-configurations) - [Data Flow Patterns](#data-flow-patterns) ## System Overview The Predict-Otron-9000 is a comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces. The system supports flexible deployment patterns from monolithic to microservices architectures. ```mermaid graph TB subgraph "Core Components" A[Main Server
predict-otron-9000] B[Inference Engine
Gemma/Llama via Candle] C[Embeddings Engine
FastEmbed] D[Web Frontend
Leptos WASM] end subgraph "Client Interfaces" E[TypeScript CLI
Bun/Node.js] F[Web Browser
HTTP/WebSocket] G[HTTP API Clients
OpenAI Compatible] end subgraph "Platform Support" H[CPU Fallback
All Platforms] I[CUDA Support
Linux GPU] J[Metal Support
macOS GPU] end A --- B A --- C A --- D E -.-> A F -.-> A G -.-> A B --- H B --- I B --- J ``` ## Workspace Structure The project uses a 9-crate Rust workspace with TypeScript tooling, designed for maximum flexibility in deployment configurations. ```mermaid graph TD subgraph "Rust Workspace" subgraph "Main Orchestrator" A[predict-otron-9000
Edition: 2024
Port: 8080] end subgraph "AI Services" B[inference-engine
Edition: 2021
Port: 8080
Multi-model orchestrator] J[gemma-runner
Edition: 2021
Gemma via Candle] K[llama-runner
Edition: 2021
Llama via Candle] C[embeddings-engine
Edition: 2024
Port: 8080
FastEmbed] end subgraph "Frontend" D[chat-ui
Edition: 2021
Port: 8788
WASM UI] end subgraph "Tooling" L[helm-chart-tool
Edition: 2024
K8s deployment] E[cli
Edition: 2024
TypeScript/Bun CLI] end end subgraph "Dependencies" A --> B A --> C A --> D B --> J B --> K J -.-> F[Candle 0.9.1] K -.-> F C -.-> G[FastEmbed 4.x] D -.-> H[Leptos 0.8.0] E -.-> I[OpenAI SDK 5.16+] end style A fill:#e1f5fe style B fill:#f3e5f5 style J fill:#f3e5f5 style K fill:#f3e5f5 style C fill:#e8f5e8 style D fill:#fff3e0 style E fill:#fce4ec style L fill:#fff9c4 ``` ## Deployment Configurations ### Development Mode Local development runs all services integrated within the main server for simplicity. ```mermaid graph LR subgraph "Development Environment" subgraph "Single Process - Port 8080" A[predict-otron-9000 Server] A --> B[Embedded Inference Engine] A --> C[Embedded Embeddings Engine] A --> D[SSR Leptos Frontend] end end subgraph "External Clients" F[CLI Client
cli.ts via Bun] G[Web Browser] H[HTTP API Clients] end F -.-> A G -.-> A G -.-> E H -.-> A style A fill:#e3f2fd style E fill:#f1f8e9 ``` ### Docker Monolithic Docker Compose runs a single containerized service handling all functionality. ```mermaid graph TB subgraph "Docker Environment" subgraph "predict-otron-9000 Container" A[Main Server :8080] A --> B[Inference Engine
Library Mode] A --> C[Embeddings Engine
Library Mode] A --> D[Leptos Frontend
SSR Mode] end subgraph "Persistent Storage" E[HF Cache Volume
/.hf-cache] F[FastEmbed Cache Volume
/.fastembed_cache] end subgraph "Network" G[predict-otron-network
Bridge Driver] end end subgraph "External Access" H[Host Port 8080] I[External Clients] end A --- E A --- F A --- G H --> A I -.-> H style A fill:#e8f5e8 style E fill:#fff3e0 style F fill:#fff3e0 ``` ### Kubernetes Microservices Kubernetes deployment separates all services for horizontal scalability and fault isolation. ```mermaid graph TB subgraph "Kubernetes Namespace" subgraph "Main Orchestrator" A[predict-otron-9000 Pod
:8080
ClusterIP Service] end subgraph "AI Services" B[inference-engine Pod
:8080
ClusterIP Service] C[embeddings-engine Pod
:8080
ClusterIP Service] end subgraph "Frontend" D[chat-ui Pod
:8788
ClusterIP Service] end subgraph "Ingress" E[Ingress Controller
predict-otron-9000.local] end end subgraph "External" F[External Clients] G[Container Registry
ghcr.io/geoffsee/*] end A <--> B A <--> C E --> A E --> D F -.-> E G -.-> A G -.-> B G -.-> C G -.-> D style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#e8f5e8 style D fill:#fff3e0 style E fill:#fce4ec ``` ## Service Interactions ### API Flow and Communication Patterns ```mermaid sequenceDiagram participant Client as External Client participant Main as Main Server
(Port 8080) participant Inf as Inference Engine participant Emb as Embeddings Engine participant Web as Web Frontend Note over Client, Web: Development/Monolithic Mode Client->>Main: POST /v1/chat/completions Main->>Inf: Internal call (library) Inf-->>Main: Generated response Main-->>Client: Streaming/Non-streaming response Client->>Main: POST /v1/embeddings Main->>Emb: Internal call (library) Emb-->>Main: Vector embeddings Main-->>Client: Embeddings response Note over Client, Web: Kubernetes Microservices Mode Client->>Main: POST /v1/chat/completions Main->>Inf: HTTP POST :8080/v1/chat/completions Inf-->>Main: HTTP Response (streaming) Main-->>Client: Proxied response Client->>Main: POST /v1/embeddings Main->>Emb: HTTP POST :8080/v1/embeddings Emb-->>Main: HTTP Response Main-->>Client: Proxied response Note over Client, Web: Web Interface Flow Web->>Main: WebSocket connection Web->>Main: Chat message Main->>Inf: Process inference Inf-->>Main: Streaming tokens Main-->>Web: WebSocket stream ``` ### Port Configuration Matrix ```mermaid graph TB subgraph "Port Allocation by Mode" subgraph "Development" A[Main Server: 8080
All services embedded] B[Leptos Dev: 8788
Hot reload: 3001] end subgraph "Docker Monolithic" C[Main Server: 8080
All services embedded
Host mapped] end subgraph "Kubernetes Microservices" D[Main Server: 8080] E[Inference Engine: 8080] F[Embeddings Engine: 8080] G[Leptos Frontend: 8788] end end style A fill:#e3f2fd style C fill:#e8f5e8 style D fill:#f3e5f5 style E fill:#f3e5f5 style F fill:#e8f5e8 style G fill:#fff3e0 ``` ## Platform-Specific Configurations ### Hardware Acceleration Support ```mermaid graph TB subgraph "Platform Detection" A[Build System] end subgraph "macOS" A --> B[Metal Features Available] B --> C[CPU Fallback
Stability Priority] C --> D[F32 Precision
Gemma Compatibility] end subgraph "Linux" A --> E[CUDA Features Available] E --> F[GPU Acceleration
Performance Priority] F --> G[BF16 Precision
GPU Optimized] E --> H[CPU Fallback
F32 Precision] end subgraph "Other Platforms" A --> I[CPU Only
Universal Compatibility] I --> J[F32 Precision
Standard Support] end style B fill:#e8f5e8 style E fill:#e3f2fd style I fill:#fff3e0 ``` ### Model Loading and Caching ```mermaid graph LR subgraph "Model Access Flow" A[Application Start] --> B{Model Cache Exists?} B -->|Yes| C[Load from Cache] B -->|No| D[HuggingFace Authentication] D --> E{HF Token Valid?} E -->|Yes| F[Download Model] E -->|No| G[Authentication Error] F --> H[Save to Cache] H --> C C --> I[Initialize Inference] end subgraph "Cache Locations" J[HF_HOME Cache
.hf-cache] K[FastEmbed Cache
.fastembed_cache] end F -.-> J F -.-> K style D fill:#fce4ec style G fill:#ffebee ``` ## Data Flow Patterns ### Request Processing Pipeline ```mermaid flowchart TD A[Client Request] --> B{Request Type} B -->|Chat Completion| C[Parse Messages] B -->|Model List| D[Return Available Models] B -->|Embeddings| E[Process Text Input] C --> F[Apply Prompt Template] F --> G{Streaming?} G -->|Yes| H[Initialize Stream] G -->|No| I[Generate Complete Response] H --> J[Token Generation Loop] J --> K[Send Chunk] K --> L{More Tokens?} L -->|Yes| J L -->|No| M[End Stream] I --> N[Return Complete Response] E --> O[Generate Embeddings] O --> P[Return Vectors] D --> Q[Return Model Metadata] style A fill:#e3f2fd style H fill:#e8f5e8 style I fill:#f3e5f5 style O fill:#fff3e0 ``` ### Authentication and Security Flow ```mermaid sequenceDiagram participant User as User/Client participant App as Application participant HF as HuggingFace Hub participant Model as Model Cache Note over User, Model: First-time Setup User->>App: Start application App->>HF: Check model access (gated) HF-->>App: 401 Unauthorized App-->>User: Requires HF authentication User->>User: huggingface-cli login User->>App: Retry start App->>HF: Check model access (with token) HF-->>App: 200 OK + model metadata App->>HF: Download model files HF-->>App: Model data stream App->>Model: Cache model locally Note over User, Model: Subsequent Runs User->>App: Start application App->>Model: Load cached model Model-->>App: Ready for inference ``` --- ## Summary The Predict-Otron-9000 architecture provides maximum flexibility through: - **Monolithic Mode**: Single server embedding all services for development and simple deployments - **Microservices Mode**: Separate services for production scalability and fault isolation - **Hybrid Capabilities**: Each service can operate as both library and standalone service - **Platform Optimization**: Conditional compilation for optimal performance across CPU/GPU configurations - **OpenAI Compatibility**: Standard API interfaces for seamless integration with existing tools This flexible architecture allows teams to start with simple monolithic deployments and scale to distributed microservices as needs grow, all while maintaining API compatibility and leveraging platform-specific optimizations.