diff --git a/README.md b/README.md
index 39583d2..ab76b80 100644
--- a/README.md
+++ b/README.md
@@ -26,7 +26,16 @@ Aliens, in a native executable.
- **`predict-otron-9000`**: Main unified server that combines both engines
- **`embeddings-engine`**: Handles text embeddings using FastEmbed with the Nomic Embed Text v1.5 model
- **`inference-engine`**: Provides text generation capabilities using Gemma models (1B, 2B, 7B, 9B variants) via Candle transformers
-- **`leptos-chat`**: WebAssembly-based chat interface built with Leptos framework for browser-based interaction with the inference engine
+- **`leptos-app`**: WebAssembly-based chat interface built with Leptos framework for browser-based interaction with the inference engine
+
+## Further Reading
+
+### Documentation
+
+- [Architecture](docs/ARCHITECTURE.md) - Detailed server configuration options and deployment modes
+- [Server Configuration Guide](docs/SERVER_CONFIG.md) - Detailed server configuration options and deployment modes
+- [Testing Documentation](docs/TESTING.md) - Comprehensive testing guide including unit, integration and e2e tests
+- [Performance Benchmarking](docs/BENCHMARKING.md) - Instructions for running and analyzing performance benchmarks
## Installation
@@ -262,8 +271,8 @@ The project includes a WebAssembly-based chat interface built with the Leptos fr
### Building the Chat Interface
```shell
-# Navigate to the leptos-chat crate
-cd crates/leptos-chat
+# Navigate to the leptos-app crate
+cd crates/leptos-app
# Build the WebAssembly package
cargo build --target wasm32-unknown-unknown
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
new file mode 100644
index 0000000..4979b4b
--- /dev/null
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,430 @@
+# Predict-Otron-9000 Architecture Documentation
+
+This document provides comprehensive architectural diagrams for the Predict-Otron-9000 multi-service AI platform, showing all supported configurations and deployment patterns.
+
+## Table of Contents
+
+- [System Overview](#system-overview)
+- [Workspace Structure](#workspace-structure)
+- [Deployment Configurations](#deployment-configurations)
+ - [Development Mode](#development-mode)
+ - [Docker Monolithic](#docker-monolithic)
+ - [Kubernetes Microservices](#kubernetes-microservices)
+- [Service Interactions](#service-interactions)
+- [Platform-Specific Configurations](#platform-specific-configurations)
+- [Data Flow Patterns](#data-flow-patterns)
+
+## System Overview
+
+The Predict-Otron-9000 is a comprehensive multi-service AI platform built around local LLM inference, embeddings, and web interfaces. The system supports flexible deployment patterns from monolithic to microservices architectures.
+
+```mermaid
+graph TB
+ subgraph "Core Components"
+ A[Main Server
predict-otron-9000]
+ B[Inference Engine
Gemma via Candle]
+ C[Embeddings Engine
FastEmbed]
+ D[Web Frontend
Leptos WASM]
+ end
+
+ subgraph "Client Interfaces"
+ E[TypeScript CLI
Bun/Node.js]
+ F[Web Browser
HTTP/WebSocket]
+ G[HTTP API Clients
OpenAI Compatible]
+ end
+
+ subgraph "Platform Support"
+ H[CPU Fallback
All Platforms]
+ I[CUDA Support
Linux GPU]
+ J[Metal Support
macOS GPU]
+ end
+
+ A --- B
+ A --- C
+ A --- D
+ E -.-> A
+ F -.-> A
+ G -.-> A
+ B --- H
+ B --- I
+ B --- J
+```
+
+## Workspace Structure
+
+The project uses a 4-crate Rust workspace with TypeScript tooling, designed for maximum flexibility in deployment configurations.
+
+```mermaid
+graph TD
+ subgraph "Rust Workspace"
+ subgraph "Main Orchestrator"
+ A[predict-otron-9000
Edition: 2024
Port: 8080]
+ end
+
+ subgraph "AI Services"
+ B[inference-engine
Edition: 2021
Port: 8080
Candle ML]
+ C[embeddings-engine
Edition: 2024
Port: 8080
FastEmbed]
+ end
+
+ subgraph "Frontend"
+ D[leptos-app
Edition: 2021
Port: 3000/8788
WASM/SSR]
+ end
+ end
+
+ subgraph "External Tooling"
+ E[cli.ts
TypeScript/Bun
OpenAI SDK]
+ end
+
+ subgraph "Dependencies"
+ A --> B
+ A --> C
+ A --> D
+ B -.-> F[Candle 0.9.1]
+ C -.-> G[FastEmbed 4.x]
+ D -.-> H[Leptos 0.8.0]
+ E -.-> I[OpenAI SDK 5.16+]
+ end
+
+ style A fill:#e1f5fe
+ style B fill:#f3e5f5
+ style C fill:#e8f5e8
+ style D fill:#fff3e0
+ style E fill:#fce4ec
+```
+
+## Deployment Configurations
+
+### Development Mode
+
+Local development runs all services integrated within the main server for simplicity.
+
+```mermaid
+graph LR
+ subgraph "Development Environment"
+ subgraph "Single Process - Port 8080"
+ A[predict-otron-9000 Server]
+ A --> B[Embedded Inference Engine]
+ A --> C[Embedded Embeddings Engine]
+ A --> D[SSR Leptos Frontend]
+ end
+
+ subgraph "Separate Frontend - Port 8788"
+ E[Trunk Dev Server
Hot Reload: 3001]
+ end
+ end
+
+ subgraph "External Clients"
+ F[CLI Client
cli.ts via Bun]
+ G[Web Browser]
+ H[HTTP API Clients]
+ end
+
+ F -.-> A
+ G -.-> A
+ G -.-> E
+ H -.-> A
+
+ style A fill:#e3f2fd
+ style E fill:#f1f8e9
+```
+
+### Docker Monolithic
+
+Docker Compose runs a single containerized service handling all functionality.
+
+```mermaid
+graph TB
+ subgraph "Docker Environment"
+ subgraph "predict-otron-9000 Container"
+ A[Main Server :8080]
+ A --> B[Inference Engine
Library Mode]
+ A --> C[Embeddings Engine
Library Mode]
+ A --> D[Leptos Frontend
SSR Mode]
+ end
+
+ subgraph "Persistent Storage"
+ E[HF Cache Volume
/.hf-cache]
+ F[FastEmbed Cache Volume
/.fastembed_cache]
+ end
+
+ subgraph "Network"
+ G[predict-otron-network
Bridge Driver]
+ end
+ end
+
+ subgraph "External Access"
+ H[Host Port 8080]
+ I[External Clients]
+ end
+
+ A --- E
+ A --- F
+ A --- G
+ H --> A
+ I -.-> H
+
+ style A fill:#e8f5e8
+ style E fill:#fff3e0
+ style F fill:#fff3e0
+```
+
+### Kubernetes Microservices
+
+Kubernetes deployment separates all services for horizontal scalability and fault isolation.
+
+```mermaid
+graph TB
+ subgraph "Kubernetes Namespace"
+ subgraph "Main Orchestrator"
+ A[predict-otron-9000 Pod
:8080
ClusterIP Service]
+ end
+
+ subgraph "AI Services"
+ B[inference-engine Pod
:8080
ClusterIP Service]
+ C[embeddings-engine Pod
:8080
ClusterIP Service]
+ end
+
+ subgraph "Frontend"
+ D[leptos-app Pod
:8788
ClusterIP Service]
+ end
+
+ subgraph "Ingress"
+ E[Ingress Controller
predict-otron-9000.local]
+ end
+ end
+
+ subgraph "External"
+ F[External Clients]
+ G[Container Registry
ghcr.io/geoffsee/*]
+ end
+
+ A <--> B
+ A <--> C
+ E --> A
+ E --> D
+ F -.-> E
+
+ G -.-> A
+ G -.-> B
+ G -.-> C
+ G -.-> D
+
+ style A fill:#e3f2fd
+ style B fill:#f3e5f5
+ style C fill:#e8f5e8
+ style D fill:#fff3e0
+ style E fill:#fce4ec
+```
+
+## Service Interactions
+
+### API Flow and Communication Patterns
+
+```mermaid
+sequenceDiagram
+ participant Client as External Client
+ participant Main as Main Server
(Port 8080)
+ participant Inf as Inference Engine
+ participant Emb as Embeddings Engine
+ participant Web as Web Frontend
+
+ Note over Client, Web: Development/Monolithic Mode
+ Client->>Main: POST /v1/chat/completions
+ Main->>Inf: Internal call (library)
+ Inf-->>Main: Generated response
+ Main-->>Client: Streaming/Non-streaming response
+
+ Client->>Main: POST /v1/embeddings
+ Main->>Emb: Internal call (library)
+ Emb-->>Main: Vector embeddings
+ Main-->>Client: Embeddings response
+
+ Note over Client, Web: Kubernetes Microservices Mode
+ Client->>Main: POST /v1/chat/completions
+ Main->>Inf: HTTP POST :8080/v1/chat/completions
+ Inf-->>Main: HTTP Response (streaming)
+ Main-->>Client: Proxied response
+
+ Client->>Main: POST /v1/embeddings
+ Main->>Emb: HTTP POST :8080/v1/embeddings
+ Emb-->>Main: HTTP Response
+ Main-->>Client: Proxied response
+
+ Note over Client, Web: Web Interface Flow
+ Web->>Main: WebSocket connection
+ Web->>Main: Chat message
+ Main->>Inf: Process inference
+ Inf-->>Main: Streaming tokens
+ Main-->>Web: WebSocket stream
+```
+
+### Port Configuration Matrix
+
+```mermaid
+graph TB
+ subgraph "Port Allocation by Mode"
+ subgraph "Development"
+ A[Main Server: 8080
All services embedded]
+ B[Leptos Dev: 8788
Hot reload: 3001]
+ end
+
+ subgraph "Docker Monolithic"
+ C[Main Server: 8080
All services embedded
Host mapped]
+ end
+
+ subgraph "Kubernetes Microservices"
+ D[Main Server: 8080]
+ E[Inference Engine: 8080]
+ F[Embeddings Engine: 8080]
+ G[Leptos Frontend: 8788]
+ end
+ end
+
+ style A fill:#e3f2fd
+ style C fill:#e8f5e8
+ style D fill:#f3e5f5
+ style E fill:#f3e5f5
+ style F fill:#e8f5e8
+ style G fill:#fff3e0
+```
+
+## Platform-Specific Configurations
+
+### Hardware Acceleration Support
+
+```mermaid
+graph TB
+ subgraph "Platform Detection"
+ A[Build System]
+ end
+
+ subgraph "macOS"
+ A --> B[Metal Features Available]
+ B --> C[CPU Fallback
Stability Priority]
+ C --> D[F32 Precision
Gemma Compatibility]
+ end
+
+ subgraph "Linux"
+ A --> E[CUDA Features Available]
+ E --> F[GPU Acceleration
Performance Priority]
+ F --> G[BF16 Precision
GPU Optimized]
+ E --> H[CPU Fallback
F32 Precision]
+ end
+
+ subgraph "Other Platforms"
+ A --> I[CPU Only
Universal Compatibility]
+ I --> J[F32 Precision
Standard Support]
+ end
+
+ style B fill:#e8f5e8
+ style E fill:#e3f2fd
+ style I fill:#fff3e0
+```
+
+### Model Loading and Caching
+
+```mermaid
+graph LR
+ subgraph "Model Access Flow"
+ A[Application Start] --> B{Model Cache Exists?}
+ B -->|Yes| C[Load from Cache]
+ B -->|No| D[HuggingFace Authentication]
+ D --> E{HF Token Valid?}
+ E -->|Yes| F[Download Model]
+ E -->|No| G[Authentication Error]
+ F --> H[Save to Cache]
+ H --> C
+ C --> I[Initialize Inference]
+ end
+
+ subgraph "Cache Locations"
+ J[HF_HOME Cache
.hf-cache]
+ K[FastEmbed Cache
.fastembed_cache]
+ end
+
+ F -.-> J
+ F -.-> K
+
+ style D fill:#fce4ec
+ style G fill:#ffebee
+```
+
+## Data Flow Patterns
+
+### Request Processing Pipeline
+
+```mermaid
+flowchart TD
+ A[Client Request] --> B{Request Type}
+
+ B -->|Chat Completion| C[Parse Messages]
+ B -->|Model List| D[Return Available Models]
+ B -->|Embeddings| E[Process Text Input]
+
+ C --> F[Apply Prompt Template]
+ F --> G{Streaming?}
+
+ G -->|Yes| H[Initialize Stream]
+ G -->|No| I[Generate Complete Response]
+
+ H --> J[Token Generation Loop]
+ J --> K[Send Chunk]
+ K --> L{More Tokens?}
+ L -->|Yes| J
+ L -->|No| M[End Stream]
+
+ I --> N[Return Complete Response]
+
+ E --> O[Generate Embeddings]
+ O --> P[Return Vectors]
+
+ D --> Q[Return Model Metadata]
+
+ style A fill:#e3f2fd
+ style H fill:#e8f5e8
+ style I fill:#f3e5f5
+ style O fill:#fff3e0
+```
+
+### Authentication and Security Flow
+
+```mermaid
+sequenceDiagram
+ participant User as User/Client
+ participant App as Application
+ participant HF as HuggingFace Hub
+ participant Model as Model Cache
+
+ Note over User, Model: First-time Setup
+ User->>App: Start application
+ App->>HF: Check model access (gated)
+ HF-->>App: 401 Unauthorized
+ App-->>User: Requires HF authentication
+
+ User->>User: huggingface-cli login
+ User->>App: Retry start
+ App->>HF: Check model access (with token)
+ HF-->>App: 200 OK + model metadata
+ App->>HF: Download model files
+ HF-->>App: Model data stream
+ App->>Model: Cache model locally
+
+ Note over User, Model: Subsequent Runs
+ User->>App: Start application
+ App->>Model: Load cached model
+ Model-->>App: Ready for inference
+```
+
+---
+
+## Summary
+
+The Predict-Otron-9000 architecture provides maximum flexibility through:
+
+- **Monolithic Mode**: Single server embedding all services for development and simple deployments
+- **Microservices Mode**: Separate services for production scalability and fault isolation
+- **Hybrid Capabilities**: Each service can operate as both library and standalone service
+- **Platform Optimization**: Conditional compilation for optimal performance across CPU/GPU configurations
+- **OpenAI Compatibility**: Standard API interfaces for seamless integration with existing tools
+
+This flexible architecture allows teams to start with simple monolithic deployments and scale to distributed microservices as needs grow, all while maintaining API compatibility and leveraging platform-specific optimizations.
\ No newline at end of file
diff --git a/helm-chart-tool/README.md b/helm-chart-tool/README.md
index 6036328..58ee48d 100644
--- a/helm-chart-tool/README.md
+++ b/helm-chart-tool/README.md
@@ -137,7 +137,7 @@ Parsing workspace at: ..
Output directory: ../generated-helm-chart
Chart name: predict-otron-9000
Found 4 services:
- - leptos-chat: ghcr.io/geoffsee/leptos-chat:latest (port 8788)
+ - leptos-app: ghcr.io/geoffsee/leptos-app:latest (port 8788)
- inference-engine: ghcr.io/geoffsee/inference-service:latest (port 8080)
- embeddings-engine: ghcr.io/geoffsee/embeddings-service:latest (port 8080)
- predict-otron-9000: ghcr.io/geoffsee/predict-otron-9000:latest (port 8080)