Files
predict-otron-9001/docs/PERFORMANCE.md
geoffsee 8338750beb Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.
Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
2025-08-27 16:15:01 -04:00

6.2 KiB

Performance Testing and Optimization Guide

This guide provides instructions for measuring, analyzing, and optimizing the performance of predict-otron-9000 components.

Overview

The predict-otron-9000 system consists of three main components:

  1. predict-otron-9000: The main server that integrates the other components
  2. embeddings-engine: Generates text embeddings using the Nomic Embed Text v1.5 model
  3. inference-engine: Handles text generation using various Gemma models

We've implemented performance metrics collection in all three components to identify bottlenecks and measure optimization impact.

Getting Started

Prerequisites

  • Rust 1.70+ with 2024 edition support
  • Cargo package manager
  • Basic understanding of the system architecture
  • The project built with cargo build --release

Running Performance Tests

We've created two scripts for performance testing:

  1. performance_test_embeddings.sh: Tests embedding generation with different input sizes
  2. performance_test_inference.sh: Tests text generation with different prompt sizes

Step 1: Start the Server

# Start the server in a terminal window
./run_server.sh

Wait for the server to fully initialize (look for "server listening" message).

Step 2: Run Embedding Performance Tests

In a new terminal window:

# Run the embeddings performance test
./performance_test_embeddings.sh

This will test embedding generation with small, medium, and large inputs and report timing metrics.

Step 3: Run Inference Performance Tests

# Run the inference performance test
./performance_test_inference.sh

This will test text generation with small, medium, and large prompts and report timing metrics.

Step 4: Collect and Analyze Results

The test scripts store detailed results in temporary directories. Review these results along with the server logs to identify performance bottlenecks.

# Check server logs for detailed timing breakdowns
# Analyze the performance metrics summaries

Performance Metrics Collected

API Request Metrics (predict-otron-9000)

  • Total request count
  • Average response time
  • Minimum response time
  • Maximum response time
  • Per-endpoint metrics

These metrics are logged every 60 seconds to the server console.

Embedding Generation Metrics (embeddings-engine)

  • Model initialization time
  • Input processing time
  • Embedding generation time
  • Post-processing time
  • Total request time
  • Memory usage estimates

Text Generation Metrics (inference-engine)

  • Tokenization time
  • Forward pass time (per token)
  • Repeat penalty computation time
  • Token sampling time
  • Average time per token
  • Total generation time
  • Tokens per second rate

Potential Optimization Areas

Based on code analysis, here are potential areas for optimization:

Embeddings Engine

  1. Model Initialization: The model is initialized for each request. Consider:

    • Creating a persistent model instance (singleton pattern)
    • Implementing a model cache
    • Using a smaller model for less demanding tasks
  2. Padding Logic: The code pads embeddings to 768 dimensions, which may be unnecessary:

    • Make padding configurable
    • Use the native dimension size when possible
  3. Random Embedding Generation: When embeddings are all zeros, random embeddings are generated:

    • Profile this logic to assess performance impact
    • Consider pre-computing fallback embeddings

Inference Engine

  1. Context Window Management: The code uses different approaches for different model versions:

    • Profile both approaches to determine the more efficient one
    • Optimize context window size based on performance data
  2. Repeat Penalty Computation: This computation is done for each token:

    • Consider optimizing the algorithm or data structure
    • Analyze if penalty strength can be reduced for better performance
  3. Tensor Operations: The code creates new tensors frequently:

    • Consider tensor reuse where possible
    • Investigate more efficient tensor operations
  4. Token Streaming: Improve the efficiency of token output streaming:

    • Batch token decoding where possible
    • Reduce memory allocations during streaming

Optimization Cycle

Follow this cycle for each optimization:

  1. Measure: Run performance tests to establish baseline
  2. Identify: Find the biggest bottleneck based on metrics
  3. Optimize: Make targeted changes to address the bottleneck
  4. Test: Run performance tests again to measure improvement
  5. Repeat: Identify the next bottleneck and continue

Tips for Effective Optimization

  1. Make One Change at a Time: Isolate changes to accurately measure their impact
  2. Focus on Hot Paths: Optimize code that runs frequently or takes significant time
  3. Use Profiling Tools: Consider using Rust profiling tools like perf or flamegraph
  4. Consider Trade-offs: Some optimizations may increase memory usage or reduce accuracy
  5. Document Changes: Keep track of optimizations and their measured impact

Memory Optimization

Beyond speed, consider memory usage optimization:

  1. Monitor Memory Usage: Use tools like top or htop to monitor process memory
  2. Reduce Allocations: Minimize temporary allocations in hot loops
  3. Buffer Reuse: Reuse buffers instead of creating new ones
  4. Lazy Loading: Load resources only when needed

Implemented Optimizations

Several optimizations have already been implemented based on this guide:

  1. Embeddings Engine: Persistent model instance (singleton pattern) using once_cell
  2. Inference Engine: Optimized repeat penalty computation with caching

For details on these optimizations, their implementation, and impact, see the OPTIMIZATIONS.md document.

Next Steps

After the initial optimizations, consider these additional system-level improvements:

  1. Concurrency: Process multiple requests in parallel where appropriate
  2. Caching: Implement caching for common inputs/responses
  3. Load Balancing: Distribute work across multiple instances
  4. Hardware Acceleration: Utilize GPU or specialized hardware if available

Refer to OPTIMIZATIONS.md for a prioritized roadmap of future optimizations.