mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Files

geoffsee 8338750beb Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.

Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking

2025-08-27 16:15:01 -04:00

6.2 KiB

Raw Blame History

Performance Testing and Optimization Guide

This guide provides instructions for measuring, analyzing, and optimizing the performance of predict-otron-9000 components.

Overview

The predict-otron-9000 system consists of three main components:

predict-otron-9000: The main server that integrates the other components
embeddings-engine: Generates text embeddings using the Nomic Embed Text v1.5 model
inference-engine: Handles text generation using various Gemma models

We've implemented performance metrics collection in all three components to identify bottlenecks and measure optimization impact.

Getting Started

Prerequisites

Rust 1.70+ with 2024 edition support
Cargo package manager
Basic understanding of the system architecture
The project built with cargo build --release

Running Performance Tests

We've created two scripts for performance testing:

performance_test_embeddings.sh: Tests embedding generation with different input sizes
performance_test_inference.sh: Tests text generation with different prompt sizes

Step 1: Start the Server

# Start the server in a terminal window
./run_server.sh

Wait for the server to fully initialize (look for "server listening" message).

Step 2: Run Embedding Performance Tests

In a new terminal window:

# Run the embeddings performance test
./performance_test_embeddings.sh

This will test embedding generation with small, medium, and large inputs and report timing metrics.

Step 3: Run Inference Performance Tests

# Run the inference performance test
./performance_test_inference.sh

This will test text generation with small, medium, and large prompts and report timing metrics.

Step 4: Collect and Analyze Results

The test scripts store detailed results in temporary directories. Review these results along with the server logs to identify performance bottlenecks.

# Check server logs for detailed timing breakdowns
# Analyze the performance metrics summaries

Performance Metrics Collected

API Request Metrics (predict-otron-9000)

Total request count
Average response time
Minimum response time
Maximum response time
Per-endpoint metrics

These metrics are logged every 60 seconds to the server console.

Embedding Generation Metrics (embeddings-engine)

Model initialization time
Input processing time
Embedding generation time
Post-processing time
Total request time
Memory usage estimates

Text Generation Metrics (inference-engine)

Tokenization time
Forward pass time (per token)
Repeat penalty computation time
Token sampling time
Average time per token
Total generation time
Tokens per second rate

Potential Optimization Areas

Based on code analysis, here are potential areas for optimization:

Embeddings Engine

Model Initialization: The model is initialized for each request. Consider:
- Creating a persistent model instance (singleton pattern)
- Implementing a model cache
- Using a smaller model for less demanding tasks
Padding Logic: The code pads embeddings to 768 dimensions, which may be unnecessary:
- Make padding configurable
- Use the native dimension size when possible
Random Embedding Generation: When embeddings are all zeros, random embeddings are generated:
- Profile this logic to assess performance impact
- Consider pre-computing fallback embeddings

Inference Engine

Context Window Management: The code uses different approaches for different model versions:
- Profile both approaches to determine the more efficient one
- Optimize context window size based on performance data
Repeat Penalty Computation: This computation is done for each token:
- Consider optimizing the algorithm or data structure
- Analyze if penalty strength can be reduced for better performance
Tensor Operations: The code creates new tensors frequently:
- Consider tensor reuse where possible
- Investigate more efficient tensor operations
Token Streaming: Improve the efficiency of token output streaming:
- Batch token decoding where possible
- Reduce memory allocations during streaming

Optimization Cycle

Follow this cycle for each optimization:

Measure: Run performance tests to establish baseline
Identify: Find the biggest bottleneck based on metrics
Optimize: Make targeted changes to address the bottleneck
Test: Run performance tests again to measure improvement
Repeat: Identify the next bottleneck and continue

Tips for Effective Optimization

Make One Change at a Time: Isolate changes to accurately measure their impact
Focus on Hot Paths: Optimize code that runs frequently or takes significant time
Use Profiling Tools: Consider using Rust profiling tools like perf or flamegraph
Consider Trade-offs: Some optimizations may increase memory usage or reduce accuracy
Document Changes: Keep track of optimizations and their measured impact

Memory Optimization

Beyond speed, consider memory usage optimization:

Monitor Memory Usage: Use tools like top or htop to monitor process memory
Reduce Allocations: Minimize temporary allocations in hot loops
Buffer Reuse: Reuse buffers instead of creating new ones
Lazy Loading: Load resources only when needed

Implemented Optimizations

Several optimizations have already been implemented based on this guide:

Embeddings Engine: Persistent model instance (singleton pattern) using once_cell
Inference Engine: Optimized repeat penalty computation with caching

For details on these optimizations, their implementation, and impact, see the OPTIMIZATIONS.md document.

Next Steps

After the initial optimizations, consider these additional system-level improvements:

Concurrency: Process multiple requests in parallel where appropriate
Caching: Implement caching for common inputs/responses
Load Balancing: Distribute work across multiple instances
Hardware Acceleration: Utilize GPU or specialized hardware if available

Refer to OPTIMIZATIONS.md for a prioritized roadmap of future optimizations.

6.2 KiB Raw Blame History