Files
predict-otron-9001/server_test.log
geoffsee 8338750beb Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.
Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
2025-08-27 16:15:01 -04:00

108 lines
4.2 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

warning: unused import: `candle_core::Tensor`
--> crates/inference-engine/src/model.rs:1:5
|
1 | use candle_core::Tensor;
| ^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused import: `Config as Config1`
--> crates/inference-engine/src/model.rs:2:42
|
2 | use candle_transformers::models::gemma::{Config as Config1, Model as Model1};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `Config as Config2`
--> crates/inference-engine/src/model.rs:3:43
|
3 | use candle_transformers::models::gemma2::{Config as Config2, Model as Model2};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `Config as Config3`
--> crates/inference-engine/src/model.rs:4:43
|
4 | use candle_transformers::models::gemma3::{Config as Config3, Model as Model3};
| ^^^^^^^^^^^^^^^^^
warning: unused import: `ArrayBuilder`
--> crates/inference-engine/src/openai_types.rs:23:27
|
23 | use utoipa::openapi::{ArrayBuilder, ObjectBuilder, OneOfBuilder, RefOr, Schema...
| ^^^^^^^^^^^^
warning: unused import: `IntoResponse`
--> crates/inference-engine/src/server.rs:4:38
|
4 | response::{sse::Event, sse::Sse, IntoResponse},
| ^^^^^^^^^^^^
warning: unused import: `future`
--> crates/inference-engine/src/server.rs:9:31
|
9 | use futures_util::{StreamExt, future};
| ^^^^^^
warning: unused import: `std::io::Write`
--> crates/inference-engine/src/text_generation.rs:5:5
|
5 | use std::io::Write;
| ^^^^^^^^^^^^^^
warning: unused import: `StreamExt`
--> crates/inference-engine/src/server.rs:9:20
|
9 | use futures_util::{StreamExt, future};
| ^^^^^^^^^
warning: method `apply_cached_repeat_penalty` is never used
--> crates/inference-engine/src/text_generation.rs:47:8
|
22 | impl TextGeneration {
| ------------------- method in this implementation
...
47 | fn apply_cached_repeat_penalty(
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(dead_code)]` on by default
warning: unused import: `get`
--> crates/embeddings-engine/src/lib.rs:3:47
|
3 | response::Json as ResponseJson, routing::{get, post},
| ^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused imports: `Deserialize` and `Serialize`
--> crates/embeddings-engine/src/lib.rs:9:13
|
9 | use serde::{Deserialize, Serialize};
| ^^^^^^^^^^^ ^^^^^^^^^
warning: `inference-engine` (lib) generated 10 warnings (run `cargo fix --lib -p inference-engine` to apply 7 suggestions)
warning: `embeddings-engine` (lib) generated 2 warnings (run `cargo fix --lib -p embeddings-engine` to apply 2 suggestions)
warning: unused import: `axum::response::IntoResponse`
--> crates/predict-otron-9000/src/main.rs:8:5
|
8 | use axum::response::IntoResponse;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: `predict-otron-9000` (bin "predict-otron-9000") generated 1 warning (run `cargo fix --bin "predict-otron-9000"` to apply 1 suggestion)
Finished `release` profile [optimized] target(s) in 0.14s
Running `target/release/predict-otron-9000`
avx: false, neon: true, simd128: false, f16c: false
2025-08-27T17:54:45.554609Z  INFO hf_hub: Using token file found "/Users/williamseemueller/.cache/huggingface/token"
2025-08-27T17:54:45.555593Z  INFO predict_otron_9000::middleware::metrics: Performance metrics summary:
Checking model_id: 'google/gemma-3-1b-it'
Trimmed model_id length: 20
Using explicitly specified model type: InstructV3_1B
retrieved the files in 1.332041ms
Note: Using CPU for Gemma 3 model due to missing Metal implementations for required operations (e.g., rotary-emb).
loaded the model in 879.2335ms
thread 'main' panicked at crates/predict-otron-9000/src/main.rs:91:61:
called `Result::unwrap()` on an `Err` value: Os { code: 48, kind: AddrInUse, message: "Address already in use" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace