mirror of
https://github.com/geoffsee/predict-otron-9001.git
synced 2025-09-08 22:46:44 +00:00
- Change default server host to localhost for improved security.
- Increase default maximum tokens in CLI configuration to 256. - Refactor and reorganize CLI
This commit is contained in:
2
Cargo.lock
generated
2
Cargo.lock
generated
@@ -2736,6 +2736,7 @@ dependencies = [
|
||||
"symphonia",
|
||||
"tokenizers",
|
||||
"tokio",
|
||||
"tokio-stream",
|
||||
"tower",
|
||||
"tower-http",
|
||||
"tracing",
|
||||
@@ -6159,6 +6160,7 @@ dependencies = [
|
||||
"futures-core",
|
||||
"pin-project-lite",
|
||||
"tokio",
|
||||
"tokio-util",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
|
157
ROOT_CAUSE_ANALYSIS.md
Normal file
157
ROOT_CAUSE_ANALYSIS.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Root Cause Analysis: Token Repetition in Streaming Text Generation
|
||||
|
||||
**Date:** August 27, 2025
|
||||
**System:** Predict-Otron-9000 Inference Engine
|
||||
**Issue:** Token repetition in streaming text generation despite successful individual token streaming implementation
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The Predict-Otron-9000 system has successfully implemented individual token streaming and resolved false positive stream issues in CLI multiple invocations. However, token repetition remains a critical issue that degrades output quality. This analysis identifies the root cause as insufficient context preservation in the incremental token generation process, particularly for Gemma model variants.
|
||||
|
||||
## Technical Background
|
||||
|
||||
### System Architecture
|
||||
|
||||
The Predict-Otron-9000 consists of several key components:
|
||||
|
||||
1. **Inference Engine** (`crates/inference-engine/`): Core text generation logic
|
||||
- `TokenOutputStream`: Handles token-by-token decoding and streaming
|
||||
- `TextGeneration`: Main generation logic with streaming support
|
||||
- `server.rs`: HTTP API with Server-Sent Events (SSE) streaming
|
||||
|
||||
2. **CLI Client** (`cli.ts`): TypeScript client for interacting with the inference engine
|
||||
|
||||
3. **Model Support**: Gemma-1, Gemma-2, and Gemma-3 model variants
|
||||
|
||||
### Streaming Implementation Changes
|
||||
|
||||
#### Individual Token Generation ✅ RESOLVED
|
||||
|
||||
**Previous Behavior:** Tokens were generated in batches and sent all at once.
|
||||
|
||||
**Current Implementation:**
|
||||
- `TokenOutputStream.next_token()` processes individual tokens with incremental decoding
|
||||
- Modified to "include all tokens, not just alphanumeric ones" (token_output_stream.rs:44)
|
||||
- Server streams tokens via SSE using callback mechanism in `TextGeneration.run_with_streaming()`
|
||||
|
||||
#### CLI Multiple Invocation Support ✅ RESOLVED
|
||||
|
||||
**Previous Issue:** Multiple CLI invocations received false positive streams from previous sessions.
|
||||
|
||||
**Current Solution:**
|
||||
- Each CLI invocation creates a fresh OpenAI client connection
|
||||
- Server calls `text_gen.reset_state()` before each streaming request
|
||||
- `TokenOutputStream.clear()` resets token buffers and indices
|
||||
- Penalty cache is cleared for each new generation
|
||||
|
||||
## Root Cause Analysis: Token Repetition
|
||||
|
||||
### Primary Root Cause: Insufficient Context Window
|
||||
|
||||
The token repetition issue stems from **severe context limitation** in the incremental generation process:
|
||||
|
||||
#### 1. Gemma Model Special Handling (Lines 694-806 in text_generation.rs)
|
||||
|
||||
```rust
|
||||
// Use just the last token for subsequent iterations to avoid shape mismatch
|
||||
let context_tokens = &tokens[(tokens.len()-1)..];
|
||||
let start_pos = tokens.len() - 1;
|
||||
```
|
||||
|
||||
**Problem:** For Gemma-2 and Gemma-3 models, only the **last single token** is used for subsequent forward passes. This eliminates virtually all context, forcing the model to generate based on minimal information.
|
||||
|
||||
#### 2. Standard Model Handling (Lines 808-850 in text_generation.rs)
|
||||
|
||||
```rust
|
||||
let context_size = if index > 0 { 1 } else { tokens.len() };
|
||||
let start_pos = tokens.len().saturating_sub(context_size);
|
||||
let ctxt = &tokens[start_pos..];
|
||||
```
|
||||
|
||||
**Problem:** After the first token, context is limited to just **1 token** for all subsequent generations, again severely restricting the model's ability to maintain coherent context.
|
||||
|
||||
#### 3. Penalty Cache Clearing
|
||||
|
||||
```rust
|
||||
// Clear penalty cache for new generation
|
||||
self.penalty_cache.clear();
|
||||
```
|
||||
|
||||
**Contributing Factor:** The repeat penalty cache is cleared at the start of each streaming generation, reducing the effectiveness of repetition prevention mechanisms.
|
||||
|
||||
### Secondary Contributing Factors
|
||||
|
||||
1. **Shape Compatibility Workaround**: The single-token context approach was implemented to "avoid shape mismatch" in Gemma models, prioritizing technical compatibility over generation quality.
|
||||
|
||||
2. **Incremental Context Loss**: Each forward pass operates with minimal historical context, making it impossible for the model to understand what it has already generated.
|
||||
|
||||
3. **Inadequate Repeat Penalty Context**: The repeat penalty mechanism (`apply_cached_repeat_penalty`) has limited effectiveness when working with truncated context windows.
|
||||
|
||||
## Impact Analysis
|
||||
|
||||
### Performance Impact
|
||||
- **Positive**: Individual token streaming provides responsive user experience
|
||||
- **Positive**: CLI multiple invocations work correctly without interference
|
||||
- **Negative**: Poor output quality due to repetitive content
|
||||
|
||||
### User Experience Impact
|
||||
- **Critical**: Generated text contains significant repetition, reducing practical utility
|
||||
- **Positive**: Real-time streaming provides immediate feedback
|
||||
- **Positive**: Consistent behavior across multiple CLI sessions
|
||||
|
||||
### Technical Debt
|
||||
- **High**: Current implementation prioritizes technical workarounds over generation quality
|
||||
- **Medium**: Context limitation approach creates maintenance burden
|
||||
- **Low**: Streaming infrastructure is well-architected and maintainable
|
||||
|
||||
## Timeline and Change History
|
||||
|
||||
Based on code analysis, the following changes were implemented:
|
||||
|
||||
1. **Token Streaming Enhancement**: Modified `TokenOutputStream` to include all tokens, not just alphanumeric
|
||||
2. **Individual Token Callbacks**: Implemented streaming callbacks in `TextGeneration.run_with_streaming()`
|
||||
3. **CLI State Management**: Added proper state reset and fresh connections
|
||||
4. **Context Limitation Implementation**: Applied single-token context for incremental generation
|
||||
5. **SSE Integration**: Implemented Server-Sent Events for real-time token delivery
|
||||
|
||||
## Recommendations for Future Iterations
|
||||
|
||||
### Immediate Priority (Critical)
|
||||
1. **Implement Sliding Window Context**: Replace single-token context with a configurable sliding window (e.g., last 50-100 tokens)
|
||||
2. **Context-Aware Repeat Penalty**: Maintain repeat penalty context across the full generation window
|
||||
3. **Model-Specific Context Handling**: Develop proper context management for each Gemma variant without sacrificing context size
|
||||
|
||||
### Medium-Term Improvements
|
||||
1. **Dynamic Context Sizing**: Implement adaptive context windows based on available memory and model capabilities
|
||||
2. **Advanced Repetition Detection**: Implement semantic-level repetition detection beyond token-level penalties
|
||||
3. **Context Compression**: Explore context compression techniques to maintain longer effective context windows
|
||||
|
||||
### Long-Term Enhancements
|
||||
1. **Beam Search Integration**: Implement beam search with streaming for better output quality
|
||||
2. **Adaptive Sampling**: Dynamic adjustment of sampling parameters based on repetition detection
|
||||
3. **Model Fine-tuning**: Consider fine-tuning approaches to reduce repetition tendency at the model level
|
||||
|
||||
## Monitoring and Validation
|
||||
|
||||
### Key Metrics to Track
|
||||
1. **Repetition Rate**: Measure token and n-gram repetition frequency
|
||||
2. **Context Utilization**: Monitor effective context window usage
|
||||
3. **Generation Quality**: Track coherence and diversity metrics
|
||||
4. **Streaming Performance**: Maintain current responsiveness standards
|
||||
|
||||
### Testing Strategy
|
||||
1. **Repetition Benchmarks**: Create standardized tests for repetition detection
|
||||
2. **Context Window Testing**: Validate context preservation across different window sizes
|
||||
3. **Model Variant Testing**: Ensure consistent behavior across Gemma-1, Gemma-2, and Gemma-3
|
||||
4. **Regression Testing**: Maintain streaming functionality during context improvements
|
||||
|
||||
## Conclusion
|
||||
|
||||
The Predict-Otron-9000 has successfully achieved individual token streaming and eliminated false positive streams in CLI usage. However, the current implementation's approach to context management—using only single tokens for incremental generation—is the primary root cause of token repetition issues.
|
||||
|
||||
The solution requires balancing technical compatibility with generation quality by implementing proper sliding window context management while maintaining the current streaming performance and reliability. This represents a critical technical debt that should be addressed in the next development iteration to realize the system's full potential.
|
||||
|
||||
**Priority Level:** Critical
|
||||
**Complexity:** Medium
|
||||
**Risk Level:** Low (improvements can be made incrementally)
|
||||
**User Impact:** High (significant quality improvement expected)
|
248
cli.ts
248
cli.ts
@@ -3,8 +3,30 @@
|
||||
import OpenAI from "openai";
|
||||
import { parseArgs } from "util";
|
||||
|
||||
// =====================
|
||||
// Config
|
||||
// =====================
|
||||
const DEFAULT_MODEL = "gemma-3-1b-it";
|
||||
const DEFAULT_MAX_TOKENS = 100;
|
||||
const DEFAULT_MAX_TOKENS = 256;
|
||||
|
||||
// Toggle this to reduce log overhead during timing runs
|
||||
const PRINT_CHUNK_DEBUG = false;
|
||||
|
||||
// How many rows to show in the timing tables
|
||||
const SHOW_FIRST_N = 3;
|
||||
const SHOW_SLOWEST_N = 3;
|
||||
|
||||
// =====================
|
||||
// Helpers
|
||||
// =====================
|
||||
const now = () => performance.now();
|
||||
|
||||
type ChunkStat = {
|
||||
index: number;
|
||||
tSinceRequestStartMs: number;
|
||||
dtSincePrevMs: number;
|
||||
contentChars: number;
|
||||
};
|
||||
|
||||
function printHelp() {
|
||||
console.log(`
|
||||
@@ -30,20 +52,12 @@ Start it with: ./run_server.sh
|
||||
}
|
||||
|
||||
const { values, positionals } = parseArgs({
|
||||
args: Bun.argv,
|
||||
args: process.argv.slice(2),
|
||||
options: {
|
||||
model: {
|
||||
type: 'string',
|
||||
},
|
||||
prompt: {
|
||||
type: 'string',
|
||||
},
|
||||
help: {
|
||||
type: 'boolean',
|
||||
},
|
||||
'list-models': {
|
||||
type: 'boolean',
|
||||
},
|
||||
model: { type: "string" },
|
||||
prompt: { type: "string" },
|
||||
help: { type: "boolean" },
|
||||
"list-models": { type: "boolean" },
|
||||
},
|
||||
strict: false,
|
||||
allowPositionals: true,
|
||||
@@ -55,16 +69,20 @@ async function requestLocalOpenAI(model: string, userPrompt: string) {
|
||||
apiKey: "not used",
|
||||
});
|
||||
try {
|
||||
console.log("[DEBUG] Creating chat completion request...");
|
||||
return openai.chat.completions.create({
|
||||
model: model,
|
||||
model,
|
||||
max_tokens: DEFAULT_MAX_TOKENS,
|
||||
stream: true,
|
||||
messages: [
|
||||
{name: "assistant_1", role: "system", content: "I am a helpful assistant" },
|
||||
{name: "user_1", role: "user", content: userPrompt}
|
||||
]
|
||||
{
|
||||
role: "system",
|
||||
content: "You are a helpful assistant who responds thoughtfully and concisely.",
|
||||
},
|
||||
{ role: "user", content: userPrompt },
|
||||
],
|
||||
});
|
||||
} catch (e) {
|
||||
} catch (e: any) {
|
||||
console.error("[ERROR] Failed to connect to local OpenAI server:", e.message);
|
||||
console.error("[HINT] Make sure the server is running at http://localhost:8080");
|
||||
console.error("[HINT] Start it with: ./run_server.sh");
|
||||
@@ -93,8 +111,7 @@ async function listModels() {
|
||||
} else {
|
||||
console.log("No models found.");
|
||||
}
|
||||
|
||||
} catch (e) {
|
||||
} catch (e: any) {
|
||||
console.error("[ERROR] Failed to fetch models from local OpenAI server:", e.message);
|
||||
console.error("[HINT] Make sure the server is running at http://localhost:8080");
|
||||
console.error("[HINT] Start it with: ./run_server.sh");
|
||||
@@ -102,26 +119,51 @@ async function listModels() {
|
||||
}
|
||||
}
|
||||
|
||||
// =====================
|
||||
// Timing math
|
||||
// =====================
|
||||
function median(nums: number[]) {
|
||||
if (nums.length === 0) return 0;
|
||||
const s = [...nums].sort((a, b) => a - b);
|
||||
const mid = Math.floor(s.length / 2);
|
||||
return s.length % 2 ? s[mid] : (s[mid - 1] + s[mid]) / 2;
|
||||
}
|
||||
|
||||
function quantile(nums: number[], q: number) {
|
||||
if (nums.length === 0) return 0;
|
||||
const s = [...nums].sort((a, b) => a - b);
|
||||
const pos = (s.length - 1) * q;
|
||||
const base = Math.floor(pos);
|
||||
const rest = pos - base;
|
||||
return s[base + 1] !== undefined ? s[base] + rest * (s[base + 1] - s[base]) : s[base];
|
||||
}
|
||||
|
||||
function ms(n: number) {
|
||||
return `${n.toFixed(1)} ms`;
|
||||
}
|
||||
|
||||
// =====================
|
||||
// Main
|
||||
// =====================
|
||||
async function main() {
|
||||
// Show help if requested
|
||||
const tProgramStart = now();
|
||||
|
||||
if (values.help) {
|
||||
printHelp();
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
// List models if requested
|
||||
if (values['list-models']) {
|
||||
if (values["list-models"]) {
|
||||
try {
|
||||
await listModels();
|
||||
process.exit(0);
|
||||
} catch (error) {
|
||||
} catch (error: any) {
|
||||
console.error("\n[ERROR] Failed to list models:", error.message);
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
// Get the prompt from either --prompt flag or positional argument
|
||||
const prompt = values.prompt || positionals[2]; // positionals[0] is 'bun', positionals[1] is 'client_cli.ts'
|
||||
const prompt = values.prompt ?? positionals[0];
|
||||
|
||||
if (!prompt) {
|
||||
console.error("[ERROR] No prompt provided!");
|
||||
@@ -129,7 +171,6 @@ async function main() {
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
// Get the model (use default if not provided)
|
||||
const model = values.model || DEFAULT_MODEL;
|
||||
|
||||
console.log(`[INFO] Using model: ${model}`);
|
||||
@@ -137,32 +178,163 @@ async function main() {
|
||||
console.log(`[INFO] Connecting to: http://localhost:8080/v1`);
|
||||
console.log("---");
|
||||
|
||||
try {
|
||||
const response = await requestLocalOpenAI(model, prompt);
|
||||
const tBeforeRequest = now();
|
||||
|
||||
// Handle streaming response
|
||||
try {
|
||||
console.log("[DEBUG] Initiating request to OpenAI server...");
|
||||
const response = await requestLocalOpenAI(model, prompt);
|
||||
const tAfterCreate = now();
|
||||
|
||||
// Streaming handling + timing
|
||||
let fullResponse = "";
|
||||
let chunkCount = 0;
|
||||
|
||||
const chunkStats: ChunkStat[] = [];
|
||||
let tFirstChunk: number | null = null;
|
||||
let tPrevChunk: number | null = null;
|
||||
|
||||
console.log("[INFO] Waiting for model to generate response...");
|
||||
let loadingInterval;
|
||||
if (!PRINT_CHUNK_DEBUG) {
|
||||
// Show loading animation only if not in debug mode
|
||||
const loadingChars = ['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏'];
|
||||
let i = 0;
|
||||
process.stdout.write('\r[INFO] Thinking ');
|
||||
loadingInterval = setInterval(() => {
|
||||
process.stdout.write(`\r[INFO] Thinking ${loadingChars[i++ % loadingChars.length]} `);
|
||||
}, 80);
|
||||
} else {
|
||||
console.log("[DEBUG] Starting to receive streaming response...");
|
||||
}
|
||||
|
||||
for await (const chunk of response) {
|
||||
const content = chunk.choices[0]?.delta?.content;
|
||||
// Clear loading animation on first chunk
|
||||
if (loadingInterval) {
|
||||
clearInterval(loadingInterval);
|
||||
process.stdout.write('\r \r');
|
||||
}
|
||||
const tNow = now();
|
||||
chunkCount++;
|
||||
|
||||
// Extract content (delta) if present
|
||||
const content = chunk.choices?.[0]?.delta?.content ?? "";
|
||||
if (PRINT_CHUNK_DEBUG) {
|
||||
console.log(`[DEBUG] Received chunk #${chunkCount}:`, JSON.stringify(chunk));
|
||||
if (content) console.log(`[DEBUG] Chunk content: "${content}"`);
|
||||
}
|
||||
|
||||
if (content) {
|
||||
process.stdout.write(content);
|
||||
fullResponse += content;
|
||||
}
|
||||
|
||||
if (tFirstChunk === null) tFirstChunk = tNow;
|
||||
|
||||
const dtSincePrev = tPrevChunk === null ? 0 : tNow - tPrevChunk;
|
||||
chunkStats.push({
|
||||
index: chunkCount,
|
||||
tSinceRequestStartMs: tNow - tBeforeRequest,
|
||||
dtSincePrevMs: dtSincePrev,
|
||||
contentChars: content.length,
|
||||
});
|
||||
|
||||
tPrevChunk = tNow;
|
||||
}
|
||||
|
||||
console.log("\n---");
|
||||
console.log(`[INFO] Response completed. Total length: ${fullResponse.length} characters`);
|
||||
// =========
|
||||
// Summary
|
||||
// =========
|
||||
const tStreamEnd = now();
|
||||
const totalChars = fullResponse.length;
|
||||
|
||||
} catch (error) {
|
||||
console.log("\n---");
|
||||
console.log(`[DEBUG] Stream completed after ${chunkCount} chunks`);
|
||||
console.log(`[INFO] Response completed. Total length: ${totalChars} characters`);
|
||||
|
||||
// Build timing metrics
|
||||
const ttfbMs = (tFirstChunk ?? tStreamEnd) - tAfterCreate; // time from create() resolved → first chunk
|
||||
const createOverheadMs = tAfterCreate - tBeforeRequest; // time spent awaiting create() promise
|
||||
const totalSinceRequestMs = tStreamEnd - tBeforeRequest; // from just before create() to last chunk
|
||||
const streamDurationMs =
|
||||
tFirstChunk === null ? 0 : tStreamEnd - tFirstChunk;
|
||||
|
||||
const gaps = chunkStats
|
||||
.map((c) => c.dtSincePrevMs)
|
||||
// ignore the first "gap" which is 0 by construction
|
||||
.slice(1);
|
||||
|
||||
const avgGapMs = gaps.length ? gaps.reduce((a, b) => a + b, 0) / gaps.length : 0;
|
||||
const medGapMs = median(gaps);
|
||||
const p95GapMs = quantile(gaps, 0.95);
|
||||
|
||||
let maxGapMs = 0;
|
||||
let maxGapAtChunk = 0;
|
||||
for (let i = 0; i < gaps.length; i++) {
|
||||
if (gaps[i] > maxGapMs) {
|
||||
maxGapMs = gaps[i];
|
||||
maxGapAtChunk = i + 2; // +1 to move from 0-based, +1 because we sliced starting at second chunk
|
||||
}
|
||||
}
|
||||
|
||||
// Pretty print summary
|
||||
console.log("\n=== Timing Summary ===");
|
||||
console.log(`create() await time: ${ms(createOverheadMs)}`);
|
||||
console.log(`TTFB (to 1st chunk): ${ms(ttfbMs)}`);
|
||||
console.log(`Stream duration: ${ms(streamDurationMs)}`);
|
||||
console.log(`End-to-end (req→last): ${ms(totalSinceRequestMs)}`);
|
||||
console.log(`Chunks: ${chunkCount}`);
|
||||
console.log(`Total content chars: ${totalChars}`);
|
||||
console.log(`Avg chars/chunk: ${(chunkCount ? totalChars / chunkCount : 0).toFixed(1)}`);
|
||||
console.log(`Inter-chunk gap (avg): ${ms(avgGapMs)}`);
|
||||
console.log(`Inter-chunk gap (median): ${ms(medGapMs)}`);
|
||||
console.log(`Inter-chunk gap (p95): ${ms(p95GapMs)}`);
|
||||
if (gaps.length > 0) {
|
||||
console.log(`Largest gap: ${ms(maxGapMs)} (before chunk #${maxGapAtChunk})`);
|
||||
}
|
||||
|
||||
// Small tables: first N and slowest N gaps
|
||||
const firstRows = chunkStats.slice(0, SHOW_FIRST_N).map((c) => ({
|
||||
chunk: c.index,
|
||||
"t since request": `${c.tSinceRequestStartMs.toFixed(1)} ms`,
|
||||
"dt since prev": `${c.dtSincePrevMs.toFixed(1)} ms`,
|
||||
"chars": c.contentChars,
|
||||
}));
|
||||
|
||||
const slowestRows = chunkStats
|
||||
.slice(1) // skip first (no meaningful gap)
|
||||
.sort((a, b) => b.dtSincePrevMs - a.dtSincePrevMs)
|
||||
.slice(0, SHOW_SLOWEST_N)
|
||||
.map((c) => ({
|
||||
chunk: c.index,
|
||||
"t since request": `${c.tSinceRequestStartMs.toFixed(1)} ms`,
|
||||
"dt since prev": `${c.dtSincePrevMs.toFixed(1)} ms`,
|
||||
"chars": c.contentChars,
|
||||
}));
|
||||
|
||||
if (firstRows.length > 0) {
|
||||
console.log("\n--- First chunk timings ---");
|
||||
// @ts-ignore Bun/Node support console.table
|
||||
console.table(firstRows);
|
||||
}
|
||||
|
||||
if (slowestRows.length > 0) {
|
||||
console.log(`\n--- Slowest ${SHOW_SLOWEST_N} gaps ---`);
|
||||
// @ts-ignore
|
||||
console.table(slowestRows);
|
||||
}
|
||||
|
||||
const tProgramEnd = now();
|
||||
console.log("\n=== Program Overhead ===");
|
||||
console.log(`Total program runtime: ${ms(tProgramEnd - tProgramStart)}`);
|
||||
|
||||
} catch (error: any) {
|
||||
console.error("\n[ERROR] Request failed:", error.message);
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
// Run the main function
|
||||
main().catch(error => {
|
||||
main().catch((error) => {
|
||||
console.error("[FATAL ERROR]:", error);
|
||||
process.exit(1);
|
||||
});
|
||||
|
||||
|
||||
|
@@ -24,3 +24,12 @@ tracing-subscriber = { version = "0.3", features = ["env-filter"] }
|
||||
rand = "0.8.5"
|
||||
async-openai = "0.28.3"
|
||||
once_cell = "1.19.0"
|
||||
|
||||
[package.metadata.kube]
|
||||
image = "ghcr.io/geoffsee/embeddings-service:latest"
|
||||
replicas = 1
|
||||
port = 8080
|
||||
resources.cpu = "500m"
|
||||
resources.memory = "256Mi"
|
||||
#ingress.host = "my-service.example.com"
|
||||
#env = { RUST_LOG = "info", DATABASE_URL = "postgres://..." }
|
@@ -10,7 +10,7 @@ use std::env;
|
||||
use tower_http::trace::TraceLayer;
|
||||
use tracing;
|
||||
|
||||
const DEFAULT_SERVER_HOST: &str = "0.0.0.0";
|
||||
const DEFAULT_SERVER_HOST: &str = "127.0.0.1";
|
||||
const DEFAULT_SERVER_PORT: &str = "8080";
|
||||
|
||||
async fn root() -> &'static str {
|
||||
|
@@ -44,6 +44,7 @@ axum = { version = "0.8.4", features = ["json"] }
|
||||
tower = "0.5.2"
|
||||
tower-http = { version = "0.6.6", features = ["cors"] }
|
||||
tokio = { version = "1.43.0", features = ["full"] }
|
||||
tokio-stream = { version = "0.1.16", features = ["sync"] }
|
||||
either = { version = "1.9.0", features = ["serde"] }
|
||||
utoipa = { version = "4.2.0", features = ["axum_extras"] }
|
||||
uuid = { version = "1.7.0", features = ["v4"] }
|
||||
@@ -81,3 +82,12 @@ tokio = "1.43.0"
|
||||
[build-dependencies]
|
||||
anyhow = { version = "1", features = ["backtrace"] }
|
||||
bindgen_cuda = { version = "0.1.1", optional = true }
|
||||
|
||||
[package.metadata.kube]
|
||||
image = "ghcr.io/geoffsee/inference-service:latest"
|
||||
replicas = 1
|
||||
port = 8080
|
||||
resources.cpu = "500m"
|
||||
resources.memory = "256Mi"
|
||||
#ingress.host = "my-service.example.com"
|
||||
#env = { RUST_LOG = "info", DATABASE_URL = "postgres://..." }
|
||||
|
@@ -17,7 +17,7 @@ use std::env;
|
||||
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};
|
||||
|
||||
/// Server configuration constants
|
||||
pub const DEFAULT_SERVER_HOST: &str = "0.0.0.0";
|
||||
pub const DEFAULT_SERVER_HOST: &str = "127.0.0.1";
|
||||
pub const DEFAULT_SERVER_PORT: &str = "8080";
|
||||
|
||||
/// Get server configuration from environment variables with defaults
|
||||
|
@@ -5,14 +5,13 @@ use axum::{
|
||||
routing::{get, post},
|
||||
Json, Router,
|
||||
};
|
||||
use futures_util::stream::{self, Stream};
|
||||
use std::convert::Infallible;
|
||||
use candle_core::DType;
|
||||
use candle_nn::VarBuilder;
|
||||
use futures_util::stream::{self, Stream};
|
||||
use tokio_stream::wrappers::UnboundedReceiverStream;
|
||||
use std::convert::Infallible;
|
||||
use std::{path::PathBuf, sync::Arc};
|
||||
use std::time::Duration;
|
||||
use tokio::sync::Mutex;
|
||||
use tokio::time;
|
||||
use tokio::sync::{Mutex, mpsc};
|
||||
use tower_http::cors::{Any, CorsLayer};
|
||||
use uuid::Uuid;
|
||||
|
||||
@@ -46,40 +45,26 @@ impl Default for AppState {
|
||||
let text_generation = build_pipeline(args.clone());
|
||||
Self {
|
||||
text_generation: Arc::new(Mutex::new(text_generation)),
|
||||
model_id: String::new(),
|
||||
model_id: args.model_id.clone(),
|
||||
build_args: args,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// -------------------------
|
||||
// Pipeline configuration
|
||||
// -------------------------
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct PipelineArgs {
|
||||
/// HF model repo id, e.g. "google/gemma-2b"
|
||||
pub model_id: String,
|
||||
|
||||
/// Which internal model family to instantiate
|
||||
pub which: Which,
|
||||
|
||||
/// Optional HF revision/branch/tag; None => "main"
|
||||
pub revision: Option<String>,
|
||||
|
||||
/// Optional explicit tokenizer path
|
||||
pub tokenizer_path: Option<PathBuf>,
|
||||
|
||||
/// Optional explicit config path
|
||||
pub config_path: Option<PathBuf>,
|
||||
|
||||
/// Optional explicit weight paths. If empty, they will be resolved from the hub.
|
||||
pub weight_paths: Vec<PathBuf>,
|
||||
|
||||
/// Runtime toggles
|
||||
pub use_flash_attn: bool,
|
||||
pub force_cpu: bool,
|
||||
|
||||
/// Sampling / decoding params
|
||||
pub seed: u64,
|
||||
pub temperature: Option<f64>,
|
||||
pub top_p: Option<f64>,
|
||||
@@ -98,39 +83,43 @@ impl Default for PipelineArgs {
|
||||
weight_paths: Vec::new(),
|
||||
use_flash_attn: false,
|
||||
force_cpu: false,
|
||||
seed: 0,
|
||||
temperature: None,
|
||||
top_p: None,
|
||||
repeat_penalty: 0.0,
|
||||
repeat_last_n: 0,
|
||||
seed: 299792458, // Speed of light in vacuum (m/s)
|
||||
temperature: Some(0.8), // Good balance between creativity and coherence
|
||||
top_p: Some(0.9), // Keep diverse but reasonable options
|
||||
repeat_penalty: 1.2, // Stronger penalty for repetition to prevent looping
|
||||
repeat_last_n: 64, // Consider last 64 tokens for repetition
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// If no owner/org is present, prefix with a sensible default (tweak as you like).
|
||||
fn normalize_model_id(model_id: &str) -> String {
|
||||
if model_id.contains('/') { model_id.to_string() } else { format!("google/{}", model_id) }
|
||||
if model_id.contains('/') {
|
||||
model_id.to_string()
|
||||
} else {
|
||||
format!("google/{}", model_id)
|
||||
}
|
||||
}
|
||||
|
||||
// Quick existence check, mapping 404 into a helpful message.
|
||||
fn ensure_repo_exists(api: &Api, model_id: &str, revision: &str) -> anyhow::Result<()> {
|
||||
let repo = api.repo(Repo::with_revision(model_id.to_string(), RepoType::Model, revision.to_string()));
|
||||
let repo = api.repo(Repo::with_revision(
|
||||
model_id.to_string(),
|
||||
RepoType::Model,
|
||||
revision.to_string(),
|
||||
));
|
||||
match repo.get("config.json") {
|
||||
Ok(_) => Ok(()),
|
||||
Err(e) => match e {
|
||||
ApiError::RequestError(resp) => {
|
||||
// For HF API, RequestError with 404 status is returned when repo doesn't exist
|
||||
let error_str = resp.to_string();
|
||||
if error_str.contains("404") {
|
||||
anyhow::bail!(
|
||||
"Hugging Face model repo not found: '{model_id}' at revision '{revision}'. \
|
||||
Please provide a fully-qualified repo id like 'google/gemma-2b-it'."
|
||||
"Hugging Face model repo not found: '{model_id}' at revision '{revision}'."
|
||||
)
|
||||
}
|
||||
Err(anyhow::Error::new(ApiError::RequestError(resp)))
|
||||
}
|
||||
other => Err(anyhow::Error::new(other)),
|
||||
}
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
@@ -151,18 +140,13 @@ pub fn build_pipeline(mut args: PipelineArgs) -> TextGeneration {
|
||||
let api = Api::new().unwrap();
|
||||
let revision = args.revision.as_deref().unwrap_or("main");
|
||||
|
||||
// Check if model_id is empty before normalizing it
|
||||
println!("Checking model_id: '{}'", args.model_id);
|
||||
|
||||
println!("Trimmed model_id length: {}", args.model_id.trim().len());
|
||||
if args.model_id.trim().is_empty() {
|
||||
panic!("No model ID specified. Please provide a valid model ID (e.g., 'gemma-2b-it' or 'google/gemma-2b-it').");
|
||||
panic!("No model ID specified.");
|
||||
}
|
||||
args.model_id = normalize_model_id(&args.model_id);
|
||||
|
||||
// Validate early (nice error if the repo/revision is wrong).
|
||||
match ensure_repo_exists(&api, &args.model_id, revision) {
|
||||
Ok(_) => {},
|
||||
Ok(_) => {}
|
||||
Err(e) => panic!("{}", e),
|
||||
};
|
||||
|
||||
@@ -172,105 +156,82 @@ pub fn build_pipeline(mut args: PipelineArgs) -> TextGeneration {
|
||||
revision.to_string(),
|
||||
));
|
||||
|
||||
// Resolve files (prefer explicit paths; fallback to hub)
|
||||
let tokenizer_path = args
|
||||
.tokenizer_path
|
||||
.unwrap_or_else(|| repo.get("tokenizer.json").unwrap());
|
||||
|
||||
let config_path = args
|
||||
.config_path
|
||||
.unwrap_or_else(|| repo.get("config.json").unwrap());
|
||||
|
||||
// Only use auto-detection if no specific model type was provided
|
||||
// This ensures that explicitly specified model types are respected
|
||||
if !matches!(args.which,
|
||||
Which::Base2B | Which::Base7B |
|
||||
Which::Instruct2B | Which::Instruct7B |
|
||||
Which::InstructV1_1_2B | Which::InstructV1_1_7B |
|
||||
Which::CodeBase2B | Which::CodeBase7B |
|
||||
Which::CodeInstruct2B | Which::CodeInstruct7B |
|
||||
Which::BaseV2_2B | Which::InstructV2_2B |
|
||||
Which::BaseV2_9B | Which::InstructV2_9B |
|
||||
Which::BaseV3_1B | Which::InstructV3_1B) {
|
||||
|
||||
// If model_id is a known value, map it directly
|
||||
if !matches!(
|
||||
args.which,
|
||||
Which::Base2B
|
||||
| Which::Base7B
|
||||
| Which::Instruct2B
|
||||
| Which::Instruct7B
|
||||
| Which::InstructV1_1_2B
|
||||
| Which::InstructV1_1_7B
|
||||
| Which::CodeBase2B
|
||||
| Which::CodeBase7B
|
||||
| Which::CodeInstruct2B
|
||||
| Which::CodeInstruct7B
|
||||
| Which::BaseV2_2B
|
||||
| Which::InstructV2_2B
|
||||
| Which::BaseV2_9B
|
||||
| Which::InstructV2_9B
|
||||
| Which::BaseV3_1B
|
||||
| Which::InstructV3_1B
|
||||
) {
|
||||
if args.model_id.contains("gemma-2-2b-it") {
|
||||
args.which = Which::InstructV2_2B;
|
||||
println!("Setting model type to InstructV2_2B based on model_id: {}", args.model_id);
|
||||
} else if args.model_id.contains("gemma-3-1b-it") {
|
||||
args.which = Which::InstructV3_1B;
|
||||
println!("Setting model type to InstructV3_1B based on model_id: {}", args.model_id);
|
||||
} else {
|
||||
// Fallback to auto-detection from config.json
|
||||
if let Ok(file) = std::fs::File::open(config_path.clone()) {
|
||||
if let Ok(cfg_val) = serde_json::from_reader::<_, serde_json::Value>(file) {
|
||||
if let Some(model_type) = cfg_val.get("model_type").and_then(|v| v.as_str()) {
|
||||
println!("Auto-detecting model type from config.json: {}", model_type);
|
||||
// Map HF model_type to an internal Which variant
|
||||
if model_type.contains("gemma3") {
|
||||
args.which = Which::InstructV3_1B;
|
||||
println!("Setting model type to InstructV3_1B based on config");
|
||||
} else if model_type.contains("gemma2") {
|
||||
args.which = Which::InstructV2_2B;
|
||||
println!("Setting model type to InstructV2_2B based on config");
|
||||
} else {
|
||||
// default to Gemma v1
|
||||
args.which = Which::Instruct2B;
|
||||
println!("Setting model type to Instruct2B (v1) based on config");
|
||||
}
|
||||
} else if let Ok(file) = std::fs::File::open(config_path.clone()) {
|
||||
if let Ok(cfg_val) = serde_json::from_reader::<_, serde_json::Value>(file) {
|
||||
if let Some(model_type) = cfg_val.get("model_type").and_then(|v| v.as_str()) {
|
||||
if model_type.contains("gemma3") {
|
||||
args.which = Which::InstructV3_1B;
|
||||
} else if model_type.contains("gemma2") {
|
||||
args.which = Which::InstructV2_2B;
|
||||
} else {
|
||||
args.which = Which::Instruct2B;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
} else {
|
||||
println!("Using explicitly specified model type: {:?}", args.which);
|
||||
}
|
||||
|
||||
// Resolve weight files: try a single-file first, then fall back to sharded index
|
||||
let weight_paths = if !args.weight_paths.is_empty() {
|
||||
args.weight_paths
|
||||
} else {
|
||||
match repo.get("model.safetensors") {
|
||||
Ok(single) => vec![single],
|
||||
Err(_) => {
|
||||
match utilities_lib::hub_load_safetensors(&repo, "model.safetensors.index.json") {
|
||||
Ok(paths) => paths,
|
||||
Err(e) => {
|
||||
panic!(
|
||||
"Unable to locate model weights for '{}'. Tried 'model.safetensors' and 'model.safetensors.index.json'. Underlying error: {}",
|
||||
args.model_id, e
|
||||
);
|
||||
}
|
||||
Err(_) => match utilities_lib::hub_load_safetensors(&repo, "model.safetensors.index.json") {
|
||||
Ok(paths) => paths,
|
||||
Err(e) => {
|
||||
panic!("Unable to locate model weights: {}", e);
|
||||
}
|
||||
}
|
||||
},
|
||||
}
|
||||
};
|
||||
|
||||
println!("retrieved the files in {:?}", start.elapsed());
|
||||
|
||||
let tokenizer = Tokenizer::from_file(tokenizer_path)
|
||||
.map_err(anyhow::Error::msg)
|
||||
.unwrap();
|
||||
|
||||
let start = std::time::Instant::now();
|
||||
let tokenizer = Tokenizer::from_file(tokenizer_path).unwrap();
|
||||
|
||||
let initial_device = utilities_lib::device(args.force_cpu).unwrap();
|
||||
|
||||
// Check if we're using a V3 model (Gemma 3) and if we're on Metal (macOS)
|
||||
let is_v3_model = args.which.is_v3_model();
|
||||
let is_metal = !initial_device.is_cpu() && candle_core::utils::metal_is_available() && !args.force_cpu;
|
||||
let is_metal = !initial_device.is_cpu()
|
||||
&& candle_core::utils::metal_is_available()
|
||||
&& !args.force_cpu;
|
||||
|
||||
// Use CPU for V3 models on Metal due to missing implementations
|
||||
let device = if is_v3_model && is_metal {
|
||||
println!("Note: Using CPU for Gemma 3 model due to missing Metal implementations for required operations (e.g., rotary-emb).");
|
||||
candle_core::Device::Cpu
|
||||
} else {
|
||||
initial_device
|
||||
};
|
||||
|
||||
let dtype = if device.is_cuda() { DType::BF16 } else { DType::F32 };
|
||||
|
||||
// Keep original device + dtype
|
||||
let vb = unsafe { VarBuilder::from_mmaped_safetensors(&weight_paths, dtype, &device).unwrap() };
|
||||
|
||||
let model = match args.which {
|
||||
@@ -285,23 +246,18 @@ pub fn build_pipeline(mut args: PipelineArgs) -> TextGeneration {
|
||||
| Which::CodeInstruct2B
|
||||
| Which::CodeInstruct7B => {
|
||||
let config: Config1 = serde_json::from_reader(std::fs::File::open(config_path.clone()).unwrap()).unwrap();
|
||||
let model = Model1::new(args.use_flash_attn, &config, vb).unwrap();
|
||||
GemmaModel::V1(model)
|
||||
GemmaModel::V1(Model1::new(args.use_flash_attn, &config, vb).unwrap())
|
||||
}
|
||||
Which::BaseV2_2B | Which::InstructV2_2B | Which::BaseV2_9B | Which::InstructV2_9B => {
|
||||
let config: Config2 = serde_json::from_reader(std::fs::File::open(config_path.clone()).unwrap()).unwrap();
|
||||
let model = Model2::new(args.use_flash_attn, &config, vb).unwrap();
|
||||
GemmaModel::V2(model)
|
||||
GemmaModel::V2(Model2::new(args.use_flash_attn, &config, vb).unwrap())
|
||||
}
|
||||
Which::BaseV3_1B | Which::InstructV3_1B => {
|
||||
let config: Config3 = serde_json::from_reader(std::fs::File::open(config_path).unwrap()).unwrap();
|
||||
let model = Model3::new(args.use_flash_attn, &config, vb).unwrap();
|
||||
GemmaModel::V3(model)
|
||||
GemmaModel::V3(Model3::new(args.use_flash_attn, &config, vb).unwrap())
|
||||
}
|
||||
};
|
||||
|
||||
println!("loaded the model in {:?}", start.elapsed());
|
||||
|
||||
TextGeneration::new(
|
||||
model,
|
||||
tokenizer,
|
||||
@@ -314,6 +270,43 @@ pub fn build_pipeline(mut args: PipelineArgs) -> TextGeneration {
|
||||
)
|
||||
}
|
||||
|
||||
fn build_gemma_prompt(messages: &[Message]) -> String {
|
||||
let mut prompt = String::new();
|
||||
let mut system_prompt: Option<String> = None;
|
||||
|
||||
for message in messages {
|
||||
let content = match &message.content {
|
||||
Some(content) => match &content.0 {
|
||||
Either::Left(text) => text.clone(),
|
||||
Either::Right(_) => "".to_string(),
|
||||
},
|
||||
None => "".to_string(),
|
||||
};
|
||||
|
||||
match message.role.as_str() {
|
||||
"system" => system_prompt = Some(content),
|
||||
"user" => {
|
||||
prompt.push_str("<start_of_turn>user\n");
|
||||
if let Some(sys_prompt) = system_prompt.take() {
|
||||
prompt.push_str(&sys_prompt);
|
||||
prompt.push_str("\n\n");
|
||||
}
|
||||
prompt.push_str(&content);
|
||||
prompt.push_str("<end_of_turn>\n");
|
||||
}
|
||||
"assistant" => {
|
||||
prompt.push_str("<start_of_turn>model\n");
|
||||
prompt.push_str(&content);
|
||||
prompt.push_str("<end_of_turn>\n");
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
prompt.push_str("<start_of_turn>model\n");
|
||||
prompt
|
||||
}
|
||||
|
||||
// -------------------------
|
||||
// OpenAI-compatible handler
|
||||
// -------------------------
|
||||
@@ -322,72 +315,68 @@ pub async fn chat_completions(
|
||||
State(state): State<AppState>,
|
||||
Json(request): Json<ChatCompletionRequest>,
|
||||
) -> Result<impl IntoResponse, (StatusCode, String)> {
|
||||
// If streaming was requested, this function shouldn't be called
|
||||
// A separate route handles streaming requests
|
||||
if !request.stream.unwrap_or(false) {
|
||||
return Ok(chat_completions_non_streaming_proxy(state, request).await.into_response())
|
||||
return Ok(chat_completions_non_streaming_proxy(state, request).await.into_response());
|
||||
}
|
||||
|
||||
Ok(chat_completions_stream(state, request).await.into_response())
|
||||
}
|
||||
|
||||
pub async fn chat_completions_non_streaming_proxy(state: AppState, request: ChatCompletionRequest) -> Result<impl IntoResponse, (StatusCode, Json<Value>)> {
|
||||
// Non-streaming response - original implementation
|
||||
let mut prompt = String::new();
|
||||
pub async fn chat_completions_non_streaming_proxy(
|
||||
state: AppState,
|
||||
request: ChatCompletionRequest,
|
||||
) -> Result<impl IntoResponse, (StatusCode, Json<Value>)> {
|
||||
let prompt = build_gemma_prompt(&request.messages);
|
||||
|
||||
// Convert messages to a prompt string
|
||||
for message in &request.messages {
|
||||
let role = &message.role;
|
||||
let content = match &message.content {
|
||||
Some(content) => match &content.0 {
|
||||
Either::Left(text) => text.clone(),
|
||||
Either::Right(_) => "".to_string(), // Handle complex content if needed
|
||||
},
|
||||
None => "".to_string(),
|
||||
};
|
||||
|
||||
match role.as_str() {
|
||||
"system" => prompt.push_str(&format!("System: {}\n", content)),
|
||||
"user" => prompt.push_str(&format!("User: {}\n", content)),
|
||||
"assistant" => prompt.push_str(&format!("Assistant: {}\n", content)),
|
||||
_ => prompt.push_str(&format!("{}: {}\n", role, content)),
|
||||
}
|
||||
}
|
||||
prompt.push_str("Assistant: ");
|
||||
|
||||
let model_id = state.model_id.clone();
|
||||
|
||||
// Generate
|
||||
let mut output = Vec::new();
|
||||
{
|
||||
// Recreate TextGeneration instance to ensure completely fresh state
|
||||
// This prevents KV cache persistence that causes tensor shape mismatches
|
||||
let fresh_text_gen = build_pipeline(state.build_args.clone());
|
||||
let mut text_gen = state.text_generation.lock().await;
|
||||
*text_gen = fresh_text_gen;
|
||||
|
||||
let mut buffer = Vec::new();
|
||||
let max_tokens = request.max_tokens.unwrap_or(1000);
|
||||
let result = text_gen.run_with_output(&prompt, max_tokens, &mut buffer);
|
||||
|
||||
if let Err(e) = result {
|
||||
// Enforce model selection behavior: reject if a different model is requested
|
||||
let configured_model = state.build_args.model_id.clone();
|
||||
let requested_model = request.model.clone();
|
||||
if requested_model.to_lowercase() != "default" {
|
||||
let normalized_requested = normalize_model_id(&requested_model);
|
||||
if normalized_requested != configured_model {
|
||||
return Err((
|
||||
StatusCode::BAD_REQUEST,
|
||||
Json(serde_json::json!({
|
||||
"error": {
|
||||
"message": format!("Error generating text: {}", e),
|
||||
"type": "text_generation_error"
|
||||
"message": format!(
|
||||
"Requested model '{}' is not available. This server is running '{}' only.",
|
||||
requested_model, configured_model
|
||||
),
|
||||
"type": "model_mismatch"
|
||||
}
|
||||
})),
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
if let Ok(text) = String::from_utf8(buffer) {
|
||||
output.push(text);
|
||||
let model_id = state.model_id.clone();
|
||||
|
||||
let mut buffer = Vec::new();
|
||||
{
|
||||
let mut text_gen = state.text_generation.lock().await;
|
||||
// Reset per-request state without rebuilding the whole pipeline
|
||||
text_gen.reset_state();
|
||||
let max_tokens = request.max_tokens.unwrap_or(1000);
|
||||
if let Err(e) = text_gen.run_with_output(&prompt, max_tokens, &mut buffer) {
|
||||
return Err((
|
||||
StatusCode::BAD_REQUEST,
|
||||
Json(serde_json::json!({
|
||||
"error": { "message": format!("Error generating text: {}", e) }
|
||||
})),
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
let completion = output.join("");
|
||||
let completion = match String::from_utf8(buffer) {
|
||||
Ok(s) => s,
|
||||
Err(e) => {
|
||||
return Err((
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
Json(serde_json::json!({
|
||||
"error": { "message": format!("UTF-8 conversion error: {}", e) }
|
||||
})),
|
||||
));
|
||||
}
|
||||
};
|
||||
|
||||
let response = ChatCompletionResponse {
|
||||
id: format!("chatcmpl-{}", Uuid::new_v4().to_string().replace('-', "")),
|
||||
@@ -407,7 +396,6 @@ pub async fn chat_completions_non_streaming_proxy(state: AppState, request: Chat
|
||||
finish_reason: "stop".to_string(),
|
||||
}],
|
||||
usage: Usage {
|
||||
// still rough estimates
|
||||
prompt_tokens: prompt.len() / 4,
|
||||
completion_tokens: completion.len() / 4,
|
||||
total_tokens: (prompt.len() + completion.len()) / 4,
|
||||
@@ -415,23 +403,44 @@ pub async fn chat_completions_non_streaming_proxy(state: AppState, request: Chat
|
||||
};
|
||||
Ok(Json(response).into_response())
|
||||
}
|
||||
|
||||
// -------------------------
|
||||
// Streaming implementation
|
||||
// -------------------------
|
||||
|
||||
pub async fn chat_completions_stream(
|
||||
state: AppState,
|
||||
chat_completion_request: ChatCompletionRequest,
|
||||
) -> Result<Sse<impl Stream<Item = Result<Event, Infallible>>>, (StatusCode, Json<serde_json::Value>)> {
|
||||
// Call the handler function
|
||||
handle_streaming_request(state, chat_completion_request).await
|
||||
request: ChatCompletionRequest,
|
||||
) -> Result<Sse<impl Stream<Item = Result<Event, Infallible>>>, (StatusCode, Json<Value>)> {
|
||||
handle_streaming_request(state, request).await
|
||||
}
|
||||
|
||||
/// Handle streaming requests with Server-Sent Events (SSE)
|
||||
async fn handle_streaming_request(
|
||||
state: AppState,
|
||||
request: ChatCompletionRequest
|
||||
) -> Result<Sse<impl Stream<Item = Result<Event, Infallible>>>, (StatusCode, Json<serde_json::Value>)> {
|
||||
// Generate a unique ID for this completion
|
||||
request: ChatCompletionRequest,
|
||||
) -> Result<Sse<impl Stream<Item = Result<Event, Infallible>>>, (StatusCode, Json<Value>)> {
|
||||
// Validate requested model vs configured model
|
||||
let configured_model = state.build_args.model_id.clone();
|
||||
let requested_model = request.model.clone();
|
||||
if requested_model.to_lowercase() != "default" {
|
||||
let normalized_requested = normalize_model_id(&requested_model);
|
||||
if normalized_requested != configured_model {
|
||||
return Err((
|
||||
StatusCode::BAD_REQUEST,
|
||||
Json(serde_json::json!({
|
||||
"error": {
|
||||
"message": format!(
|
||||
"Requested model '{}' is not available. This server is running '{}' only.",
|
||||
requested_model, configured_model
|
||||
),
|
||||
"type": "model_mismatch"
|
||||
}
|
||||
})),
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
// Generate a unique ID and metadata
|
||||
let response_id = format!("chatcmpl-{}", Uuid::new_v4().to_string().replace('-', ""));
|
||||
let created = std::time::SystemTime::now()
|
||||
.duration_since(std::time::UNIX_EPOCH)
|
||||
@@ -439,161 +448,150 @@ async fn handle_streaming_request(
|
||||
.as_secs();
|
||||
let model_id = state.model_id.clone();
|
||||
|
||||
// Convert messages to a prompt string (same as non-streaming)
|
||||
let mut prompt = String::new();
|
||||
for message in &request.messages {
|
||||
let role = &message.role;
|
||||
let content = match &message.content {
|
||||
Some(content) => match &content.0 {
|
||||
Either::Left(text) => text.clone(),
|
||||
Either::Right(_) => "".to_string(), // Handle complex content if needed
|
||||
},
|
||||
None => "".to_string(),
|
||||
};
|
||||
// Build prompt
|
||||
let prompt = build_gemma_prompt(&request.messages);
|
||||
tracing::debug!("Formatted prompt: {}", prompt);
|
||||
|
||||
match role.as_str() {
|
||||
"system" => prompt.push_str(&format!("System: {}\n", content)),
|
||||
"user" => prompt.push_str(&format!("User: {}\n", content)),
|
||||
"assistant" => prompt.push_str(&format!("Assistant: {}\n", content)),
|
||||
_ => prompt.push_str(&format!("{}: {}\n", role, content)),
|
||||
}
|
||||
// Channel for streaming SSE events
|
||||
let (tx, rx) = mpsc::unbounded_channel::<Result<Event, Infallible>>();
|
||||
|
||||
// Send initial role event
|
||||
let initial_chunk = ChatCompletionChunk {
|
||||
id: response_id.clone(),
|
||||
object: "chat.completion.chunk".to_string(),
|
||||
created,
|
||||
model: model_id.clone(),
|
||||
choices: vec![ChatCompletionChunkChoice {
|
||||
index: 0,
|
||||
delta: Delta { role: Some("assistant".to_string()), content: None },
|
||||
finish_reason: None,
|
||||
}],
|
||||
};
|
||||
if let Ok(json) = serde_json::to_string(&initial_chunk) {
|
||||
let _ = tx.send(Ok(Event::default().data(json)));
|
||||
}
|
||||
prompt.push_str("Assistant: ");
|
||||
|
||||
// Generate text using existing buffer-based approach
|
||||
let mut buffer = Vec::new();
|
||||
{
|
||||
// Recreate TextGeneration instance to ensure completely fresh state
|
||||
// This prevents KV cache persistence that causes tensor shape mismatches
|
||||
let fresh_text_gen = build_pipeline(state.build_args.clone());
|
||||
let mut text_gen = state.text_generation.lock().await;
|
||||
*text_gen = fresh_text_gen;
|
||||
|
||||
// Spawn generation task that streams tokens as they are generated
|
||||
let state_clone = state.clone();
|
||||
let response_id_clone = response_id.clone();
|
||||
tokio::spawn(async move {
|
||||
let max_tokens = request.max_tokens.unwrap_or(1000);
|
||||
let mut text_gen = state_clone.text_generation.lock().await;
|
||||
text_gen.reset_state();
|
||||
|
||||
if let Err(e) = text_gen.run_with_output(&prompt, max_tokens, &mut buffer) {
|
||||
return Err((
|
||||
StatusCode::BAD_REQUEST,
|
||||
Json(serde_json::json!({
|
||||
"error": {
|
||||
"message": format!("Error generating text: {}", e),
|
||||
"type": "text_generation_error"
|
||||
// Stream tokens via callback with repetition detection
|
||||
let mut recent_tokens = Vec::new();
|
||||
let mut repetition_count = 0;
|
||||
const MAX_REPETITION_COUNT: usize = 5; // Stop after 5 consecutive repetitions
|
||||
const REPETITION_WINDOW: usize = 8; // Look at last 8 tokens for patterns
|
||||
|
||||
let result = text_gen.run_with_streaming(&prompt, max_tokens, |token| {
|
||||
// Debug log to verify token content
|
||||
tracing::debug!("Streaming token: '{}'", token);
|
||||
|
||||
// Skip sending empty tokens
|
||||
if token.is_empty() {
|
||||
tracing::debug!("Skipping empty token");
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// Add token to recent history for repetition detection
|
||||
recent_tokens.push(token.to_string());
|
||||
if recent_tokens.len() > REPETITION_WINDOW {
|
||||
recent_tokens.remove(0);
|
||||
}
|
||||
|
||||
// Check for repetitive patterns
|
||||
if recent_tokens.len() >= 4 {
|
||||
let last_token = &recent_tokens[recent_tokens.len() - 1];
|
||||
let second_last = &recent_tokens[recent_tokens.len() - 2];
|
||||
|
||||
// Check if we're repeating the same token or pattern
|
||||
if last_token == second_last ||
|
||||
(last_token.trim() == "plus" && second_last.trim() == "plus") ||
|
||||
(recent_tokens.len() >= 6 &&
|
||||
recent_tokens[recent_tokens.len()-3..].iter().all(|t| t.trim() == "plus" || t.trim().is_empty())) {
|
||||
repetition_count += 1;
|
||||
tracing::warn!("Detected repetition pattern: '{}' (count: {})", last_token, repetition_count);
|
||||
|
||||
if repetition_count >= MAX_REPETITION_COUNT {
|
||||
tracing::info!("Stopping generation due to excessive repetition");
|
||||
return Err(anyhow::Error::msg("Repetition detected - stopping generation"));
|
||||
}
|
||||
})),
|
||||
));
|
||||
}
|
||||
}
|
||||
} else {
|
||||
repetition_count = 0; // Reset counter if pattern breaks
|
||||
}
|
||||
}
|
||||
|
||||
// Convert buffer to string
|
||||
let generated_text = match String::from_utf8(buffer) {
|
||||
Ok(text) => text,
|
||||
Err(e) => {
|
||||
return Err((
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
Json(serde_json::json!({
|
||||
"error": {
|
||||
"message": format!("Error converting generated text to UTF-8: {}", e),
|
||||
"type": "encoding_error"
|
||||
}
|
||||
})),
|
||||
));
|
||||
}
|
||||
};
|
||||
|
||||
tracing::debug!("Generated text for streaming: {}", generated_text);
|
||||
|
||||
// Split the generated text into chunks for streaming
|
||||
// This is a simplified approach - ideally we'd use proper tokenization
|
||||
let chunks: Vec<String> = if !generated_text.is_empty() {
|
||||
// Split by words for more natural streaming (simple approach)
|
||||
generated_text.split_whitespace()
|
||||
.map(|word| word.to_string() + " ")
|
||||
.collect()
|
||||
} else {
|
||||
// If no text was generated, provide a default response
|
||||
vec!["Abraham Lincoln was the 16th president of the United States.".to_string()]
|
||||
};
|
||||
|
||||
// Create a vector to hold all the events (both chunks and DONE)
|
||||
let mut events = Vec::new();
|
||||
|
||||
// First event includes the role
|
||||
if !chunks.is_empty() {
|
||||
let first_chunk = &chunks[0];
|
||||
let chunk = ChatCompletionChunk {
|
||||
id: response_id.clone(),
|
||||
object: "chat.completion.chunk".to_string(),
|
||||
created,
|
||||
model: model_id.clone(),
|
||||
choices: vec![ChatCompletionChunkChoice {
|
||||
index: 0,
|
||||
delta: Delta {
|
||||
role: Some("assistant".to_string()),
|
||||
content: Some(first_chunk.clone()),
|
||||
},
|
||||
finish_reason: None,
|
||||
}],
|
||||
};
|
||||
|
||||
if let Ok(json) = serde_json::to_string(&chunk) {
|
||||
events.push(Ok(Event::default().data(json)));
|
||||
}
|
||||
|
||||
// Add remaining chunks
|
||||
for chunk_text in chunks.iter().skip(1) {
|
||||
let chunk = ChatCompletionChunk {
|
||||
id: response_id.clone(),
|
||||
id: response_id_clone.clone(),
|
||||
object: "chat.completion.chunk".to_string(),
|
||||
created,
|
||||
model: model_id.clone(),
|
||||
choices: vec![ChatCompletionChunkChoice {
|
||||
index: 0,
|
||||
delta: Delta {
|
||||
role: None,
|
||||
content: Some(chunk_text.clone()),
|
||||
},
|
||||
delta: Delta { role: None, content: Some(token.to_string()) },
|
||||
finish_reason: None,
|
||||
}],
|
||||
};
|
||||
|
||||
if let Ok(json) = serde_json::to_string(&chunk) {
|
||||
events.push(Ok(Event::default().data(json)));
|
||||
tracing::debug!("Sending chunk with content: '{}'", token);
|
||||
let _ = tx.send(Ok(Event::default().data(json)));
|
||||
}
|
||||
Ok(())
|
||||
}).await;
|
||||
|
||||
// Log result of generation
|
||||
match result {
|
||||
Ok(_) => tracing::debug!("Text generation completed successfully"),
|
||||
Err(e) => tracing::info!("Text generation stopped: {}", e),
|
||||
}
|
||||
|
||||
// Add final chunk with finish_reason
|
||||
// Send final stop chunk and DONE marker
|
||||
let final_chunk = ChatCompletionChunk {
|
||||
id: response_id,
|
||||
id: response_id_clone.clone(),
|
||||
object: "chat.completion.chunk".to_string(),
|
||||
created,
|
||||
model: model_id,
|
||||
model: model_id.clone(),
|
||||
choices: vec![ChatCompletionChunkChoice {
|
||||
index: 0,
|
||||
delta: Delta {
|
||||
role: None,
|
||||
content: None,
|
||||
},
|
||||
delta: Delta { role: None, content: None },
|
||||
finish_reason: Some("stop".to_string()),
|
||||
}],
|
||||
};
|
||||
|
||||
if let Ok(json) = serde_json::to_string(&final_chunk) {
|
||||
events.push(Ok(Event::default().data(json)));
|
||||
let _ = tx.send(Ok(Event::default().data(json)));
|
||||
}
|
||||
}
|
||||
let _ = tx.send(Ok(Event::default().data("[DONE]")));
|
||||
});
|
||||
|
||||
// Add [DONE] event
|
||||
events.push(Ok(Event::default().data("[DONE]")));
|
||||
|
||||
// Create a stream from the events
|
||||
let stream = stream::iter(events);
|
||||
|
||||
// Return the SSE stream
|
||||
// Convert receiver into a Stream for SSE
|
||||
let stream = UnboundedReceiverStream::new(rx);
|
||||
Ok(Sse::new(stream))
|
||||
}
|
||||
|
||||
|
||||
|
||||
// -------------------------
|
||||
// Router
|
||||
// -------------------------
|
||||
|
||||
pub fn create_router(app_state: AppState) -> Router {
|
||||
let cors = CorsLayer::new()
|
||||
.allow_headers(Any)
|
||||
.allow_origin(Any)
|
||||
.allow_methods(Any)
|
||||
.allow_headers(Any);
|
||||
|
||||
Router::new()
|
||||
.route("/v1/chat/completions", post(chat_completions))
|
||||
.route("/v1/models", get(list_models))
|
||||
.layer(cors)
|
||||
.with_state(app_state)
|
||||
}
|
||||
|
||||
/// Handler for GET /v1/models - returns list of available models
|
||||
async fn list_models() -> Json<ModelListResponse> {
|
||||
pub async fn list_models() -> Json<ModelListResponse> {
|
||||
// Get all available model variants from the Which enum
|
||||
let models = vec![
|
||||
Model {
|
||||
@@ -700,24 +698,6 @@ async fn list_models() -> Json<ModelListResponse> {
|
||||
})
|
||||
}
|
||||
|
||||
// -------------------------
|
||||
// Router
|
||||
// -------------------------
|
||||
|
||||
pub fn create_router(app_state: AppState) -> Router {
|
||||
let cors = CorsLayer::new()
|
||||
.allow_headers(Any)
|
||||
.allow_origin(Any)
|
||||
.allow_methods(Any)
|
||||
.allow_headers(Any);
|
||||
|
||||
Router::new()
|
||||
.route("/v1/chat/completions", post(chat_completions))
|
||||
.route("/v1/models", get(list_models))
|
||||
// .route("/v1/chat/completions/stream", post(chat_completions_stream))
|
||||
.layer(cors)
|
||||
.with_state(app_state)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
@@ -725,92 +705,59 @@ mod tests {
|
||||
use crate::openai_types::{Message, MessageContent};
|
||||
use either::Either;
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_models_list_endpoint() {
|
||||
println!("[DEBUG_LOG] Testing models list endpoint");
|
||||
#[test]
|
||||
fn test_build_gemma_prompt() {
|
||||
let messages = vec![
|
||||
Message {
|
||||
role: "system".to_string(),
|
||||
content: Some(MessageContent(Either::Left("System message".to_string()))),
|
||||
name: None,
|
||||
},
|
||||
Message {
|
||||
role: "user".to_string(),
|
||||
content: Some(MessageContent(Either::Left("Knock knock.".to_string()))),
|
||||
name: None,
|
||||
},
|
||||
Message {
|
||||
role: "assistant".to_string(),
|
||||
content: Some(MessageContent(Either::Left("Who's there?".to_string()))),
|
||||
name: None,
|
||||
},
|
||||
Message {
|
||||
role: "user".to_string(),
|
||||
content: Some(MessageContent(Either::Left("Gemma.".to_string()))),
|
||||
name: None,
|
||||
},
|
||||
];
|
||||
|
||||
let response = list_models().await;
|
||||
let models_response = response.0;
|
||||
let prompt = build_gemma_prompt(&messages);
|
||||
|
||||
// Verify response structure
|
||||
assert_eq!(models_response.object, "list");
|
||||
assert_eq!(models_response.data.len(), 16);
|
||||
let expected = "<start_of_turn>user\nSystem message\n\nKnock knock.<end_of_turn>\n\
|
||||
<start_of_turn>model\nWho's there?<end_of_turn>\n\
|
||||
<start_of_turn>user\nGemma.<end_of_turn>\n\
|
||||
<start_of_turn>model\n";
|
||||
|
||||
// Verify some key models are present
|
||||
let model_ids: Vec<String> = models_response.data.iter().map(|m| m.id.clone()).collect();
|
||||
assert!(model_ids.contains(&"gemma-2b".to_string()));
|
||||
assert!(model_ids.contains(&"gemma-7b".to_string()));
|
||||
assert!(model_ids.contains(&"gemma-3-1b-it".to_string()));
|
||||
assert!(model_ids.contains(&"codegemma-2b-it".to_string()));
|
||||
|
||||
// Verify model structure
|
||||
for model in &models_response.data {
|
||||
assert_eq!(model.object, "model");
|
||||
assert_eq!(model.owned_by, "google");
|
||||
assert_eq!(model.created, 1686935002);
|
||||
assert!(!model.id.is_empty());
|
||||
}
|
||||
|
||||
println!("[DEBUG_LOG] Models list endpoint test passed - {} models available", models_response.data.len());
|
||||
assert_eq!(prompt, expected);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_reproduce_tensor_shape_mismatch() {
|
||||
// Create a test app state with Gemma 3 model (same as the failing request)
|
||||
let mut args = PipelineArgs::default();
|
||||
args.model_id = "google/gemma-3-1b-it".to_string();
|
||||
args.which = Which::InstructV3_1B;
|
||||
#[test]
|
||||
fn test_empty_messages() {
|
||||
let messages: Vec<Message> = vec![];
|
||||
let prompt = build_gemma_prompt(&messages);
|
||||
assert_eq!(prompt, "<start_of_turn>model\n");
|
||||
}
|
||||
|
||||
println!("[DEBUG_LOG] Creating pipeline with model: {}", args.model_id);
|
||||
|
||||
// This should reproduce the same conditions as the curl script
|
||||
let text_generation = build_pipeline(args.clone());
|
||||
let app_state = AppState {
|
||||
text_generation: Arc::new(Mutex::new(text_generation)),
|
||||
model_id: "gemma-3-1b-it".to_string(),
|
||||
build_args: args,
|
||||
};
|
||||
|
||||
// Create the same request as the curl script
|
||||
let request = ChatCompletionRequest {
|
||||
model: "gemma-3-1b-it".to_string(),
|
||||
messages: vec![Message {
|
||||
#[test]
|
||||
fn test_missing_content() {
|
||||
let messages = vec![
|
||||
Message {
|
||||
role: "user".to_string(),
|
||||
content: Some(MessageContent(Either::Left("What is the capital of France?".to_string()))),
|
||||
content: None,
|
||||
name: None,
|
||||
}],
|
||||
max_tokens: Some(128),
|
||||
stream: Some(true),
|
||||
temperature: None,
|
||||
top_p: None,
|
||||
logprobs: false,
|
||||
n_choices: 1,
|
||||
};
|
||||
|
||||
println!("[DEBUG_LOG] Attempting to reproduce tensor shape mismatch error...");
|
||||
|
||||
// This should trigger the same error as the curl script
|
||||
let result = handle_streaming_request(app_state, request).await;
|
||||
|
||||
match result {
|
||||
Ok(_) => {
|
||||
println!("[DEBUG_LOG] No error occurred - this suggests the issue might be fixed or environmental");
|
||||
}
|
||||
Err((status_code, json_error)) => {
|
||||
println!("[DEBUG_LOG] Error reproduced! Status: {:?}", status_code);
|
||||
println!("[DEBUG_LOG] Error details: {:?}", json_error);
|
||||
];
|
||||
|
||||
// Check if this is the expected tensor shape mismatch error
|
||||
if let Some(error_obj) = json_error.0.as_object() {
|
||||
if let Some(error_details) = error_obj.get("error").and_then(|e| e.as_object()) {
|
||||
if let Some(message) = error_details.get("message").and_then(|m| m.as_str()) {
|
||||
assert!(message.contains("shape mismatch"),
|
||||
"Expected shape mismatch error, got: {}", message);
|
||||
println!("[DEBUG_LOG] Successfully reproduced tensor shape mismatch error");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
let prompt = build_gemma_prompt(&messages);
|
||||
assert_eq!(prompt, "<start_of_turn>user\n<end_of_turn>\n<start_of_turn>model\n");
|
||||
}
|
||||
}
|
||||
|
@@ -20,6 +20,8 @@ pub struct TextGeneration {
|
||||
repeat_last_n: usize,
|
||||
// Cache for repeat penalty computation to avoid redundant calculations
|
||||
penalty_cache: HashMap<usize, f32>,
|
||||
// Context window size for sliding window context (default: 64 tokens)
|
||||
context_window_size: usize,
|
||||
}
|
||||
|
||||
impl TextGeneration {
|
||||
@@ -55,6 +57,7 @@ impl TextGeneration {
|
||||
cpu_device,
|
||||
try_primary_device,
|
||||
penalty_cache: HashMap::new(),
|
||||
context_window_size: 64, // Default sliding window size for better context preservation
|
||||
}
|
||||
}
|
||||
|
||||
@@ -192,9 +195,14 @@ impl TextGeneration {
|
||||
// Track overall performance
|
||||
let start_time = std::time::Instant::now();
|
||||
|
||||
// Clear penalty cache for new generation
|
||||
self.penalty_cache.clear();
|
||||
tracing::debug!("Cleared penalty cache for new generation");
|
||||
// Keep penalty cache across generation for better repetition prevention
|
||||
// Only clear cache if it becomes too large to prevent memory bloat
|
||||
if self.penalty_cache.len() > 10000 {
|
||||
self.penalty_cache.clear();
|
||||
tracing::debug!("Cleared penalty cache due to size limit");
|
||||
} else {
|
||||
tracing::debug!("Maintaining penalty cache across generation for better repetition prevention");
|
||||
}
|
||||
|
||||
// Phase 1: Tokenize input
|
||||
let tokenize_start = std::time::Instant::now();
|
||||
@@ -440,9 +448,14 @@ impl TextGeneration {
|
||||
// Track overall performance
|
||||
let start_time = std::time::Instant::now();
|
||||
|
||||
// Clear penalty cache for new generation
|
||||
self.penalty_cache.clear();
|
||||
tracing::debug!("Cleared penalty cache for new generation (API mode)");
|
||||
// Keep penalty cache across generation for better repetition prevention
|
||||
// Only clear cache if it becomes too large to prevent memory bloat
|
||||
if self.penalty_cache.len() > 10000 {
|
||||
self.penalty_cache.clear();
|
||||
tracing::debug!("Cleared penalty cache due to size limit (API mode)");
|
||||
} else {
|
||||
tracing::debug!("Maintaining penalty cache across generation for better repetition prevention (API mode)");
|
||||
}
|
||||
|
||||
// Phase 1: Tokenize input
|
||||
let tokenize_start = std::time::Instant::now();
|
||||
@@ -573,10 +586,18 @@ impl TextGeneration {
|
||||
for index in 0..sample_len {
|
||||
let token_start = std::time::Instant::now();
|
||||
|
||||
let context_size = if index > 0 { 1 } else { tokens.len() };
|
||||
// Use sliding window context instead of single token to preserve context and reduce repetition
|
||||
let context_size = if index > 0 {
|
||||
std::cmp::min(self.context_window_size, tokens.len())
|
||||
} else {
|
||||
tokens.len()
|
||||
};
|
||||
let start_pos = tokens.len().saturating_sub(context_size);
|
||||
let ctxt = &tokens[start_pos..];
|
||||
|
||||
tracing::debug!("API standard model: Using sliding window context: {} tokens (from position {})",
|
||||
ctxt.len(), start_pos);
|
||||
|
||||
// Track tensor operations and model forward pass
|
||||
let forward_start = std::time::Instant::now();
|
||||
let input = Tensor::new(ctxt, &self.device)?.unsqueeze(0)?;
|
||||
@@ -630,6 +651,266 @@ impl TextGeneration {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// Run text generation with streaming callback for each token
|
||||
pub async fn run_with_streaming<F>(&mut self, prompt: &str, sample_len: usize, mut token_callback: F) -> Result<String>
|
||||
where
|
||||
F: FnMut(&str) -> Result<()>,
|
||||
{
|
||||
// Track overall performance
|
||||
let start_time = std::time::Instant::now();
|
||||
|
||||
// Keep penalty cache across generation for better repetition prevention
|
||||
// Only clear cache if it becomes too large to prevent memory bloat
|
||||
if self.penalty_cache.len() > 10000 {
|
||||
self.penalty_cache.clear();
|
||||
tracing::debug!("Cleared penalty cache due to size limit (streaming mode)");
|
||||
} else {
|
||||
tracing::debug!("Maintaining penalty cache across generation for better repetition prevention (streaming mode)");
|
||||
}
|
||||
|
||||
// Phase 1: Tokenize input
|
||||
let tokenize_start = std::time::Instant::now();
|
||||
self.tokenizer.clear();
|
||||
let mut tokens = self
|
||||
.tokenizer
|
||||
.tokenizer()
|
||||
.encode(prompt, true)
|
||||
.map_err(E::msg)?
|
||||
.get_ids()
|
||||
.to_vec();
|
||||
|
||||
let tokenize_time = tokenize_start.elapsed();
|
||||
tracing::debug!("Streaming Tokenization completed in {:.2?}", tokenize_time);
|
||||
tracing::debug!("Streaming Input tokens: {}", tokens.len());
|
||||
|
||||
// Collect all output for final return
|
||||
let mut full_output = String::new();
|
||||
|
||||
let mut generated_tokens = 0usize;
|
||||
let eos_token = match self.tokenizer.get_token("<eos>") {
|
||||
Some(token) => token,
|
||||
None => anyhow::bail!("cannot find the <eos> token"),
|
||||
};
|
||||
|
||||
let eot_token = match self.tokenizer.get_token("<end_of_turn>") {
|
||||
Some(token) => token,
|
||||
None => {
|
||||
tracing::warn!("Warning: <end_of_turn> token not found in tokenizer, using <eos> as a backup");
|
||||
eos_token
|
||||
}
|
||||
};
|
||||
|
||||
// Determine if we're using a Model2 (gemma-2) or Model3 (gemma-3) variant
|
||||
let needs_special_handling = match &self.model {
|
||||
Model::V2(_) => true,
|
||||
Model::V3(_) => true,
|
||||
_ => false,
|
||||
};
|
||||
|
||||
// Track generation timing
|
||||
let start_gen = std::time::Instant::now();
|
||||
|
||||
// Track per-token generation timing for performance analysis
|
||||
let mut token_times = Vec::new();
|
||||
let mut forward_times = Vec::new();
|
||||
let mut repeat_penalty_times = Vec::new();
|
||||
let mut sampling_times = Vec::new();
|
||||
|
||||
// For Model2 and Model3, we need to use a special approach for shape compatibility
|
||||
if needs_special_handling {
|
||||
tracing::debug!("Using special generation approach for gemma-2/gemma-3 models (streaming)");
|
||||
tracing::debug!("Streaming: sample_len = {}", sample_len);
|
||||
|
||||
// Initial generation with the full prompt
|
||||
let forward_start = std::time::Instant::now();
|
||||
let input = Tensor::new(tokens.as_slice(), &self.device)?.unsqueeze(0)?;
|
||||
|
||||
let mut logits = self.execute_with_fallback(&input, 0)?;
|
||||
|
||||
logits = logits.squeeze(0)?.squeeze(0)?.to_dtype(DType::F32)?;
|
||||
let forward_time = forward_start.elapsed();
|
||||
forward_times.push(forward_time);
|
||||
|
||||
tracing::debug!("Streaming: About to enter generation loop with sample_len = {}", sample_len);
|
||||
for gen_index in 0..sample_len {
|
||||
tracing::debug!("Streaming: Starting generation iteration {} / {}", gen_index + 1, sample_len);
|
||||
let token_start = std::time::Instant::now();
|
||||
|
||||
// Apply repeat penalty using optimized cached implementation
|
||||
let (current_logits, repeat_time) = self.apply_cached_repeat_penalty(logits.clone(), &tokens)?;
|
||||
repeat_penalty_times.push(repeat_time);
|
||||
|
||||
// Track token sampling
|
||||
let sampling_start = std::time::Instant::now();
|
||||
let next_token = self.logits_processor.sample(¤t_logits)?;
|
||||
let sampling_time = sampling_start.elapsed();
|
||||
sampling_times.push(sampling_time);
|
||||
|
||||
tokens.push(next_token);
|
||||
generated_tokens += 1;
|
||||
|
||||
tracing::debug!("Streaming: Generated token {} (id: {}), eos: {}, eot: {}",
|
||||
next_token, next_token, eos_token, eot_token);
|
||||
if next_token == eos_token || next_token == eot_token {
|
||||
tracing::debug!("Streaming: Breaking due to end token");
|
||||
break;
|
||||
}
|
||||
|
||||
if let Some(token_text) = self.tokenizer.next_token(next_token)? {
|
||||
full_output.push_str(&token_text);
|
||||
// Call the streaming callback with this token
|
||||
token_callback(&token_text)?;
|
||||
}
|
||||
|
||||
// For the next iteration, use single token to avoid shape mismatch
|
||||
let forward_start = std::time::Instant::now();
|
||||
tracing::debug!("Streaming: Preparing next forward pass with {} tokens", tokens.len());
|
||||
|
||||
// Use just the last token for subsequent iterations to avoid shape mismatch
|
||||
// This is required for Gemma model's attention mechanism compatibility
|
||||
let context_tokens = &tokens[(tokens.len()-1)..];
|
||||
let start_pos = tokens.len() - 1;
|
||||
|
||||
tracing::debug!("Streaming: Using single token context for Gemma: {} tokens (from position {})",
|
||||
context_tokens.len(), start_pos);
|
||||
|
||||
let new_input = match Tensor::new(context_tokens, &self.device) {
|
||||
Ok(tensor) => tensor,
|
||||
Err(e) => {
|
||||
tracing::error!("Streaming: Failed to create input tensor: {}", e);
|
||||
return Err(e.into());
|
||||
}
|
||||
};
|
||||
|
||||
let new_input = match new_input.unsqueeze(0) {
|
||||
Ok(tensor) => tensor,
|
||||
Err(e) => {
|
||||
tracing::error!("Streaming: Failed to unsqueeze input tensor: {}", e);
|
||||
return Err(e.into());
|
||||
}
|
||||
};
|
||||
|
||||
tracing::debug!("Streaming: About to call execute_with_fallback for iteration {} with start_pos {}", gen_index + 1, start_pos);
|
||||
logits = match self.execute_with_fallback(&new_input, start_pos) {
|
||||
Ok(result) => result,
|
||||
Err(e) => {
|
||||
tracing::error!("Streaming: Forward pass failed: {}", e);
|
||||
return Err(e);
|
||||
}
|
||||
};
|
||||
|
||||
logits = match logits.squeeze(0) {
|
||||
Ok(result) => result,
|
||||
Err(e) => {
|
||||
tracing::error!("Streaming: Failed to squeeze logits (dim 0): {}", e);
|
||||
return Err(e.into());
|
||||
}
|
||||
};
|
||||
|
||||
logits = match logits.squeeze(0) {
|
||||
Ok(result) => result,
|
||||
Err(e) => {
|
||||
tracing::error!("Streaming: Failed to squeeze logits (dim 0 again): {}", e);
|
||||
return Err(e.into());
|
||||
}
|
||||
};
|
||||
|
||||
logits = match logits.to_dtype(DType::F32) {
|
||||
Ok(result) => result,
|
||||
Err(e) => {
|
||||
tracing::error!("Streaming: Failed to convert logits to F32: {}", e);
|
||||
return Err(e.into());
|
||||
}
|
||||
};
|
||||
|
||||
let forward_time = forward_start.elapsed();
|
||||
forward_times.push(forward_time);
|
||||
tracing::debug!("Streaming: Forward pass completed for iteration {}", gen_index + 1);
|
||||
|
||||
let token_time = token_start.elapsed();
|
||||
token_times.push(token_time);
|
||||
|
||||
// Yield to allow other async tasks to run
|
||||
tokio::task::yield_now().await;
|
||||
}
|
||||
} else {
|
||||
// Standard approach for other models
|
||||
tracing::debug!("Using standard generation approach (streaming)");
|
||||
|
||||
for index in 0..sample_len {
|
||||
let token_start = std::time::Instant::now();
|
||||
|
||||
// Use sliding window context instead of single token to preserve context and reduce repetition
|
||||
let context_size = if index > 0 {
|
||||
std::cmp::min(self.context_window_size, tokens.len())
|
||||
} else {
|
||||
tokens.len()
|
||||
};
|
||||
let start_pos = tokens.len().saturating_sub(context_size);
|
||||
let ctxt = &tokens[start_pos..];
|
||||
|
||||
tracing::debug!("Standard model: Using sliding window context: {} tokens (from position {})",
|
||||
ctxt.len(), start_pos);
|
||||
|
||||
// Track tensor operations and model forward pass
|
||||
let forward_start = std::time::Instant::now();
|
||||
let input = Tensor::new(ctxt, &self.device)?.unsqueeze(0)?;
|
||||
let logits = self.execute_with_fallback(&input, start_pos)?;
|
||||
let logits = logits.squeeze(0)?.squeeze(0)?.to_dtype(DType::F32)?;
|
||||
let forward_time = forward_start.elapsed();
|
||||
forward_times.push(forward_time);
|
||||
|
||||
// Apply repeat penalty using optimized cached implementation
|
||||
let (logits, repeat_time) = self.apply_cached_repeat_penalty(logits, &tokens)?;
|
||||
repeat_penalty_times.push(repeat_time);
|
||||
|
||||
// Track token sampling
|
||||
let sampling_start = std::time::Instant::now();
|
||||
let next_token = self.logits_processor.sample(&logits)?;
|
||||
let sampling_time = sampling_start.elapsed();
|
||||
sampling_times.push(sampling_time);
|
||||
|
||||
tokens.push(next_token);
|
||||
generated_tokens += 1;
|
||||
if next_token == eos_token || next_token == eot_token {
|
||||
break;
|
||||
}
|
||||
if let Some(token_text) = self.tokenizer.next_token(next_token)? {
|
||||
full_output.push_str(&token_text);
|
||||
// Call the streaming callback with this token
|
||||
token_callback(&token_text)?;
|
||||
}
|
||||
|
||||
let token_time = token_start.elapsed();
|
||||
token_times.push(token_time);
|
||||
}
|
||||
}
|
||||
|
||||
let dt = start_gen.elapsed();
|
||||
|
||||
// Phase 3: Final decoding
|
||||
let decode_start = std::time::Instant::now();
|
||||
|
||||
// Decode any remaining tokens but don't send through callback to avoid repetition
|
||||
// The tokens were already streamed individually in the generation loop above
|
||||
if let Some(rest) = self.tokenizer.decode_rest().map_err(E::msg)? {
|
||||
full_output.push_str(&rest);
|
||||
// Note: NOT calling token_callback(&rest) here to prevent token repetition
|
||||
// Individual tokens were already streamed via the callback in the generation loop
|
||||
}
|
||||
|
||||
let decode_time = decode_start.elapsed();
|
||||
|
||||
// Log performance metrics
|
||||
Self::log_performance_metrics(
|
||||
dt, generated_tokens, &token_times, &forward_times,
|
||||
&repeat_penalty_times, &sampling_times, tokenize_time,
|
||||
decode_time, start_time, "Streaming"
|
||||
);
|
||||
|
||||
Ok(full_output)
|
||||
}
|
||||
|
||||
// Helper function for logging performance metrics
|
||||
fn log_performance_metrics(
|
||||
generation_time: std::time::Duration,
|
||||
|
@@ -40,7 +40,8 @@ impl TokenOutputStream {
|
||||
};
|
||||
self.tokens.push(token);
|
||||
let text = self.decode(&self.tokens[self.prev_index..])?;
|
||||
if text.len() > prev_text.len() && text.chars().last().unwrap().is_alphanumeric() {
|
||||
if text.len() > prev_text.len() {
|
||||
// Modified to include all tokens, not just alphanumeric ones
|
||||
let text = text.split_at(prev_text.len());
|
||||
self.prev_index = self.current_index;
|
||||
self.current_index = self.tokens.len();
|
||||
|
@@ -50,3 +50,12 @@ features = [
|
||||
"macro-diagnostics", # Enable better diagnostics for compile-time UUIDs
|
||||
"js", # Enable JavaScript RNG for WASM targets
|
||||
]
|
||||
|
||||
[package.metadata.kube]
|
||||
image = "ghcr.io/geoffsee/leptos-chat:latest"
|
||||
replicas = 1
|
||||
port = 8788
|
||||
resources.cpu = "500m"
|
||||
resources.memory = "256Mi"
|
||||
#ingress.host = "my-service.example.com"
|
||||
#env = { RUST_LOG = "info", DATABASE_URL = "postgres://..." }
|
@@ -15,7 +15,7 @@ use async_openai_wasm::{
|
||||
Client,
|
||||
};
|
||||
use async_openai_wasm::config::OpenAIConfig;
|
||||
use async_openai_wasm::types::{ChatCompletionResponseStream, Model};
|
||||
use async_openai_wasm::types::{ChatCompletionResponseStream, Model, Role, FinishReason};
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Message {
|
||||
@@ -127,7 +127,7 @@ async fn fetch_available_models() -> Result<Vec<OpenAIModel>, String> {
|
||||
}
|
||||
|
||||
async fn send_chat_request(chat_request: ChatRequest) -> ChatCompletionResponseStream {
|
||||
let config = OpenAIConfig::new().with_api_base("http://localhost:8080".to_string());
|
||||
let config = OpenAIConfig::new().with_api_base("http://localhost:8080/v1".to_string());
|
||||
let client = Client::with_config(config);
|
||||
|
||||
let mut typed_chat = async_openai_wasm::types::CreateChatCompletionRequest {
|
||||
@@ -205,7 +205,7 @@ async fn send_chat_request(chat_request: ChatRequest) -> ChatCompletionResponseS
|
||||
// Err("leptos-chat chat request only supported on wasm32 target".to_string())
|
||||
// }
|
||||
|
||||
const DEFAULT_MODEL: &str = "gemma-2b-it";
|
||||
const DEFAULT_MODEL: &str = "default";
|
||||
|
||||
#[component]
|
||||
fn ChatInterface() -> impl IntoView {
|
||||
@@ -272,11 +272,37 @@ fn ChatInterface() -> impl IntoView {
|
||||
let history_count = messages.with_untracked(|msgs| {
|
||||
let count = msgs.len();
|
||||
for msg in msgs.iter() {
|
||||
let message = ChatCompletionRequestUserMessageArgs::default()
|
||||
.content(msg.content.clone())
|
||||
.build()
|
||||
.expect("failed to build message");
|
||||
chat_messages.push(message.into());
|
||||
match msg.role.as_str() {
|
||||
"user" => {
|
||||
let message = ChatCompletionRequestUserMessageArgs::default()
|
||||
.content(msg.content.clone())
|
||||
.build()
|
||||
.expect("failed to build user message");
|
||||
chat_messages.push(message.into());
|
||||
}
|
||||
"assistant" => {
|
||||
let message = ChatCompletionRequestAssistantMessageArgs::default()
|
||||
.content(msg.content.clone())
|
||||
.build()
|
||||
.expect("failed to build assistant message");
|
||||
chat_messages.push(message.into());
|
||||
}
|
||||
"system" => {
|
||||
let message = ChatCompletionRequestSystemMessageArgs::default()
|
||||
.content(msg.content.clone())
|
||||
.build()
|
||||
.expect("failed to build system message");
|
||||
chat_messages.push(message.into());
|
||||
}
|
||||
_ => {
|
||||
// Default to user message for unknown roles
|
||||
let message = ChatCompletionRequestUserMessageArgs::default()
|
||||
.content(msg.content.clone())
|
||||
.build()
|
||||
.expect("failed to build default message");
|
||||
chat_messages.push(message.into());
|
||||
}
|
||||
}
|
||||
}
|
||||
count
|
||||
});
|
||||
@@ -319,51 +345,69 @@ fn ChatInterface() -> impl IntoView {
|
||||
Ok(mut stream) => {
|
||||
log::info!("[DEBUG_LOG] send_message: Successfully created stream, starting to receive response");
|
||||
|
||||
// Insert a placeholder assistant message to append into
|
||||
let assistant_id = Uuid::new_v4().to_string();
|
||||
set_messages.update(|msgs| {
|
||||
msgs.push_back(Message {
|
||||
id: assistant_id.clone(),
|
||||
role: "assistant".to_string(),
|
||||
content: String::new(),
|
||||
timestamp: Date::now(),
|
||||
});
|
||||
});
|
||||
|
||||
// Defer creating assistant message until we receive role=assistant from the stream
|
||||
let mut assistant_created = false;
|
||||
let mut content_appended = false;
|
||||
let mut chunks_received = 0;
|
||||
// Stream loop: append deltas to the last message
|
||||
// Stream loop: handle deltas and finish events
|
||||
while let Some(next) = stream.next().await {
|
||||
match next {
|
||||
Ok(chunk) => {
|
||||
chunks_received += 1;
|
||||
// Try to pull out the content delta in a tolerant way.
|
||||
// async-openai 0.28.x stream chunk usually looks like:
|
||||
// choices[0].delta.content: Option<String>
|
||||
let mut delta_txt = String::new();
|
||||
|
||||
if let Some(choice) = chunk.choices.get(0) {
|
||||
// Newer message API may expose different shapes; try common ones
|
||||
// 1) Simple string content delta
|
||||
if let Some(content) = &choice.delta.content {
|
||||
delta_txt.push_str(content);
|
||||
}
|
||||
|
||||
// 2) Some providers pack text under .delta.role/.delta.<other>
|
||||
// If nothing extracted, ignore quietly.
|
||||
|
||||
// If a finish_reason arrives, we could stop early,
|
||||
// but usually the stream naturally ends.
|
||||
}
|
||||
|
||||
if !delta_txt.is_empty() {
|
||||
set_messages.update(|msgs| {
|
||||
if let Some(last) = msgs.back_mut() {
|
||||
if last.role == "assistant" {
|
||||
last.content.push_str(&delta_txt);
|
||||
last.timestamp = Date::now();
|
||||
// 1) Create assistant message when role arrives
|
||||
if !assistant_created {
|
||||
if let Some(role) = &choice.delta.role {
|
||||
if role == &Role::Assistant {
|
||||
assistant_created = true;
|
||||
let assistant_id = Uuid::new_v4().to_string();
|
||||
set_messages.update(|msgs| {
|
||||
msgs.push_back(Message {
|
||||
id: assistant_id,
|
||||
role: "assistant".to_string(),
|
||||
content: String::new(),
|
||||
timestamp: Date::now(),
|
||||
});
|
||||
});
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// 2) Append content tokens when provided
|
||||
if let Some(content) = &choice.delta.content {
|
||||
if !content.is_empty() {
|
||||
// If content arrives before role, create assistant message now
|
||||
if !assistant_created {
|
||||
assistant_created = true;
|
||||
let assistant_id = Uuid::new_v4().to_string();
|
||||
set_messages.update(|msgs| {
|
||||
msgs.push_back(Message {
|
||||
id: assistant_id,
|
||||
role: "assistant".to_string(),
|
||||
content: String::new(),
|
||||
timestamp: Date::now(),
|
||||
});
|
||||
});
|
||||
}
|
||||
content_appended = true;
|
||||
set_messages.update(|msgs| {
|
||||
if let Some(last) = msgs.back_mut() {
|
||||
if last.role == "assistant" {
|
||||
last.content.push_str(content);
|
||||
last.timestamp = Date::now();
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// 3) Stop on finish_reason=="stop" (mirrors [DONE])
|
||||
if let Some(reason) = &choice.finish_reason {
|
||||
if reason == &FinishReason::Stop {
|
||||
log::info!("[DEBUG_LOG] send_message: Received finish_reason=stop after {} chunks", chunks_received);
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
@@ -381,6 +425,21 @@ fn ChatInterface() -> impl IntoView {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Cleanup: If we created an assistant message but no content ever arrived, remove the empty message
|
||||
if assistant_created && !content_appended {
|
||||
set_messages.update(|msgs| {
|
||||
let should_pop = msgs
|
||||
.back()
|
||||
.map(|m| m.role == "assistant" && m.content.is_empty())
|
||||
.unwrap_or(false);
|
||||
if should_pop {
|
||||
log::info!("[DEBUG_LOG] send_message: Removing empty assistant message (no content received)");
|
||||
msgs.pop_back();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
log::info!("[DEBUG_LOG] send_message: Stream completed successfully, received {} chunks", chunks_received);
|
||||
}
|
||||
Err(e) => {
|
||||
|
@@ -24,3 +24,13 @@ embeddings-engine = { path = "../embeddings-engine" }
|
||||
|
||||
# Dependencies for inference functionality
|
||||
inference-engine = { path = "../inference-engine" }
|
||||
|
||||
|
||||
[package.metadata.kube]
|
||||
image = "ghcr.io/geoffsee/predict-otron-9000:latest"
|
||||
replicas = 1
|
||||
port = 8080
|
||||
resources.cpu = "500m"
|
||||
resources.memory = "256Mi"
|
||||
#ingress.host = "my-service.example.com"
|
||||
#env = { RUST_LOG = "info", DATABASE_URL = "postgres://..." }
|
33
package-lock.json
generated
Normal file
33
package-lock.json
generated
Normal file
@@ -0,0 +1,33 @@
|
||||
{
|
||||
"name": "predict-otron-9000",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {
|
||||
"": {
|
||||
"dependencies": {
|
||||
"openai": "^5.16.0"
|
||||
}
|
||||
},
|
||||
"node_modules/openai": {
|
||||
"version": "5.16.0",
|
||||
"resolved": "https://registry.npmjs.org/openai/-/openai-5.16.0.tgz",
|
||||
"integrity": "sha512-hoEH8ZNvg1HXjU9mp88L/ZH8O082Z8r6FHCXGiWAzVRrEv443aI57qhch4snu07yQydj+AUAWLenAiBXhu89Tw==",
|
||||
"license": "Apache-2.0",
|
||||
"bin": {
|
||||
"openai": "bin/cli"
|
||||
},
|
||||
"peerDependencies": {
|
||||
"ws": "^8.18.0",
|
||||
"zod": "^3.23.8"
|
||||
},
|
||||
"peerDependenciesMeta": {
|
||||
"ws": {
|
||||
"optional": true
|
||||
},
|
||||
"zod": {
|
||||
"optional": true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
5
package.json
Normal file
5
package.json
Normal file
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"dependencies": {
|
||||
"openai": "^5.16.0"
|
||||
}
|
||||
}
|
48
server.log
Normal file
48
server.log
Normal file
@@ -0,0 +1,48 @@
|
||||
Compiling inference-engine v0.1.0 (/Users/williamseemueller/workspace/seemueller-io/predict-otron-9000/crates/inference-engine)
|
||||
warning: unused import: `Config as Config1`
|
||||
--> crates/inference-engine/src/model.rs:2:42
|
||||
|
|
||||
2 | use candle_transformers::models::gemma::{Config as Config1, Model as Model1};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
|
||||
= note: `#[warn(unused_imports)]` on by default
|
||||
|
||||
warning: unused import: `Config as Config2`
|
||||
--> crates/inference-engine/src/model.rs:3:43
|
||||
|
|
||||
3 | use candle_transformers::models::gemma2::{Config as Config2, Model as Model2};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
||||
warning: unused import: `Config as Config3`
|
||||
--> crates/inference-engine/src/model.rs:4:43
|
||||
|
|
||||
4 | use candle_transformers::models::gemma3::{Config as Config3, Model as Model3};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
||||
warning: unused import: `self`
|
||||
--> crates/inference-engine/src/server.rs:10:28
|
||||
|
|
||||
10 | use futures_util::stream::{self, Stream};
|
||||
| ^^^^
|
||||
|
||||
warning: `inference-engine` (lib) generated 4 warnings (run `cargo fix --lib -p inference-engine` to apply 4 suggestions)
|
||||
Compiling predict-otron-9000 v0.1.0 (/Users/williamseemueller/workspace/seemueller-io/predict-otron-9000/crates/predict-otron-9000)
|
||||
Finished `release` profile [optimized] target(s) in 4.01s
|
||||
Running `target/release/predict-otron-9000`
|
||||
[2m2025-08-28T01:43:11.512475Z[0m [32m INFO[0m [2mpredict_otron_9000::middleware::metrics[0m[2m:[0m Performance metrics summary:
|
||||
avx: false, neon: true, simd128: false, f16c: false
|
||||
[2m2025-08-28T01:43:11.512811Z[0m [32m INFO[0m [2mhf_hub[0m[2m:[0m Using token file found "/Users/williamseemueller/.cache/huggingface/token"
|
||||
retrieved the files in 685.958µs
|
||||
[2m2025-08-28T01:43:12.661378Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m Unified predict-otron-9000 server listening on 127.0.0.1:8080
|
||||
[2m2025-08-28T01:43:12.661400Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m Performance metrics tracking enabled - summary logs every 60 seconds
|
||||
[2m2025-08-28T01:43:12.661403Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m Available endpoints:
|
||||
[2m2025-08-28T01:43:12.661405Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m GET / - Root endpoint from embeddings-engine
|
||||
[2m2025-08-28T01:43:12.661407Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m POST /v1/embeddings - Text embeddings
|
||||
[2m2025-08-28T01:43:12.661409Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m POST /v1/chat/completions - Chat completions
|
||||
[2m2025-08-28T01:43:19.166677Z[0m [33m WARN[0m [2minference_engine::server[0m[2m:[0m Detected repetition pattern: ' plus' (count: 1)
|
||||
[2m2025-08-28T01:43:19.296257Z[0m [33m WARN[0m [2minference_engine::server[0m[2m:[0m Detected repetition pattern: ' plus' (count: 2)
|
||||
[2m2025-08-28T01:43:19.424883Z[0m [33m WARN[0m [2minference_engine::server[0m[2m:[0m Detected repetition pattern: ' plus' (count: 3)
|
||||
[2m2025-08-28T01:43:19.554508Z[0m [33m WARN[0m [2minference_engine::server[0m[2m:[0m Detected repetition pattern: ' plus' (count: 4)
|
||||
[2m2025-08-28T01:43:19.683153Z[0m [33m WARN[0m [2minference_engine::server[0m[2m:[0m Detected repetition pattern: ' plus' (count: 5)
|
||||
[2m2025-08-28T01:43:19.683181Z[0m [32m INFO[0m [2minference_engine::server[0m[2m:[0m Stopping generation due to excessive repetition
|
||||
[2m2025-08-28T01:43:19.683221Z[0m [32m INFO[0m [2minference_engine::server[0m[2m:[0m Text generation stopped: Repetition detected - stopping generation
|
277
server_fresh.txt
Normal file
277
server_fresh.txt
Normal file
@@ -0,0 +1,277 @@
|
||||
warning: unused import: `Config as Config1`
|
||||
--> crates/inference-engine/src/model.rs:2:42
|
||||
|
|
||||
2 | use candle_transformers::models::gemma::{Config as Config1, Model as Model1};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
|
||||
= note: `#[warn(unused_imports)]` on by default
|
||||
|
||||
warning: unused import: `Config as Config2`
|
||||
--> crates/inference-engine/src/model.rs:3:43
|
||||
|
|
||||
3 | use candle_transformers::models::gemma2::{Config as Config2, Model as Model2};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
||||
warning: unused import: `Config as Config3`
|
||||
--> crates/inference-engine/src/model.rs:4:43
|
||||
|
|
||||
4 | use candle_transformers::models::gemma3::{Config as Config3, Model as Model3};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
||||
warning: unused import: `self`
|
||||
--> crates/inference-engine/src/server.rs:10:28
|
||||
|
|
||||
10 | use futures_util::stream::{self, Stream};
|
||||
| ^^^^
|
||||
|
||||
warning: `inference-engine` (lib) generated 4 warnings (run `cargo fix --lib -p inference-engine` to apply 4 suggestions)
|
||||
Finished `release` profile [optimized] target(s) in 0.13s
|
||||
Running `target/release/predict-otron-9000`
|
||||
avx: false, neon: true, simd128: false, f16c: false
|
||||
[2m2025-08-28T00:34:39.293635Z[0m [32m INFO[0m [2mhf_hub[0m[2m:[0m Using token file found "/Users/williamseemueller/.cache/huggingface/token"
|
||||
retrieved the files in 295.458µs
|
||||
[2m2025-08-28T00:34:39.294536Z[0m [32m INFO[0m [2mpredict_otron_9000::middleware::metrics[0m[2m:[0m Performance metrics summary:
|
||||
[2m2025-08-28T00:34:40.507474Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m Unified predict-otron-9000 server listening on 127.0.0.1:8080
|
||||
[2m2025-08-28T00:34:40.507503Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m Performance metrics tracking enabled - summary logs every 60 seconds
|
||||
[2m2025-08-28T00:34:40.507508Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m Available endpoints:
|
||||
[2m2025-08-28T00:34:40.507512Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m GET / - Root endpoint from embeddings-engine
|
||||
[2m2025-08-28T00:34:40.507515Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m POST /v1/embeddings - Text embeddings
|
||||
[2m2025-08-28T00:34:40.507517Z[0m [32m INFO[0m [2mpredict_otron_9000[0m[2m:[0m POST /v1/chat/completions - Chat completions
|
||||
[2m2025-08-28T00:34:52.313606Z[0m [34mDEBUG[0m [1mrequest[0m[1m{[0m[3mmethod[0m[2m=[0mPOST [3muri[0m[2m=[0m/v1/chat/completions [3mversion[0m[2m=[0mHTTP/1.1[1m}[0m[2m:[0m [2mtower_http::trace::on_request[0m[2m:[0m started processing request
|
||||
[2m2025-08-28T00:34:52.313671Z[0m [34mDEBUG[0m [1mrequest[0m[1m{[0m[3mmethod[0m[2m=[0mPOST [3muri[0m[2m=[0m/v1/chat/completions [3mversion[0m[2m=[0mHTTP/1.1[1m}[0m[2m:[0m [2minference_engine::server[0m[2m:[0m Formatted prompt: <start_of_turn>user
|
||||
You are a helpful assistant who responds thoughtfully and concisely.
|
||||
|
||||
Write a paragraph about dogs<end_of_turn>
|
||||
<start_of_turn>model
|
||||
|
||||
[2m2025-08-28T00:34:52.313693Z[0m [34mDEBUG[0m [1mrequest[0m[1m{[0m[3mmethod[0m[2m=[0mPOST [3muri[0m[2m=[0m/v1/chat/completions [3mversion[0m[2m=[0mHTTP/1.1[1m}[0m[2m:[0m [2mpredict_otron_9000::middleware::metrics[0m[2m:[0m POST /v1/chat/completions 200 OK - 0 ms
|
||||
[2m2025-08-28T00:34:52.313709Z[0m [34mDEBUG[0m [1mrequest[0m[1m{[0m[3mmethod[0m[2m=[0mPOST [3muri[0m[2m=[0m/v1/chat/completions [3mversion[0m[2m=[0mHTTP/1.1[1m}[0m[2m:[0m [2mtower_http::trace::on_response[0m[2m:[0m finished processing request [3mlatency[0m[2m=[0m0 ms [3mstatus[0m[2m=[0m200
|
||||
[2m2025-08-28T00:34:52.313763Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Cleared penalty cache for new generation (streaming mode)
|
||||
[2m2025-08-28T00:34:52.313985Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Tokenization completed in 217.04µs
|
||||
[2m2025-08-28T00:34:52.313990Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Input tokens: 26
|
||||
[2m2025-08-28T00:34:52.340937Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Using special generation approach for gemma-2/gemma-3 models (streaming)
|
||||
[2m2025-08-28T00:34:52.602691Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: 'Dogs'
|
||||
[2m2025-08-28T00:34:52.602718Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: 'Dogs'
|
||||
[2m2025-08-28T00:34:52.769918Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' have'
|
||||
[2m2025-08-28T00:34:52.769949Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' have'
|
||||
[2m2025-08-28T00:34:52.905947Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' captivated'
|
||||
[2m2025-08-28T00:34:52.905977Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' captivated'
|
||||
[2m2025-08-28T00:34:53.040888Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' humans'
|
||||
[2m2025-08-28T00:34:53.040921Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' humans'
|
||||
[2m2025-08-28T00:34:53.177116Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' for'
|
||||
[2m2025-08-28T00:34:53.177145Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' for'
|
||||
[2m2025-08-28T00:34:53.313887Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' millennia'
|
||||
[2m2025-08-28T00:34:53.313920Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' millennia'
|
||||
[2m2025-08-28T00:34:53.444031Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ','
|
||||
[2m2025-08-28T00:34:53.444060Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ','
|
||||
[2m2025-08-28T00:34:53.571919Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' evolving'
|
||||
[2m2025-08-28T00:34:53.571951Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' evolving'
|
||||
[2m2025-08-28T00:34:53.699811Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' from'
|
||||
[2m2025-08-28T00:34:53.699852Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' from'
|
||||
[2m2025-08-28T00:34:53.828082Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' wolves'
|
||||
[2m2025-08-28T00:34:53.828111Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' wolves'
|
||||
[2m2025-08-28T00:34:53.957276Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' to'
|
||||
[2m2025-08-28T00:34:53.957313Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' to'
|
||||
[2m2025-08-28T00:34:54.093248Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' beloved'
|
||||
[2m2025-08-28T00:34:54.093284Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' beloved'
|
||||
[2m2025-08-28T00:34:54.228357Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' companions'
|
||||
[2m2025-08-28T00:34:54.228385Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' companions'
|
||||
[2m2025-08-28T00:34:54.356315Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' offering'
|
||||
[2m2025-08-28T00:34:54.356349Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' offering'
|
||||
[2m2025-08-28T00:34:54.484051Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' unwavering'
|
||||
[2m2025-08-28T00:34:54.484085Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' unwavering'
|
||||
[2m2025-08-28T00:34:54.613022Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' loyalty'
|
||||
[2m2025-08-28T00:34:54.613061Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' loyalty'
|
||||
[2m2025-08-28T00:34:54.742024Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' alongside'
|
||||
[2m2025-08-28T00:34:54.742043Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' alongside'
|
||||
[2m2025-08-28T00:34:54.869804Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' boundless'
|
||||
[2m2025-08-28T00:34:54.869829Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' boundless'
|
||||
[2m2025-08-28T00:34:54.998140Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' affection'
|
||||
[2m2025-08-28T00:34:54.998165Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' affection'
|
||||
[2m2025-08-28T00:34:55.126560Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' –'
|
||||
[2m2025-08-28T00:34:55.126582Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' –'
|
||||
[2m2025-08-28T00:34:55.255214Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' often'
|
||||
[2m2025-08-28T00:34:55.255232Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' often'
|
||||
[2m2025-08-28T00:34:55.383529Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' fueled'
|
||||
[2m2025-08-28T00:34:55.383551Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' fueled'
|
||||
[2m2025-08-28T00:34:55.511437Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' by'
|
||||
[2m2025-08-28T00:34:55.511456Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' by'
|
||||
[2m2025-08-28T00:34:55.639748Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' their'
|
||||
[2m2025-08-28T00:34:55.639768Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' their'
|
||||
[2m2025-08-28T00:34:55.767723Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' incredible'
|
||||
[2m2025-08-28T00:34:55.767741Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' incredible'
|
||||
[2m2025-08-28T00:34:55.895796Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' ability'
|
||||
[2m2025-08-28T00:34:55.895817Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' ability'
|
||||
[2m2025-08-28T00:34:56.025191Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' at'
|
||||
[2m2025-08-28T00:34:56.025219Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' at'
|
||||
[2m2025-08-28T00:34:56.153604Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' understanding'
|
||||
[2m2025-08-28T00:34:56.153626Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' understanding'
|
||||
[2m2025-08-28T00:34:56.282571Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' human'
|
||||
[2m2025-08-28T00:34:56.282590Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' human'
|
||||
[2m2025-08-28T00:34:56.411224Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' emotion'
|
||||
[2m2025-08-28T00:34:56.411247Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' emotion'
|
||||
[2m2025-08-28T00:34:56.540028Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' through'
|
||||
[2m2025-08-28T00:34:56.540050Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' through'
|
||||
[2m2025-08-28T00:34:56.668612Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' subtle'
|
||||
[2m2025-08-28T00:34:56.668630Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' subtle'
|
||||
[2m2025-08-28T00:34:56.797698Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' cues'
|
||||
[2m2025-08-28T00:34:56.797716Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' cues'
|
||||
[2m2025-08-28T00:34:56.927032Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: '!'
|
||||
[2m2025-08-28T00:34:56.927054Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: '!'
|
||||
[2m2025-08-28T00:34:57.054903Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' Beyond'
|
||||
[2m2025-08-28T00:34:57.054922Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' Beyond'
|
||||
[2m2025-08-28T00:34:57.183890Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' companionship'
|
||||
[2m2025-08-28T00:34:57.183914Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' companionship'
|
||||
[2m2025-08-28T00:34:57.313258Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' they'
|
||||
[2m2025-08-28T00:34:57.313278Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' they'
|
||||
[2m2025-08-28T00:34:57.441875Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' provide'
|
||||
[2m2025-08-28T00:34:57.441897Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' provide'
|
||||
[2m2025-08-28T00:34:57.569839Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' crucial'
|
||||
[2m2025-08-28T00:34:57.569864Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' crucial'
|
||||
[2m2025-08-28T00:34:57.700161Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' assistance'
|
||||
[2m2025-08-28T00:34:57.700184Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' assistance'
|
||||
[2m2025-08-28T00:34:57.828427Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' with'
|
||||
[2m2025-08-28T00:34:57.828453Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' with'
|
||||
[2m2025-08-28T00:34:57.957703Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' tasks'
|
||||
[2m2025-08-28T00:34:57.957727Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' tasks'
|
||||
[2m2025-08-28T00:34:58.085556Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' like'
|
||||
[2m2025-08-28T00:34:58.085579Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' like'
|
||||
[2m2025-08-28T00:34:58.213727Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' guarding'
|
||||
[2m2025-08-28T00:34:58.213750Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' guarding'
|
||||
[2m2025-08-28T00:34:58.342674Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' property'
|
||||
[2m2025-08-28T00:34:58.342696Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' property'
|
||||
[2m2025-08-28T00:34:58.474992Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' or'
|
||||
[2m2025-08-28T00:34:58.475011Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' or'
|
||||
[2m2025-08-28T00:34:58.603613Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' assisting'
|
||||
[2m2025-08-28T00:34:58.603636Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' assisting'
|
||||
[2m2025-08-28T00:34:58.732292Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' individuals'
|
||||
[2m2025-08-28T00:34:58.732316Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' individuals'
|
||||
[2m2025-08-28T00:34:58.861810Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' who'
|
||||
[2m2025-08-28T00:34:58.861847Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' who'
|
||||
[2m2025-08-28T00:34:58.989748Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' are'
|
||||
[2m2025-08-28T00:34:58.989765Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' are'
|
||||
[2m2025-08-28T00:34:59.118088Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' blind'
|
||||
[2m2025-08-28T00:34:59.118105Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' blind'
|
||||
[2m2025-08-28T00:34:59.246722Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' and'
|
||||
[2m2025-08-28T00:34:59.246746Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' and'
|
||||
[2m2025-08-28T00:34:59.375090Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' deaf'
|
||||
[2m2025-08-28T00:34:59.375119Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' deaf'
|
||||
[2m2025-08-28T00:34:59.503369Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: '.'
|
||||
[2m2025-08-28T00:34:59.503398Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: '.'
|
||||
[2m2025-08-28T00:34:59.632352Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' Their'
|
||||
[2m2025-08-28T00:34:59.632374Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' Their'
|
||||
[2m2025-08-28T00:34:59.760656Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' diverse'
|
||||
[2m2025-08-28T00:34:59.760675Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' diverse'
|
||||
[2m2025-08-28T00:34:59.889274Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' breeds'
|
||||
[2m2025-08-28T00:34:59.889293Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' breeds'
|
||||
[2m2025-08-28T00:35:00.018013Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' reflect'
|
||||
[2m2025-08-28T00:35:00.018043Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' reflect'
|
||||
[2m2025-08-28T00:35:00.146874Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' a'
|
||||
[2m2025-08-28T00:35:00.146903Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' a'
|
||||
[2m2025-08-28T00:35:00.275232Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' fascinating'
|
||||
[2m2025-08-28T00:35:00.275257Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' fascinating'
|
||||
[2m2025-08-28T00:35:00.403452Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' range'
|
||||
[2m2025-08-28T00:35:00.403472Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' range'
|
||||
[2m2025-08-28T00:35:00.535110Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' of'
|
||||
[2m2025-08-28T00:35:00.535133Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' of'
|
||||
[2m2025-08-28T00:35:00.663383Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' personalities'
|
||||
[2m2025-08-28T00:35:00.663402Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' personalities'
|
||||
[2m2025-08-28T00:35:00.792808Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' shaped'
|
||||
[2m2025-08-28T00:35:00.792836Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' shaped'
|
||||
[2m2025-08-28T00:35:00.921350Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' over'
|
||||
[2m2025-08-28T00:35:00.921378Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' over'
|
||||
[2m2025-08-28T00:35:01.049207Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' countless'
|
||||
[2m2025-08-28T00:35:01.049228Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' countless'
|
||||
[2m2025-08-28T00:35:01.178030Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' generations'
|
||||
[2m2025-08-28T00:35:01.178058Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' generations'
|
||||
[2m2025-08-28T00:35:01.306740Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: '،'
|
||||
[2m2025-08-28T00:35:01.306762Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: '،'
|
||||
[2m2025-08-28T00:35:01.434552Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' making'
|
||||
[2m2025-08-28T00:35:01.434573Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' making'
|
||||
[2m2025-08-28T00:35:01.562628Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' them'
|
||||
[2m2025-08-28T00:35:01.562647Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' them'
|
||||
[2m2025-08-28T00:35:01.690509Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:01.690530Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:01.819330Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' unique'
|
||||
[2m2025-08-28T00:35:01.819351Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' unique'
|
||||
[2m2025-08-28T00:35:01.947700Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' members'
|
||||
[2m2025-08-28T00:35:01.947720Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' members'
|
||||
[2m2025-08-28T00:35:02.076045Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' within'
|
||||
[2m2025-08-28T00:35:02.076071Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' within'
|
||||
[2m2025-08-28T00:35:02.204721Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' our'
|
||||
[2m2025-08-28T00:35:02.204743Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' our'
|
||||
[2m2025-08-28T00:35:02.333483Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' families'
|
||||
[2m2025-08-28T00:35:02.333506Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' families'
|
||||
[2m2025-08-28T00:35:02.461905Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ','
|
||||
[2m2025-08-28T00:35:02.461926Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ','
|
||||
[2m2025-08-28T00:35:02.589686Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' enriching'
|
||||
[2m2025-08-28T00:35:02.589710Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' enriching'
|
||||
[2m2025-08-28T00:35:02.718589Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' lives'
|
||||
[2m2025-08-28T00:35:02.718618Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' lives'
|
||||
[2m2025-08-28T00:35:02.846614Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' in'
|
||||
[2m2025-08-28T00:35:02.846635Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' in'
|
||||
[2m2025-08-28T00:35:02.976008Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' profound'
|
||||
[2m2025-08-28T00:35:02.976028Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' profound'
|
||||
[2m2025-08-28T00:35:03.107573Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' ways'
|
||||
[2m2025-08-28T00:35:03.107594Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' ways'
|
||||
[2m2025-08-28T00:35:03.236069Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' regardless'
|
||||
[2m2025-08-28T00:35:03.236088Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' regardless'
|
||||
[2m2025-08-28T00:35:03.364469Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' if'
|
||||
[2m2025-08-28T00:35:03.364492Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' if'
|
||||
[2m2025-08-28T00:35:03.492669Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' we'
|
||||
[2m2025-08-28T00:35:03.492690Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' we'
|
||||
[2m2025-08-28T00:35:03.621905Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' choose'
|
||||
[2m2025-08-28T00:35:03.621927Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' choose'
|
||||
[2m2025-08-28T00:35:03.754038Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' to'
|
||||
[2m2025-08-28T00:35:03.754059Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' to'
|
||||
[2m2025-08-28T00:35:03.883044Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' own'
|
||||
[2m2025-08-28T00:35:03.883066Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' own'
|
||||
[2m2025-08-28T00:35:04.010685Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' one'
|
||||
[2m2025-08-28T00:35:04.010703Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' one'
|
||||
[2m2025-08-28T00:35:04.139584Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' ourselves'
|
||||
[2m2025-08-28T00:35:04.139609Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' ourselves'
|
||||
[2m2025-08-28T00:35:04.269128Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:04.269144Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:04.398132Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:04.398151Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:04.527627Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:04.527654Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:04.657885Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:04.657914Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:04.788586Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:04.788607Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:04.918153Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:04.918179Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:05.048431Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:05.048460Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:05.178022Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:05.178055Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:05.308805Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:05.308833Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:05.438091Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Streaming token: ' truly'
|
||||
[2m2025-08-28T00:35:05.438113Z[0m [34mDEBUG[0m [2minference_engine::server[0m[2m:[0m Sending chunk with content: ' truly'
|
||||
[2m2025-08-28T00:35:05.561745Z[0m [32m INFO[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Text generation completed in 13.22s
|
||||
[2m2025-08-28T00:35:05.561767Z[0m [32m INFO[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Tokens generated: 100
|
||||
[2m2025-08-28T00:35:05.561770Z[0m [32m INFO[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Generation speed: 7.56 tokens/second
|
||||
[2m2025-08-28T00:35:05.561772Z[0m [32m INFO[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Average time per token: 129.65ms
|
||||
[2m2025-08-28T00:35:05.561774Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming - Forward pass: 124.98ms (96.4%)
|
||||
[2m2025-08-28T00:35:05.561776Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming - Repeat penalty: 74.02µs (0.1%)
|
||||
[2m2025-08-28T00:35:05.561778Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming - Sampling: 5.85ms (4.5%)
|
||||
[2m2025-08-28T00:35:05.561779Z[0m [32m INFO[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Total request time: 13.25s
|
||||
[2m2025-08-28T00:35:05.561781Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming - Tokenization: 217.04µs (0.0%)
|
||||
[2m2025-08-28T00:35:05.561782Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming - Generation: 13.22s (99.8%)
|
||||
[2m2025-08-28T00:35:05.561783Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming - Final decoding: 8.17µs (0.0%)
|
||||
[2m2025-08-28T00:35:30.845607Z[0m [34mDEBUG[0m [1mrequest[0m[1m{[0m[3mmethod[0m[2m=[0mPOST [3muri[0m[2m=[0m/v1/chat/completions [3mversion[0m[2m=[0mHTTP/1.1[1m}[0m[2m:[0m [2mtower_http::trace::on_request[0m[2m:[0m started processing request
|
||||
[2m2025-08-28T00:35:30.845670Z[0m [34mDEBUG[0m [1mrequest[0m[1m{[0m[3mmethod[0m[2m=[0mPOST [3muri[0m[2m=[0m/v1/chat/completions [3mversion[0m[2m=[0mHTTP/1.1[1m}[0m[2m:[0m [2minference_engine::server[0m[2m:[0m Formatted prompt: <start_of_turn>user
|
||||
You are a helpful assistant who responds thoughtfully and concisely.
|
||||
|
||||
Write a paragraph about cats<end_of_turn>
|
||||
<start_of_turn>model
|
||||
|
||||
[2m2025-08-28T00:35:30.845684Z[0m [34mDEBUG[0m [1mrequest[0m[1m{[0m[3mmethod[0m[2m=[0mPOST [3muri[0m[2m=[0m/v1/chat/completions [3mversion[0m[2m=[0mHTTP/1.1[1m}[0m[2m:[0m [2mpredict_otron_9000::middleware::metrics[0m[2m:[0m POST /v1/chat/completions 200 OK - 0 ms
|
||||
[2m2025-08-28T00:35:30.845691Z[0m [34mDEBUG[0m [1mrequest[0m[1m{[0m[3mmethod[0m[2m=[0mPOST [3muri[0m[2m=[0m/v1/chat/completions [3mversion[0m[2m=[0mHTTP/1.1[1m}[0m[2m:[0m [2mtower_http::trace::on_response[0m[2m:[0m finished processing request [3mlatency[0m[2m=[0m0 ms [3mstatus[0m[2m=[0m200
|
||||
[2m2025-08-28T00:35:30.845719Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Cleared penalty cache for new generation (streaming mode)
|
||||
[2m2025-08-28T00:35:30.845789Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Tokenization completed in 65.50µs
|
||||
[2m2025-08-28T00:35:30.845794Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Streaming Input tokens: 26
|
||||
[2m2025-08-28T00:35:30.871195Z[0m [34mDEBUG[0m [2minference_engine::text_generation[0m[2m:[0m Using special generation approach for gemma-2/gemma-3 models (streaming)
|
||||
./run_server.sh: line 7: 30566 Killed: 9 cargo run --bin predict-otron-9000 --release
|
39
server_log.txt
Normal file
39
server_log.txt
Normal file
@@ -0,0 +1,39 @@
|
||||
Compiling inference-engine v0.1.0 (/Users/williamseemueller/workspace/seemueller-io/predict-otron-9000/crates/inference-engine)
|
||||
warning: unused import: `Config as Config1`
|
||||
--> crates/inference-engine/src/model.rs:2:42
|
||||
|
|
||||
2 | use candle_transformers::models::gemma::{Config as Config1, Model as Model1};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
|
||||
= note: `#[warn(unused_imports)]` on by default
|
||||
|
||||
warning: unused import: `Config as Config2`
|
||||
--> crates/inference-engine/src/model.rs:3:43
|
||||
|
|
||||
3 | use candle_transformers::models::gemma2::{Config as Config2, Model as Model2};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
||||
warning: unused import: `Config as Config3`
|
||||
--> crates/inference-engine/src/model.rs:4:43
|
||||
|
|
||||
4 | use candle_transformers::models::gemma3::{Config as Config3, Model as Model3};
|
||||
| ^^^^^^^^^^^^^^^^^
|
||||
|
||||
warning: unused import: `self`
|
||||
--> crates/inference-engine/src/server.rs:10:28
|
||||
|
|
||||
10 | use futures_util::stream::{self, Stream};
|
||||
| ^^^^
|
||||
|
||||
warning: `inference-engine` (lib) generated 4 warnings (run `cargo fix --lib -p inference-engine` to apply 4 suggestions)
|
||||
Compiling predict-otron-9000 v0.1.0 (/Users/williamseemueller/workspace/seemueller-io/predict-otron-9000/crates/predict-otron-9000)
|
||||
Finished `release` profile [optimized] target(s) in 4.24s
|
||||
Running `target/release/predict-otron-9000`
|
||||
avx: false, neon: true, simd128: false, f16c: false
|
||||
[2m2025-08-28T00:28:26.075133Z[0m [32m INFO[0m [2mhf_hub[0m[2m:[0m Using token file found "/Users/williamseemueller/.cache/huggingface/token"
|
||||
retrieved the files in 557.625µs
|
||||
[2m2025-08-28T00:28:26.075815Z[0m [32m INFO[0m [2mpredict_otron_9000::middleware::metrics[0m[2m:[0m Performance metrics summary:
|
||||
|
||||
thread 'main' panicked at crates/predict-otron-9000/src/main.rs:91:61:
|
||||
called `Result::unwrap()` on an `Err` value: Os { code: 48, kind: AddrInUse, message: "Address already in use" }
|
||||
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
|
69
test_predict_otron.sh
Executable file
69
test_predict_otron.sh
Executable file
@@ -0,0 +1,69 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Script to test predict-otron-9000 server with 2 sequential CLI requests
|
||||
# Ensures proper cleanup of child processes on exit
|
||||
|
||||
set -e # Exit on any error
|
||||
|
||||
# Function to cleanup background processes
|
||||
cleanup() {
|
||||
echo "[INFO] Cleaning up background processes..."
|
||||
if [[ -n "$SERVER_PID" ]]; then
|
||||
echo "[INFO] Killing server process (PID: $SERVER_PID)"
|
||||
kill "$SERVER_PID" 2>/dev/null || true
|
||||
wait "$SERVER_PID" 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# Kill any remaining cargo processes related to predict-otron-9000
|
||||
pkill -f "predict-otron-9000" 2>/dev/null || true
|
||||
|
||||
echo "[INFO] Cleanup complete"
|
||||
}
|
||||
|
||||
# Set up trap to ensure cleanup on script exit
|
||||
trap cleanup EXIT INT TERM
|
||||
|
||||
# Set environment variables
|
||||
export SERVER_PORT=${SERVER_PORT:-8080}
|
||||
export RUST_LOG=${RUST_LOG:-info}
|
||||
|
||||
echo "[INFO] Starting predict-otron-9000 server in background..."
|
||||
|
||||
# Start the server in background and capture its PID
|
||||
cargo run --bin predict-otron-9000 --release > server.log 2>&1 &
|
||||
SERVER_PID=$!
|
||||
|
||||
echo "[INFO] Server started with PID: $SERVER_PID"
|
||||
|
||||
# Function to check if server is ready
|
||||
check_server() {
|
||||
curl -s -f http://localhost:8080/v1/models > /dev/null 2>&1
|
||||
}
|
||||
|
||||
# Wait for server to be ready
|
||||
echo "[INFO] Waiting for server to be ready..."
|
||||
TIMEOUT=60 # 60 seconds timeout
|
||||
ELAPSED=0
|
||||
|
||||
while ! check_server; do
|
||||
if [[ $ELAPSED -ge $TIMEOUT ]]; then
|
||||
echo "[ERROR] Server did not start within $TIMEOUT seconds"
|
||||
exit 1
|
||||
fi
|
||||
sleep 2
|
||||
ELAPSED=$((ELAPSED + 2))
|
||||
echo "[INFO] Still waiting for server... (${ELAPSED}s elapsed)"
|
||||
done
|
||||
|
||||
echo "[INFO] Server is ready!"
|
||||
|
||||
# Run first CLI request
|
||||
echo "[INFO] Running first CLI request - listing models..."
|
||||
bun run cli.ts --list-models
|
||||
|
||||
echo ""
|
||||
echo "[INFO] Running second CLI request - chat completion..."
|
||||
bun run cli.ts "What is 2+2?"
|
||||
|
||||
echo ""
|
||||
echo "[INFO] Both CLI requests completed successfully!"
|
85
test_repetition.ts
Normal file
85
test_repetition.ts
Normal file
@@ -0,0 +1,85 @@
|
||||
#!/usr/bin/env node
|
||||
|
||||
// Test script to reproduce token repetition issue with special characters
|
||||
const { fetch } = require('node-fetch');
|
||||
|
||||
async function testTokenRepetition() {
|
||||
console.log("Testing token repetition with special characters...");
|
||||
|
||||
try {
|
||||
const response = await fetch('http://localhost:8080/chat/stream', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
},
|
||||
body: JSON.stringify({
|
||||
message: "Write a simple greeting with punctuation marks like: Hello! How are you? I'm fine, thanks."
|
||||
})
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
throw new Error(`HTTP error! status: ${response.status}`);
|
||||
}
|
||||
|
||||
const reader = response.body?.getReader();
|
||||
if (!reader) {
|
||||
throw new Error('No reader available');
|
||||
}
|
||||
|
||||
let fullResponse = '';
|
||||
let tokens = [];
|
||||
|
||||
while (true) {
|
||||
const { done, value } = await reader.read();
|
||||
if (done) break;
|
||||
|
||||
const chunk = new TextDecoder().decode(value);
|
||||
const lines = chunk.split('\n');
|
||||
|
||||
for (const line of lines) {
|
||||
if (line.startsWith('data: ')) {
|
||||
const data = line.slice(6);
|
||||
if (data === '[DONE]') {
|
||||
continue;
|
||||
}
|
||||
|
||||
try {
|
||||
const parsed = JSON.parse(data);
|
||||
if (parsed.token) {
|
||||
tokens.push(parsed.token);
|
||||
fullResponse += parsed.token;
|
||||
console.log(`Token: "${parsed.token}"`);
|
||||
}
|
||||
} catch (e) {
|
||||
console.log(`Non-JSON data: ${data}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
console.log('\n=== ANALYSIS ===');
|
||||
console.log('Full response:', fullResponse);
|
||||
console.log('Total tokens:', tokens.length);
|
||||
|
||||
// Check for repetition issues
|
||||
const tokenCounts = {};
|
||||
let hasRepetition = false;
|
||||
|
||||
for (const token of tokens) {
|
||||
tokenCounts[token] = (tokenCounts[token] || 0) + 1;
|
||||
if (tokenCounts[token] > 1 && token.match(/[!?,.;:]/)) {
|
||||
console.log(`⚠️ Repetition detected: "${token}" appears ${tokenCounts[token]} times`);
|
||||
hasRepetition = true;
|
||||
}
|
||||
}
|
||||
|
||||
if (!hasRepetition) {
|
||||
console.log('✅ No token repetition detected');
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error testing token repetition:', error);
|
||||
}
|
||||
}
|
||||
|
||||
testTokenRepetition();
|
Reference in New Issue
Block a user