predict-otron-9001/integration/gemma-runner/README.md

# Gemma Runner

Fast Gemma inference with Candle framework in Rust.

## Features

- Support for multiple Gemma model versions (v1, v2, v3)
- GPU acceleration with CUDA and Metal
- Configurable sampling parameters
- Multiple model variants including instruct and code models

## Supported Models

### Gemma v1
- `gemma-2b` - Base 2B model
- `gemma-7b` - Base 7B model
- `gemma-2b-it` - Instruct 2B model
- `gemma-7b-it` - Instruct 7B model
- `gemma-1.1-2b-it` - Instruct 2B v1.1 model
- `gemma-1.1-7b-it` - Instruct 7B v1.1 model

### CodeGemma
- `codegemma-2b` - Code base 2B model
- `codegemma-7b` - Code base 7B model
- `codegemma-2b-it` - Code instruct 2B model
- `codegemma-7b-it` - Code instruct 7B model

### Gemma v2
- `gemma-2-2b` - Base 2B v2 model (default)
- `gemma-2-2b-it` - Instruct 2B v2 model
- `gemma-2-9b` - Base 9B v2 model
- `gemma-2-9b-it` - Instruct 9B v2 model

### Gemma v3
- `gemma-3-1b` - Base 1B v3 model
- `gemma-3-1b-it` - Instruct 1B v3 model

## Installation

```bash
cd gemma-runner
cargo build --release
```

For GPU support:
```bash
# CUDA
cargo build --release --features cuda

# Metal (macOS)
cargo build --release --features metal
```

## Usage

### Basic Usage

```bash
# Run with default model (gemma-2-2b)
cargo run -- --prompt "The capital of France is"

# Specify a different model
cargo run -- --model gemma-2b-it --prompt "Explain quantum computing"

# Generate more tokens
cargo run -- --model codegemma-2b-it --prompt "Write a Python function to sort a list" --max-tokens 200
```

### Advanced Options

```bash
# Use CPU instead of GPU
cargo run -- --cpu --prompt "Hello world"

# Adjust sampling parameters
cargo run -- --temperature 0.8 --top-p 0.9 --prompt "Write a story about"

# Use custom model from HuggingFace Hub
cargo run -- --model-id "google/gemma-2-2b-it" --prompt "What is AI?"

# Enable tracing for performance analysis
cargo run -- --tracing --prompt "Explain machine learning"
```

### Command Line Arguments

- `--prompt, -p` - The prompt to generate text from (default: "The capital of France is")
- `--model, -m` - The model to use (default: "gemma-2-2b")
- `--cpu` - Run on CPU rather than GPU
- `--temperature, -t` - Sampling temperature (optional)
- `--top-p` - Nucleus sampling probability cutoff (optional)
- `--seed` - Random seed (default: 299792458)
- `--max-tokens, -n` - Maximum tokens to generate (default: 100)
- `--model-id` - Custom model ID from HuggingFace Hub
- `--revision` - Model revision (default: "main")
- `--use-flash-attn` - Use flash attention
- `--repeat-penalty` - Repetition penalty (default: 1.1)
- `--repeat-last-n` - Context size for repeat penalty (default: 64)
- `--dtype` - Data type (f16, bf16, f32)
- `--tracing` - Enable performance tracing

## Examples

### Text Generation
```bash
cargo run -- --model gemma-2b-it --prompt "Explain the theory of relativity" --max-tokens 150
```

### Code Generation
```bash
cargo run -- --model codegemma-7b-it --prompt "Write a Rust function to calculate factorial" --max-tokens 100
```

### Creative Writing
```bash
cargo run -- --model gemma-7b-it --temperature 0.9 --prompt "Once upon a time in a magical forest" --max-tokens 200
```

### Chat with Gemma 3 (Instruct format)
```bash
cargo run -- --model gemma-3-1b-it --prompt "How do I learn Rust programming?"
```

## Performance Notes

- GPU acceleration is automatically detected and used when available
- BF16 precision is used on CUDA for better performance
- F32 precision is used on CPU
- Flash attention can be enabled with `--use-flash-attn` for supported models
- Model files are cached locally after first download

## Requirements

- Rust 1.70+
- CUDA toolkit (for CUDA support)
- Metal (automatically available on macOS)
- Internet connection for first-time model download