Files
predict-otron-9001/integration/gemma-runner/README.md
2025-09-04 12:40:59 -04:00

3.7 KiB

Gemma Runner

Fast Gemma inference with Candle framework in Rust.

Features

  • Support for multiple Gemma model versions (v1, v2, v3)
  • GPU acceleration with CUDA and Metal
  • Configurable sampling parameters
  • Multiple model variants including instruct and code models

Supported Models

Gemma v1

  • gemma-2b - Base 2B model
  • gemma-7b - Base 7B model
  • gemma-2b-it - Instruct 2B model
  • gemma-7b-it - Instruct 7B model
  • gemma-1.1-2b-it - Instruct 2B v1.1 model
  • gemma-1.1-7b-it - Instruct 7B v1.1 model

CodeGemma

  • codegemma-2b - Code base 2B model
  • codegemma-7b - Code base 7B model
  • codegemma-2b-it - Code instruct 2B model
  • codegemma-7b-it - Code instruct 7B model

Gemma v2

  • gemma-2-2b - Base 2B v2 model (default)
  • gemma-2-2b-it - Instruct 2B v2 model
  • gemma-2-9b - Base 9B v2 model
  • gemma-2-9b-it - Instruct 9B v2 model

Gemma v3

  • gemma-3-1b - Base 1B v3 model
  • gemma-3-1b-it - Instruct 1B v3 model

Installation

cd gemma-runner
cargo build --release

For GPU support:

# CUDA
cargo build --release --features cuda

# Metal (macOS)
cargo build --release --features metal

Usage

Basic Usage

# Run with default model (gemma-2-2b)
cargo run -- --prompt "The capital of France is"

# Specify a different model
cargo run -- --model gemma-2b-it --prompt "Explain quantum computing"

# Generate more tokens
cargo run -- --model codegemma-2b-it --prompt "Write a Python function to sort a list" --max-tokens 200

Advanced Options

# Use CPU instead of GPU
cargo run -- --cpu --prompt "Hello world"

# Adjust sampling parameters
cargo run -- --temperature 0.8 --top-p 0.9 --prompt "Write a story about"

# Use custom model from HuggingFace Hub
cargo run -- --model-id "google/gemma-2-2b-it" --prompt "What is AI?"

# Enable tracing for performance analysis
cargo run -- --tracing --prompt "Explain machine learning"

Command Line Arguments

  • --prompt, -p - The prompt to generate text from (default: "The capital of France is")
  • --model, -m - The model to use (default: "gemma-2-2b")
  • --cpu - Run on CPU rather than GPU
  • --temperature, -t - Sampling temperature (optional)
  • --top-p - Nucleus sampling probability cutoff (optional)
  • --seed - Random seed (default: 299792458)
  • --max-tokens, -n - Maximum tokens to generate (default: 100)
  • --model-id - Custom model ID from HuggingFace Hub
  • --revision - Model revision (default: "main")
  • --use-flash-attn - Use flash attention
  • --repeat-penalty - Repetition penalty (default: 1.1)
  • --repeat-last-n - Context size for repeat penalty (default: 64)
  • --dtype - Data type (f16, bf16, f32)
  • --tracing - Enable performance tracing

Examples

Text Generation

cargo run -- --model gemma-2b-it --prompt "Explain the theory of relativity" --max-tokens 150

Code Generation

cargo run -- --model codegemma-7b-it --prompt "Write a Rust function to calculate factorial" --max-tokens 100

Creative Writing

cargo run -- --model gemma-7b-it --temperature 0.9 --prompt "Once upon a time in a magical forest" --max-tokens 200

Chat with Gemma 3 (Instruct format)

cargo run -- --model gemma-3-1b-it --prompt "How do I learn Rust programming?"

Performance Notes

  • GPU acceleration is automatically detected and used when available
  • BF16 precision is used on CUDA for better performance
  • F32 precision is used on CPU
  • Flash attention can be enabled with --use-flash-attn for supported models
  • Model files are cached locally after first download

Requirements

  • Rust 1.70+
  • CUDA toolkit (for CUDA support)
  • Metal (automatically available on macOS)
  • Internet connection for first-time model download