mirror of
https://github.com/geoffsee/predict-otron-9001.git
synced 2025-09-08 22:46:44 +00:00
3.7 KiB
3.7 KiB
Gemma Runner
Fast Gemma inference with Candle framework in Rust.
Features
- Support for multiple Gemma model versions (v1, v2, v3)
- GPU acceleration with CUDA and Metal
- Configurable sampling parameters
- Multiple model variants including instruct and code models
Supported Models
Gemma v1
gemma-2b
- Base 2B modelgemma-7b
- Base 7B modelgemma-2b-it
- Instruct 2B modelgemma-7b-it
- Instruct 7B modelgemma-1.1-2b-it
- Instruct 2B v1.1 modelgemma-1.1-7b-it
- Instruct 7B v1.1 model
CodeGemma
codegemma-2b
- Code base 2B modelcodegemma-7b
- Code base 7B modelcodegemma-2b-it
- Code instruct 2B modelcodegemma-7b-it
- Code instruct 7B model
Gemma v2
gemma-2-2b
- Base 2B v2 model (default)gemma-2-2b-it
- Instruct 2B v2 modelgemma-2-9b
- Base 9B v2 modelgemma-2-9b-it
- Instruct 9B v2 model
Gemma v3
gemma-3-1b
- Base 1B v3 modelgemma-3-1b-it
- Instruct 1B v3 model
Installation
cd gemma-runner
cargo build --release
For GPU support:
# CUDA
cargo build --release --features cuda
# Metal (macOS)
cargo build --release --features metal
Usage
Basic Usage
# Run with default model (gemma-2-2b)
cargo run -- --prompt "The capital of France is"
# Specify a different model
cargo run -- --model gemma-2b-it --prompt "Explain quantum computing"
# Generate more tokens
cargo run -- --model codegemma-2b-it --prompt "Write a Python function to sort a list" --max-tokens 200
Advanced Options
# Use CPU instead of GPU
cargo run -- --cpu --prompt "Hello world"
# Adjust sampling parameters
cargo run -- --temperature 0.8 --top-p 0.9 --prompt "Write a story about"
# Use custom model from HuggingFace Hub
cargo run -- --model-id "google/gemma-2-2b-it" --prompt "What is AI?"
# Enable tracing for performance analysis
cargo run -- --tracing --prompt "Explain machine learning"
Command Line Arguments
--prompt, -p
- The prompt to generate text from (default: "The capital of France is")--model, -m
- The model to use (default: "gemma-2-2b")--cpu
- Run on CPU rather than GPU--temperature, -t
- Sampling temperature (optional)--top-p
- Nucleus sampling probability cutoff (optional)--seed
- Random seed (default: 299792458)--max-tokens, -n
- Maximum tokens to generate (default: 100)--model-id
- Custom model ID from HuggingFace Hub--revision
- Model revision (default: "main")--use-flash-attn
- Use flash attention--repeat-penalty
- Repetition penalty (default: 1.1)--repeat-last-n
- Context size for repeat penalty (default: 64)--dtype
- Data type (f16, bf16, f32)--tracing
- Enable performance tracing
Examples
Text Generation
cargo run -- --model gemma-2b-it --prompt "Explain the theory of relativity" --max-tokens 150
Code Generation
cargo run -- --model codegemma-7b-it --prompt "Write a Rust function to calculate factorial" --max-tokens 100
Creative Writing
cargo run -- --model gemma-7b-it --temperature 0.9 --prompt "Once upon a time in a magical forest" --max-tokens 200
Chat with Gemma 3 (Instruct format)
cargo run -- --model gemma-3-1b-it --prompt "How do I learn Rust programming?"
Performance Notes
- GPU acceleration is automatically detected and used when available
- BF16 precision is used on CUDA for better performance
- F32 precision is used on CPU
- Flash attention can be enabled with
--use-flash-attn
for supported models - Model files are cached locally after first download
Requirements
- Rust 1.70+
- CUDA toolkit (for CUDA support)
- Metal (automatically available on macOS)
- Internet connection for first-time model download