geoffsee/predict-otron-9001

Fork 0

mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Files

geoffsee ff55d882c7 reorg + update docs with new paths

2025-09-04 12:40:59 -04:00

3.7 KiB

Raw Blame History

Gemma Runner

Fast Gemma inference with Candle framework in Rust.

Features

Support for multiple Gemma model versions (v1, v2, v3)
GPU acceleration with CUDA and Metal
Configurable sampling parameters
Multiple model variants including instruct and code models

Supported Models

Gemma v1

gemma-2b - Base 2B model
gemma-7b - Base 7B model
gemma-2b-it - Instruct 2B model
gemma-7b-it - Instruct 7B model
gemma-1.1-2b-it - Instruct 2B v1.1 model
gemma-1.1-7b-it - Instruct 7B v1.1 model

CodeGemma

codegemma-2b - Code base 2B model
codegemma-7b - Code base 7B model
codegemma-2b-it - Code instruct 2B model
codegemma-7b-it - Code instruct 7B model

Gemma v2

gemma-2-2b - Base 2B v2 model (default)
gemma-2-2b-it - Instruct 2B v2 model
gemma-2-9b - Base 9B v2 model
gemma-2-9b-it - Instruct 9B v2 model

Gemma v3

gemma-3-1b - Base 1B v3 model
gemma-3-1b-it - Instruct 1B v3 model

Installation

cd gemma-runner
cargo build --release

For GPU support:

# CUDA
cargo build --release --features cuda

# Metal (macOS)
cargo build --release --features metal

Usage

Basic Usage

# Run with default model (gemma-2-2b)
cargo run -- --prompt "The capital of France is"

# Specify a different model
cargo run -- --model gemma-2b-it --prompt "Explain quantum computing"

# Generate more tokens
cargo run -- --model codegemma-2b-it --prompt "Write a Python function to sort a list" --max-tokens 200

Advanced Options

# Use CPU instead of GPU
cargo run -- --cpu --prompt "Hello world"

# Adjust sampling parameters
cargo run -- --temperature 0.8 --top-p 0.9 --prompt "Write a story about"

# Use custom model from HuggingFace Hub
cargo run -- --model-id "google/gemma-2-2b-it" --prompt "What is AI?"

# Enable tracing for performance analysis
cargo run -- --tracing --prompt "Explain machine learning"

Command Line Arguments

--prompt, -p - The prompt to generate text from (default: "The capital of France is")
--model, -m - The model to use (default: "gemma-2-2b")
--cpu - Run on CPU rather than GPU
--temperature, -t - Sampling temperature (optional)
--top-p - Nucleus sampling probability cutoff (optional)
--seed - Random seed (default: 299792458)
--max-tokens, -n - Maximum tokens to generate (default: 100)
--model-id - Custom model ID from HuggingFace Hub
--revision - Model revision (default: "main")
--use-flash-attn - Use flash attention
--repeat-penalty - Repetition penalty (default: 1.1)
--repeat-last-n - Context size for repeat penalty (default: 64)
--dtype - Data type (f16, bf16, f32)
--tracing - Enable performance tracing

Examples

Text Generation

cargo run -- --model gemma-2b-it --prompt "Explain the theory of relativity" --max-tokens 150

Code Generation

cargo run -- --model codegemma-7b-it --prompt "Write a Rust function to calculate factorial" --max-tokens 100

Creative Writing

cargo run -- --model gemma-7b-it --temperature 0.9 --prompt "Once upon a time in a magical forest" --max-tokens 200

Chat with Gemma 3 (Instruct format)

cargo run -- --model gemma-3-1b-it --prompt "How do I learn Rust programming?"

Performance Notes

GPU acceleration is automatically detected and used when available
BF16 precision is used on CUDA for better performance
F32 precision is used on CPU
Flash attention can be enabled with --use-flash-attn for supported models
Model files are cached locally after first download

Requirements

Rust 1.70+
CUDA toolkit (for CUDA support)
Metal (automatically available on macOS)
Internet connection for first-time model download

3.7 KiB Raw Blame History

Gemma Runner

Features

Supported Models

Gemma v1

CodeGemma

Gemma v2

Gemma v3

Installation

Usage

Basic Usage

Advanced Options

Command Line Arguments

Examples

Text Generation

Code Generation

Creative Writing

Chat with Gemma 3 (Instruct format)

Performance Notes

Requirements

3.7 KiB

Raw Blame History