mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Files

geoffsee ff55d882c7 reorg + update docs with new paths

2025-09-04 12:40:59 -04:00

5.4 KiB

Raw Blame History

Llama Runner

A fast Rust implementation for running Llama and other language models using the Candle deep learning framework. Built on the official Candle examples with optimizations for speed and usability.

Features

🚀 High Performance: Metal GPU acceleration on macOS, CUDA support on Linux/Windows
🤖 Multiple Models: Supports Llama 3.2, SmolLM2, TinyLlama, and more
⚡ Fast Inference: Optimized with F16 precision and KV caching
🎯 Advanced Sampling: Top-k, top-p, temperature, and repeat penalty controls
📊 Performance Metrics: Real-time tokens/second reporting
🔧 Easy CLI: Simple command-line interface with sensible defaults

Supported Models

Model	Size	Command	Description
SmolLM2-135M	135M	`smollm2-135m`	Tiny, fast model for testing
SmolLM2-360M	360M	`smollm2-360m`	Small, efficient model
SmolLM2-1.7B	1.7B	`smollm2-1.7b`	Balanced performance/speed
Llama-3.2-1B	1B	`llama-3.2-1b`	Meta's compact model
Llama-3.2-3B	3B	`llama-3.2-3b`	Larger Llama model
TinyLlama-1.1B	1.1B	`tinyllama-1.1b-chat`	Chat-optimized small model

Add -instruct suffix for instruction-tuned variants (e.g., smollm2-135m-instruct).

Installation

# Clone the repository
git clone <repository-url>
cd llama-runner

# Build with GPU acceleration (recommended)
cargo build --release --features metal  # macOS
cargo build --release --features cuda   # Linux/Windows with NVIDIA GPU

# CPU-only build
cargo build --release

Quick Start

# Fast inference with GPU acceleration
cargo run --features metal -- --prompt "What is quantum computing?"

# Specify a model and parameters
cargo run --features metal -- \
  --prompt "Write a short story about space exploration" \
  --model smollm2-360m \
  --max-tokens 100 \
  --temperature 0.8

# Use CPU (slower but works everywhere)
cargo run -- --prompt "Hello, world!" --model smollm2-135m --cpu

Usage Examples

Basic Text Generation

# Simple completion
cargo run --features metal -- --prompt "The capital of France is"

# Creative writing with higher temperature
cargo run --features metal -- \
  --prompt "Once upon a time" \
  --temperature 1.0 \
  --max-tokens 200

Advanced Sampling

# Top-k and top-p sampling
cargo run --features metal -- \
  --prompt "Explain artificial intelligence" \
  --top-k 40 \
  --top-p 0.9 \
  --temperature 0.7

# Reduce repetition
cargo run --features metal -- \
  --prompt "List the benefits of renewable energy" \
  --repeat-penalty 1.2 \
  --repeat-last-n 64

Different Models

# Ultra-fast with tiny model
cargo run --features metal -- \
  --prompt "Quick test" \
  --model smollm2-135m

# Better quality with larger model
cargo run --features metal -- \
  --prompt "Explain quantum physics" \
  --model llama-3.2-1b \
  --max-tokens 150

Command-Line Options

Option	Short	Default	Description
`--prompt`	`-p`	"The capital of France is"	Input prompt
`--model`	`-m`	`smollm2-135m`	Model to use
`--max-tokens`	`-n`	100	Maximum tokens to generate
`--temperature`	`-t`	0.8	Sampling temperature (0.0 = deterministic)
`--top-k`		None	Top-k sampling
`--top-p`		None	Top-p (nucleus) sampling
`--seed`		299792458	Random seed for reproducibility
`--repeat-penalty`		1.1	Repetition penalty (1.0 = no penalty)
`--repeat-last-n`		128	Context window for repeat penalty
`--cpu`		false	Force CPU usage
`--dtype`		f16	Data type: f16, bf16, f32
`--no-kv-cache`		false	Disable key-value caching

Performance

Typical performance on Apple M2 with Metal acceleration:

Model	Size	Speed	Memory
SmolLM2-135M	135M	~100 tok/s	~500MB
SmolLM2-360M	360M	~80 tok/s	~1GB
SmolLM2-1.7B	1.7B	~50 tok/s	~3GB
Llama-3.2-1B	1B	~40 tok/s	~2GB

Requirements

Rust: 1.70+ (latest stable recommended)
Memory: 2-8GB RAM depending on model size
Storage: 1-10GB for model weights
Network: Internet connection for first-time model download
GPU (optional): Metal on macOS, CUDA on Linux/Windows

GPU Support

macOS (Metal)

cargo run --features metal -- [options]

Linux/Windows (CUDA)

cargo run --features cuda -- [options]

CPU Only

cargo run -- --cpu [options]

Model Downloads

Models are automatically downloaded from HuggingFace Hub on first use and cached locally. Download times:

SmolLM2-135M: ~1 minute
SmolLM2-360M: ~2 minutes
Llama-3.2-1B: ~5 minutes
Larger models: 10+ minutes

Troubleshooting

Slow Performance

Use --features metal on macOS or --features cuda on Linux/Windows
Try smaller models like smollm2-135m for faster inference
Ensure sufficient RAM for your chosen model

Out of Memory

Use --cpu to use system RAM instead of GPU memory
Try smaller models or reduce --max-tokens
Use --dtype f32 if f16 causes issues

Model Download Issues

Check internet connection
Some models may require HuggingFace Hub authentication
Verify sufficient disk space in ~/.cache/huggingface/

Contributing

Contributions welcome! This project is based on the Candle framework by HuggingFace.

License

MIT License - see LICENSE file for details.

5.4 KiB Raw Blame History