mirror of
https://github.com/geoffsee/predict-otron-9001.git
synced 2025-09-08 22:46:44 +00:00
188 lines
5.4 KiB
Markdown
188 lines
5.4 KiB
Markdown
# Llama Runner
|
|
|
|
A fast Rust implementation for running Llama and other language models using the Candle deep learning framework. Built on the official Candle examples with optimizations for speed and usability.
|
|
|
|
## Features
|
|
|
|
- 🚀 **High Performance**: Metal GPU acceleration on macOS, CUDA support on Linux/Windows
|
|
- 🤖 **Multiple Models**: Supports Llama 3.2, SmolLM2, TinyLlama, and more
|
|
- ⚡ **Fast Inference**: Optimized with F16 precision and KV caching
|
|
- 🎯 **Advanced Sampling**: Top-k, top-p, temperature, and repeat penalty controls
|
|
- 📊 **Performance Metrics**: Real-time tokens/second reporting
|
|
- 🔧 **Easy CLI**: Simple command-line interface with sensible defaults
|
|
|
|
## Supported Models
|
|
|
|
| Model | Size | Command | Description |
|
|
|-------|------|---------|-------------|
|
|
| SmolLM2-135M | 135M | `smollm2-135m` | Tiny, fast model for testing |
|
|
| SmolLM2-360M | 360M | `smollm2-360m` | Small, efficient model |
|
|
| SmolLM2-1.7B | 1.7B | `smollm2-1.7b` | Balanced performance/speed |
|
|
| Llama-3.2-1B | 1B | `llama-3.2-1b` | Meta's compact model |
|
|
| Llama-3.2-3B | 3B | `llama-3.2-3b` | Larger Llama model |
|
|
| TinyLlama-1.1B | 1.1B | `tinyllama-1.1b-chat` | Chat-optimized small model |
|
|
|
|
Add `-instruct` suffix for instruction-tuned variants (e.g., `smollm2-135m-instruct`).
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone <repository-url>
|
|
cd llama-runner
|
|
|
|
# Build with GPU acceleration (recommended)
|
|
cargo build --release --features metal # macOS
|
|
cargo build --release --features cuda # Linux/Windows with NVIDIA GPU
|
|
|
|
# CPU-only build
|
|
cargo build --release
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Fast inference with GPU acceleration
|
|
cargo run --features metal -- --prompt "What is quantum computing?"
|
|
|
|
# Specify a model and parameters
|
|
cargo run --features metal -- \
|
|
--prompt "Write a short story about space exploration" \
|
|
--model smollm2-360m \
|
|
--max-tokens 100 \
|
|
--temperature 0.8
|
|
|
|
# Use CPU (slower but works everywhere)
|
|
cargo run -- --prompt "Hello, world!" --model smollm2-135m --cpu
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Text Generation
|
|
```bash
|
|
# Simple completion
|
|
cargo run --features metal -- --prompt "The capital of France is"
|
|
|
|
# Creative writing with higher temperature
|
|
cargo run --features metal -- \
|
|
--prompt "Once upon a time" \
|
|
--temperature 1.0 \
|
|
--max-tokens 200
|
|
```
|
|
|
|
### Advanced Sampling
|
|
```bash
|
|
# Top-k and top-p sampling
|
|
cargo run --features metal -- \
|
|
--prompt "Explain artificial intelligence" \
|
|
--top-k 40 \
|
|
--top-p 0.9 \
|
|
--temperature 0.7
|
|
|
|
# Reduce repetition
|
|
cargo run --features metal -- \
|
|
--prompt "List the benefits of renewable energy" \
|
|
--repeat-penalty 1.2 \
|
|
--repeat-last-n 64
|
|
```
|
|
|
|
### Different Models
|
|
```bash
|
|
# Ultra-fast with tiny model
|
|
cargo run --features metal -- \
|
|
--prompt "Quick test" \
|
|
--model smollm2-135m
|
|
|
|
# Better quality with larger model
|
|
cargo run --features metal -- \
|
|
--prompt "Explain quantum physics" \
|
|
--model llama-3.2-1b \
|
|
--max-tokens 150
|
|
```
|
|
|
|
## Command-Line Options
|
|
|
|
| Option | Short | Default | Description |
|
|
|--------|-------|---------|-------------|
|
|
| `--prompt` | `-p` | "The capital of France is" | Input prompt |
|
|
| `--model` | `-m` | `smollm2-135m` | Model to use |
|
|
| `--max-tokens` | `-n` | 100 | Maximum tokens to generate |
|
|
| `--temperature` | `-t` | 0.8 | Sampling temperature (0.0 = deterministic) |
|
|
| `--top-k` | | None | Top-k sampling |
|
|
| `--top-p` | | None | Top-p (nucleus) sampling |
|
|
| `--seed` | | 299792458 | Random seed for reproducibility |
|
|
| `--repeat-penalty` | | 1.1 | Repetition penalty (1.0 = no penalty) |
|
|
| `--repeat-last-n` | | 128 | Context window for repeat penalty |
|
|
| `--cpu` | | false | Force CPU usage |
|
|
| `--dtype` | | f16 | Data type: f16, bf16, f32 |
|
|
| `--no-kv-cache` | | false | Disable key-value caching |
|
|
|
|
## Performance
|
|
|
|
Typical performance on Apple M2 with Metal acceleration:
|
|
|
|
| Model | Size | Speed | Memory |
|
|
|-------|------|-------|--------|
|
|
| SmolLM2-135M | 135M | ~100 tok/s | ~500MB |
|
|
| SmolLM2-360M | 360M | ~80 tok/s | ~1GB |
|
|
| SmolLM2-1.7B | 1.7B | ~50 tok/s | ~3GB |
|
|
| Llama-3.2-1B | 1B | ~40 tok/s | ~2GB |
|
|
|
|
## Requirements
|
|
|
|
- **Rust**: 1.70+ (latest stable recommended)
|
|
- **Memory**: 2-8GB RAM depending on model size
|
|
- **Storage**: 1-10GB for model weights
|
|
- **Network**: Internet connection for first-time model download
|
|
- **GPU** (optional): Metal on macOS, CUDA on Linux/Windows
|
|
|
|
## GPU Support
|
|
|
|
### macOS (Metal)
|
|
```bash
|
|
cargo run --features metal -- [options]
|
|
```
|
|
|
|
### Linux/Windows (CUDA)
|
|
```bash
|
|
cargo run --features cuda -- [options]
|
|
```
|
|
|
|
### CPU Only
|
|
```bash
|
|
cargo run -- --cpu [options]
|
|
```
|
|
|
|
## Model Downloads
|
|
|
|
Models are automatically downloaded from HuggingFace Hub on first use and cached locally. Download times:
|
|
|
|
- SmolLM2-135M: ~1 minute
|
|
- SmolLM2-360M: ~2 minutes
|
|
- Llama-3.2-1B: ~5 minutes
|
|
- Larger models: 10+ minutes
|
|
|
|
## Troubleshooting
|
|
|
|
### Slow Performance
|
|
- Use `--features metal` on macOS or `--features cuda` on Linux/Windows
|
|
- Try smaller models like `smollm2-135m` for faster inference
|
|
- Ensure sufficient RAM for your chosen model
|
|
|
|
### Out of Memory
|
|
- Use `--cpu` to use system RAM instead of GPU memory
|
|
- Try smaller models or reduce `--max-tokens`
|
|
- Use `--dtype f32` if f16 causes issues
|
|
|
|
### Model Download Issues
|
|
- Check internet connection
|
|
- Some models may require HuggingFace Hub authentication
|
|
- Verify sufficient disk space in `~/.cache/huggingface/`
|
|
|
|
## Contributing
|
|
|
|
Contributions welcome! This project is based on the [Candle](https://github.com/huggingface/candle) framework by HuggingFace.
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details. |