reorg + update docs with new paths

2025-09-08 22:46:44 +00:00 · 2025-09-04 12:27:13 -04:00
parent 400c70f17d
commit ff55d882c7
43 changed files with 493 additions and 182 deletions
--- a/integration/llama-runner/README.md
+++ b/integration/llama-runner/README.md
@@ -0,0 +1,188 @@
+# Llama Runner
+
+A fast Rust implementation for running Llama and other language models using the Candle deep learning framework. Built on the official Candle examples with optimizations for speed and usability.
+
+## Features
+
+- 🚀 **High Performance**: Metal GPU acceleration on macOS, CUDA support on Linux/Windows
+- 🤖 **Multiple Models**: Supports Llama 3.2, SmolLM2, TinyLlama, and more
+- ⚡ **Fast Inference**: Optimized with F16 precision and KV caching
+- 🎯 **Advanced Sampling**: Top-k, top-p, temperature, and repeat penalty controls  
+- 📊 **Performance Metrics**: Real-time tokens/second reporting
+- 🔧 **Easy CLI**: Simple command-line interface with sensible defaults
+
+## Supported Models
+
+| Model | Size | Command | Description |
+|-------|------|---------|-------------|
+| SmolLM2-135M | 135M | `smollm2-135m` | Tiny, fast model for testing |
+| SmolLM2-360M | 360M | `smollm2-360m` | Small, efficient model |
+| SmolLM2-1.7B | 1.7B | `smollm2-1.7b` | Balanced performance/speed |
+| Llama-3.2-1B | 1B | `llama-3.2-1b` | Meta's compact model |
+| Llama-3.2-3B | 3B | `llama-3.2-3b` | Larger Llama model |
+| TinyLlama-1.1B | 1.1B | `tinyllama-1.1b-chat` | Chat-optimized small model |
+
+Add `-instruct` suffix for instruction-tuned variants (e.g., `smollm2-135m-instruct`).
+
+## Installation
+
+```bash
+# Clone the repository
+git clone <repository-url>
+cd llama-runner
+
+# Build with GPU acceleration (recommended)
+cargo build --release --features metal  # macOS
+cargo build --release --features cuda   # Linux/Windows with NVIDIA GPU
+
+# CPU-only build
+cargo build --release
+```
+
+## Quick Start
+
+```bash
+# Fast inference with GPU acceleration
+cargo run --features metal -- --prompt "What is quantum computing?"
+
+# Specify a model and parameters
+cargo run --features metal -- \
+  --prompt "Write a short story about space exploration" \
+  --model smollm2-360m \
+  --max-tokens 100 \
+  --temperature 0.8
+
+# Use CPU (slower but works everywhere)
+cargo run -- --prompt "Hello, world!" --model smollm2-135m --cpu
+```
+
+## Usage Examples
+
+### Basic Text Generation
+```bash
+# Simple completion
+cargo run --features metal -- --prompt "The capital of France is"
+
+# Creative writing with higher temperature
+cargo run --features metal -- \
+  --prompt "Once upon a time" \
+  --temperature 1.0 \
+  --max-tokens 200
+```
+
+### Advanced Sampling
+```bash
+# Top-k and top-p sampling
+cargo run --features metal -- \
+  --prompt "Explain artificial intelligence" \
+  --top-k 40 \
+  --top-p 0.9 \
+  --temperature 0.7
+
+# Reduce repetition
+cargo run --features metal -- \
+  --prompt "List the benefits of renewable energy" \
+  --repeat-penalty 1.2 \
+  --repeat-last-n 64
+```
+
+### Different Models
+```bash
+# Ultra-fast with tiny model
+cargo run --features metal -- \
+  --prompt "Quick test" \
+  --model smollm2-135m
+
+# Better quality with larger model
+cargo run --features metal -- \
+  --prompt "Explain quantum physics" \
+  --model llama-3.2-1b \
+  --max-tokens 150
+```
+
+## Command-Line Options
+
+| Option | Short | Default | Description |
+|--------|-------|---------|-------------|
+| `--prompt` | `-p` | "The capital of France is" | Input prompt |
+| `--model` | `-m` | `smollm2-135m` | Model to use |
+| `--max-tokens` | `-n` | 100 | Maximum tokens to generate |
+| `--temperature` | `-t` | 0.8 | Sampling temperature (0.0 = deterministic) |
+| `--top-k` | | None | Top-k sampling |
+| `--top-p` | | None | Top-p (nucleus) sampling |
+| `--seed` | | 299792458 | Random seed for reproducibility |
+| `--repeat-penalty` | | 1.1 | Repetition penalty (1.0 = no penalty) |
+| `--repeat-last-n` | | 128 | Context window for repeat penalty |
+| `--cpu` | | false | Force CPU usage |
+| `--dtype` | | f16 | Data type: f16, bf16, f32 |
+| `--no-kv-cache` | | false | Disable key-value caching |
+
+## Performance
+
+Typical performance on Apple M2 with Metal acceleration:
+
+| Model | Size | Speed | Memory |
+|-------|------|-------|--------|
+| SmolLM2-135M | 135M | ~100 tok/s | ~500MB |
+| SmolLM2-360M | 360M | ~80 tok/s | ~1GB |
+| SmolLM2-1.7B | 1.7B | ~50 tok/s | ~3GB |
+| Llama-3.2-1B | 1B | ~40 tok/s | ~2GB |
+
+## Requirements
+
+- **Rust**: 1.70+ (latest stable recommended)
+- **Memory**: 2-8GB RAM depending on model size
+- **Storage**: 1-10GB for model weights
+- **Network**: Internet connection for first-time model download
+- **GPU** (optional): Metal on macOS, CUDA on Linux/Windows
+
+## GPU Support
+
+### macOS (Metal)
+```bash
+cargo run --features metal -- [options]
+```
+
+### Linux/Windows (CUDA)
+```bash
+cargo run --features cuda -- [options]  
+```
+
+### CPU Only
+```bash
+cargo run -- --cpu [options]
+```
+
+## Model Downloads
+
+Models are automatically downloaded from HuggingFace Hub on first use and cached locally. Download times:
+
+- SmolLM2-135M: ~1 minute
+- SmolLM2-360M: ~2 minutes  
+- Llama-3.2-1B: ~5 minutes
+- Larger models: 10+ minutes
+
+## Troubleshooting
+
+### Slow Performance
+- Use `--features metal` on macOS or `--features cuda` on Linux/Windows
+- Try smaller models like `smollm2-135m` for faster inference
+- Ensure sufficient RAM for your chosen model
+
+### Out of Memory
+- Use `--cpu` to use system RAM instead of GPU memory
+- Try smaller models or reduce `--max-tokens`
+- Use `--dtype f32` if f16 causes issues
+
+### Model Download Issues
+- Check internet connection
+- Some models may require HuggingFace Hub authentication
+- Verify sufficient disk space in `~/.cache/huggingface/`
+
+## Contributing
+
+Contributions welcome! This project is based on the [Candle](https://github.com/huggingface/candle) framework by HuggingFace.
+
+## License
+
+MIT License - see LICENSE file for details.