207 lines
5.2 KiB
Markdown
207 lines
5.2 KiB
Markdown
# @open-web-agent-rs/inference-engine
|
|
|
|
A Rust-based inference engine for running large language models locally. This tool supports both CLI mode for direct text generation and server mode with an OpenAI-compatible API.
|
|
|
|
## Features
|
|
|
|
- Run Gemma models locally (1B, 2B, 7B, 9B variants)
|
|
- CLI mode for direct text generation
|
|
- Server mode with OpenAI-compatible API
|
|
- Support for various model configurations (base, instruction-tuned)
|
|
- Metal acceleration on macOS
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
- Rust toolchain (install via [rustup](https://rustup.rs/))
|
|
- Cargo package manager
|
|
- For GPU acceleration:
|
|
- macOS: Metal support
|
|
- Linux/Windows: CUDA support (requires appropriate drivers)
|
|
|
|
### Building from Source
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone https://github.com/seemueller-io/open-web-agent-rs.git
|
|
cd open-web-agent-rs
|
|
```
|
|
|
|
2. Build the local inference engine:
|
|
```bash
|
|
cargo build -p inference-engine --release
|
|
```
|
|
|
|
## Usage
|
|
|
|
### CLI Mode
|
|
|
|
Run the inference engine in CLI mode to generate text directly:
|
|
|
|
```bash
|
|
cargo run -p inference-engine --release -- --prompt "Your prompt text here" --which 3-1b-it
|
|
```
|
|
|
|
#### CLI Options
|
|
|
|
- `--prompt <TEXT>`: The prompt text to generate from
|
|
- `--which <MODEL>`: Model variant to use (default: "3-1b-it")
|
|
- `--server`: Run OpenAI compatible server
|
|
- Available options: "2b", "7b", "2b-it", "7b-it", "1.1-2b-it", "1.1-7b-it", "code-2b", "code-7b", "code-2b-it", "code-7b-it", "2-2b", "2-2b-it", "2-9b", "2-9b-it", "3-1b", "3-1b-it"
|
|
- `--temperature <FLOAT>`: Temperature for sampling (higher = more random)
|
|
- `--top-p <FLOAT>`: Nucleus sampling probability cutoff
|
|
- `--sample-len <INT>`: Maximum number of tokens to generate (default: 10000)
|
|
- `--repeat-penalty <FLOAT>`: Penalty for repeating tokens (default: 1.1)
|
|
- `--repeat-last-n <INT>`: Context size for repeat penalty (default: 64)
|
|
- `--cpu`: Run on CPU instead of GPU
|
|
- `--tracing`: Enable tracing (generates a trace-timestamp.json file)
|
|
|
|
### Server Mode with OpenAI-compatible API
|
|
|
|
Run the inference engine in server mode to expose an OpenAI-compatible API:
|
|
|
|
```bash
|
|
cargo run -p inference-engine --release -- --server --port 3777 --which 3-1b-it
|
|
```
|
|
|
|
This starts a web server on the specified port (default: 3777) with an OpenAI-compatible chat completions endpoint.
|
|
|
|
#### Server Options
|
|
|
|
- `--server`: Run in server mode
|
|
- `--port <INT>`: Port to use for the server (default: 3777)
|
|
- `--which <MODEL>`: Model variant to use (default: "3-1b-it")
|
|
- Other model options as described in CLI mode
|
|
|
|
## API Usage
|
|
|
|
The server exposes an OpenAI-compatible chat completions endpoint:
|
|
|
|
### Chat Completions
|
|
|
|
```
|
|
POST /v1/chat/completions
|
|
```
|
|
|
|
#### Request Format
|
|
|
|
```json
|
|
{
|
|
"model": "gemma-3-1b-it",
|
|
"messages": [
|
|
{"role": "system", "content": "You are a helpful assistant."},
|
|
{"role": "user", "content": "Hello, how are you?"}
|
|
],
|
|
"temperature": 0.7,
|
|
"max_tokens": 256,
|
|
"top_p": 0.9,
|
|
"stream": false
|
|
}
|
|
```
|
|
|
|
#### Response Format
|
|
|
|
```json
|
|
{
|
|
"id": "chatcmpl-123abc456def789ghi",
|
|
"object": "chat.completion",
|
|
"created": 1677858242,
|
|
"model": "gemma-3-1b-it",
|
|
"choices": [
|
|
{
|
|
"index": 0,
|
|
"message": {
|
|
"role": "assistant",
|
|
"content": "I'm doing well, thank you for asking! How can I assist you today?"
|
|
},
|
|
"finish_reason": "stop"
|
|
}
|
|
],
|
|
"usage": {
|
|
"prompt_tokens": 25,
|
|
"completion_tokens": 15,
|
|
"total_tokens": 40
|
|
}
|
|
}
|
|
```
|
|
|
|
### Example: Using cURL
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3777/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "gemma-3-1b-it",
|
|
"messages": [
|
|
{"role": "user", "content": "What is the capital of France?"}
|
|
],
|
|
"temperature": 0.7,
|
|
"max_tokens": 100
|
|
}'
|
|
```
|
|
|
|
### Example: Using Python with OpenAI Client
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI(
|
|
base_url="http://localhost:3777/v1",
|
|
api_key="dummy" # API key is not validated but required by the client
|
|
)
|
|
|
|
response = client.chat.completions.create(
|
|
model="gemma-3-1b-it",
|
|
messages=[
|
|
{"role": "user", "content": "What is the capital of France?"}
|
|
],
|
|
temperature=0.7,
|
|
max_tokens=100
|
|
)
|
|
|
|
print(response.choices[0].message.content)
|
|
```
|
|
|
|
### Example: Using JavaScript/TypeScript with OpenAI SDK
|
|
|
|
```javascript
|
|
import OpenAI from 'openai';
|
|
|
|
const openai = new OpenAI({
|
|
baseURL: 'http://localhost:3777/v1',
|
|
apiKey: 'dummy', // API key is not validated but required by the client
|
|
});
|
|
|
|
async function main() {
|
|
const response = await openai.chat.completions.create({
|
|
model: 'gemma-3-1b-it',
|
|
messages: [
|
|
{ role: 'user', content: 'What is the capital of France?' }
|
|
],
|
|
temperature: 0.7,
|
|
max_tokens: 100,
|
|
});
|
|
|
|
console.log(response.choices[0].message.content);
|
|
}
|
|
|
|
main();
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Model download errors**: Make sure you have a stable internet connection. The models are downloaded from Hugging Face Hub.
|
|
|
|
2. **Out of memory errors**: Try using a smaller model variant or reducing the batch size.
|
|
|
|
3. **Slow inference on CPU**: This is expected. For better performance, use GPU acceleration if available.
|
|
|
|
4. **Metal/CUDA errors**: Ensure you have the latest drivers installed for your GPU.
|
|
|
|
## License
|
|
|
|
This project is licensed under the terms specified in the LICENSE file.
|