19 Commits

Author SHA1 Message Date
geoffsee
296d4dbe7e add root dockerfile that contains binaries for all services 2025-09-04 14:54:20 -04:00
geoffsee
ff55d882c7 reorg + update docs with new paths 2025-09-04 12:40:59 -04:00
geoffsee
400c70f17d streaming implementaion re-added to UI 2025-09-02 14:45:16 -04:00
geoffsee
21f20470de patch version 2025-09-01 22:55:59 -04:00
geoffsee
2deecb5e51 chat client only displays available models 2025-09-01 22:29:54 -04:00
geoffsee
d1a7d5b28e fix format error 2025-08-31 19:59:09 -04:00
geoffsee
64daa77c6b leptos chat ui renders 2025-08-31 18:50:25 -04:00
geoffsee
2b4a8a9df8 chat-ui not functional yet but builds 2025-08-31 18:18:56 -04:00
geoffsee
7bc9479a11 fix format issues, needs precommit hook 2025-08-31 13:24:51 -04:00
geoffsee
0580dc8c5e move cli into crates and stage for release 2025-08-31 13:23:50 -04:00
geoffsee
e6c417bd83 align dependencies across inference features 2025-08-31 10:49:04 -04:00
geoffsee
315ef17605 supports small llama and gemma models
Refactor inference

dedicated crates for llama and gemma inferencing, not integrated
2025-08-29 20:00:41 -04:00
geoffsee
e38a2d4512 predict-otron-9000 serves a leptos SSR frontend 2025-08-28 12:06:22 -04:00
geoffsee
45d7cd8819 - Introduced ServerConfig for handling deployment modes and services.
- Added HighAvailability mode for proxying requests to external services.
- Maintained Local mode for embedded services.
- Updated `README.md` and included `SERVER_CONFIG.md` for detailed documentation.
2025-08-28 09:55:39 -04:00
geoffsee
719beb3791 - Change default server host to localhost for improved security.
- Increase default maximum tokens in CLI configuration to 256.
- Refactor and reorganize CLI
2025-08-27 21:47:31 -04:00
geoffsee
432c04d9df Removed legacy inference engine assets. 2025-08-27 16:19:31 -04:00
geoffsee
8338750beb Refactor apply_cached_repeat_penalty for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models.
Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions.

- Add CPU fallback support for text generation when primary device is unsupported
- Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors
- Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations
- Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing

chat completion endpoint functions with gemma3 (no streaming)

Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking
2025-08-27 16:15:01 -04:00
geoffsee
b8ba994783 Integrate create_inference_router from inference-engine into predict-otron-9000, simplify server routing, and update dependencies to unify versions. 2025-08-16 19:53:33 -04:00
geoffsee
2aa6d4cdf8 Introduce predict-otron-9000: Unified server combining embeddings and inference engines. Includes OpenAI-compatible APIs, full documentation, and example scripts. 2025-08-16 19:11:35 -04:00