predict-otron-9001

mirror of https://github.com/geoffsee/predict-otron-9001.git synced 2025-09-08 22:46:44 +00:00

Author	SHA1	Message	Date
geoffsee	296d4dbe7e	add root dockerfile that contains binaries for all services	2025-09-04 14:54:20 -04:00
geoffsee	ff55d882c7	reorg + update docs with new paths	2025-09-04 12:40:59 -04:00
geoffsee	400c70f17d	streaming implementaion re-added to UI	2025-09-02 14:45:16 -04:00
geoffsee	21f20470de	patch version	2025-09-01 22:55:59 -04:00
geoffsee	2deecb5e51	chat client only displays available models	2025-09-01 22:29:54 -04:00
geoffsee	d1a7d5b28e	fix format error	2025-08-31 19:59:09 -04:00
geoffsee	64daa77c6b	leptos chat ui renders	2025-08-31 18:50:25 -04:00
geoffsee	2b4a8a9df8	chat-ui not functional yet but builds	2025-08-31 18:18:56 -04:00
geoffsee	7bc9479a11	fix format issues, needs precommit hook	2025-08-31 13:24:51 -04:00
geoffsee	0580dc8c5e	move cli into crates and stage for release	2025-08-31 13:23:50 -04:00
geoffsee	e6c417bd83	align dependencies across inference features	2025-08-31 10:49:04 -04:00
geoffsee	315ef17605	supports small llama and gemma models Refactor inference dedicated crates for llama and gemma inferencing, not integrated	2025-08-29 20:00:41 -04:00
geoffsee	e38a2d4512	predict-otron-9000 serves a leptos SSR frontend	2025-08-28 12:06:22 -04:00
geoffsee	45d7cd8819	- Introduced `ServerConfig` for handling deployment modes and services. - Added HighAvailability mode for proxying requests to external services. - Maintained Local mode for embedded services. - Updated `README.md` and included `SERVER_CONFIG.md` for detailed documentation.	2025-08-28 09:55:39 -04:00
geoffsee	719beb3791	- Change default server host to localhost for improved security. - Increase default maximum tokens in CLI configuration to 256. - Refactor and reorganize CLI	2025-08-27 21:47:31 -04:00
geoffsee	432c04d9df	Removed legacy inference engine assets.	2025-08-27 16:19:31 -04:00
geoffsee	8338750beb	Refactor `apply_cached_repeat_penalty` for optimized caching and reuse, add extensive unit tests, and integrate special handling for gemma-specific models. Removed `test_request.sh`, deprecated functionality, and unused imports; introduced a new CLI tool (`cli.ts`) for testing inference engine and adjusted handling of non-streaming/streaming chat completions. - Add CPU fallback support for text generation when primary device is unsupported - Introduce `execute_with_fallback` method to handle device compatibility and shape mismatch errors - Extend unit tests to reproduce tensor shape mismatch errors specific to model configurations - Increase HTTP timeout limits in `curl_chat_stream.sh` script for reliable API testing chat completion endpoint functions with gemma3 (no streaming) Add benchmarking guide with HTML reporting, Leptos chat crate, and middleware for metrics tracking	2025-08-27 16:15:01 -04:00
geoffsee	b8ba994783	Integrate `create_inference_router` from `inference-engine` into `predict-otron-9000`, simplify server routing, and update dependencies to unify versions.	2025-08-16 19:53:33 -04:00
geoffsee	2aa6d4cdf8	Introduce predict-otron-9000: Unified server combining embeddings and inference engines. Includes OpenAI-compatible APIs, full documentation, and example scripts.	2025-08-16 19:11:35 -04:00

19 Commits