Skip to content

openinfer

A from-scratch LLM inference engine in pure Rust + CUDA. No PyTorch, no ONNX, no model framework runtime.

Rust + CUDA, nothing else

The entire stack — weights loading, paged KV cache, schedulers, kernels — is built from the ground up in Rust and CUDA, with Triton AOT and FlashInfer kernels compiled at build time. Python is build-time only.

OpenAI-compatible API

Serves a /v1/completions endpoint with streaming. Point any OpenAI SDK or curl at it and start generating.

One engine per model

No universal model abstraction. Each model line owns its scheduler, kernel plan, and execution path — full attention, hybrid linear attention, MLA, and MoE with expert parallelism.

CUDA Graph decode

The decode path is captured as a CUDA graph with pre-allocated buffers, keeping per-token overhead low and decode latency flat across context lengths.

ModelArchitecture
Qwen3-4B / 8BFull attention, tensor parallel
Qwen3.5-4BHybrid: 24 linear + 8 full attention layers
DeepSeek-V4MoE + compressor + indexer, 8-GPU
DeepSeek-V2-LiteMoE + expert parallelism, 2-GPU
Kimi-K2MLA + MoE + Marlin INT4, 8-GPU expert parallelism