Getting Started
openinfer builds from source with Cargo. Python is needed once at build time for Triton AOT kernel compilation — the running server has no Python dependency.
Prerequisites
Section titled “Prerequisites”- Rust (2024 edition)
- CUDA Toolkit (nvcc, cuBLAS) and a CUDA-capable GPU
- NVIDIA driver R535 (CUDA 12.2) or newer
- Python 3 + Triton (build-time only)
Build & run
Section titled “Build & run”git clone https://github.com/openinfer-project/openinfercd openinfer
# One-time Python setup for Triton AOT kernel compilationuv venv && source .venv/bin/activateuv pip install torch --index-url https://download.pytorch.org/whl/cu128
# Download a modelhuggingface-cli download Qwen/Qwen3-4B --local-dir models/Qwen3-4B
# Build & start the server on port 8000export CUDA_HOME=/usr/local/cudaexport OPENINFER_TRITON_PYTHON=.venv/bin/pythoncargo run --release -- --model-path models/Qwen3-4BAlways build with --release — debug builds of the CUDA paths are far too
slow to be usable.
Send a request
Section titled “Send a request”The server exposes an OpenAI-compatible /v1/completions endpoint:
curl -s http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "The capital of France is", "max_tokens": 32}'Streaming:
curl -N http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "The capital of France is", "max_tokens": 32, "stream": true}'Any OpenAI SDK works the same way — set the base URL to
http://localhost:8000/v1.
Next steps
Section titled “Next steps”Pick a model from the sidebar for model-specific launch flags, performance numbers, and architecture notes. Qwen3-4B is the most mature line and the best place to start.