xLLM logo

The High-Performance Inference Framework for LLM, VLM, DiT and REC Models on Heterogeneous Accelerators

Easy

Deploy mainstream open models on supported accelerators with a unified service-engine stack and production-oriented tooling.

Fast

Improve throughput with asynchronous scheduling, graph optimization, multi-stream execution, and dynamic load balancing.

Cost Efficient

Reduce inference cost through hardware-aware optimization, efficient memory management, and global KV cache control.

Broad Support

Cross-Platform Model & Hardware Support

xLLM supports mainstream open models across heterogeneous accelerator targets.

Supported Hardware

  • Ascend NPUs

  • Cambricon MLUs

  • Iluvatar CoreX GPUs

  • Moore Threads GPUs

Supported Models

  • DeepSeek

  • Qwen

  • Kimi

  • GLM

  • Flux

  • OneRec

Got Questions?

We are here to help

Whether you are getting started or debugging a deployment, the public project resources are the fastest place to continue.