The High-Performance Inference Framework for LLM, VLM, DiT and REC Models on Heterogeneous Accelerators

Easy

Deploy mainstream open models on supported accelerators with a unified service-engine stack and production-oriented tooling.

Fast

Improve throughput with asynchronous scheduling, graph optimization, multi-stream execution, and dynamic load balancing.

Cost Efficient

Reduce inference cost through hardware-aware optimization, efficient memory management, and global KV cache control.

Broad Support

Cross-Platform Model & Hardware Support

xLLM supports mainstream open models across heterogeneous accelerator targets.

Supported Hardware

Ascend NPUs
Cambricon MLUs
Iluvatar CoreX GPUs
Moore Threads GPUs

View all supported hardware

Supported Models

DeepSeek
Qwen
Kimi
GLM
Flux
OneRec

View all supported models

Got Questions?

We are here to help

Whether you are getting started or debugging a deployment, the public project resources are the fastest place to continue.

Documentation

Setup guides, features, launch flows, and reference material.

GitHub Issues

Bug reports, feature requests, and implementation questions.

Contributors

See the developers and maintainers contributing to the project.

Resources

Explore docs, paper, and releases

Core project resources in one place.

Documentation

Guides, setup instructions, and feature docs.

Publications

Read xLLM papers, reports, and supporting artifacts.

Docker

Use published images as a starting point for deployment and development.

The High-Performance Inference Framework for LLM, VLM, DiT and REC Models on Heterogeneous Accelerators

Easy

Fast

Cost Efficient

Cross-Platform Model & Hardware Support

Ascend NPUs

Cambricon MLUs

Iluvatar CoreX GPUs

Moore Threads GPUs

DeepSeek

Qwen

Kimi

GLM

Flux

OneRec

We are here to help

Documentation

GitHub Issues

Contributors

Explore docs, paper, and releases

Documentation

Publications

Docker