Model Compression & Quantization

Novel quantization techniques and pruning methods that significantly reduce model size and inference cost with minimal accuracy degradation.

INT4/INT8Polar TransformSparsity

Fast Inference & Runtime Optimization

Hardware-aware inference algorithms and memory-efficient runtime systems for low-latency, high-throughput deployment on real devices.

LatencyThroughputGPU/Edge

Efficient Large Language Models

Attention approximation, KV cache optimization, and speculative decoding pushing the efficiency frontier of LLMs for practical deployment.

LLMKV CacheAttention