Research
We focus on developing the fundamental algorithms and theoretical frameworks for fast, memory-efficient machine learning algorithms.
Model Compression & Quantization
Novel quantization techniques and pruning methods that significantly reduce model size and inference cost with minimal accuracy degradation.
Fast Inference & Runtime Optimization
Hardware-aware inference algorithms and memory-efficient runtime systems for low-latency, high-throughput deployment on real devices.
Efficient Large Language Models
Attention approximation, KV cache optimization, and speculative decoding pushing the efficiency frontier of LLMs for practical deployment.