Research | FlexML KAIST

Novel quantization techniques and pruning methods that significantly reduce model size and inference cost with minimal accuracy degradation.

INT4/INT8Polar TransformSparsity

Hardware-aware inference algorithms and memory-efficient runtime systems for low-latency, high-throughput deployment on real devices.

LatencyThroughputGPU/Edge

Attention approximation, KV cache optimization, and speculative decoding pushing the efficiency frontier of LLMs for practical deployment.

LLMKV CacheAttention