Auto LLM Tuner
Problem
LLMs are expensive to run due to high inference latency and memory cost. Standard quantization applies one precision across the whole model, but different layers have different sensitivity. This project builds a system that automatically selects per-layer bit-widths to balance efficiency and output quality, despite an exponential search space.
Approach
We represent a model configuration as a per-layer precision vector (e.g., [8, 4, 4, 16, 8, ...]). An evolutionary algorithm generates candidate vectors, applies layer-wise quantization, runs evaluation, combines metrics into a fitness score, and iterates toward high-quality tradeoffs on the Pareto frontier.
System Architecture
- Model management layer: load base model, apply per-layer bit-widths, run inference, collect metrics
- Precision vector system: create/manage per-layer precision vectors
- Evaluation pipeline: compute accuracy, latency, and memory footprint
- Fitness module: aggregate metrics into a single optimization score
- Evolutionary search engine: selection + mutation/crossover to find Pareto-optimal configs
- Visualization: Pareto frontier plots + fitness progression + bit-width heatmaps
Dataset & Setup
- Evaluated on MATH-500 (and a faster MATH-50 subset) for math reasoning benchmarking
- Quantization levels explored: 4-bit / 8-bit / 16-bit
- Models tested included Qwen2.5-7B-Instruct, Trinity-Mini, and Mistral-7B
- Multi-GPU evaluation using Python multiprocessing; experiments run on 8× H200 GPUs (Vast.ai) and additional runs on Pace/ICE
Findings
Across tested models, the best configurations tended to lie in a low-latency, low-memory region, and strong configurations often used non-uniform precision vectors (different bit-widths per layer) rather than uniform quantization.