Auto LLM Tuner

Fall 2025

PythonPyTorchHugging Face TransformersbitsandbytesMatplotlib/SeabornSlurm

Problem

LLMs are expensive to run due to high inference latency and memory cost. Standard quantization applies one precision across the whole model, but different layers have different sensitivity. This project builds a system that automatically selects per-layer bit-widths to balance efficiency and output quality, despite an exponential search space.

Approach

We represent a model configuration as a per-layer precision vector (e.g., [8, 4, 4, 16, 8, ...]). An evolutionary algorithm generates candidate vectors, applies layer-wise quantization, runs evaluation, combines metrics into a fitness score, and iterates toward high-quality tradeoffs on the Pareto frontier.

System Architecture

Model management layer: load base model, apply per-layer bit-widths, run inference, collect metrics
Precision vector system: create/manage per-layer precision vectors
Evaluation pipeline: compute accuracy, latency, and memory footprint
Fitness module: aggregate metrics into a single optimization score
Evolutionary search engine: selection + mutation/crossover to find Pareto-optimal configs
Visualization: Pareto frontier plots + fitness progression + bit-width heatmaps

Dataset & Setup

Evaluated on MATH-500 (and a faster MATH-50 subset) for math reasoning benchmarking
Quantization levels explored: 4-bit / 8-bit / 16-bit
Models tested included Qwen2.5-7B-Instruct, Trinity-Mini, and Mistral-7B
Multi-GPU evaluation using Python multiprocessing; experiments run on 8× H200 GPUs (Vast.ai) and additional runs on Pace/ICE

Findings

Across tested models, the best configurations tended to lie in a low-latency, low-memory region, and strong configurations often used non-uniform precision vectors (different bit-widths per layer) rather than uniform quantization.