Quantization and Deployment for Finetuned Models
Summary: Finetuned models are converted to GGUF and published to Ollama in three quantization variants (q4_k_m, q5_k_m, q8_0). The GPU-served reference baseline uses BnB NF4 4-bit quantization at inference (not true fp16 — the evaluation GPU lacked sufficient VRAM for the full model), so all comparisons are NF4 vs. GGUF. Quality degrades with quantization relative to the NF4 baseline: ~2% loss at q8_0, ~12% at q5_k_m, ~9% at q4_k_m. q8_0 is the best quantized variant; q5_k_m offers the best quality/size trade-off for CPU/RPi deployment.
Evaluation basis: performance numbers in this document were measured on the finetuned summarization model (47 held-out incidents, judge: gpt-oss-120b). The GPU-served reference is
serve_model.pywith--quant 4bit(bitsandbytes NF4), not true fp16. The conversion and publication methodology applies to any finetuned model in this pipeline.
Index
GGUF Conversion
Script: convert_to_gguf.py
GGUF (GPT-Unified Format) is the binary format used by llama.cpp and Ollama to store quantized model weights for efficient CPU and GPU inference. The conversion script takes the merged 16-bit PyTorch model produced by training and converts it to a self-contained GGUF file at a target quantization level.
Standard path
Used for q8_0 and f16 (already near-lossless or lossless — no benefit from importance weighting):
Load at full precision — the model is loaded via
FastLanguageModel.from_pretrained()withload_in_4bit=False, ensuring no precision is lost before conversionConvert and quantize —
model.save_pretrained_gguf()is called with the target quantization method; Unsloth delegates to its bundled llama.cpp binaries to perform tensor-level quantization and write the GGUF fileRelocate output — Unsloth always writes to
<model_dir>_gguf/; the script optionally moves the result to a user-specified--outputdirectory
Imatrix-guided path
Standard quantization maps all weights to lower precision uniformly, which can degrade quality on the layers that matter most. The imatrix (importance matrix) path addresses this by using calibration data to identify which weights have the highest activation impact, then allocating more precision to those weights during quantization.
Used when a calibration.txt is present and the target quant is one of q2_k, q3_k_m, q4_0, q4_k_m, q5_0, q5_k_m:
Produce an intermediate F16 GGUF — a lossless 16-bit GGUF is generated first as the input for imatrix computation
Compute activation statistics —
llama-imatrixruns inference on the calibration text using the F16 GGUF, recording how much each weight matrix contributes to the model’s outputs across the calibration corpus; the result is a.imatrix.datfileRe-quantize with importance guidance —
llama-quantize --imatrixperforms non-uniform quantization: weights more important to the model’s predictions are preserved at higher precision, while less critical weights are compressed more aggressivelyCleanup — the intermediate F16 GGUF and
.imatrix.datfiles are deleted; only the final quantized GGUF is kept
The number of calibration chunks (default: 128) controls how much calibration text is processed — more chunks produce more accurate importance estimates at the cost of longer computation.
Modelfile generation
After the GGUF is produced, the script auto-detects the chat template by inspecting tokenizer_config.json:
<|im_start|>→ ChatML format (Qwen2.5)<|start_header_id|>→ Llama-3 formatFalls back to ChatML if detection is inconclusive
An Ollama-compatible Modelfile is written alongside the GGUF, embedding the correct template with {{ .System }}, {{ .Prompt }}, and {{ .Response }} variables, plus appropriate stop tokens (<|im_end|> and <|endoftext|> for ChatML). If an OLLAMA_README.md exists next to the script, it is copied into the output directory as README.md to populate the model card on Ollama.com.
cd unsloth-scripts/
python3 convert_to_gguf.py \
--model /path/to/qwen_finetuned_merged_16bit \
--quant q5_k_m \
--output ./gguf_q5_k_m/
Ollama Publication
Script: publish_to_ollama.sh
Automates the complete pipeline from raw 16-bit weights to a publicly accessible model on Ollama. For each quantization variant (default: q4_k_m, q5_k_m, q8_0), the following steps run sequentially:
Step 1 — Convert to GGUF
Calls convert_to_gguf.py with the model path and target quant, writing the GGUF file, Modelfile, and README to ./gguf_<quant>/.
Step 2 — Register locally with Ollama
Runs ollama create <model-name>:<quant> -f Modelfile from within the output directory. This registers the model in the local Ollama registry, making it immediately usable via ollama run without a network round-trip.
Step 3 — Tag :latest
After all quants are built, the first quant in the list (q4_k_m) is copied to the :latest tag. This ensures a bare ollama pull without an explicit tag fetches the most portable variant.
Step 4 — Push to Ollama.com
Each tag is pushed with ollama push. Requires prior ollama login with the stratosphere organization credentials.
# Publish all quants
./publish_to_ollama.sh
# Publish a single quant
./publish_to_ollama.sh --quant q5_k_m
Tag |
Quantization |
Notes |
|---|---|---|
|
4-bit K-means |
Most portable, smallest size |
|
5-bit K-means |
Recommended for CPU/low-VRAM |
|
8-bit integer |
Best quality among quantized variants |
|
= q4_k_m |
Default tag for bare pulls |
Published model: stratosphere/qwen2.5-1.5b-slips-immune-summarization
Performance by Quantization
The GPU-served NF4 model serves as the reference; all GGUF variants are compared against it.
Overall
Quantization |
Avg Score |
Win Rate |
Score Loss |
Size |
|---|---|---|---|---|
NF4 4-bit / GPU (reference) |
4.70 |
19.1% |
— |
~3.0 GB |
q8_0 |
4.59 |
17.4% |
−0.11 (−2%) |
~1.6 GB |
q4_k_m |
4.28 |
14.9% |
−0.42 (−9%) |
~0.9 GB |
q5_k_m |
4.14 |
8.5% |
−0.56 (−12%) |
~1.1 GB |
By Complexity
Tier |
NF4 / GPU |
q8_0 |
q5_k_m |
q4_k_m |
|---|---|---|---|---|
Simple (<500 events) |
5.45 |
5.24 |
4.67 |
4.93 |
Medium (500–1999 events) |
4.00 |
4.25 |
3.00 |
3.60 |
Complex (≥2000 events) |
3.11 |
3.00 |
3.12 |
3.22 |
Normal traffic |
2.00 |
3.00 |
2.00 |
2.00 |
Key observations:
q8_0 quality loss is negligible (−0.11) — the best quantized variant by a clear margin
q5_k_m and q4_k_m both show meaningful score drops; q4_k_m surprisingly recovers slightly vs q5_k_m on simple incidents
All variants struggle on Normal incidents — this is a training data imbalance issue, not a quantization artifact
Complex incident scores are consistent across all quants (~3.0–3.2), tied to input truncation at 4096 tokens rather than quantization effects
Deployment Recommendation
Scenario |
Recommended variant |
Rationale |
|---|---|---|
Raspberry Pi 5 (CPU-only) |
q5_k_m |
Best quality/size balance at 1.1 GB; fits RPi RAM with headroom |
Low-VRAM GPU (≤4 GB) |
q8_0 |
Only 2% score loss vs NF4 baseline at half the memory |
GPU with ≥6 GB VRAM |
NF4 via serve_model.py |
Reference quality: 4.70 avg score, 19.1% win rate |
Edge / minimal storage |
q4_k_m |
Smallest footprint (0.9 GB); 9% score loss acceptable for triage-only use |
Risk Assessment Model Quantization
Evaluation basis: performance numbers below were measured on the finetuned risk model (67 held-out incidents, judge: qwen3.5, date: 2026-04-24). The GPU-served reference is
serve_model.pywith--quant 4bit(bitsandbytes NF4), not true fp16. Scores are cause score (max 30) and risk score (max 30); win rate is fraction of incidents ranked 1st among 5 models.
Overall Performance
Variant |
Avg Position |
Cause Score |
Risk Score |
Win Rate |
|---|---|---|---|---|
NF4 4-bit / GPU (reference) |
1.99 |
20.21 |
13.33 |
35.8% |
q8_0 |
2.25 |
17.66 |
14.15 |
34.3% |
q4_k_m |
2.34 |
18.03 |
14.46 |
26.9% |
q5_k_m |
2.57 |
16.46 |
14.45 |
23.9% |
By Complexity
Tier |
NF4/GPU cause/risk/wr |
q8_0 cause/risk/wr |
q4_k_m cause/risk/wr |
q5_k_m cause/risk/wr |
|---|---|---|---|---|
Simple (<500 events) |
21.05 / 12.39 / 38.6% |
19.27 / 14.45 / 45.5% |
19.68 / 15.05 / 34.1% |
17.95 / 15.48 / 29.5% |
Medium (500–1999 events) |
21.38 / 15.62 / 50.0% |
16.25 / 16.62 / 37.5% |
18.25 / 15.00 / 37.5% |
15.62 / 15.38 / 37.5% |
Complex (≥2000 events) |
17.13 / 14.87 / 20.0% |
13.67 / 11.93 / 0.0% |
13.07 / 12.47 / 0.0% |
12.53 / 10.93 / 0.0% |
Key observations:
NF4/GPU is the only variant competitive on complex incidents (20% win rate). All GGUF quantized variants collapse to 0% wins on complex incidents — quantization significantly degrades performance on long, evidence-heavy DAGs.
Quantization degrades cause analysis more than risk assessment. Cause scores drop 2–4 points with quantization while risk scores remain roughly stable or improve slightly. Cause analysis requires precise evidence grounding from the DAG and is more sensitive to precision loss.
q8_0 is the best quantized variant. Smallest gap vs fp16: cause 17.66 (vs 20.21), win rate 34.3% (vs 35.8%). It even outperforms fp16 on risk score and simple incident win rate (45.5% vs 38.6%).
q4_k_m outperforms q5_k_m (cause 18.03 vs 16.46, win rate 26.9% vs 23.9%). q5_k_m is not a reliable quality/size middle ground for this task.
Deployment Recommendation
Use Case |
Recommended Variant |
|---|---|
Best accuracy (research/offline) |
NF4 via serve_model.py |
Production deployment (GPU server) |
q8_0 — near-NF4 quality on simple/medium, ~2× memory saving |
Edge / constrained deployment |
q4_k_m — outperforms q5_k_m on this task |
For complex incident analysis specifically, only the NF4/GPU variant is competitive. If complex incidents are a priority deployment target, do not use quantized variants without further fine-tuning on longer DAGs.
Unified Model Quantization
Evaluation basis: performance numbers below were measured on the unified model v2 (47 summary incidents, judge: gpt-oss-120b; 67 risk incidents, judge: qwen3.5; date: 2026-06-09). The fp16 baseline uses BnB NF4 4-bit quantization at inference (the evaluation GPU lacked sufficient VRAM for the full model in true fp16), so the comparison is NF4 vs. GGUF rather than full-precision vs. GGUF.
Summary Task Performance (47 incidents)
Variant |
Avg Score /10 |
Win Rate |
|---|---|---|
fp16 / NF4 (reference) |
5.20 |
17.0% |
q8_0 |
5.09 |
32.6% |
q5_k_m |
5.00 |
14.9% |
q4_k_m |
4.91 |
12.8% |
Risk Task Performance (67 incidents)
Variant |
Avg Cause /30 |
Avg Risk /30 |
Win Rate |
|---|---|---|---|
fp16 / NF4 (reference) |
18.30 |
12.36 |
23.9% |
q8_0 |
17.43 |
12.75 |
26.9% |
q5_k_m |
17.30 |
13.66 |
26.9% |
q4_k_m |
17.75 |
13.70 |
26.9% |
Key observations:
Quantization does not hurt — it slightly helps on risk. Unlike the standalone risk model, all three unified GGUF variants match or exceed the fp16 baseline on risk win rate (26.9% vs 23.9%). This is explained by the NF4 vs GGUF comparison: the fp16 baseline is already a 4-bit quantized inference path, so GGUF quantization is competitive with it.
q8_0 is the standout for summary. q8_0 wins 32.6% of summary incidents vs 17.0% for fp16 — a +15.6pp gain. Average score is nearly identical (5.09 vs 5.20). Ollama’s inference stack appears to produce more consistently top-ranked outputs than the BnB NF4 GPU server on this task.
All three quants are equivalent on risk. q4_k_m, q5_k_m, and q8_0 all win 26.9% of risk incidents. There is no meaningful quality degradation from lower quantization on the risk task.
q4_k_m is the recommended deployment variant. At 986 MB, it matches the larger quants on risk and comes close on summary, with the smallest footprint.
Deployment Recommendation
Use Case |
Recommended Variant |
|---|---|
Best summary quality |
q8_0 — 32.6% win rate, nearly identical avg score to fp16 |
Best risk quality |
q4_k_m — best cause score among GGUF variants, smallest size |
Balanced / general deployment |
q5_k_m — good middle ground on both tasks |
Edge / RPi5 |
q4_k_m — 986 MB, competitive on both tasks |
Published model: stratosphere/qwen2.5-1.5b-slips-immune-unified