Summarization Fine-Tuning: Dataset and Training Procedure
Summary: The summarization model is trained on a quality-filtered subset of the Slips summarization dataset, using the highest-scoring model response per incident as the training target. The same general LoRA+Unsloth pipeline applies; this document covers the summarization-specific dataset preparation and system prompt.
Index
Dataset
Source: summarization_dataset_v3.json
532 security incidents from Slips, each with four LLM-generated summaries (GPT-4o, GPT-4o-mini, Qwen2.5 1B, Qwen2.5 3B) and associated LLM-as-judge quality scores.
For how this dataset was generated, see Summarization Dataset Report.
Step 1 — Quality Filtering
filter_dataset.py applies two filters per incident:
Score threshold: reject if best model score < 4 (bottom quality quartile)
Token length: reject if best response summary is < 50 or > 400 tokens (too short = trivial; too long = likely template failure or prompt echo)
Surviving incidents are split 90/10 into train and eval sets (random_state=42).
cd unsloth-scripts/
python3 filter_dataset.py
# Outputs: filtered_train.json, filtered_eval.json
Step 2 — Ground Truth Selection
select_best_responses.py selects the highest-scoring model response per incident and formats each record as a two-turn SFT conversation using a merged prompt format:
user— a single message containing both the instructions (security analyst persona, task description, output format rules) and the DAG analysis textassistant— best-scoring summary (ground truth)
No system prompt is used. Instructions and DAG are concatenated into one user message so the model learns to handle both in the same context.
DAG inputs exceeding the token budget are truncated at clean line boundaries with an explicit truncation marker, so the model learns to handle partial inputs gracefully.
python3 select_best_responses.py
# Outputs: train_dataset.json, eval_dataset.json
The user message instructs the model to group identical events, assign severity labels (CRITICAL / HIGH / MEDIUM / LOW / INFO), and produce a fixed structured output format. This format is what the judge and downstream Slips components expect.
Training
Training follows the general procedure in Fine-Tuning Approach. Summarization-specific config values:
Parameter |
Value |
|---|---|
Max sequence length |
4096 |
Epochs |
3 |
Learning rate |
1e-5 |
LoRA dropout |
0.05 |
Batch size (effective) |
8 (1 × grad accum 8) |
Precision |
FP16 |
Quantization (training) |
4bit |
Hardware |
A100 80GB MiG 20GB slice (e-infra.cz cloud) |
python3 train_qwen.py
# Reads config.yaml, outputs merged 16-bit weights + GGUF (q4_k_m)
Published Model
The trained model is published on HuggingFace:
For evaluation results, see Summarization Fine-Tuned Model: Evaluation Results.
For GGUF conversion and Ollama deployment, see Quantization and Deployment.