Unified Fine-Tuning: Dataset and Training Procedure

Summary: The unified model trains a single LoRA adapter to handle all three analysis tasks — incident summarization (Task S), cause analysis (Task A), and risk assessment (Task B) — from one GGUF file. It uses a purpose-built 5-script dataset pipeline that intersects the summarization and risk source datasets and augments with risk-only incidents to counteract task dilution. The production model trains on 2195 SFT records with lora_r=128 + RSLoRA.

Index

Motivation
Dataset
Step 1 — Merge Summary Judge Results
Step 2 — Build Unified Dataset
Step 3 — Filter Dataset
Step 4 — Select Best Responses and Augment
Training
Published Model

Motivation

Running two separate models on the Raspberry Pi 5 — one for summarization, one for cause+risk — requires loading and unloading GGUF files between tasks, adding memory and latency overhead. A unified adapter handles all three tasks from a single model file, eliminating model-switching at deployment time.

The trade-off is a harder training objective: one adapter must learn three distinct output formats and reasoning patterns simultaneously. This requires a higher LoRA rank than either standalone model, and a dataset that gives each task type adequate representation throughout training.

Dataset

Source datasets:

Dataset	Records	Description
`summarization_dataset_v4.json`	961	Summary LLM responses (GPT-4o, GPT-4o-mini, Qwen2.5 1.5B, Qwen2.5 3B)
`risk_dataset_v2.json`	826	Cause+risk LLM responses + dag_analysis
`summarization_results_merged.json`	802	Summary judge scores (gpt-oss-120b)
`risk_dataset_v2_results_qwen35.json`	826	Cause+risk judge scores (Qwen3.5, subscored)

The unified pipeline requires every incident to have both summary and risk judge scores — a stricter requirement than either standalone pipeline. Only the 802 incidents scored on both tasks enter the pipeline. The final training set contains 2195 SFT records across all three task types.

The final SFT dataset is published on HuggingFace: stratosphere/immune-unified-sft-dataset

For how the source datasets were generated, see Summarization Dataset Report and Risk Analysis Dataset Report.

Step 1 — Merge Summary Judge Results

merge_summary_results.py merges judge results from two evaluation runs into a single file:

summarization_dataset_v3_results_oss.json (532 incidents, gpt-oss-120b via e-infra.cz)
summarization_v4_new_results_oss.json (270 new v4 incidents, gpt-oss-120b via NVIDIA NIM)

Output: summarization_results_merged.json — 802 incidents with summary judge scores.

Step 2 — Build Unified Dataset

build_unified_dataset.py joins all four source datasets on incident_id, keeping only the 802 incidents scored on both tasks:

Identity fields + timeline from summarization_dataset_v4.json
dag_analysis from risk_dataset_v2.json (full coverage for all 802)
Summary LLM responses (4 models) from summarization_dataset_v4.json
Cause+risk LLM responses (4 models) from risk_dataset_v2.json
Summary judge scores from summarization_results_merged.json
Cause+risk subscored judge scores from risk_dataset_v2_results_qwen35.json

Output: datasets/unified_dataset.json — 802 incidents.

Step 3 — Filter Dataset

filter_dataset_unified.py applies quality thresholds from both standalone pipelines simultaneously — an incident must pass all filters to be retained:

Filter	Threshold
Best summary score	≥ 4 / 10
Best cause total	≥ 14 / 30
Best risk total	≥ 10 / 30
Summary response token length	50–400 tokens
Cause response token length	50–600 tokens
Risk response token length	30–300 tokens
Risk level keyword	Critical / High / Medium / Low

Result: 750 / 802 incidents passed (93.5%). Split 90/10 (seed=42): 675 train / 75 eval.

The intersection requirement excludes incidents that passed the standalone risk quality filter but have no summary judge score. To recover those risk training examples, Step 4 appends them separately.

Step 4 — Select Best Responses and Augment

select_best_responses_unified.py selects the highest-scoring model response per task per incident and builds SFT conversation records:

Task S: best summary score (1–10) → [system, user(dag), assistant(summary)]
Task A: best cause total (subscored) → [user(cause_prompt+dag), assistant(cause_analysis)]
Task B: same winner as Task A → [user(risk_prompt+dag), assistant(risk_assessment)]

DAG inputs are truncated at 3500 tokens at clean line boundaries with an explicit truncation marker. Records are interleaved S→A→B per incident so the adapter sees all three task types continuously throughout training.

Intermediate output: unified_train_dataset.json — 2025 train records (675 × 3 tasks).

The three task types use distinct prompt formats:

Task	Prompt focus	Output structure
Task S (Summarization)	Human-readable incident summary	Summary + Key Events + Threat Assessment
Task A (Cause Analysis)	Structured root cause identification	Possible Causes × 3 categories + Conclusion
Task B (Risk Assessment)	Calibrated risk evaluation	Risk Level + Justification + Business Impact + Likelihood + Priority

Augmentation with risk-only incidents: The intersection requirement keeps only incidents with both summary and risk scores, which naturally shrinks the risk-task pool: the unified pipeline produces 675 cause+risk training records, compared to 1328 in the standalone risk pipeline. This halving of risk-specific signal causes task dilution — the model learns cause and risk analysis from fewer examples relative to the standalone model.

To recover this signal without modifying the unified pipeline, augment_unified_with_risk.py appends cause+risk SFT records from risk-only incidents — incidents present in risk_filtered_train.json (standalone risk train split) that were excluded from the unified pipeline because they lacked summary judge scores. Of the 151 incidents in risk_dataset_v2 but not in the unified pool, 85 passed filter_dataset_risk.py quality filters and are appended.

The augmentation uses the same prompt templates and DAG truncation (3500 tokens) as select_best_responses_unified.py. Each incident contributes two records (cause + risk), interleaved before appending.

Source	Incidents	Records
Unified filtered train (S+A+B)	675	2025
Risk-only extras (A+B only)	85	170
Augmented total	760	2195

Final output: unified_train_dataset_augmented.json — 2195 train records (675 summary + 760 cause + 760 risk).

cd alert_summary/
python3 merge_summary_results.py
python3 build_unified_dataset.py
python3 filter_dataset_unified.py
python3 select_best_responses_unified.py

cd ../unsloth-scripts/
python3 augment_unified_with_risk.py
# Output: unified_train_dataset_augmented.json (2195 records)

Training

Training follows the general procedure in Fine-Tuning Approach. Config: config_unified_4096_20gb_v2.yaml.

Parameter	Value
Max sequence length	4096
LoRA rank (`r`)	128
LoRA alpha	128
LoRA dropout	0.0
RSLoRA	enabled (required at r=128)
LoRA targets	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs	2
Learning rate	2e-5
LR scheduler	cosine
Warmup steps	20
Weight decay	0.01
Batch size (effective)	16 (2 × grad accum 8)
Optimizer	adamw_8bit
Precision	BF16
Quantization (training)	4bit (QLoRA)
Hardware	A100 80GB MiG 20GB slice (e-infra.cz cloud)

The higher LoRA rank (r=128 vs r=16 for summarization, r=64 for risk) gives the adapter more representational capacity for three competing task objectives. RSLoRA normalizes the adapter contribution at higher ranks to prevent training instability. Two epochs are used to avoid overfitting toward the most frequent task pattern (summary) in the mixed dataset.

cd unsloth-scripts/
python3 train_qwen.py --config config_unified_4096_20gb_v2.yaml
# Outputs: qwen_unified_finetuned_v2/ (adapter) + qwen_unified_finetuned_v2_merged_16bit/

Training History Notes

v1 used the base unified_train_dataset.json (2025 records, lora_r=64, 3 epochs). Evaluation showed that risk and cause analysis performance was significantly below the standalone risk model. The root cause was task dilution: at r=64, the adapter did not have enough capacity to learn all three tasks simultaneously, and the summary task — being the most frequent pattern — dominated training signal at the cost of risk quality.

v3 was an experiment that doubled all cause+risk records via 2× upsampling (unified_train_dataset_augmented_2x_risk.json). It backfired: the model overfit to the repeated pattern, producing a win rate drop of −6.4pp on summary and −13.5pp on risk compared to v2. v2, which reaches the same ~80% cause+risk proportion naturally from dataset sizes, is the better-calibrated training distribution.

Published Model

The trained model is published on HuggingFace:

stratosphere/qwen2.5-1.5b-slips-immune-unified

For evaluation results, see Unified Fine-Tuned Model: Evaluation Results.
For GGUF conversion and Ollama deployment, see Quantization and Deployment.