Unified Fine-Tuning: Dataset and Training Procedure

Summary: The unified model trains a single LoRA adapter to handle all three analysis tasks — incident summarization (Task S), cause analysis (Task A), and risk assessment (Task B) — from one GGUF file. It uses a purpose-built 5-script dataset pipeline that intersects the summarization and risk source datasets and augments with risk-only incidents to counteract task dilution. The production model trains on 2195 SFT records with lora_r=128 + RSLoRA.


Index


Motivation

Running two separate models on the Raspberry Pi 5 — one for summarization, one for cause+risk — requires loading and unloading GGUF files between tasks, adding memory and latency overhead. A unified adapter handles all three tasks from a single model file, eliminating model-switching at deployment time.

The trade-off is a harder training objective: one adapter must learn three distinct output formats and reasoning patterns simultaneously. This requires a higher LoRA rank than either standalone model, and a dataset that gives each task type adequate representation throughout training.


Dataset

Source datasets:

Dataset

Records

Description

summarization_dataset_v4.json

961

Summary LLM responses (GPT-4o, GPT-4o-mini, Qwen2.5 1.5B, Qwen2.5 3B)

risk_dataset_v2.json

826

Cause+risk LLM responses + dag_analysis

summarization_results_merged.json

802

Summary judge scores (gpt-oss-120b)

risk_dataset_v2_results_qwen35.json

826

Cause+risk judge scores (Qwen3.5, subscored)

The unified pipeline requires every incident to have both summary and risk judge scores — a stricter requirement than either standalone pipeline. Only the 802 incidents scored on both tasks enter the pipeline. The final training set contains 2195 SFT records across all three task types.

The final SFT dataset is published on HuggingFace: stratosphere/immune-unified-sft-dataset

For how the source datasets were generated, see Summarization Dataset Report and Risk Analysis Dataset Report.


Step 1 — Merge Summary Judge Results

merge_summary_results.py merges judge results from two evaluation runs into a single file:

  • summarization_dataset_v3_results_oss.json (532 incidents, gpt-oss-120b via e-infra.cz)

  • summarization_v4_new_results_oss.json (270 new v4 incidents, gpt-oss-120b via NVIDIA NIM)

Output: summarization_results_merged.json — 802 incidents with summary judge scores.


Step 2 — Build Unified Dataset

build_unified_dataset.py joins all four source datasets on incident_id, keeping only the 802 incidents scored on both tasks:

  • Identity fields + timeline from summarization_dataset_v4.json

  • dag_analysis from risk_dataset_v2.json (full coverage for all 802)

  • Summary LLM responses (4 models) from summarization_dataset_v4.json

  • Cause+risk LLM responses (4 models) from risk_dataset_v2.json

  • Summary judge scores from summarization_results_merged.json

  • Cause+risk subscored judge scores from risk_dataset_v2_results_qwen35.json

Output: datasets/unified_dataset.json — 802 incidents.


Step 3 — Filter Dataset

filter_dataset_unified.py applies quality thresholds from both standalone pipelines simultaneously — an incident must pass all filters to be retained:

Filter

Threshold

Best summary score

≥ 4 / 10

Best cause total

≥ 14 / 30

Best risk total

≥ 10 / 30

Summary response token length

50–400 tokens

Cause response token length

50–600 tokens

Risk response token length

30–300 tokens

Risk level keyword

Critical / High / Medium / Low

Result: 750 / 802 incidents passed (93.5%). Split 90/10 (seed=42): 675 train / 75 eval.

The intersection requirement excludes incidents that passed the standalone risk quality filter but have no summary judge score. To recover those risk training examples, Step 4 appends them separately.


Step 4 — Select Best Responses and Augment

select_best_responses_unified.py selects the highest-scoring model response per task per incident and builds SFT conversation records:

  • Task S: best summary score (1–10) → [system, user(dag), assistant(summary)]

  • Task A: best cause total (subscored) → [user(cause_prompt+dag), assistant(cause_analysis)]

  • Task B: same winner as Task A → [user(risk_prompt+dag), assistant(risk_assessment)]

DAG inputs are truncated at 3500 tokens at clean line boundaries with an explicit truncation marker. Records are interleaved S→A→B per incident so the adapter sees all three task types continuously throughout training.

Intermediate output: unified_train_dataset.json — 2025 train records (675 × 3 tasks).

The three task types use distinct prompt formats:

Task

Prompt focus

Output structure

Task S (Summarization)

Human-readable incident summary

Summary + Key Events + Threat Assessment

Task A (Cause Analysis)

Structured root cause identification

Possible Causes × 3 categories + Conclusion

Task B (Risk Assessment)

Calibrated risk evaluation

Risk Level + Justification + Business Impact + Likelihood + Priority

Augmentation with risk-only incidents: The intersection requirement keeps only incidents with both summary and risk scores, which naturally shrinks the risk-task pool: the unified pipeline produces 675 cause+risk training records, compared to 1328 in the standalone risk pipeline. This halving of risk-specific signal causes task dilution — the model learns cause and risk analysis from fewer examples relative to the standalone model.

To recover this signal without modifying the unified pipeline, augment_unified_with_risk.py appends cause+risk SFT records from risk-only incidents — incidents present in risk_filtered_train.json (standalone risk train split) that were excluded from the unified pipeline because they lacked summary judge scores. Of the 151 incidents in risk_dataset_v2 but not in the unified pool, 85 passed filter_dataset_risk.py quality filters and are appended.

The augmentation uses the same prompt templates and DAG truncation (3500 tokens) as select_best_responses_unified.py. Each incident contributes two records (cause + risk), interleaved before appending.

Source

Incidents

Records

Unified filtered train (S+A+B)

675

2025

Risk-only extras (A+B only)

85

170

Augmented total

760

2195

Final output: unified_train_dataset_augmented.json2195 train records (675 summary + 760 cause + 760 risk).

cd alert_summary/
python3 merge_summary_results.py
python3 build_unified_dataset.py
python3 filter_dataset_unified.py
python3 select_best_responses_unified.py

cd ../unsloth-scripts/
python3 augment_unified_with_risk.py
# Output: unified_train_dataset_augmented.json (2195 records)

Training

Training follows the general procedure in Fine-Tuning Approach. Config: config_unified_4096_20gb_v2.yaml.

Parameter

Value

Max sequence length

4096

LoRA rank (r)

128

LoRA alpha

128

LoRA dropout

0.0

RSLoRA

enabled (required at r=128)

LoRA targets

q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Epochs

2

Learning rate

2e-5

LR scheduler

cosine

Warmup steps

20

Weight decay

0.01

Batch size (effective)

16 (2 × grad accum 8)

Optimizer

adamw_8bit

Precision

BF16

Quantization (training)

4bit (QLoRA)

Hardware

A100 80GB MiG 20GB slice (e-infra.cz cloud)

The higher LoRA rank (r=128 vs r=16 for summarization, r=64 for risk) gives the adapter more representational capacity for three competing task objectives. RSLoRA normalizes the adapter contribution at higher ranks to prevent training instability. Two epochs are used to avoid overfitting toward the most frequent task pattern (summary) in the mixed dataset.

cd unsloth-scripts/
python3 train_qwen.py --config config_unified_4096_20gb_v2.yaml
# Outputs: qwen_unified_finetuned_v2/ (adapter) + qwen_unified_finetuned_v2_merged_16bit/

Training History Notes

v1 used the base unified_train_dataset.json (2025 records, lora_r=64, 3 epochs). Evaluation showed that risk and cause analysis performance was significantly below the standalone risk model. The root cause was task dilution: at r=64, the adapter did not have enough capacity to learn all three tasks simultaneously, and the summary task — being the most frequent pattern — dominated training signal at the cost of risk quality.

v3 was an experiment that doubled all cause+risk records via 2× upsampling (unified_train_dataset_augmented_2x_risk.json). It backfired: the model overfit to the repeated pattern, producing a win rate drop of −6.4pp on summary and −13.5pp on risk compared to v2. v2, which reaches the same ~80% cause+risk proportion naturally from dataset sizes, is the better-calibrated training distribution.


Published Model

The trained model is published on HuggingFace:

For evaluation results, see Unified Fine-Tuned Model: Evaluation Results.
For GGUF conversion and Ollama deployment, see Quantization and Deployment.