Fine-Tuning Evaluation Methodology

Summary: Fine-tuned models are evaluated using the same LLM-as-judge framework used for baseline comparison, extended to include the finetuned model as a fifth competitor. The methodology — blind comparative ranking, three metrics, two breakdown dimensions — is identical across tasks.


Index


Overview

After fine-tuning, the model is evaluated against a fixed set of competitors using LLM-as-judge methodology. A strong external judge model (default: gpt-oss-120b) acts as an experienced security analyst and scores all model outputs for each incident. Model labels are randomized per incident to prevent position bias.

This extends the LLM Evaluation Guide used for baseline model comparison — the judge criteria, scoring format, and analysis scripts are identical; the only change is adding the finetuned model as a fifth competitor.

Scoring differs by task:

  • Summarization: single 1–10 quality score per incident

  • Risk assessment: separate cause analysis score (max 30) and risk assessment score (max 30), summed for ranking (max 60). See LLM-as-Judge Rubric for the full scoring criteria.


Evaluation Pipeline

Four sequential steps, applied identically for any task:

Step 1 — Inference on the eval split

The fine-tuned model is served locally via serve_model.py (OpenAI-compatible API on localhost:8000) and queried on each incident in the held-out eval split, using the same system prompt and input format as training.

cd unsloth-scripts/
python3 serve_model.py /path/to/finetuned_model --device cuda --quant 4bit
python3 run_finetuned_inference.py \
  --input filtered_eval.json \
  --output finetuned_eval_results.json \
  --url http://localhost:8000/v1

Step 2 — LLM-as-judge scoring

The judge receives all five model responses simultaneously and ranks them. The task-specific judge script is used (e.g. evaluate_summaries.py for summarization, evaluate_risk.py for cause & risk assessment).

cd alert_summary/
python3 evaluate_summaries.py \
  --input ../unsloth-scripts/finetuned_eval_results.json \
  --output results/finetuned_results.json \
  --judge gpt-oss-120b \
  --base-url http://YOUR_LOCAL_ENDPOINT/v1

Step 3 — Statistical analysis

python3 analyze_results.py \
  --results results/finetuned_results.json \
  --summary results/finetuned_report.md \
  --csv results/finetuned_data.csv

Step 4 — Improvement report

Reads the report and CSV, computes deltas vs. both baselines, identifies the weakest complexity tier and category, and outputs concrete suggestions for data composition and training config.


Baselines and Success Criteria

Two baselines are tracked for any finetuned model:

Baseline

Role

Success criterion

Qwen2.5 1B (untuned)

Lower bound — same family, smaller

Finetuned must beat this

Qwen2.5 3B (untuned)

Stretch goal — larger, slower on RPi

Finetuned should match or exceed

The stretch goal tests whether task-specific fine-tuning compensates for parameter count, which determines whether the smaller model can replace the larger one on the Raspberry Pi 5.


Metrics

Three metrics are computed per model:

Metric

Description

Avg Score

Mean quality score from the judge (1–10 scale)

Win Rate

Fraction of incidents where the model ranks 1st

Avg Position

Mean rank position (lower = better, 1–5 scale)


Breakdown Dimensions

Results are broken down along two dimensions for every task:

By incident category:

  • Malware incidents

  • Normal traffic incidents

By incident complexity (based on event count):

  • Simple: < 500 events

  • Medium: 500–1999 events

  • Complex: ≥ 2000 events

The complexity dimension is particularly relevant for RPi deployment: complex incidents stress the input token budget and expose truncation behavior. A model that performs well on simple/medium but fails on complex incidents has a known, addressable limitation.


Running the Evaluation

The full pipeline (dataset prep → training → inference → eval → report) can be run end-to-end:

/finetune-eval

# Or jump to evaluation if the model is already trained:
/finetune-eval --skip-to 5

Task-Specific Results

Task

Results Document

Incident Summarization

Summarization Fine-Tuned Model: Evaluation Results

Risk Assessment & Cause Analysis

Risk Fine-Tuned Model: Evaluation Results

Unified (Summary + Cause + Risk)

Unified Fine-Tuned Model: Evaluation Results