Dataset Generation Pipeline for Slips Alert Analysis
Note: This guide covers the Summarization workflow (summary + behavior analysis). For Cause & Risk analysis workflow, see README_dataset_risk_workflow.md.
1. Overview
This pipeline transforms raw Slips security alerts into structured multi-model analysis datasets. The workflow consists of four stages: (1) sampling incidents from raw logs into JSONL format, (2) generating DAG-based structural analysis, (3) producing LLM-enhanced summaries with behavior analysis from multiple models, and (4) correlating all analyses into a unified JSON dataset. The output provides comprehensive incident analysis from different analytical perspectives, enabling comparative evaluation of model performance on security analysis tasks.
2. Pipeline Components
2.1 Python Scripts
sample_dataset.py
Samples INCIDENT alerts and their associated EVENT alerts from Slips alerts.json files. Preserves the complete event context for each incident by following CorrelID references. Supports filtering by category (normal/malware), severity (low/medium/high), and reproducible sampling via random seeds. Outputs JSONL format compatible with downstream analysis tools.
alert_dag_parser.py
Parses JSONL incident files and generates Directed Acyclic Graph (DAG) analysis showing the chronological structure of security events. Extracts incident metadata (source IPs, timewindows, threat levels, timelines) and produces comprehensive event summaries. Outputs structured JSON with incident-level analysis.
alert_dag_parser_llm.py
Generates LLM-enhanced analysis by querying language models with structured incident data. Implements two key optimizations: (1) event grouping by pattern normalization (replaces IPs, ports, numbers with placeholders to identify identical patterns), reducing token counts by 96-99% for large incidents, and (2) dual-prompt analysis generating both severity-assessed summaries and structured behavior explanations. Supports multiple LLM backends via OpenAI-compatible APIs. Outputs JSON with both summary and behavior_analysis fields.
correlate_incidents.py
Merges multiple JSON analysis files by matching incident_id fields. Combines DAG analysis with multiple LLM analyses (from different models) into a single unified dataset. Automatically detects analysis types from filenames (e.g., .dag.json, .llm.gpt-4o-mini.json, .llm.qwen2.5.json) and creates appropriately named fields in the output. Produces consolidated JSON suitable for model comparison and evaluation.
merge_datasets.py
Merges multiple correlated dataset JSON files into a single unified dataset. Removes duplicates based on incident_id while preserving all analysis fields from each incident. Useful for extending existing datasets by combining separately generated correlated datasets. Supports multiple input files, automatic deduplication, and optional compact output format.
2.2 Shell Wrappers
sample_dataset.sh
Wrapper for sample_dataset.py providing simplified command-line interface. Handles argument parsing, validation, and automatic file naming (appends .jsonl extension). Supports filtering options, random seed configuration, and optional statistics generation.
generate_dag_analysis.sh
Wrapper for alert_dag_parser.py with automatic output filename generation based on input JSONL file. Converts input.jsonl to input.dag.json by default. Provides colored status logging and error handling.
generate_llm_analysis.sh
Wrapper for alert_dag_parser_llm.py supporting multiple model configurations. Auto-generates output filenames incorporating model names (e.g., input.llm.gpt-4o-mini.json, input.llm.qwen2.5.json). Handles model endpoint configuration for both cloud APIs (OpenAI) and local servers (Ollama). Passes through optimization flags for event grouping and behavior analysis.
3. Dataset Generation Workflow
3.1 Prerequisites
Input Requirements:
Raw Slips logs:
alerts.jsonfiles from Slips network security analysisDirectory structure:
sample_logs/datasets/{Normal,Malware}/...
Model Configuration:
GPT-4o-mini: OpenAI API key in environment variable
OPENAI_API_KEYQwen2.5:3b: Ollama server running at
http://10.147.20.102:11434/v1(adjust as needed)Qwen2.5:1.5b: Ollama server with model installed
Software Dependencies:
Python 3.6+ with standard library only (no external packages required)
bash,jqfor shell scriptsOpenAI Python package for LLM analysis:
pip install openai
3.2 Step-by-Step Process
Step 1: Sample Incidents from Raw Logs
Generate a JSONL file containing sampled incidents with all associated events:
./sample_dataset.sh 20 my_dataset --category malware --seed 42 --include-stats
This creates:
my_dataset.jsonl- Sampled incidents and events in JSONL formatmy_dataset.stats.json- Statistics about the sample (optional)
Step 2: Generate DAG Analysis
Parse the JSONL file and generate structural DAG analysis:
./generate_dag_analysis.sh my_dataset.jsonl
Output: my_dataset.dag.json - JSON array of incidents with DAG-based analysis
Step 3: Generate LLM Analysis (GPT-4o-mini)
Query GPT-4o-mini for enhanced analysis with event grouping and behavior analysis:
./generate_llm_analysis.sh my_dataset.jsonl \
--model gpt-4o-mini \
--base-url https://api.openai.com/v1 \
--group-events \
--behavior-analysis
Output: my_dataset.llm.gpt-4o-mini.json - JSON array with summary and behavior_analysis fields
Step 4: Generate LLM Analysis (Qwen2.5:3b)
Query Qwen2.5:3b model via Ollama with same optimization flags:
./generate_llm_analysis.sh my_dataset.jsonl \
--model qwen2.5:3b \
--base-url http://10.147.20.102:11434/v1 \
--group-events \
--behavior-analysis
Output: my_dataset.llm.qwen2.5.json - JSON array with model-specific analysis
Step 5: Generate LLM Analysis (Qwen2.5:1.5b)
Query Qwen2.5:1.5b model for comparison with smaller model:
./generate_llm_analysis.sh my_dataset.jsonl \
--model qwen2.5:1.5b \
--base-url http://10.147.20.102:11434/v1 \
--group-events \
--behavior-analysis
Output: my_dataset.llm.qwen2.5.1.5b.json - JSON array from smaller model
Step 6: Correlate All Analyses
Merge all analysis files into a unified dataset by incident_id, including category information from the original JSONL:
python3 correlate_incidents.py my_dataset.*.json --jsonl my_dataset.jsonl -o final_dataset.json
Output: final_dataset.json - Consolidated dataset with all analyses per incident
Note: The --jsonl parameter is used to extract the category field (Malware/Normal) from the original sampled data, ensuring proper ground truth labeling in the final dataset.
3.3 Complete Workflow Example
# Full pipeline execution
./sample_dataset.sh 20 my_dataset --category malware --seed 42
./generate_dag_analysis.sh my_dataset.jsonl
./generate_llm_analysis.sh my_dataset.jsonl --model gpt-4o-mini --group-events --behavior-analysis
./generate_llm_analysis.sh my_dataset.jsonl --model qwen2.5:3b --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis
./generate_llm_analysis.sh my_dataset.jsonl --model qwen2.5:1.5b --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis
python3 correlate_incidents.py my_dataset.*.json --jsonl my_dataset.jsonl -o final_dataset.json
Files generated:
my_dataset.jsonl- Sampled incidents (JSONL)my_dataset.dag.json- DAG analysismy_dataset.llm.gpt-4o-mini.json- GPT-4o-mini analysismy_dataset.llm.qwen2.5.json- Qwen2.5:3b analysismy_dataset.llm.qwen2.5.1.5b.json- Qwen2.5:1.5b analysisfinal_dataset.json- Unified correlated dataset
3.4 Extending Existing Datasets
To add more incidents to an existing correlated dataset without regenerating from scratch:
Step 1: Sample Additional Incidents
Use a different random seed to ensure new samples don’t duplicate existing ones:
./sample_dataset.sh 20 extension --category malware --seed 99
Step 2: Generate All Analyses for Extension
Run the full analysis pipeline on the new samples:
./generate_dag_analysis.sh extension.jsonl
./generate_llm_analysis.sh extension.jsonl --model gpt-4o-mini --group-events --behavior-analysis
./generate_llm_analysis.sh extension.jsonl --model qwen2.5:3b --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis
./generate_llm_analysis.sh extension.jsonl --model qwen2.5:1.5b --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis
Step 3: Correlate Extension Data
python3 correlate_incidents.py extension.*.json --jsonl extension.jsonl -o extension_dataset.json
Step 4: Merge with Existing Dataset
Combine the original and extension datasets, automatically removing any duplicates:
python3 merge_datasets.py final_dataset.json extension_dataset.json -o final_dataset_v2.json
Alternative: Merge Multiple Extensions
If you have multiple extension datasets:
python3 merge_datasets.py final_dataset.json extension1_dataset.json extension2_dataset.json -o combined_dataset.json
Note on Deduplication: The merge_datasets.py script automatically detects and removes duplicate incidents based on incident_id. If the same incident appears in multiple input files, only the first occurrence is kept.
Verification: After merging, verify the operation completed successfully:
python3 verify_merge.py --verbose
This validates file integrity, count accuracy, deduplication correctness, completeness, and data integrity. Use --inputs and --output flags to verify custom merge operations.
4. Output Dataset Structure
The final correlated dataset is a JSON array where each object represents one incident with all analyses:
[
{
"incident_id": "bd47e95b-a211-41b1-9644-40d6a2e77a07",
"category": "Malware",
"source_ip": "10.0.2.15",
"timewindow": "12",
"timeline": "2024-04-05 16:53:07 to 16:53:50",
"threat_level": 15.36,
"event_count": 4604,
"dag_analysis": "Comprehensive analysis:\n- Source IP: 10.0.2.15\n- Timewindow: 12...",
"llm_gpt4o_mini_analysis": {
"summary": "Incident bd47e95b-a211-41b1-9644-40d6a2e77a07 involves...",
"behavior_analysis": "**Source:** 10.0.2.15\n**Activity:** Port scanning...\n**Detected Flows:**\n• 10.0.2.15 → 185.29.135.234:443/TCP (HTTPS)\n..."
},
"llm_qwen2_5_3b_analysis": {
"summary": "This incident represents a sophisticated attack...",
"behavior_analysis": "**Source:** 10.0.2.15\n**Activity:** Multi-stage attack...\n..."
},
"llm_qwen2_5_1_5b_analysis": {
"summary": "The incident shows malicious behavior with...",
"behavior_analysis": "**Source:** 10.0.2.15\n**Activity:** Network reconnaissance...\n..."
}
}
]
Key Fields:
incident_id: UUID identifying the unique security incidentcategory: Classification of the capture origin (“Malware” or “Normal”)source_ip: Primary source IP address for the incidenttimewindow: Slips timewindow number for temporal contexttimeline: Human-readable time range (start to end)threat_level: Accumulated threat score from Slipsevent_count: Number of security events in this incidentdag_analysis: Structural DAG-based analysis (string)llm_<model>_analysis: Object withsummaryandbehavior_analysisstrings
Analysis Field Contents:
DAG Analysis: Chronological event summary with threat levels, detection types, and temporal patterns.
LLM Summary: Severity-assessed event descriptions prioritizing high-confidence and high-threat-level evidence. Groups similar events by pattern to reduce verbosity.
LLM Behavior Analysis: Structured technical explanation formatted as:
**Source:** <IP>
**Activity:** <brief activity type>
**Detected Flows:**
• <src:port/proto> → <dest> (service)
• [additional flows]
**Summary:** [1-2 sentence technical summary]
5. Performance Considerations
Event Grouping (–group-events)
Purpose: Reduce token count for large incidents to enable processing on low-specification devices.
Mechanism: Normalizes event descriptions by replacing variable components (IP addresses → <IP>, ports → <PORT>, numbers → <NUM>) to identify identical patterns. Groups events with matching normalized patterns while preserving threat level and timing information.
Impact:
Small incident (103 events): 3,522 tokens → 976 tokens (72% reduction)
Large incident (4,604 events): ~50,000 tokens → 1,897 tokens (96% reduction)
Trade-off: Slight reduction in granularity (individual IPs/ports shown as samples) for massive token savings. Recommended for all production use.
Behavior Analysis (–behavior-analysis)
Purpose: Generate structured technical explanations of network behavior alongside severity-assessed summaries.
Mechanism: Issues two separate LLM queries per incident:
Summary prompt: Assesses severity and filters high-priority evidence
Behavior prompt: Produces structured flow analysis and technical summary
Impact:
Adds ~1,500 tokens per incident (behavior prompt)
Doubles API calls and processing time per incident
Provides richer analytical context for security analysts
Trade-off: Enhanced analysis quality and readability at cost of increased processing time and API usage. Recommended for datasets under 100 incidents or when quality is prioritized over speed.
Combined Usage
Using both flags together (--group-events --behavior-analysis) achieves optimal balance:
Event grouping minimizes prompt size (token reduction)
Behavior analysis maximizes output quality (richer insights)
Large incidents become processable while maintaining analytical depth
Example token counts with both flags:
4,604 events: 1,897 tokens (summary) + 1,527 tokens (behavior) = 3,424 total tokens
Processing time: ~10-15 seconds per incident on low-spec devices (Ollama on Raspberry Pi)