SELID: Selective Event Labeling for Intrusion Detection Datasets.

Woohyuk Jang Hyunmin Kim Hyungbin Seo Minsong Kim Myungkeun Yoon

Sensors (Basel)

Department of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of Korea.

Published: July 2023

A large volume of security events, generally collected by distributed monitoring sensors, overwhelms human analysts at security operations centers and raises an alert fatigue problem. Machine learning is expected to mitigate this problem by automatically distinguishing between true alerts, or attacks, and falsely reported ones. Machine learning models should first be trained on datasets having correct labels, but the labeling process itself requires considerable human resources. In this paper, we present a new selective sampling scheme for efficient data labeling via unsupervised clustering. The new scheme transforms the byte sequence of an event into a fixed-size vector through content-defined chunking and feature hashing. Then, a clustering algorithm is applied to the vectors, and only a few samples from each cluster are selected for manual labeling. The experimental results demonstrate that the new scheme can select only 2% of the data for labeling without degrading the F1-score of the machine learning model. Two datasets, a private dataset from a real security operations center and a public dataset from the Internet for experimental reproducibility, are used.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10347169	PMC
http://dx.doi.org/10.3390/s23136105	DOI Listing

Publication Analysis

Top Keywords

machine learning

security operations

data labeling

labeling

selid selective

selective event

event labeling

labeling intrusion

intrusion detection

detection datasets

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!