A Kullback-Liebler divergence-based representation algorithm for malware detection.

Faitouri A Aboaoja Anazida Zainal Fuad A Ghaleb Norah Saleh Alghamdi Faisal Saeed Husayn Alhuwayji

PeerJ Comput Sci

Higher Institute of Science and Technology, Qarabulli, Higher Institute of Science and Technology, Qarabulli, Tripoli, Libya.

Published: September 2023

Malware poses a significant security risk, as sophisticated methods make it hard to distinguish between malicious and legitimate software behaviors.
Existing solutions often use the TF-IDF technique to analyze malware, but this approach inaccurately represents features, leading to high false alarm rates.
The new Kullback-Liebler Divergence-based Term Frequency-Probability Class Distribution (KLD-based TF-PCD) algorithm improves accuracy by weighing features based on their probability distributions in malware versus benign classes, achieving an impressive accuracy of 0.972.

Background: Malware, malicious software, is the major security concern of the digital realm. Conventional cyber-security solutions are challenged by sophisticated malicious behaviors. Currently, an overlap between malicious and legitimate behaviors causes more difficulties in characterizing those behaviors as malicious or legitimate activities. For instance, evasive malware often mimics legitimate behaviors, and evasion techniques are utilized by legitimate and malicious software.

Problem: Most of the existing solutions use the traditional term of frequency-inverse document frequency (TF-IDF) technique or its concept to represent malware behaviors. However, the traditional TF-IDF and the developed techniques represent the features, especially the shared ones, inaccurately because those techniques calculate a weight for each feature without considering its distribution in each class; instead, the generated weight is generated based on the distribution of the feature among all the documents. Such presumption can reduce the meaning of those features, and when those features are used to classify malware, they lead to a high false alarms.

Method: This study proposes a Kullback-Liebler Divergence-based Term Frequency-Probability Class Distribution (KLD-based TF-PCD) algorithm to represent the extracted features based on the differences between the probability distributions of the terms in malware and benign classes. Unlike the existing solution, the proposed algorithm increases the weights of the important features by using the Kullback-Liebler Divergence tool to measure the differences between their probability distributions in malware and benign classes.

Results: The experimental results show that the proposed KLD-based TF-PCD algorithm achieved an accuracy of 0.972, the false positive rate of 0.037, and the F-measure of 0.978. Such results were significant compared to the related work studies. Thus, the proposed KLD-based TF-PCD algorithm contributes to improving the security of cyberspace.

Conclusion: New meaningful characteristics have been added by the proposed algorithm to promote the learned knowledge of the classifiers, and thus increase their ability to classify malicious behaviors accurately.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557483	PMC
http://dx.doi.org/10.7717/peerj-cs.1492	DOI Listing

Publication Analysis

Top Keywords

kld-based tf-pcd

tf-pcd algorithm

kullback-liebler divergence-based

malicious behaviors

malicious legitimate

legitimate behaviors

differences probability

probability distributions

malware benign

proposed algorithm

Similar Publications

A Kullback-Liebler divergence-based representation algorithm for malware detection.

PeerJ Comput Sci

September 2023

Higher Institute of Science and Technology, Qarabulli, Higher Institute of Science and Technology, Qarabulli, Tripoli, Libya.

Faitouri A Aboaoja Anazida Zainal Fuad A Ghaleb Norah Saleh Alghamdi Faisal Saeed

Article Synopsis

Malware poses a significant security risk, as sophisticated methods make it hard to distinguish between malicious and legitimate software behaviors.
Existing solutions often use the TF-IDF technique to analyze malware, but this approach inaccurately represents features, leading to high false alarm rates.
The new Kullback-Liebler Divergence-based Term Frequency-Probability Class Distribution (KLD-based TF-PCD) algorithm improves accuracy by weighing features based on their probability distributions in malware versus benign classes, achieving an impressive accuracy of 0.972.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!