Automatically building datasets of labeled IP traffic traces: A self-training approach

Gargiulo, Francesco; Mazzariello, Claudio; Sansone, Carlo

doi:10.1016/j.asoc.2012.02.012

Many approaches have been proposed so far to tackle computer network security. Among them, several systems exploit Machine Learning and Pattern Recognition techniques, by regarding malicious behavior detection as a classification problem. Supervised and unsupervised algorithms have been used in this context, each one with its own benefits and shortcomings. When using supervised techniques, a representative training set is required, which reliably indicates what a human expert wants the system to learn and recognize, by means of suitably labeled samples. In real environments there is a significant difficulty in collecting a representative dataset of correctly labeled traffic traces. In adversarial environments such a task is made even harder by malicious attackers, trying to make their actions’ evidences stealthy. In order to overcome this problem, a self-training system is presented in this paper, building a dataset of labeled network traffic based on raw tcpdump traces and no prior knowledge on data. Results on both emulated and real traffic traces have shown that intrusion detection systems trained on such a dataset perform as well as the same systems trained on correctly hand-labeled data.

Automatically building datasets of labeled IP traffic traces: A self-training approach / Francesco, Gargiulo; Mazzariello, Claudio; Sansone, Carlo. - In: APPLIED SOFT COMPUTING. - ISSN 1568-4946. - 12:6(2012), pp. 1640-1649. [10.1016/j.asoc.2012.02.012]