Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification

Peer-to-Peer (P2P) detection by Machine Learning (ML) classification is affected by the quality and recency of training dataset. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this research work, a novel practical training dataset generation and automatic retrai...

詳細記述

保存先:
書誌詳細
第一著者: Zarei, Roozbeh
フォーマット: 学位論文
言語:English
出版事項: 2012
主題:
オンライン・アクセス:http://eprints.utm.my/id/eprint/33398/5/RoozehZareiMFKE2012.pdf
http://eprints.utm.my/id/eprint/33398/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:72709?site_name=Restricted Repository
タグ: タグ追加
タグなし, このレコードへの初めてのタグを付けませんか!
id my.utm.33398
record_format eprints
spelling my.utm.333982018-05-27T08:07:40Z http://eprints.utm.my/id/eprint/33398/ Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification Zarei, Roozbeh TK Electrical engineering. Electronics Nuclear engineering Peer-to-Peer (P2P) detection by Machine Learning (ML) classification is affected by the quality and recency of training dataset. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this research work, a novel practical training dataset generation and automatic retraining mechanism for on-line P2P traffic classification are proposed. These two proposals are integrated in a system that removes the limitations of ML classification and makes them suitable for on-line P2P traffic classification. For the first part, a novel two-stage training dataset generation is proposed by combining a 3-class heuristic and a 3-class statistical classification to accurately generate training dataset. In the heuristic stage, traffic is classified as P2P, nonP2P or unknown. In statistical stage, a dual-Decision Tree (DT) is built based on dataset generated in heuristic stage to classify unknown traffic into three classes in order to reduce the amount of classified unknown traffics. The final training dataset is generated based on all flows which are classified in these two stages. In the second part of the system, an automatic retraining mechanism is proposed to satisfy the needs of retraining ML classifier by detecting the changes of traffic behavior and updating the on-line ML classifier with recent accurate training dataset. This mechanism evaluates the accuracy of the on-line ML classifier based on flows labeled by the two-stage training dataset generation. The on-line ML classifier is retrained if its accuracy falls below a predefined threshold. The proposed system has been evaluated on traces captured from the Universiti Teknologi Malaysia (UTM) campus network between October and November 2011. The overall results shows that the two-stage training dataset generation can generate accurate training dataset by classifying more than 95% of total flows with high accuracy (98:59%) and low false positive (0:91%). The on-line ML classifier which is built based on (J48) algorithm and training dataset generated by the two-stage training dataset generation classifies traffic with high accuracy (99%) by using the 25 feature extracted from first 5 packets of each flow. The results also show that using automatic retraining mechanism allow the on-line ML classifier able to maintain its accuracy above a set threshold over time. 2012-01 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/33398/5/RoozehZareiMFKE2012.pdf Zarei, Roozbeh (2012) Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification. Masters thesis, Universiti Teknologi Malaysia, Faculty of Electrical Engineering. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:72709?site_name=Restricted Repository
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
language English
topic TK Electrical engineering. Electronics Nuclear engineering
spellingShingle TK Electrical engineering. Electronics Nuclear engineering
Zarei, Roozbeh
Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
description Peer-to-Peer (P2P) detection by Machine Learning (ML) classification is affected by the quality and recency of training dataset. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this research work, a novel practical training dataset generation and automatic retraining mechanism for on-line P2P traffic classification are proposed. These two proposals are integrated in a system that removes the limitations of ML classification and makes them suitable for on-line P2P traffic classification. For the first part, a novel two-stage training dataset generation is proposed by combining a 3-class heuristic and a 3-class statistical classification to accurately generate training dataset. In the heuristic stage, traffic is classified as P2P, nonP2P or unknown. In statistical stage, a dual-Decision Tree (DT) is built based on dataset generated in heuristic stage to classify unknown traffic into three classes in order to reduce the amount of classified unknown traffics. The final training dataset is generated based on all flows which are classified in these two stages. In the second part of the system, an automatic retraining mechanism is proposed to satisfy the needs of retraining ML classifier by detecting the changes of traffic behavior and updating the on-line ML classifier with recent accurate training dataset. This mechanism evaluates the accuracy of the on-line ML classifier based on flows labeled by the two-stage training dataset generation. The on-line ML classifier is retrained if its accuracy falls below a predefined threshold. The proposed system has been evaluated on traces captured from the Universiti Teknologi Malaysia (UTM) campus network between October and November 2011. The overall results shows that the two-stage training dataset generation can generate accurate training dataset by classifying more than 95% of total flows with high accuracy (98:59%) and low false positive (0:91%). The on-line ML classifier which is built based on (J48) algorithm and training dataset generated by the two-stage training dataset generation classifies traffic with high accuracy (99%) by using the 25 feature extracted from first 5 packets of each flow. The results also show that using automatic retraining mechanism allow the on-line ML classifier able to maintain its accuracy above a set threshold over time.
format Thesis
author Zarei, Roozbeh
author_facet Zarei, Roozbeh
author_sort Zarei, Roozbeh
title Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_short Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_full Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_fullStr Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_full_unstemmed Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_sort practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
publishDate 2012
url http://eprints.utm.my/id/eprint/33398/5/RoozehZareiMFKE2012.pdf
http://eprints.utm.my/id/eprint/33398/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:72709?site_name=Restricted Repository
_version_ 1643649318985400320
score 13.252575