Context
The data presented here was collected in a network section from Universidad Del Cauca, Popayán, Colombia by performing packet captures at different hours, during morning and afternoon, over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017. A total of 3.577.296 instances were collected and are currently stored in a CSV (Comma Separated Values) file.
Content
This dataset contains 87 features. Each instance holds the information of an IP flow generated by a network device i.e., source and destination IP addresses, ports, interarrival times, layer 7 protocol (application) used on that flow as the class, among others. Most of the attributes are numeric type but there are also nominal types and a date type due to the Timestamp.
The flow statistics (IP addresses, ports, inter-arrival times, etc) were obtained using CICFlowmeter (http://www.unb.ca/cic/research/applications.html - https://github.com/ISCX/CICFlowMeter). The application layer protocol was obtained by performing a DPI (Deep Packet Inspection) processing on the flows with ntopng (https://www.ntop.org/products/traffic-analysis/ntop/ - https://github.com/ntop/ntopng).
For further information and if you find this dataset useful, please read and cite the following papers:
Research Gate:
https://www.researchgate.net/publication/326150046_Personalized_Service_Degradation_Policies_on_OTT_Applications_Based_on_the_Consumption_Behavior_of_Users
Research Gate:
https://www.researchgate.net/publication/335954240_Consumption_Behavior_Analysis_of_Over_The_Top_Services_Incremental_Learning_or_Traditional_Methods
Springer:
https://link.springer.com/chapter/10.1007/978-3-319-95168-3_37
IEEExplore
https://ieeexplore.ieee.org/document/8845576
Research Gate:
https://www.researchgate.net/publication/345990587_Smart_User_Consumption_Profiling_Incremental_Learning-based_OTT_Service_Degradation
IEEExpore
https://ieeexplore.ieee.org/document/9258898
Acknowledgements
I would like to thank Universidad Del Cauca for supporting the research that generated this dataset and Colciencias for my PhD scholarship.
Inspiration
Considering that most of the network traffic classification datasets are aimed only at identifying the type of application an IP flow holds (WWW, DNS, FTP, P2P, Telnet,etc), this dataset goes a step further by generating machine learning models capable of detecting specific applications such as Facebook, YouTube, Instagram, etc, from IP flow statistics (currently 75 applications).