Vectorized malware byte-files.
Dataset Description
Source
This dataset contains vectorized byte-files taken from the original dataset of Microsoft Malware Classification Challenge (BIG 2015) competition. Original dataset belongs to http://arxiv.org/abs/1802.10135.
Original Train and Test dataset are ~18GB each. This random sample extracted and vectorized is just ~15MB is size.
How the dataset is sampled?
- Randomly equal number of malware byte-files from each class (except Simda) are selcted.
- Byte data in hexadecimal characters are then subjected to preprocessing.
- Finally preprocessed hex strings are then vectorized using scikit-learn
CountVectorizer.
Note: Original dataset contains only 42 byte-files for malware class 5 (Simda).
Related Datasets
-
Android Malware Dataset
@kaggle
-
Fur Banning
@owid
-
Wars On Territory
@owid