Baselight
Sign In
kaggle

Microsoft Malware Sample

Kaggle

@kaggle.dheemanthbhat_microsoft_malware_sample

Loading...
Loading...

Vectorized malware byte-files.

Dataset Description

Source

This dataset contains vectorized byte-files taken from the original dataset of Microsoft Malware Classification Challenge (BIG 2015) competition. Original dataset belongs to http://arxiv.org/abs/1802.10135.
Original Train and Test dataset are ~18GB each. This random sample extracted and vectorized is just ~15MB is size.

How the dataset is sampled?

  1. Randomly equal number of malware byte-files from each class (except Simda) are selcted.
  2. Byte data in hexadecimal characters are then subjected to preprocessing.
  3. Finally preprocessed hex strings are then vectorized using scikit-learn CountVectorizer.

Note: Original dataset contains only 42 byte-files for malware class 5 (Simda).


Related Datasets

Share link

Anyone who has the link will be able to view this.