Baselight

Microsoft Malware Sample

Vectorized malware byte-files.

@kaggle.dheemanthbhat_microsoft_malware_sample

Loading...
Loading...

About this Dataset

Microsoft Malware Sample

Source

This dataset contains vectorized byte-files taken from the original dataset of Microsoft Malware Classification Challenge (BIG 2015) competition. Original dataset belongs to http://arxiv.org/abs/1802.10135.
Original Train and Test dataset are ~18GB each. This random sample extracted and vectorized is just ~15MB is size.

How the dataset is sampled?

  1. Randomly equal number of malware byte-files from each class (except Simda) are selcted.
  2. Byte data in hexadecimal characters are then subjected to preprocessing.
  3. Finally preprocessed hex strings are then vectorized using scikit-learn CountVectorizer.

Note: Original dataset contains only 42 byte-files for malware class 5 (Simda).

Tables

Test Vec

@kaggle.dheemanthbhat_microsoft_malware_sample.test_vec
  • 7.73 MB
  • 10873 rows
  • 258 columns
Loading...

CREATE TABLE test_vec (
  "id" VARCHAR,
  "bytfsize" BIGINT,
  "n_00" BIGINT,
  "n_01" BIGINT,
  "n_02" BIGINT,
  "n_03" BIGINT,
  "n_04" BIGINT,
  "n_05" BIGINT,
  "n_06" BIGINT,
  "n_07" BIGINT,
  "n_08" BIGINT,
  "n_09" BIGINT,
  "n_0a" BIGINT,
  "n_0b" BIGINT,
  "n_0c" BIGINT,
  "n_0d" BIGINT,
  "n_0e" BIGINT,
  "n_0f" BIGINT,
  "n_10" BIGINT,
  "n_11" BIGINT,
  "n_12" BIGINT,
  "n_13" BIGINT,
  "n_14" BIGINT,
  "n_15" BIGINT,
  "n_16" BIGINT,
  "n_17" BIGINT,
  "n_18" BIGINT,
  "n_19" BIGINT,
  "n_1a" BIGINT,
  "n_1b" BIGINT,
  "n_1c" BIGINT,
  "n_1d" BIGINT,
  "n_1e" BIGINT,
  "n_1f" BIGINT,
  "n_20" BIGINT,
  "n_21" BIGINT,
  "n_22" BIGINT,
  "n_23" BIGINT,
  "n_24" BIGINT,
  "n_25" BIGINT,
  "n_26" BIGINT,
  "n_27" BIGINT,
  "n_28" BIGINT,
  "n_29" BIGINT,
  "n_2a" BIGINT,
  "n_2b" BIGINT,
  "n_2c" BIGINT,
  "n_2d" BIGINT,
  "n_2e" BIGINT,
  "n_2f" BIGINT,
  "n_30" BIGINT,
  "n_31" BIGINT,
  "n_32" BIGINT,
  "n_33" BIGINT,
  "n_34" BIGINT,
  "n_35" BIGINT,
  "n_36" BIGINT,
  "n_37" BIGINT,
  "n_38" BIGINT,
  "n_39" BIGINT,
  "n_3a" BIGINT,
  "n_3b" BIGINT,
  "n_3c" BIGINT,
  "n_3d" BIGINT,
  "n_3e" BIGINT,
  "n_3f" BIGINT,
  "n_40" BIGINT,
  "n_41" BIGINT,
  "n_42" BIGINT,
  "n_43" BIGINT,
  "n_44" BIGINT,
  "n_45" BIGINT,
  "n_46" BIGINT,
  "n_47" BIGINT,
  "n_48" BIGINT,
  "n_49" BIGINT,
  "n_4a" BIGINT,
  "n_4b" BIGINT,
  "n_4c" BIGINT,
  "n_4d" BIGINT,
  "n_4e" BIGINT,
  "n_4f" BIGINT,
  "n_50" BIGINT,
  "n_51" BIGINT,
  "n_52" BIGINT,
  "n_53" BIGINT,
  "n_54" BIGINT,
  "n_55" BIGINT,
  "n_56" BIGINT,
  "n_57" BIGINT,
  "n_58" BIGINT,
  "n_59" BIGINT,
  "n_5a" BIGINT,
  "n_5b" BIGINT,
  "n_5c" BIGINT,
  "n_5d" BIGINT,
  "n_5e" BIGINT,
  "n_5f" BIGINT,
  "n_60" BIGINT,
  "n_61" BIGINT
);

Trainlabels Bal

@kaggle.dheemanthbhat_microsoft_malware_sample.trainlabels_bal
  • 40.05 KB
  • 1642 rows
  • 2 columns
Loading...

CREATE TABLE trainlabels_bal (
  "id" VARCHAR,
  "class" BIGINT
);

Train Vec

@kaggle.dheemanthbhat_microsoft_malware_sample.train_vec
  • 1.82 MB
  • 1642 rows
  • 259 columns
Loading...

CREATE TABLE train_vec (
  "id" VARCHAR,
  "bytfsize" BIGINT,
  "n_00" BIGINT,
  "n_01" BIGINT,
  "n_02" BIGINT,
  "n_03" BIGINT,
  "n_04" BIGINT,
  "n_05" BIGINT,
  "n_06" BIGINT,
  "n_07" BIGINT,
  "n_08" BIGINT,
  "n_09" BIGINT,
  "n_0a" BIGINT,
  "n_0b" BIGINT,
  "n_0c" BIGINT,
  "n_0d" BIGINT,
  "n_0e" BIGINT,
  "n_0f" BIGINT,
  "n_10" BIGINT,
  "n_11" BIGINT,
  "n_12" BIGINT,
  "n_13" BIGINT,
  "n_14" BIGINT,
  "n_15" BIGINT,
  "n_16" BIGINT,
  "n_17" BIGINT,
  "n_18" BIGINT,
  "n_19" BIGINT,
  "n_1a" BIGINT,
  "n_1b" BIGINT,
  "n_1c" BIGINT,
  "n_1d" BIGINT,
  "n_1e" BIGINT,
  "n_1f" BIGINT,
  "n_20" BIGINT,
  "n_21" BIGINT,
  "n_22" BIGINT,
  "n_23" BIGINT,
  "n_24" BIGINT,
  "n_25" BIGINT,
  "n_26" BIGINT,
  "n_27" BIGINT,
  "n_28" BIGINT,
  "n_29" BIGINT,
  "n_2a" BIGINT,
  "n_2b" BIGINT,
  "n_2c" BIGINT,
  "n_2d" BIGINT,
  "n_2e" BIGINT,
  "n_2f" BIGINT,
  "n_30" BIGINT,
  "n_31" BIGINT,
  "n_32" BIGINT,
  "n_33" BIGINT,
  "n_34" BIGINT,
  "n_35" BIGINT,
  "n_36" BIGINT,
  "n_37" BIGINT,
  "n_38" BIGINT,
  "n_39" BIGINT,
  "n_3a" BIGINT,
  "n_3b" BIGINT,
  "n_3c" BIGINT,
  "n_3d" BIGINT,
  "n_3e" BIGINT,
  "n_3f" BIGINT,
  "n_40" BIGINT,
  "n_41" BIGINT,
  "n_42" BIGINT,
  "n_43" BIGINT,
  "n_44" BIGINT,
  "n_45" BIGINT,
  "n_46" BIGINT,
  "n_47" BIGINT,
  "n_48" BIGINT,
  "n_49" BIGINT,
  "n_4a" BIGINT,
  "n_4b" BIGINT,
  "n_4c" BIGINT,
  "n_4d" BIGINT,
  "n_4e" BIGINT,
  "n_4f" BIGINT,
  "n_50" BIGINT,
  "n_51" BIGINT,
  "n_52" BIGINT,
  "n_53" BIGINT,
  "n_54" BIGINT,
  "n_55" BIGINT,
  "n_56" BIGINT,
  "n_57" BIGINT,
  "n_58" BIGINT,
  "n_59" BIGINT,
  "n_5a" BIGINT,
  "n_5b" BIGINT,
  "n_5c" BIGINT,
  "n_5d" BIGINT,
  "n_5e" BIGINT,
  "n_5f" BIGINT,
  "n_60" BIGINT,
  "n_61" BIGINT
);

Share link

Anyone who has the link will be able to view this.