Baselight

Assembly Shellcode Dataset

The Largest Collection of Linux Assembly Shellcodes

@kaggle.thedevastator_assembly_shellcode_dataset

Loading...
Loading...

About this Dataset

Assembly Shellcode Dataset


Assembly Shellcode Dataset

The Largest Collection of Linux Assembly Shellcodes

By SoLID (From Huggingface) [source]


About this dataset

The dataset consists of multiple files for different purposes. The validation.csv file contains a set of carefully selected assembly shellcodes that serve the purpose of validation. These shellcodes are used to ensure the accuracy and integrity of any models or algorithms trained on this dataset.

The train.csv file contains both the intent column, which describes the purpose or objective behind each specific shellcode, and its corresponding assembly code snippets in order to facilitate supervised learning during training procedures. This file proves to be immensely valuable for researchers, practitioners, and developers seeking to study or develop effective techniques for dealing with malicious code analysis or security-related tasks.

For testing purposes, the test.csv file provides yet another collection of assembly shellcodes that can be employed as test cases to assess the performance, robustness, and generalization capability of various models or methodologies developed within this domain.

How to use the dataset

Understanding the Dataset

The dataset consists of multiple files that serve different purposes:

  • train.csv: This file contains the intent and corresponding assembly code snippets for training purposes. It can be used to train machine learning models or develop algorithms based on shellcode analysis.

  • test.csv: The test.csv file in the dataset contains a collection of assembly shellcodes specifically designed for testing purposes. You can use these shellcodes to evaluate and validate your models or analysis techniques.

  • validation.csv: The validation.csv file includes a set of assembly shellcodes that are specifically reserved for validation purposes. These shellcodes can be used separately to ensure the accuracy and reliability of your models.

Columns in the Dataset

The columns available in each CSV file are as follows:

  • intent: The intent column describes the purpose or objective of each specific shellcode entry. It provides information regarding what action or achievement is intended by using that particular piece of code.

  • snippet: The snippet column contains the actual assembly code corresponding to each intent entry in its respective row. It includes all necessary instructions and data required to execute the desired action specified by that intent.

Utilizing the Dataset

To effectively utilize this dataset, follow these general steps:

  • Familiarize yourself with assembly language: Assembly language is essential when working with shellcodes since they consist of low-level machine instructions understood by processors directly.

  • Explore intents: Start by analyzing and understanding different intents present in the dataset entries thoroughly. Each intent represents a specific goal or purpose behind creating an individual piece of code.

  • Examine snippets: Review the assembly code snippets corresponding to each intent entry. Carefully study the instructions and data used in the shellcode, as they directly influence their intended actions.

  • Train your models: If you are working on machine learning or algorithm development, utilize the train.csv file to train your models based on the labeled intent and snippet data provided. This step will enable you to build powerful tools for analyzing or detecting shellcodes automatically.

  • Evaluate using test datasets: Use the various assembly shellcodes present in test.csv to evaluate and validate your trained models or analysis techniques. This evaluation will help

Research Ideas

  • Malware analysis: The dataset can be used for studying and analyzing various shellcode techniques used in malware attacks. Researchers and security professionals can use this dataset to develop detection and prevention mechanisms against such attacks.
  • Penetration testing: Security experts can use this dataset to simulate real-world attack scenarios and test the effectiveness of their defensive measures. By having access to a diverse range of shellcodes, they can identify vulnerabilities in systems and patch them before malicious actors exploit them.
  • Machine learning training: This dataset can be used to train machine learning models for automatic detection or classification of shellcodes. By combining the intent column (which describes the objective of each shellcode) with the corresponding assembly code snippets, researchers can develop algorithms that automatically identify the purpose or action intended by a given piece of shellcode based on its characteristics

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name Description
intent The purpose or objective of the shellcode. (Text)
snippet The actual assembly code of the shellcode, including instructions and data necessary to execute the desired action. (Text)

File: train.csv

Column name Description
intent The purpose or objective of the shellcode. (Text)
snippet The actual assembly code of the shellcode, including instructions and data necessary to execute the desired action. (Text)

File: test.csv

Column name Description
intent The purpose or objective of the shellcode. (Text)
snippet The actual assembly code of the shellcode, including instructions and data necessary to execute the desired action. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit SoLID (From Huggingface).

Tables

Test

@kaggle.thedevastator_assembly_shellcode_dataset.test
  • 12.57 KB
  • 320 rows
  • 2 columns
Loading...

CREATE TABLE test (
  "intent" VARCHAR,
  "snippet" VARCHAR
);

Train

@kaggle.thedevastator_assembly_shellcode_dataset.train
  • 78.23 KB
  • 2560 rows
  • 2 columns
Loading...

CREATE TABLE train (
  "intent" VARCHAR,
  "snippet" VARCHAR
);

Validation

@kaggle.thedevastator_assembly_shellcode_dataset.validation
  • 13.11 KB
  • 320 rows
  • 2 columns
Loading...

CREATE TABLE validation (
  "intent" VARCHAR,
  "snippet" VARCHAR
);

Share link

Anyone who has the link will be able to view this.