Baselight

MBAL: 10 Millions Crypto Address Label Dataset

A dataset of 10 millions annotated crypto addresses with categories and entities

@kaggle.yidongchaintoolai_mbal_10m_crypto_address_label_dataset

About this Dataset

MBAL: 10 Millions Crypto Address Label Dataset

This dataset is published in the article "MBAL: A Dataset of 10 Million Annotated Crypto Addresses with Categories and Entities on Leading Blockchain Networks" and includes data related to the dataset and experiments conducted.

The dataset comprises six files, covering three sections, described as follows:

Section 1: The publicly released dataset

  • dataset_10m_ads.csv
    This file contains labeled data for 10 million addresses, with six columns explained below:
    | column_name | description |
    | --- | --- |
    | chain | The blockchain network of the address, with five possible values: bitcoin_mainnet, ethereum_mainnet, bnb_chain_mainnet, polygon_mainnet, avalanche_c_chain |
    | address | The cryptocurrency address |
    | categories | The category of the address, as enumerated in the article, with 62 possible values. An address may belong to multiple categories |
    | entity | The entity associated with the address, which may be unique or empty |
    | source | The source of the data, with three possible values: ground_truth, heuristic, external |

Second 2: Sample data for Experiment 1 (COMPARATIVE EXPERIMENT BETWEEN MBAL AND BABD-13)

Experiment 1 focuses on addresses in Bitcoin mainnet. The columns in the below three files are consistent, mainly including address, category, and other 144 feature fields. Using these sample data, Experiment 1 described in the article can be fully replicated.

The method of constructing a training/test set based on sample data is shown in this figure . We fused and de-weighted the positive sample data of the two datasets, from which 50,000 data were randomly selected as the positive sample of the test set. Negative samples are constructed in the same way, and a test set of 100,000 data is finally obtained. And the sample data removal corresponding to the test set is the training set data. The white part in this figure indicates the duplicate data, yellow indicates the test data, and light yellow indicates the training data.

  • exp1_bitcoin_sample_test_dd.csv
    Public test samples for Experiment 1.
  • exp1_bitcoin_sample_train_mbal_dd.csv
    Training samples from the MBAL dataset for Experiment 1.
  • exp1_bitcoin_sample_train_babd_dd.csv
    Training samples from the BABD dataset for Experiment 1.

Section 3: Sample data for Experiment 2 (EXPERIMENT ON SPECIFIC CATEGORIES)

Experiment 2 focuses on addresses in Ethereum mainnet. The columns in these files are consistent, mainly including address, category, and other 207 feature fields. Using these sample data, Experiment 2 described in the article can be fully replicated.
Sample Dataset Construction: When analyzing the Ethereum category, we select the transaction data from 2022 for sample dataset construction, constructed in the same way as the experiment 1. In total, we got 55571103 addresses and the corresponding 591892912 transaction data.
Training/Test Set Construction: We constructed training/test sets for specific categories of analysis experiments using the same methodology as for the experiment 1. However, in terms of quantity, we expanded by selecting 4749952 training data and 1000000 test data (500000 positive and negative samples, respectively).

  • exp2_ethereum_sample_test_mbal_dd.csv
    Test samples from the MBAL dataset for Experiment 2.
  • exp2_ethereum_sample_train_mbal_dd.csv
    Training samples from the MBAL dataset for Experiment 2.

About categories

Share link

Anyone who has the link will be able to view this.