MBAL: 10 Millions Crypto Address Label Dataset
A dataset of 10 millions annotated crypto addresses with categories and entities
@kaggle.yidongchaintoolai_mbal_10m_crypto_address_label_dataset
A dataset of 10 millions annotated crypto addresses with categories and entities
@kaggle.yidongchaintoolai_mbal_10m_crypto_address_label_dataset
This dataset is published in the article "MBAL: A Dataset of 10 Million Annotated Crypto Addresses with Categories and Entities on Leading Blockchain Networks" and includes data related to the dataset and experiments conducted.
The dataset comprises six files, covering three sections, described as follows:
Experiment 1 focuses on addresses in Bitcoin mainnet. The columns in the below three files are consistent, mainly including address, category, and other 144 feature fields. Using these sample data, Experiment 1 described in the article can be fully replicated.
The method of constructing a training/test set based on sample data is shown in this figure . We fused and de-weighted the positive sample data of the two datasets, from which 50,000 data were randomly selected as the positive sample of the test set. Negative samples are constructed in the same way, and a test set of 100,000 data is finally obtained. And the sample data removal corresponding to the test set is the training set data. The white part in this figure indicates the duplicate data, yellow indicates the test data, and light yellow indicates the training data.
Experiment 2 focuses on addresses in Ethereum mainnet. The columns in these files are consistent, mainly including address, category, and other 207 feature fields. Using these sample data, Experiment 2 described in the article can be fully replicated.
Sample Dataset Construction: When analyzing the Ethereum category, we select the transaction data from 2022 for sample dataset construction, constructed in the same way as the experiment 1. In total, we got 55571103 addresses and the corresponding 591892912 transaction data.
Training/Test Set Construction: We constructed training/test sets for specific categories of analysis experiments using the same methodology as for the experiment 1. However, in terms of quantity, we expanded by selecting 4749952 training data and 1000000 test data (500000 positive and negative samples, respectively).
Anyone who has the link will be able to view this.