A subset of the Microsoft Academic Graph (heterogeneous)

OGBN-MAG

Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag

Usage in Python

*Warning: * Currently not usable.

import torch_geometric
from ogb.nodeproppred import PygNodePropPredDataset

dataset = PygNodePropPredDataset('ogbn-mag', root = '/kaggle/input')
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
graph = dataset[0] # PyG Graph object

Description

Graph: The ogbn-mag dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1]. It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study. Similar to ogbn-arxiv, each paper is associated with a 128-dimensional word2vec feature vector, and all the other types of entities are not associated with input node features.

Prediction task: Given the heterogeneous ogbn-mag data, the task is to predict the venue (conference or journal) of each paper, given its content, references, authors, and authors’ affiliations. This is of practical interest as some manuscripts’ venue information is unknown or missing in MAG, due to the noisy nature of Web data. In total, there are 349 different venues in ogbn-mag, making the task a 349-class classification problem.

Dataset splitting: The authors of this dataset follow the same time-based strategy as ogbn-arxiv and ogbn-papers100M to split the paper nodes in the heterogeneous graph, i.e., training models to predict venue labels of all papers published before 2018, validating and testing the models on papers published in 2018 and since 2019, respectively.

Summary

Package	#Nodes	#Edges	Split Type	Task Type	Metric
`ogb>=1.2.1`	1,939,743	21,111,007	Time	Multi-class classification	Accuracy

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [2] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
[2] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

Related Datasets

Data From: Metagenomes And Metagenome-Assembled Genomes From Ex Vivo Fecal Incubations Of Six Unique Donors

@usgov
Data From: Metagenomes And Metagenome-Assembled Genomes From Ex Vivo Fecal Incubations Of Six Unique Donors

@usgov
Books Dataset

@kaggle
GBARD By Type Of Funding

@eurostat
Global Forest Resources Assessment

@owid
OA EW BGC With RUCOA11

@ukgov

Data From: Metagenomes And Metagenome-Assembled Genomes From Ex Vivo Fecal Incubations Of Six Unique Donors

Data From: Metagenomes And Metagenome-Assembled Genomes From Ex Vivo Fecal Incubations Of Six Unique Donors

Books Dataset

GBARD By Type Of Funding

Global Forest Resources Assessment

OA EW BGC With RUCOA11