Dataset: OGBN-ArXiv (Processed For PyG)

About this Dataset

OGBN-ArXiv (Processed For PyG)

OGBN-ArXiv

Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv

Usage in Python

import os.path as osp
import pandas as pd
import datatable as dt
import torch
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset

class PygOgbnArxiv(PygNodePropPredDataset):
    def __init__(self):
        root, name, transform = '/kaggle/input', 'ogbn-arxiv', T.ToSparseTensor()
        master = pd.read_csv(osp.join(root, name, 'ogbn-master.csv'), index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super().__init__(name = name, root = root, transform = transform, meta_dict = meta_dict)
    def get_idx_split(self):
        split_type = self.meta_info['split']
        path = osp.join(self.root, 'split', split_type)
        train_idx = dt.fread(osp.join(path, 'train.csv'), header = False).to_numpy().T[0]
        train_idx = torch.from_numpy(train_idx).to(torch.long)
        valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = False).to_numpy().T[0]
        valid_idx = torch.from_numpy(valid_idx).to(torch.long)
        test_idx = dt.fread(osp.join(path, 'test.csv'), header = False).to_numpy().T[0]
        test_idx = torch.from_numpy(test_idx).to(torch.long)
        return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}

dataset = PygOgbnArxiv()
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
graph = dataset[0] # PyG Graph object

Description

Graph: The ogbn-arxiv dataset is a directed graph, representing the citation network between all Computer Science (CS) arXiv papers indexed by MAG [1]. Each node is an arXiv paper and each directed edge indicates that one paper cites another one. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are computed by running the skip-gram model [2] over the MAG corpus. The authors also provide the mapping from MAG paper IDs into the raw texts of titles and abstracts here. In addition, all papers are also associated with the year that the corresponding paper was published.

Prediction task: The task is to predict the 40 subject areas of arXiv CS papers, e.g., cs.AI, cs.LG, and cs.OS, which are manually determined (i.e., labeled) by the paper’s authors and arXiv moderators. With the volume of scientific publications doubling every 12 years over the past century, it is practically important to automatically classify each publication’s areas and topics. Formally, the task is to predict the primary categories of the arXiv papers, which is formulated as a 40-class classification problem.

Dataset splitting: The authors consider a realistic data split based on the publication dates of the papers. The general setting is that the ML models are trained on existing papers and then used to predict the subject areas of newly-published papers, which supports the direct application of them into real-world scenarios, such as helping the arXiv moderators. Specifically, the authors propose to train on papers published until 2017, validate on those published in 2018, and test on those published since 2019.

Summary

Package	#Nodes	#Edges	Split Type	Task Type	Metric
`ogb>=1.1.1`	169,343	1,166,243	Time	Multi-class classification	Accuracy

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, pp. 3111–3119, 2013.
[3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

Tables

Ogbn Master

@kaggle.dataup1_ogbn_arxiv.ogbn_master

5.44 KB
15 rows
6 columns


CREATE TABLE ogbn_master (
  "unnamed_0" VARCHAR,
  "ogbn_proteins" VARCHAR,
  "ogbn_products" VARCHAR,
  "ogbn_arxiv" VARCHAR,
  "ogbn_mag" VARCHAR,
  "ogbn_papers100m" VARCHAR
);