Dataset: Train File Sizes Google Identify Contrails

About this Dataset

Train File Sizes Google Identify Contrails

This dataset comprises metadata for 225,819 train files Google Research - Identify Contrails to Reduce Global Warming challenge.

The code was obtained by using a simple bash script:

shopt -s globstar dotglob nullglob

for pathname in train/**/*; do
    if [[ -f $pathname ]] && [[ ! -h $pathname ]]; then
        stat -c $'%s\t%n' "$pathname"
    fi
done &gt;train_file_sizes.csv

After the bash script, the file was preprocessed with the following python code:

train_sizes = pd.read_csv('data/train_file_sizes.csv', delim_whitespace=True, names=['file_size', 'file_path'])
train_sizes['record_id'] =  train_sizes.file_path.str.split('/', expand=True)[1].astype(int)
train_sizes.to_csv('data/train_file_sizes.csv', index=False)

Tables

Train File Sizes

@kaggle.sergiosaharovskiy_train_file_sizes_google_identify_contrails.train_file_sizes

1.83 MB
225819 rows
3 columns


CREATE TABLE train_file_sizes (
  "file_size" BIGINT,
  "file_path" VARCHAR,
  "record_id" BIGINT
);