Baselight

Train File Sizes Google Identify Contrails

Exploring the relationship between train file sizes

@kaggle.sergiosaharovskiy_train_file_sizes_google_identify_contrails

About this Dataset

Train File Sizes Google Identify Contrails

This dataset comprises metadata for 225,819 train files Google Research - Identify Contrails to Reduce Global Warming challenge.

The code was obtained by using a simple bash script:

shopt -s globstar dotglob nullglob

for pathname in train/**/*; do
    if [[ -f $pathname ]] && [[ ! -h $pathname ]]; then
        stat -c $'%s\t%n' "$pathname"
    fi
done >train_file_sizes.csv

After the bash script, the file was preprocessed with the following python code:

train_sizes = pd.read_csv('data/train_file_sizes.csv', delim_whitespace=True, names=['file_size', 'file_path'])
train_sizes['record_id'] =  train_sizes.file_path.str.split('/', expand=True)[1].astype(int)
train_sizes.to_csv('data/train_file_sizes.csv', index=False)

Share link

Anyone who has the link will be able to view this.