This dataset comprises metadata for 225,819 train files Google Research - Identify Contrails to Reduce Global Warming challenge.
The code was obtained by using a simple bash script:
shopt -s globstar dotglob nullglob
for pathname in train/**/*; do
if [[ -f $pathname ]] && [[ ! -h $pathname ]]; then
stat -c $'%s\t%n' "$pathname"
fi
done >train_file_sizes.csv
After the bash script, the file was preprocessed with the following python code:
train_sizes = pd.read_csv('data/train_file_sizes.csv', delim_whitespace=True, names=['file_size', 'file_path'])
train_sizes['record_id'] = train_sizes.file_path.str.split('/', expand=True)[1].astype(int)
train_sizes.to_csv('data/train_file_sizes.csv', index=False)