This is part of VGGSound dataset with everything related to cats and dogs converted to 10 seconds 16kHz mono wav.
I made it for my University research, because original dataset is kind of huge :)
There are also two csv files with train/test split collected from VGG Sound splits. All data numbered according to indexes of original csv tables.
Each line in the csv file has columns defined by here:
Index in original VGGSound (my addition), YouTube ID, start seconds, label, train/test split.
Also, some of the video links (~800 of them) in tables lead to unavailable videos (age restricted/deleted/etc.), which was not downloaded and therefore is not here – so there would be no audio for some indexes.
The example of real practice use of the dataset can be found in my VQ-VAE 2 notebook 👨💻.
I've also got this helper notebook 🐱🐶 which shows some simple actions you can do with audio dat. In particular:
- Some of the audio files are 9 seconds long – how to pad it
- How to prepare spectrograms to use them as regular pictures
And umm I could not figure out how to do a proper citation, but here it is from original VGGSound
@InProceedings{Chen20,
author = "Honglie Chen and Weidi Xie and Andrea Vedaldi and Andrew Zisserman",
title = "VGGSound: A Large-scale Audio-Visual Dataset",
booktitle = "International Conference on Acoustics, Speech, and Signal Processing (ICASSP)",
year = "2020",
}