Favicons
Image data and metadata for 360,000 favicons scraped from popular websites
@kaggle.colinmorris_favicons
Image data and metadata for 360,000 favicons scraped from popular websites
@kaggle.colinmorris_favicons
Favicons are the (usually tiny) image files that browsers may use to represent websites in tabs, in the URL bar, or for bookmarks. Kaggle, for example, uses an image of a blue lowercase "k" as its favicon. This dataset contains about 360,000 favicons from popular websites.
These favicons were scraped in July 2016. I wrote a crawler that went through Alexa's top 1 million sites, and made a request for 'favicon.ico' at the site root. If I got a 200 response code, I saved the result as ${site_url}.ico
. For domains that were identical but for the TLD (e.g. google.com, google.ca, google.jp...), I scraped only one favicon. My scraping/cleaning code is on GitHub here.
Of 1m sites crawled, 540k responded with a 200 code. The dataset has 360k images, which were the remains after filtering out:
file
command (-40k). These mostly had type HTML, ASCII, or UTF-*.The remaining files are exactly as I received them from the site. They are mostly ICO files, with the most common sizes being 16x16, 32x32, and 48x48. But there's a long tail of more exotic formats and sizes (there is at least one person living among us who thought that 88x31 was a fine size for a favicon).
The favicon files are divided among 6 zip files, full-0.zip, full-1.zip... full-5.zip
. (If you wish to download the full dataset as a single tarball, you can do so from the Internet Archive)
favicon_metadata.csv
is a csv file with one row per favicon in the dataset. The split_index
says which of the zip files the image landed in. For an example of loading and interacting with particular favicons in a kernel context, check out the Favicon helper functions kernel.
As mentioned above, the full dataset is a dog's breakfast of different file formats and dimensions. I've created 'standardized' subsets of the data that may be easier to work with (particularly for machine learning applications, where it's necessary to have fixed dimensions).
16_16.tar.gz is a tarball containing all 16x16 favicons in the dataset, converted to PNG. It has 290k images. ICO is a container format, and many of the ico files in the raw dataset contain several versions of the same favicon at different resolutions. 16x16 favicons that were stuffed together in an ICO file with images of other sizes are included in this set. But I did no resizing - if a favicon has no 'native' 16x16 version, it isn't in this set.
16_16_distinct.tar.gz is identical to the above, but with 70k duplicate or near-duplicate images removed. There are a small number of commonly repeated favicons like the Blogger "B" that occur thousands of times, which could be an annoyance depending on the use case - e.g. a generative model might get stuck in a local maximum of spitting out Blogger Bs.
Alexa's top 1-million list includes 'adult' sites, so some URLs and favicons may be NSFW or offensive. (It's pretty hard to make a credible depiction of nudity in 256 pixels, but there are some occasional attempts.)
I hope this dataset might be especially useful for small-scale deep learning experiments. Scaling photographs down to 16x16 would render many of them unintelligible, but these favicons were born tiny. The 16_16
fold has more instances than MNIST, and the images are even smaller! (Though, unlike MNIST, most of the images in this dataset are not grayscale.)
If you liked this, you should also check out the recently released Large Logo Dataset. They've currently made available 550k favicons resized to 32x32. Their data was collected more recently, and their scraping process was more robust, so their dataset should probably be preferred (though you might still want to use this one if you need the raw favicon files, or if you prefer to use 16x16 non-resized images).
Anyone who has the link will be able to view this.