Baselight

Favicons

Image data and metadata for 360,000 favicons scraped from popular websites

@kaggle.colinmorris_favicons

About this Dataset

Favicons

Context

Favicons are the (usually tiny) image files that browsers may use to represent websites in tabs, in the URL bar, or for bookmarks. Kaggle, for example, uses an image of a blue lowercase "k" as its favicon. This dataset contains about 360,000 favicons from popular websites.

Content and Acknowledgements

These favicons were scraped in July 2016. I wrote a crawler that went through Alexa's top 1 million sites, and made a request for 'favicon.ico' at the site root. If I got a 200 response code, I saved the result as ${site_url}.ico. For domains that were identical but for the TLD (e.g. google.com, google.ca, google.jp...), I scraped only one favicon. My scraping/cleaning code is on GitHub here.

Of 1m sites crawled, 540k responded with a 200 code. The dataset has 360k images, which were the remains after filtering out:

  • empty files (-140k)
  • non-image files, according to the file command (-40k). These mostly had type HTML, ASCII, or UTF-*.
  • corrupt/malformed image files - i.e. those that were sufficiently messed up that ImageMagick failed to parse them. (-1k)

The remaining files are exactly as I received them from the site. They are mostly ICO files, with the most common sizes being 16x16, 32x32, and 48x48. But there's a long tail of more exotic formats and sizes (there is at least one person living among us who thought that 88x31 was a fine size for a favicon).

The favicon files are divided among 6 zip files, full-0.zip, full-1.zip... full-5.zip. (If you wish to download the full dataset as a single tarball, you can do so from the Internet Archive)

favicon_metadata.csv is a csv file with one row per favicon in the dataset. The split_index says which of the zip files the image landed in. For an example of loading and interacting with particular favicons in a kernel context, check out the Favicon helper functions kernel.

As mentioned above, the full dataset is a dog's breakfast of different file formats and dimensions. I've created 'standardized' subsets of the data that may be easier to work with (particularly for machine learning applications, where it's necessary to have fixed dimensions).

16_16.tar.gz is a tarball containing all 16x16 favicons in the dataset, converted to PNG. It has 290k images. ICO is a container format, and many of the ico files in the raw dataset contain several versions of the same favicon at different resolutions. 16x16 favicons that were stuffed together in an ICO file with images of other sizes are included in this set. But I did no resizing - if a favicon has no 'native' 16x16 version, it isn't in this set.

16_16_distinct.tar.gz is identical to the above, but with 70k duplicate or near-duplicate images removed. There are a small number of commonly repeated favicons like the Blogger "B" that occur thousands of times, which could be an annoyance depending on the use case - e.g. a generative model might get stuck in a local maximum of spitting out Blogger Bs.

Alexa's top 1-million list includes 'adult' sites, so some URLs and favicons may be NSFW or offensive. (It's pretty hard to make a credible depiction of nudity in 256 pixels, but there are some occasional attempts.)

Inspiration

I hope this dataset might be especially useful for small-scale deep learning experiments. Scaling photographs down to 16x16 would render many of them unintelligible, but these favicons were born tiny. The 16_16 fold has more instances than MNIST, and the images are even smaller! (Though, unlike MNIST, most of the images in this dataset are not grayscale.)

If you liked this, you should also check out the recently released Large Logo Dataset. They've currently made available 550k favicons resized to 32x32. Their data was collected more recently, and their scraping process was more robust, so their dataset should probably be preferred (though you might still want to use this one if you need the raw favicon files, or if you prefer to use 16x16 non-resized images).

Share link

Anyone who has the link will be able to view this.