Baselight

Google Safe Browsing Transparency Report Data

Data automatically scraped from the Google Safe Browsing Transparency Report.

@kaggle.robroseknows_google_safe_browsing_transparency_report_data

About this Dataset

Google Safe Browsing Transparency Report Data

Context

I wanted to make this for potentially using as a helper dataset in the Microsoft Malware Prediction competition. I was also inspired by Kaggle's new ability to create datasets from the outputs of Kernels, which is something I leveraged here.

Content

The data is the full data found on the Google Safe Browsing Transparency Report web page. There is plenty of missing data, sometimes the source data doesn't start for a while and there are periodic gaps for unspecified reasons. It's up to you to determine what to do with those gaps. The reinfection rate has been multiplied by 100 and converted to an int in order to signify percentage.

Acknowledgements

Thanks to @rquintino for publishing the splits for the Microsoft competition that originally inspired me to gather this data. And @cdeotte who originally published some scraped datasets in the Microsoft competition, see this discussion post for details.

Inspiration

I hope some people find this useful! For the Microsoft challenge or any future challenges! Please leave an upvote here or on the source kernel if you found it useful! I plan to rerun the source kernel weekly on Fridays. I hope Kaggle in the future enables some way to automate that, but for now I just do it manually. If the data is stale, feel free to ping me in the discussions section or on the source kernel and I'll run it.

Share link

Anyone who has the link will be able to view this.