USA News Dataset by Kaggle | Media and Entertainment

About this Dataset

USA News Dataset

Problem Description

Construct two types of models -- (A) a deep learning classifier such as LSTM or similar model to predict the category of a news article given its title and abstract, and (B) A recommendation system to recommend posts that a user is most likely to click.

The dataset consists of two files -- (1) user_news_clicks.csv, and (2) news_text.csv.

Model A, the deep learning classifier only requires the news_text.csv dataset. The goal is to predict the ‘category’ label using the ‘title’ and ‘abstract; columns. Model B, the recommendation system only requires user_news_clicks.csv but you can use the news_text.csv in addition if you’d like though it is not necessary for this exercise. The goal is to be able to recommend users news articles that they’re likely to click.

Data Description

In news_text.csv - each record consists of three attributes and a target variable:

Category - There are lots of news categories available in this dataset, as requested we need to only 3 categories - news, sports and finance
news_id - Identification number of the news
title - Title of the news
abstract - Abstract of the news

In user_news_clicks.csv - each record consists of two attributes and a target variable:

click - User has clicked the articles or not
user_id - Identification number of the user
item - Identification number of an item

Goals

Design the deep learning classifier and the recommendation system models
Build and train the models using a Python deep learning library such as Tensorflow or PyTorch
Test the model’s performance using a set of metrics
Report on the performance of the model

Instructions

Read about the dataset at
https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md

NOTE: We do not need to use the entire dataset, if resources are limited. Feel free to sample.

For Model A, use only the top 3 categories -- namely news, sports, and finance for model training and validation.
Code and build the models A and B using a Python library such as Pytorch or Tensorflow

Tables

News Text

@kaggle.vinayakshanawad_us_news_dataset.news_text

10.01 MB
51282 rows
4 columns


CREATE TABLE news_text (
  "news_id" VARCHAR,
  "title" VARCHAR,
  "abstract" VARCHAR,
  "category" VARCHAR
);

User News Clicks

@kaggle.vinayakshanawad_us_news_dataset.user_news_clicks

23.97 MB
10951083 rows
3 columns


CREATE TABLE user_news_clicks (
  "user_id" VARCHAR,
  "item" VARCHAR,
  "click" BIGINT
);