Netflix Data: Cleaning, Analysis And Visualization
Cleaning and Visualization with Pgsql and Tableau
@kaggle.ariyoomotade_netflix_data_cleaning_analysis_and_visualization
Cleaning and Visualization with Pgsql and Tableau
@kaggle.ariyoomotade_netflix_data_cleaning_analysis_and_visualization
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to:
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE FROM netflix
WHERE show_id
IN ('6797', 's6067', 's6175', 's6807', 's6902', 's7255', 's7197', 's7407', 's7848', 's8183');
rating nulls is 4. Delete them
--Show rating NULLS
SELECT show_id, rating
FROM netflix_clean
WHERE date_added IS NULL;
--Delete the nulls, and show deleted fields
DELETE FROM netflix
WHERE show_id
IN (SELECT show_id FROM netflix WHERE rating IS NULL)
RETURNING *;
--duration nulls is 4. Delete them
DELETE FROM netflix
WHERE show_id
IN (SELECT show_id FROM netflix WHERE duration IS NULL);
Now run the query to show the number of nulls in each column to confirm if there are still nulls. After this, run the query to confirm the row number in each column is the same
--Check to confirm the number of rows are the same(NO NULL)
SELECT count(*) filter (where show_id IS NOT NULL) AS showid_nulls,
count(*) filter (where type IS NOT NULL) AS type_nulls,
count(*) filter (where title IS NOT NULL) AS title_nulls,
count(*) filter (where director IS NOT NULL) AS director_nulls,
count(*) filter (where country IS NOT NULL) AS country_nulls,
count(*) filter (where date_added IS NOT NULL) AS date_addes_nulls,
count(*) filter (where release_year IS NOT NULL) AS release_year_nulls,
count(*) filter (where rating IS NOT NULL) AS rating_nulls,
count(*) filter (where duration IS NOT NULL) AS duration_nulls,
count(*) filter (where listed_in IS NOT NULL) AS listed_in_nulls
FROM netflix;
--Total number of rows are the same in all columns
We can drop the description and movie_cast column because they are not needed for our analysis or visualization task.
--DROP unneeded columns
ALTER TABLE netflix
DROP COLUMN movie_cast,
DROP COLUMN description;
Some of the rows in country column has multiple countries, for my visualization, I only need one country per row to make my map visualization clean and easy. Therefore, I am going to split the country column and retain the first country by the left which I believe is the original country of the movie
SELECT *,
SPLIT_PART(country,',',1) AS countryy,
SPLIT_PART(country,',',2),
SPLIT_PART(country,',',4),
SPLIT_PART(country,',',5),
SPLIT_PART(country,',',6),
SPLIT_PART(country,',',7),
SPLIT_PART(country,',',8),
SPLIT_PART(country,',',9),
SPLIT_PART(country,',',10)
FROM netflix;
-- NOW lets update the table
ALTER TABLE netflix
ADD country1 varchar(500);
UPDATE netflix
SET country1 = SPLIT_PART(country, ',', 1);
--This will create a column named country1 and Update it with the first split country.
Delete the country column that has multiple country entries
--Delete column
ALTER TABLE netflix
DROP COLUMN country;
Rename the country1 column to country
ALTER TABLE netflix
RENAME COLUMN country1 TO country;
After cleaning, the dataset is set for some analysis and visualization with Tableau.
Note: In the visualization captions, Contents means Movies and TV shows, and Content may either mean Movie or TV Show.
Sheet 1. Content type in percentage
This first sheet shows the two categories of content in the dataset which are Movie and Tv show.
Sheet 2. Movie & TV Show by Country
This shows the the total amount of Movies and Tv shows per country within the given period of time(2008 - 2021). This can be noted by the size of the coloured circle in the map.
To give an alternate and a clearer view. Movie & TV shows by country bar chart is below
Sheet 3. Number of Contents Added through the Years
This time series chart shows the total number of contents added to Netflix all through the given years (2008 - 2021)
Sheet 4. Top Directors
This chart shows the top 10 directors with most contents on Netflix. This char shows the count of Movie and Tv shows in their catalouge.
Sheet 5. Top Genres
This chart shows the genres with the highest numbers on Netflix.
Sheet 6. Top Ratings
Rating is a system to rate motion picture's suitability for certain audiences based on its content. This chart shows the top ratings on Netflix
-We can note that most contents on Netflix are rated TV-MA. TV-MA in the United States by the TV Parental Guidelines signifies content for mature audiences.
Sheet 7. Oldest Contents on Netflix by Content Release year
This table shows the 10 oldest movies and tv shows on Netflix
Sheet 8. Content Types over the Years
This line chart compares the Movie and Tv shows contents added to Netflix all through the years.
Sheet 9. Release Years with Highest Contents
This chart shows the Movies and Tv shows production year which has with highest contents on Netflix. We focus on the top 10 release year/production year.
-We can see that from 2012 to 2018, Netflix added most recent contents, they made sure most recent contents per release year are higher than the older release year contents. Then in 2019, it started dropping, this may be due to the Covid-19, but further analysis may be needed to determine this.
And with this, I have come to the end of this exercise. As I said this is just an exercise to test my skills as I look forward to be better. Thanks for following through. Cheers!
Anyone who has the link will be able to view this.