Baselight

Netflix Data: Cleaning, Analysis And Visualization

Cleaning and Visualization with Pgsql and Tableau

@kaggle.ariyoomotade_netflix_data_cleaning_analysis_and_visualization

About this Dataset

Netflix Data: Cleaning, Analysis And Visualization

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to:

  1. Treat the Nulls
  2. Treat the duplicates
  3. Populate missing rows
  4. Drop unneeded columns
  5. Split columns
    Extra steps and more explanation on the process will be explained through the code comments
--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                                                   
SELECT show_id, COUNT(*)                                                                                                                                                                            
FROM netflix 
GROUP BY show_id                                                                                                                                                                                            
ORDER BY show_id DESC;

--No duplicates
--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
       COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
       COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
       COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
	   COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
	   COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
       COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
       COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
       COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
	   COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
       COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
	   COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3  

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast 
UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix  AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE FROM netflix
WHERE show_id 
IN ('6797', 's6067', 's6175', 's6807', 's6902', 's7255', 's7197', 's7407', 's7848', 's8183');

rating nulls is 4. Delete them

--Show rating NULLS

SELECT show_id, rating
FROM netflix_clean
WHERE date_added IS NULL;

--Delete the nulls, and show deleted fields
DELETE FROM netflix 
WHERE show_id 
IN (SELECT show_id FROM netflix WHERE rating IS NULL)
RETURNING *;

--duration nulls is 4. Delete them


DELETE FROM netflix 
WHERE show_id 
IN (SELECT show_id FROM netflix WHERE duration IS NULL);

Now run the query to show the number of nulls in each column to confirm if there are still nulls. After this, run the query to confirm the row number in each column is the same

--Check to confirm the number of rows are the same(NO NULL)

SELECT count(*) filter (where show_id IS NOT NULL) AS showid_nulls,
       count(*) filter (where type IS NOT NULL) AS type_nulls,
       count(*) filter (where title IS NOT NULL) AS title_nulls,
       count(*) filter (where director IS NOT NULL) AS director_nulls,
	   count(*) filter (where country IS NOT NULL) AS country_nulls,
       count(*) filter (where date_added IS NOT NULL) AS date_addes_nulls,
       count(*) filter (where release_year IS NOT NULL) AS release_year_nulls,
       count(*) filter (where rating IS NOT NULL) AS rating_nulls,
	   count(*) filter (where duration IS NOT NULL) AS duration_nulls,
       count(*) filter (where listed_in IS NOT NULL) AS listed_in_nulls
FROM netflix;

 --Total number of rows are the same in all columns

We can drop the description and movie_cast column because they are not needed for our analysis or visualization task.

--DROP unneeded columns

ALTER TABLE netflix
DROP COLUMN movie_cast, 
DROP COLUMN description;

Some of the rows in country column has multiple countries, for my visualization, I only need one country per row to make my map visualization clean and easy. Therefore, I am going to split the country column and retain the first country by the left which I believe is the original country of the movie

SELECT *,
	   SPLIT_PART(country,',',1) AS countryy, 
           SPLIT_PART(country,',',2),
	   SPLIT_PART(country,',',4),
	   SPLIT_PART(country,',',5),
	   SPLIT_PART(country,',',6),
	   SPLIT_PART(country,',',7),
	   SPLIT_PART(country,',',8),
	   SPLIT_PART(country,',',9),
	   SPLIT_PART(country,',',10) 
	   
FROM netflix;
	   
-- NOW lets update the table

ALTER TABLE netflix 
ADD country1 varchar(500);
UPDATE netflix 
SET country1 = SPLIT_PART(country, ',', 1);

--This will create a column named country1 and Update it with the first split country.

Delete the country column that has multiple country entries

--Delete column
ALTER TABLE netflix 
DROP COLUMN country;

Rename the country1 column to country

ALTER TABLE netflix 
RENAME COLUMN country1 TO country;

Data Visualization

After cleaning, the dataset is set for some analysis and visualization with Tableau.

Note: In the visualization captions, Contents means Movies and TV shows, and Content may either mean Movie or TV Show.

Sheet 1. Content type in percentage

This first sheet shows the two categories of content in the dataset which are Movie and Tv show.

  • As we can see the majority of the content is Movie which takes 69.9%.
  • There are more details in the tooltip which shows the exact count of Movie and Tv show

Sheet 2. Movie & TV Show by Country

This shows the the total amount of Movies and Tv shows per country within the given period of time(2008 - 2021). This can be noted by the size of the coloured circle in the map.

  • We can see that the United State of America has the largest size, followed by India and the United Kingdom.
  • In the Tableau hosted dashboard/sheet, there is a filter for the years between 2008 and 2021 to calculate yearly record.

To give an alternate and a clearer view. Movie & TV shows by country bar chart is below

Sheet 3. Number of Contents Added through the Years

This time series chart shows the total number of contents added to Netflix all through the given years (2008 - 2021)

  • It shows that most movies and tv shows on Netflix were added in 2019
  • In the Tableau sheet, there is a filter to know how much Movies and Tv shows were added in each month of the year

Sheet 4. Top Directors

This chart shows the top 10 directors with most contents on Netflix. This char shows the count of Movie and Tv shows in their catalouge.

  • We can see that most of these directors contents are movies.
  • We can also note that the duo of Raul Campos and Jan Suter are fond of working together and have directed 18 movies on Netflix.

Sheet 5. Top Genres

This chart shows the genres with the highest numbers on Netflix.

  • We can see that Drama & International movies followed by Documentary have the highest number of contents on Netflix within the period.

Sheet 6. Top Ratings

Rating is a system to rate motion picture's suitability for certain audiences based on its content. This chart shows the top ratings on Netflix
-We can note that most contents on Netflix are rated TV-MA. TV-MA in the United States by the TV Parental Guidelines signifies content for mature audiences.

Sheet 7. Oldest Contents on Netflix by Content Release year

This table shows the 10 oldest movies and tv shows on Netflix

  • The oldest is as old as 1925

Sheet 8. Content Types over the Years

This line chart compares the Movie and Tv shows contents added to Netflix all through the years.

  • We can see that more movies have always been added.
  • In 2013, the number of contents added to Netflix for both were almost the same with Movies having 6 contents that year and Tv shows having 5.
  • It shows that in the first 5 years, only movies were added to Netflix.

Sheet 9. Release Years with Highest Contents

This chart shows the Movies and Tv shows production year which has with highest contents on Netflix. We focus on the top 10 release year/production year.
-We can see that from 2012 to 2018, Netflix added most recent contents, they made sure most recent contents per release year are higher than the older release year contents. Then in 2019, it started dropping, this may be due to the Covid-19, but further analysis may be needed to determine this.

And with this, I have come to the end of this exercise. As I said this is just an exercise to test my skills as I look forward to be better. Thanks for following through. Cheers!

Share link

Anyone who has the link will be able to view this.