A Beginner Guide to Data Analysis with Python

12 min readSep 20, 2021

Python is a very versatile programming language that is also beginner-friendly. So if you are interested in data analysis or visualization with a programming language like python, this article is just for you. If you are a seasoned Python programmer looking to get into the data visualization world, this article is also for you. This article contains a step-by-step guide on how to visualize and analyze data with python, the dataset used in this article is the Netflix dataset gotten from Kaggle. Familiarity with python is required for you to understand the article. Tools and libraries used include Google Colaboratory, Matplotlib, Plotly, Seaborn, Numpy, and Pandas.

Step 1: Importing libraries

The first thing you should do before starting data visualization in your notebook is import the libraries you will need. If your notebook doesn’t have the libraries installed, you wouldn’t be able to import them. If you are using a Jupyter notebook you can check this article to find out how to install modules in Jupyter notebooks. Google Colaboratory comes with the following libraries installed so the only thing you need to do is import them. You can follow along by typing out the codes.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns%matplotlib inlineimport warnings
warnings.filterwarnings("ignore")

Step 2: Getting your Data

The data I am using is stored in my google drive so I am mounting my google drive into Google Colaboratory. Then I read the data as a CSV file using the pandas library. If your data is stored in your local storage, GitHub, or elsewhere you can upload it from there too. If you are not using Google Colaboratory some of these codes will throw an error. To upload data in the Jupyter notebook, you can follow the instructions in this documentation.

from google.colab import drive
drive.mount('/content/drive')
path =  '/content/drive/MyDrive/SCA Dataset/netflix_titles.csv'
df = pd.read_csv(path)
df

You should get an output like this showing your data frame.

To get your data from GitHub, click on your dataset that is present in the Github repository, then click on view raw. You can use the code below:

url = 'copied_raw_Github_link'
df1 = pd.read_csv(url)
# Dataset from Github is now stored in a Pandas Dataframe

To get the data stored in your local storage, run the following code first:

from google.colab import files
uploaded = files.upload()

You will get a prompt that will ask you to choose a file. Click on the Choose files prompt and select the file from your file explorer window that comes up and upload the file. Wait for your upload to reach 100%. You will be able to see the name Colab is storing your file with. Use the following code to import it into a dataframe. Remember to double-check if the file name you are passing into uploaded is the same as what Colab stored your file as.

import io
df2 = pd.read_csv(io.BytesIO(uploaded['Dataset.csv']))
#Or you can just read the csv directly like the code below
df2 = pd.read_csv("copy and paste the name colab saved your file as in the output shown after you have uploaded your files")
# Dataset from local storage is now stored in a Pandas Dataframe

Step 3: Know your Data

Let us get information like the number of rows and columns that our data has as well as the data type of each column. To do this, you will run an info function on your data frame.

df.info()

Output:

To see the first five rows of your data frame:

df.head()
#default value for the head function is 5 and you can pass in any number you want into it, try it and see.

You will get an output like this:

To see the statistics of your numerical column:

df.describe()
#to see the statistics for all your columns try the code below
df.describe(include="all")

You will get an output that looks like this:

To see the total number of missing values that each column has:

df.isnull().sum()

Your output will look like this:

We have missing values in the director, cast, country, date_added, and rating column. Director has the highest number of missing values. So we are going to start with replacing the missing values in the director and cast column with the word ‘missing’.

#filling missing rows in director column with the keyword missing
df.director.fillna('missing', inplace=True)#filling missing rows in cast column with the keyword missing
df.cast.fillna('missing', inplace=True)

For the cast column, we are going to create a dictionary that stores the unique cast value as a key and the number of times it comes in the dataset as a value. This will help while performing the data analysis.

data = []
for i in range(len(df)):
    data.extend(df.cast.iloc[i].split(','))
element = {}
for i in data:
    element[i] = data.count(i)
element = sorted(element.items(), key = lambda item: item[1], reverse=True)
len(set(element))

This outputs the length of element which is 35373.

Next, we are going to fill missing values in country and rating column and drop missing values in date_added column. There are various ways to deal with missing values in a data frame, this is just one of them.

#filling missing rows in country column with the keyword missing
df.country.fillna('missing', inplace=True)#filling missing rows in rating column with the keyword missing
df.rating.fillna('missing', inplace=True)#Dropping missing values in date_added column
df.dropna(subset=['date_added'], inplace=True)

After running the code above, you can run the following code again and you will notice that all our columns now have the same number of rows.

df.info()

Output:

We are done with handling the missing value in the dataset. Now we move forward to further cleaning of our dataset. What Data we can extract from the dataset? We are going to:

split the date_added column into month, date, and year.
split the listed_in column into different categories.

#new added month column
df['added_month'] = np.nan
for i in range(len(df)):
    df['added_month'][i] = df.date_added.iloc[i].split(' ')[0]#new added date column
df['added_date'] = np.nan
for i in range(len(df)):
    df['added_date'][i] = df.date_added.iloc[i].split(' ')[1][:-1]#new added year column
df["date_added"] = pd.to_datetime(df['date_added'])df['added_year'] = df['date_added'].dt.year

Since we are done with extraction from the date_added column, we can drop this column from the dataset.

df.drop('date_added', axis=1, inplace=True)

Splitting values in listed_in column.

#Splitting values
listed_in = []
for i in range(len(df)):
    listed_in.extend(df.listed_in.iloc[i].split(','))#counting values in the newly created dictionary
listed_dic = {}
for i in listed_in:
    listed_dic[i] = listed_in.count(i)listed_dic = dict(listed_dic)#dropping any null values
df.dropna(inplace=True)

One thing to keep in mind is that when you work with data a large percentage of your time will be spent preparing and cleaning the data. Now that we are done with data cleaning we can move on to the next step.

Step 4: Asking Questions/ Data Analysis

QUESTIONS WE ARE GOING TO BE ANSWERING WITH OUR ANALYSIS

What are the different types of content uploaded on Netflix and their percentages?
Which director appeared the most on the Netflix dataset?
Which director directed the highest number of TV shows?
Which celebrity was cast the most?
Top 20 countries where contents were produced?
Which year had the highest number of entries for shows?
What are the different types of ratings on Netflix and which one is the most popular?
What are the correlation relationships between the different genres in our dataset?

#To check for number of tv shows and movies that we have in our dataset
content_count = df['type'].value_counts().sort_values()
print(content_count)

Output:

To see the above output in a visualization, you can run the following code:

#Matplotlib library is used here to visualize the data
plt.figure(figsize=(10,5))
plt.pie(df['type'].value_counts(),labels=df['type'].value_counts().index,explode=[0.05,0],
        autopct='%1.2f%%',colors=['Blue','Green'])
plt.show()

Output:

The pie chart above shows that more than half of our Netflix dataset are movies. Seems people prefer over movies because of:

Movies are more intense: In TV shows one episode can be intense but not all. In comparison, movies are more intense than the show. A good movie will keep you glued to your seat.
They have special effects: Movies have a better special effect and they can cut out any unnecessary shots from the movie because they know they have a limited time to capture their audience’s attention. Whereas in TV shows, some shots are kept to extend the episode which makes people lose interest.
Movies are easier to access: This means you can watch movies anywhere and anytime. Whereas TV shows have fixed timing and date for each episode whereas all the online platforms give you access to watch movies whenever you want. Whether it is early in the morning or in the middle of the night, you don’t have to wait for episodes to be dropped once the movie has been released.
Movies are easier to understand: Whenever you compare movies with TV shows, there is a vast difference like in film you can assume whatever is going to happen in the next couple of minutes and keeps you at your seat in that interval. But in TV shows a few episodes make you wonder about what is happening and what will happen in the next couple of minutes or episodes. TV shows can take you on a roller coaster ride when you watch them.
Movies need not be visualized early: You need to sit for 2 hours at a stretch, and the film is done. Whereas for series, you need to sit every day or week in front of the TV and think and try to visualize what will happen next while you are waiting for the next episode. Movies help you understand the story better and at a go.
Lots of art forms can be incorporated: In movies, you can include many art forms whereas in TV shows you don’t get them to see and they do not have much access to it.
The social aspect: You can take out time and go with your friends to watch movies because they take less time to watch whereas in TV shows your friends may not like the show the way you do. It may be of interest to you but not to your friends. Unless of course, you plan a binge-watch party with your friends which might not be as frequent as going to the movies.

Let us look at the director column:

director_count = df['director'].value_counts()
print(director_count)
#most of the director dataset is missing

Output:

#Our top 20 common directors in the dataset
df.director.value_counts()[1:20].sort_values(ascending=False).plot(kind='bar', width=0.5, color='Blue')

Output:

Raúl Campos and Jan Suter are the most common directors that direct the contents found on the Netflix dataset. We can also check the type of content i.e. TV Show or Movies that they and other directors directed.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
data = df.groupby('type')['director'].value_counts()['Movie'][1: 20]
data = pd.DataFrame(data)
ax1.bar(data.index, data.director, color='red')
ax1.tick_params(labelrotation=90)
ax1.set_title('Movie', fontsize=24, fontweight='bold')
data2 = df.groupby('type')['director'].value_counts()['TV Show'][1: 20]
data2 = pd.DataFrame(data2)
ax2.bar(data2.index, data2.director, color='green')
ax2.tick_params(labelrotation=90)
ax2.set_title('TV Show', fontsize=24, fontweight='bold')

Output:

We can see that Raúl Campos and Jan Suter mostly direct Movies that are uploaded on Netflix and Alastair Fothergill mostly directs TV Shows that were found on Netflix. Let us look at the dictionary of casts that we created earlier:

#dictionary of actors and how many contents they appeared in
dic_element = dict(element[1:])
print(dic_element)

The output will look like

Now let us visualize the above output:

dic_element = dict(element[1:])
dic_element_key = list(dic_element.keys())
dic_element_value = list(dic_element.values())
plt.bar(dic_element_key[:20], dic_element_value[:20], color='red')
plt.xticks(rotation='vertical')

Output:

Anupam Kher is the cast that appeared the most in the Netflix dataset. He mostly appears in Netflix Indian movies. The next thing we would be doing is seeing the rate at which content has been added over the years. Run the following code in your notebook:

import plotly.graph_objects as go
df_tv = df[df["type"] == "TV Show"]
df_movies = df[df["type"] == "Movie"]df_content = df['added_year'].value_counts().reset_index().rename(columns = {
    'added_year' : 'count', 'index' : 'added_year'}).sort_values('added_year')
df_content['percent'] = df_content['count'].apply(lambda x : 100*x/sum(df_content['count']))
df_tv1 = df_tv['added_year'].value_counts().reset_index().rename(columns = {
    'added_year' : 'count', 'index' : 'added_year'}).sort_values('added_year')
df_tv1['percent'] = df_tv1['count'].apply(lambda x : 100*x/sum(df_tv1['count']))
df_movies1 = df_movies['added_year'].value_counts().reset_index().rename(columns = {
    'added_year' : 'count', 'index' : 'added_year'}).sort_values('added_year')
df_movies1['percent'] = df_movies1['count'].apply(lambda x : 100*x/sum(df_movies1['count']))t1 = go.Scatter(x=df_movies1['added_year'], y=df_movies1["count"], name="Movies", marker=dict(color="#a678de"))
t2 = go.Scatter(x=df_tv1['added_year'], y=df_tv1["count"], name="TV Shows", marker=dict(color="#6ad49b"))
t3 = go.Scatter(x=df_content['added_year'], y=df_content["count"], name="Total Contents", marker=dict(color="brown"))data = [t1, t2, t3]layout = go.Layout(title="Content added over the years", legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure(data, layout=layout)
fig.show()

The output will look like this, It will be an interactive visual:

Netflix’s initial business model included DVD sales and rental by mail, but the founder abandoned the sales about a year after the company was founded to focus on the initial DVD rental business. Netflix expanded its business in 2007 with the introduction of streaming media while retaining the DVD and Blu-ray rental business. The company expanded internationally in 2010 with streaming available in Canada, followed by Latin America and the Caribbean. Netflix entered the content-production industry in 2013, debuting its first series House of Cards. This is probably the reason why the growth in content started in 2013.

The growth in the number of movies on Netflix is much higher than that of TV shows

About 1200 new movies were added in both 2018 and 2019. 2019 has the highest number of contents added. I was expecting more content to have been added in 2020 as we have seen that we had a steady growth of entries until then, I think content added reduced in 2020 because of the pandemic, there was lockdown, and production for most series and movies were halted.

Let us look at the different ratings in our dataset:

plt.figure(figsize=(8, 10))
sns.countplot(y='rating', data=df, order=df.rating.value_counts().index.to_list(), palette='dark:salmon_r')
plt.title('Different Ratings', fontsize=24, fontweight='bold');

Output

So, TV-MA is the most common rating on Netflix. This rating is mature adults and it might be the most common rating because it is adults that are paying for the subscription. And Netflix's largest audience range from 16–34 years old. So we have seen pie charts, Scatter — line charts, and bar charts. Now, let us see how to do a correlation matrix

from sklearn.preprocessing import MultiLabelBinarizer 
#Multilabelbinarizer allows you to encode multiple labels per instance.
def relation_heatmap(df, title):
    df['genre'] = df['listed_in'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
    Types = []
    for i in df['genre']: Types += i
    Types = set(Types)
    print("There are {} types in the Netflix {} Dataset".format(len(Types),title))    
    test = df['genre']
    mlb = MultiLabelBinarizer()
    res = pd.DataFrame(mlb.fit_transform(test), columns=mlb.classes_, index=test.index)
    corr = res.corr()
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    fig, ax = plt.subplots(figsize=(10, 7))
    pl = sns.heatmap(corr, mask=mask, cmap= "coolwarm", vmax=.5, vmin=-.5, center=0, square=True, linewidths=.7,
                     cbar_kws={"shrink": 0.6})
    
    plt.show()

Call the function we created above in the next code:

relation_heatmap(df_movies, 'Movie')

Output:

The negative relationship between drama and documentary is remarkable. For independent and international films, we are seeing a neutral correlation. And there is a strong correlation between Sci-Fi & Fantasy and action & Adventure. The negative relationship implies that the probability of the two genres together is very low and the positive relationship implies that the probability of the two genres together is high i.e they go together.

To see the correlation heatmap of TV shows:

relation_heatmap(dfs_tv, 'TV Show')

Output:

The negative relationship between kid’s TV and International Tv Shows is remarkable. This can mean these two genres do not go together. There is a strong positive correlation between Science & Natural and Docuseries. This can mean that most science and nature genres are also docuseries.

For reference this is a link to my Google Colaboratory notebook, You will find the full analysis I did therein. Data Visualization is a very important part of data analysis and it also helps you with telling the story of your data visually. Thank you for reading and kindly drop a comment in the comment section.

A Beginner Guide to Data Analysis with Python

Step 1: Importing libraries

Step 2: Getting your Data

Step 3: Know your Data

Step 4: Asking Questions/ Data Analysis

Written by Mina Omobonike