15 – 3/8 Learn Python for Data Analysis

~~Learn the fundamentals of Statistics~~
~~Learn SQL~~
Learn Python for Data Analysis
Learn Data Manipulation and Visualization
Learn Statistical Analysis
Learn Data Visualization Tools
Work on Projects
Learn Data Storytelling

Whilst preparing this post (gathering all the different sections), I realised just how many can be included within the series of ‘Learn Python’. Although I enjoy the shallowness that a broad concept like ‘Learn [this entire language, syntax, implementations, and situational contexts for its applications]’, I think it would be a healthy approach to show some general, key understandings around Python for Data Analysis, as opposed to reinventing the wheel. That wheel has been developed numerous times over numerous years, all across the globe. I’m just learning Python so I can apply it to things that I either find interesting, or would like to develop further skills (or education) in.

We have some basics, like: numbers (integers, floats), variables,strings, lists, tuples, dictionaries, and sets. We also have some logical operators, like and, or, not, while, and for. Then, we can start to use a mixture of these basics to create simple programs like this – outputting a list of tools that I use with for Data Analysis:

Enumeration function:

listOfTools = [“Excel”,”SQL”,”PowerQuery”,”Python”,”R”]

i=0

tools_list=[]

while i < len(listOfTools):

listOfTools.append((i,listOfTools[i]))

i+=1

print(tools_list)

[(0, ‘Excel), (1, ‘SQL), (2, ‘PowerQuery), (3, ‘Python), (4, ‘R)]

Alongside these basic operations, you can import packages and utilise pre-made (and specific) libraries for particular use cases. The following are some useful ones:

Pandas – software library written for data manipulation and analysis; data structures and operations for manipulating numerical tables and time series,
Numpy – software library written for adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays,
Matplotlib – a plotting library, a numerical mathematics extension of NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits, and
Seaborn – data visualisation library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Clerical Necessities

Import Libraries

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

import matplotlib.mlab as mlab

import matplotlib

plt.style.use(‘ggplot’)

from matplotlib.pyplot import figure

%matplotlib inline

matplotlib.rcParams[‘figure.figsize’] = (12,8) #Adjusts the configuration of the created plots

Inspect First Few Lines of Data

df.head()

So that I can understand the type of data that I am analysing, I have selected the data frame, head, option to briefly view the columns and data located within them. From here, I can commence the data cleaning process.

Data Cleaning

Check for Missing Data

Given the mass amount of data located in this analysis file, I want to check if there is any data missing. For this, I can use the following ‘for’ process to iterate through each column, searching for null values.

for col in df.columns:

pct_missing = np.mean(df[col].isnull())

print(‘{} – {}%’.format(col, pct_missing))

The output shows the percentage of null values in each column.

Check Data Types for Columns

df.dtypes

I can use this method <check terminology> to check the data types in each column. Identifying the data types means that I can use the appropriate and necessary analysis, whether it is quantitative or qualitative.

Change Data Types for ‘budget’ and ‘gross’

df[‘budget’] = df[‘budget’].astype(‘int64’)

df[‘gross’] = df[‘gross’].astype(‘int64’)

Create Correct Year column

df[‘yearcorrect’] = df[‘released’].astype(str).str[:4]

Order by Gross Revenue

df.sort_values(by=[‘gross’], inplace=False, ascending=False).head()

Remove duplicates

df[‘company’].drop_duplicates().sort_values(ascending=False).head()

Correlations

Which attribute is more correlated to gross revenue?

plt.scatter(x=df[‘budget’], y=df[‘gross’])

plt.title(‘Budget vs Gross Earnings’)

plt.xlabel(‘Gross Earnings (million, USD)’)

plt.ylabel(‘Budget for Film (million, USD)’)

plt.show()

sns.regplot(x=‘budget’, y=‘gross’, data=df, scatter_kws={“color”:”red”}, line_kws={“color”: “blue”})

Numerical Perspective of Correlation Matrix

In a previous post, 7 – 1/8 Learn the Fundamentals of Statistics, I outlined how to calculate the correlation between two variables (x, y). This is a multivariate representation of this same calculation.

Colourised Correlation Matrix

correlation_matrix = df.corr(method=‘pearson’)

sns.heatmap(correlation_matrix, annot=True)

plt.title(‘Correlation Matrix’)

plt.xlabel(‘Movie Features’)

plt.ylabel(‘Movie Features’)

plt.show()

This resulting output is a colourised version of the numerical matrix above. This version provides the observer with a quick-view of how each variable affects the other. Since each variable impacts itself with 100%, this is why we can observe the diagonal output of 1.

Using the legend on the right, we can see that the lighter the colour, the more positively correlated the variables are to one another. For example, ‘gross revenue’ has a high correlation with ‘budget’. This suggests that, on average (according to this dataset), there is a 71% probability that the budget will have an effect on revenue the film will generate.

The darker the colour, the less correlated the variables are. For example, ‘budget’ has a low correlation with ‘score’ (or rating). This suggests that, on average (according to this dataset), there is a 4.2% probability that the budget will have an impact on the score that the film will receive.

James Miller CV