19 – Uber Data Analysis Pipeline

The summary from last week’s post is that I was preparing to build a Data Analysis Pipeline. This was the macro plan that the project was being planned around. I’m going to use this plan as a way to outline this blog.

Finalised Dashboard
Gathering all of the Necessary Information
- Data
- Data Dictionary
Planning the Model
- Jupyter Notebook (Python) & Lucidchart
  - Datetime Dimension
  - Passenger Count and Trip Distance Dimensions
  - Pickup and Dropoff Dimensions
  - Ratecode Dimension
  - Payment Type Dimension
Final Dimension Model
Reviewing the Data
- Preparing the Data
Creating the Dimensions
- Datetime Dimension
- Passenger Count and Trip Distance Dimensions
- Pickup and Dropoff Dimensions
- Ratecode Dimension
- Payment Type Dimension
Migrating the Local Data to Google Cloud Storage
Construct the ETL Components
- Compute Engine
- MAGE
  - Open the Necessary Port(s)
  - Checking Availability
  - Working in MAGE
- Creating the Fact Table in BigQuery
  - BigQuery
  - MAGE
  - SSH in Browser
  - MAGE
  - BigQuery
  - MAGE
  - BigQuery
Developing the Dashboard in Looker
- Connect Data to the Dashboard

Finalised Dashboard

https://lookerstudio.google.com/s/gJ2pMR8EdWY

Gathering all of the Necessary Information

Data

Whilst thinking about various types of data that is available on the internet, I considered different countries and how they would track their information. Not only that, but display it publicly. Given the amount of transaction information that would be available in a city that is overrun by taxis, I set out to find the data for this sector in New York City. The dataset was accessed on 6 November, 2023.

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Data Dictionary

Before I build this pipeline, I need to build the model. I can build the model by using the data dictionary, located on the website.

https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

From the available information, I knew that it would be beneficial for me to create a data warehouse as I can continually add to this model in order to make quick and efficient business-oriented analysis and decisions. So, this means that I will need to convert the flat table format (.csv) into the Dimension modelling approach; data warehouse – the Fact and Dimension tables format.

Creating these models is an iterative process. Some of the understanding will come over time of modelling various different projects. Some of the understanding will come from experience of positive and negative experiences of modelling; events that were caught or missed in production due to the type of modelling that was planned. But, this current iteration will be the summation of decisions that were made when comparing the flat table data (.csv file) and this available data dictionary.

Planning the Model

Jupyter Notebook (Python) & Lucidchart

To best manipulate the data, I am going to use Jupyter Notebook (Python). Firstly, I will install the numpy library to use as mathematical functions for later. Loading the data and inspecting the first few rows (using the head method) will allow me to plan the approach for the model’s dimensions.

From this output, I can see that I will be constructing a Fact table using the VendorID. I will introduce a datetime dimension (table) and integrate the available information from columns 3 and 4.

I will be using Lucidchart to model this pipeline as a means to think out the logic and address any holes before implementing the plan.

Datetime Dimension

I could make the choice of two pickup and drop off dimensions for the date time, but I will integrate them into the one dimension at this point and I will monitor this for later to determine whether I change the normal form level.

Passenger Count and Trip Distance Dimensions

I will next create dimensions for both the passenger count and trip distance columns. I could include this in the fact table as they are a part of the transaction of each trip. However, I prefer to break these columns out into their own dimensions.

Pickup and Dropoff Dimensions

The next dimensions to be added will be the pickup and dropoff locations (latitude and longitude).

Ratecode Dimension

The next dimension that I’ll add is the rate code (the star rating that the drivers receive). Within the data dictionary, the values that can be assigned to the RateCodeID columns are the values 1-6.

However, when developing this model, I will include additional information, i.e., the name of the locations for the rates. This makes it a bit easier for us, and someone else, to understand the model.

Payment Type Dimension

I will also follow the same concept for the payment type.

Within the Fact table, we are interested in the transactions for each trip. This means that we can create attributes in the table using the following columns:

Fare_amount
Extra Miscellaneous
MTA_tax
Improvement_surcharge
Tip_amount
Tolls_amount
Total_amount

Final Dimension Model

Reviewing the Data

Preparing the Data

Before I convert this flat file into the dimension model, I want to check that I have the correct data types for each of the attributes.

I can see that both pickup_ and dropoff_ datetimes are in the ‘object’ format. So, I will need to convert this object into data. I can use pandas to convert the datetimes object format to datetime.

This line of code is providing me an output of the, to be, datetime dimension, but without any duplicate values in the two columns.

After running this code, I realised that I would need to run .drop_duplicates().resent_index(drop=True) each time I created a new dimension. So, I decided to run these methods on the entire dataset and then indexed them against a new column, ‘trip_id’.