10 – 2/8 Learn SQL – James Miller CV

~~Learn the fundamentals of Statistics~~
Learn SQL
Learn Python for Data Analysis
Learn Data Manipulation and Visualization
Learn Statistical Analysis
Learn Data Visualization Tools
Work on Projects
Learn Data Storytelling

To best serve this post, I’ve decided to list a bunch of SQL commands, functions, and processes to show when and how to use these particular elements. When coding in SQL, it is my preference to separate each clause with a new line. It makes the most sense to me when reading these queries, especially when wanting to observe particular elements at a glance. For example: which table is the data being retrieved from? What is the WHERE clause attempting to filter?

From what I could discern from the website scraping exercise I did for this post, the following are basic commands that all SQL users will most probably need and use on a daily basis. I will be using different collections of datasets for these exercises, as some will be better suited for the commands over others.

Dataset 1: eCommerce- a Kaggle sourced dataset that includes 3 tables of information, including OrderDetails, SalesTarget, and OrdersList.

Dataset 2: finance- a collection of Yahoo sourced financial datasets, including Apple, Amazon, Meta, and Netflix.

Dataset 3: Zentopic- a collection of 3 tables that require various cleaning processes and linkages allowing for appointment times to be migrated into another system.

1. Basic Commands

a) CREATE DATABASE finance

Create the database:

Import the new dataset:

b) SELECT & FROM: to retrieve the attributes of data from the mentioned table,

Wanting to garner a quick overview of the tables that I have just imported, I selected the top 5 rows for all attributes and printed the results of the three tables out:

c) SELECT DISTINCT: it eliminates duplicate rows and displays only the unique records,

I’m interested in discovering more information about the Order Details (table 2). I think it would be interesting to note if each order only contained one (line) item as opposed to having multiple items in each order.

When selecting all rows in this column, SELECT * FROM OrderDetails, it yields 1,500 rows:

Yet, I can implement a DISTINCT command, which returns only the unique (or distinct) values of a particular column. In this instance, Order_Id.

I will use an addendum of this example in the GROUP BY section below.

d) WHERE: it filters the record and shows only the ones that satisfy the given condition,

Should management be interested in those who are ordering above a certain quantity, this is where we can use the WHERE clause to act as a filter for us, similar to the filter option at the top of a column in an Excel table:

e) AND, OR, NOT: not execute the query when the condition is not True. While, AND and OR are used to apply multiple conditions,

AND:

If I was interested in determining the results of purchases, based on two conditions being true, I can introduce the AND command. In this example, I am asking for all details about orders where orders have a quantity greater than 11 and generated profit.

OR:

If I was interested in finding out the details around either phones or stole sales, I can use the OR command to retrieve this data.

NOT:

I was interested in discovering information around profit on orders, specifically between $1 and $9.99.

In SQL, you typically place the GROUP BY clause before the ORDER BY clause in a query. The reason for this order is that GROUP BY is used to group rows that have the same values in specified columns, while ORDER BY is used to sort the result set based on one or more columns.

f) GROUP BY: it groups identical data,

If I am interested in finding the number of orderst that each Order_Id has placed, I can group them by their Order_ID count (or column 2):

g) ORDER BY: it sorts the data in ascending or descending order

Looking at the screenshots above, it’s clear to see that they are being ordered by the index number in the very first column. However, if I’d like to see the results ordered by another column, Profit for example, this is where the ORDER BY command enters the arena.

This image shows that the results are now ordered by Profit. But ORDER BY automatically assigns the order in ascending order. For this particular circumstance, I’d like to see the profit in descending order:

h) HAVING: data aggregated by Group By can be further filtered out here,

I was interested in ordering the aggregated totals of orders by purchasers, of whom had more than 9 orders. In order to do this, I needed to GROUP BY the aggregated variable (Order_Id) then initiate the HAVING clause after this grouping. This allows me to successfully apply the condition post grouping the data.

i) Aggregate functions: aggregate functions like COUNT() (see previous screen shot), MAX(), MIN(), AVG(), and SUM() are used to perform operations on the given data.

MAX: If I wanted to find the order with most profit:

MIN: If I wanted to find the order with the most financial implications towards the business:

AVG: Using the built-in AVG function, as can be seen in this post, this calculated the average profit for all sales.

SUM: Since I didn’t want the impact of negative profit to impact this calculation, I created a WHERE clause to filter profits between a certain band, i.e., profit between $0 and $100.

2. CASE WHEN

I like to think of ‘Case … WHEN … THEN’ as the SQL equivalent of Excel’s IF(Condition,True,False) function. For this case, it was necessary to transform the Category names on all Order Details; Clothing to Clothes, Electronics to Electrical, and Furniture to SittingImplements. I don’t question why people down the pipeline are referring to furniture in a particular way, I just fulfil the brief, man.

3. Joins

Finance dataset:

a) Inner

Inner join shows the crossover of two data tables based on certain criteria. In this case, I have used the Amazon and Apple financial data to highlight the differences between the joins. I have removed more dates for Amazon than Apple. This first full join shows this difference:

Performing the inner join on the date column will only output results between 11/09/2018 – 2018/10/11 for both Amazon and Apple:

b) Left

For this case, I truncated the Amazon dataset back 2 years. You can see that it was the Amazon dataset through this first, Left, join.

Shift the tables around highlights how the left join is functioning- you join the table on the right (2nd table) to the table on the left (1st table).

c) Right

4. String Formatting

a) CONCAT

A very typical example for concat is to join first and last names.

Personally, I will commonly use the CONCAT function in Excel when developing code for renaming files through a Windows cmd prompt or Mac Terminal window.

b) SUBSTR

The SUBSTR command is used when selecting particular parts of a varchar (or string) variable. Let’s take a look at the SalesTarget table.

Instead of seeing the year in the Month_of_Order_Date column, I’d only like to see the month.

I can see that all months are listed in a pattern of three letters. This means that I can use some hard coded numbers in the SUBSTRING command; start at character 1 (index number 0) and cut off the string after the 3rd character (index number 2).

See the CAST command (last command on this post) for how I would search for the hyphen character ‘-’ rather than hard coding the number of characters to search through.

c) TRIM

Just like a bottle of milk, TRIM allows you to cut off particular parts of the chunk. Similar to SUBSTRING, I could use TRIM to take off the hyphen and year pattern at the end of the date column.

Now that I’ve taken care of the year 2019, I need to sort out the 2018 pattern. For this, I can use a method called nested TRIM. I simply add another TRIM command inside the initial one and it will create a second loop for the TRIM command, searching for the next year that I’ve submitted.

d) COALESCE

This function is used to return the first non-null expression in a list of expressions. It is often used to handle situations where you have multiple columns or values, and you want to retrieve the first non-null value among them. I’ll pull up some Client Appointment times and analyse for any Null values.

I was able to locate them, specifically for the Client Number 262.

Since I’m only interested in these Null values, I will pull those two attributes up.

Like TRIM, I can use a nested COALESCE function to provide a new time for any Null values. I am aware that the client wanted the appointment to start at 19:47, so I will code that into this command. Since I’m interested in this only Null value, I will tell SQL to ignore other Start times by passing through the non-null [Start] value, otherwise print ‘7:47:00 PM’ into the start time.

5. Subqueries

Having already mentioned nested functions prior to this, it will provide you the basis for this next concept- subquery. They’re also known as nested or inner queries, but it is a query that is embedded within another SQL query. Subqueries are used to retrieve data that will be used in the main query as a condition to further restrict or enhance the results. They allow you to perform more complex and dynamic queries by breaking them down into smaller, more manageable parts.

I was interested in joining two tables together – OrdersList and OrderDetails – but didn’t want to deal with excessive amounts of irrelevant information to this query. So, I selected the DISTINCT order IDs from the OrdersList and joined them with the OrderDetails table.

7. Stored Procedures

Whilst making a lot of these queries, I got tired of having to write out the same multi-line queries over and over again. So, I decided to create a procedure for each of the tables. If I wanted to see all attributes from each table, I created a procedure for each table with a simple short passage.

8. Cast

The CAST command is used to explicitly convert a value from one data type to another. This conversion is often necessary when you want to ensure that data is treated as a specific data type in SQL operations or when you need to compare or combine values of different data types.

Since the date data was listed in a particular way, Mmm-YY, I was interested in transforming this date data into user-friendly data manipulative format for later use. The attribute name denotes that the day does not have an impact on the data, so we can assume that adding any day to the date will have no dramatic impact on future calculations.

SELECT CAST(CAST(MONTH(SUBSTRING(Month_of_Order_Date,1,CHARINDEX(‘-‘, Month_of_Order_Date,0)-1)+ ‘1,1’) AS VARCHAR(10))+’/01/’+CAST(SUBSTRING(Month_of_Order_Date,CHARINDEX(‘-‘, Month_of_Order_Date,0)+1,LEN(Month_of_Order_Date)) AS VARCHAR(10)) AS Date) AS Date, Category, [Target]

FROM SalesTarget

ORDER BY 1, 3 DESC;

SELECT CAST(CAST(

MONTH( –MONTH( returns the month number, i.e. Mar = 3

SUBSTRING( –SUBSTRING( This is similar to the SUBSTITUTE function in Excel

Month_of_Order_Date,1 –We are searching this column for all values, starting at the first position (index[0])

,CHARINDEX( –CHARINDEX( This is similar to the SEARCH function in Excel

‘-‘, Month_of_Order_Date,0)-1 –‘-‘, Month_of_Order_Date,0)-1); we are looking for the ‘-‘ character in each column’s value, starting value (index[0]); return without the character (-1)

)+ ‘1,1’) AS VARCHAR(10))+’/01/’+

CAST(SUBSTRING(Month_of_Order_Date,CHARINDEX(‘-‘, Month_of_Order_Date,0)+1,LEN(Month_of_Order_Date)) AS VARCHAR(10)) AS Date) AS Date,

Category, [Target]