- Learn the Fundamentals of Statistics
- Learn SQL
- Learn Python for Data Analysis
- Learn Data Manipulation and Visualization
- Learn Statistical Analysis
- Learn Data Visualization Tools
- Work on Projects
- Learn Data Storytelling
The world is full of numbers. How many people attended the Women’s FIFA World Cup? How many people watched the World Cup in that same city? How many people tuned in in that country? Neighbouring countries? Other nation-states? What makes statistics interesting is the story behind these numbers. Using some simple formulas, we can start to infer information about people (and the decisions that they make). From this point, we can continue looking at this information across different areas and/or time to understand this information over time. This means that, with enough accurate evidence, it is possible to forecast (or predict) numbers in the future, leading to accurate outcomes.
Many of the fundamental statistics in the following set can be traced back to the grade 7 and 8 mathematics Australian schooling curriculum. They are useful measures to understand various aspects of datasets and deliver insights for business decisions. I have arranged a random array (between 1 and 1000) of 50 values in a Google Spreadsheet. The following statistical measures will be using this randomised dataset.
Mean: mean is the average of a data set,
This formula is the formal way of expressing: how many divisible groups do we have for this set of (50) numbers? What is the total sum of the individual values? What’s that total sum divided by the number of groups?
Mean = 510 + 231 + 382 + 938 + … = 26240 / 50 = 524.8
Within this dataset, there is a spread of numbers, such that the mean is equal to 524.8.
Use case: When running a service business, it can be helpful to measure the efficacy of customer support. For instance, a manager might be interested in the average number of calls that a customer support team is making per day. This average can be compared across numerous areas (business units or physical locations) and investigated when necessary, i.e., when observing skewed datasets.
Median: the middle value of the given list of data when arranged in an order,
Since we are working with an even number of terms (50), we need to use the second function.
Median = (Middle term1 + Middle term2)/2 = (505+510)/2 = 507.5
Use case: The most commonly discussed use case would potentially be in reference to housing, i.e., prices and rent. The reason for this is that the distribution of price ranges for this industry can be skewed as the range of house prices are dependent on many factors, including size, rooms, location, etc.
Mode: a number in a set of numbers that appears the most often,
Mode = 44
Use case: An application for mode within a data set might be finding the most recurring buyers or users in a business unit.
Variance: measures variability from the average or mean,
Variance = 81,435.2
Use case: Measuring or estimating variance is useful for analysis, especially for business decisions, as it provides a range of values that affect consumers’ sentiment around a product. One particular example includes users’ perspective on the same product. Do their reviews reflect: quality issues, advertising inaccuracies, delivery beyond expectations, buyers remorse?
Covariance: measures the directional relationship between the returns on two assets,
covx,y = 83.32
Use case: if we’re interested in estimating the return on two complementary assets, such as movies and popcorn, pencils and notebooks, and coffee and muffins, we can use a covariance measure on a dataset to estimate the sales between the two.
Standard Deviation: a measure of how dispersed the data is in relation to the mean,
Standard Deviation = 285.4
Use case: If a weather presenter has asked a freelancer for weather predictions, the freelancer’s prediction can have either a small standard deviation (higher probability of accurate forecasting) or a larger standard deviation (lower probability of accurate forecasting). If the provided weather data from a location yields very little change in temperatures (Bogota, Colombia) then the prediction will be quite accurate due to the low spread of temperature data for that location; a lower standard deviation. Yet, if the location is such that there are dramatic changes in temperature (Oklahoma, United States of America) then the prediction will not be as accurate due to the larger spread of temperature data for that location; a higher standard deviation.
Correlation: covariance of two variables divided by the product of their standard deviations.
r = 2.04%
Use case: Similar to covariance, a manager might be interested in the relationship between their social media mentions (X, Facebook, or Instagram) mentions and website activity. A report can be generated for a time series analysis and derive a test sample of the population over a two week period. The quality of mentions can then be analysed against the level of activity the business observes.
Regression analysis: a statistical technique that relates a dependent variable to one or more independent (explanatory) variables,
| Index | Value | |
| Coeff | 0.0010 | 24.95 |
| Std err | 0.0074 | 4.39 |
| r^2; se_y | 0.04% | 14.73 |
| F; d_f | 0.0201 | 48 |
| ss_reg; ss_resid | 4.35 | 10408.15 |
Use case: Utilising the correlation (above), a manager may then utilise those findings in a way to forecast potential marketing and/or product campaigns based on particular metrics. With the use of social media mentions, website activity, product promotions, and sales numbers, a manager can correlate these processes to potential future earnings based on the aforementioned independent variables on the model.
Although this has been a more formulaic- and number-“heavy” post, I am to integrate these learnings into a meaningful way as I move through the Data Analyst Roadmap.


4 responses to “7 – 1/8 Learn the Fundamentals of Statistics”
[…] This is one of those posts that involves the culmination of a series of previous projects/solutions that I’ve created. This post shows how they can collectively contribute to the weft of complementary solutions. Specifically, here are those previous posts that I’ve been able to pull together: I Want to Learn the LET function, 10 Google Sheets Troubleshooting with some FAQs and not so FAQs (Reference Other Sheets and Other Google Spreadsheets, List of Unique IDs, What is a Pivot Table and how to use one?, How to view ARRAYFORMULAs, and How to look at Excel like SQL), and 1/8 Learn the Fundamentals of Statistics. […]
[…] and summarising data. You will be able to peruse my understanding of descriptive statistics in a previous post throughout this roadmap. These aspects of statistics offer a succinct portrayal of central […]
[…] a previous post, 7 – 1/8 Learn the Fundamentals of Statistics, I outlined how to calculate the correlation between two variables (x, y). This is a multivariate […]
[…] Using the built-in AVG function, as can be seen in this post, this calculated the average profit for all […]