In the last post, we identified outliers in a dataset following specific steps. We then discussed the effects that these outliers have on data and the impacts they have on decision-making. This week is about continuing this foresight and seeing how we can forecast future sales for each individual product that we have been tracking.
Here’s a reminder of the last post:
- What is a Z-score?
- Calculate Average and Standard Deviation
- Standard Deviation
- Calculate Z-score
- What’s an Outlier?
- Identify Outliers
- The Issue with Z-score
- Product Level: Average and Standard Deviation
- Average of Product
- Standard Deviation of Product
- Product Level: Z-score
- Conditional Formatting
- Why Do We Care About Outliers?
Forecasting Monthly Sales using this Data
Now that you have a reminder of where we’ve come from (and where we now are), here is the spreadsheet that we will be working through. Note: this is the same document as last week, with an added sheet.
Transform the Data
To best analyse this data, I will create a matrix using product and date as dimensions. There are three variables that we are using: product, date, and sales. When using the INDEX-MATCH solution, it is often depicted by showing how to lookup (and match) a row and column. Although we will be organising our data in this manner (dates on the top row and products down the first column), we will need to match two columns.
The original data shows this:
We are interested in the Sales for each Product on a certain Date. However, there are some Products that will not see Sales on certain dates and we are accounting for those non-entries.
TRANSPOSE(UNIQUE(Dates))
Firstly, let’s automatically set up the number of columns we need by obtaining all of the ‘UNIQUE’ dates from the dataset. When performing this command, it will create a vertical list of those dates as the data is stored in a column:
So, let’s just add the TRANSPOSE function to the beginning of UNIQUE:
UNIQUE(Products)
Let’s do the same thing for the Products that we did for the Date, except not transform it:
I know that I am going to show two SPARKLINE formulas later, so let’s add the columns for those now:
Configuring Matrix
For this INDEX-MATCH formula, we will be looking to match two conditions from two different columns; INDEX(MATCH(condition_1,condition_2)). This is the formula that has been written for this matrix:
=ARRAYFORMULA(INDEX($D$6:$D$228,MATCH(1,(O$5=$C$6:$C$228)*($L6=$B$6:$B$228),0)))
Let’s arrange that in a nicer way:
=ARRAYFORMULA(
INDEX($D$6:$D$228,
MATCH(1,
(O$5=$C$6:$C$228)*($L6=$B$6:$B$228),0)
)
)
As I was mentioning earlier, this matrix is going to be transposing two columns (B and C) into a row of data (C) and column (B). Let’s start with the INDEX portion of this formula.
Index
Index will be the return value, after the conditions have been met. In this case, our return value(s) are the Sales (column D).
Match
To retrieve those Sales values, we need to match two columns. INDEX has the following syntax though: INDEX(reference, [row], [column]). So, we need to place two arrays within the row variable. To do this, we will integrate these two arrays into the MATCH formula. MATCH has the following syntax: MATCH(search_key, range, [search_type]). Let’s take a closer look at the MATCH formula.
MATCH(1,(O$5=$C$6:$C$228)*($L6=$B$6:$B$228),0)
Firstly, why are we attempting to match 1 as the search_key? When MATCH performs the operation on the arrays within the range, it actually outputs as a series of True and False values, depending on whether the lookups have valid values. If they do, it will return True (or 1), and if not, False (or 0).
Secondly, we want to match by two criteria, so we need to multiply the arrays by one another; does the date in cell O5 match the date in column C? Does the item in cell L6 match the item in column B? If so, then True (1), else False (0).
Lastly, we want exact matches, so we enter the 0 for the search_type.
Reducing Bias in the Calculation
=IFERROR(ARRAYFORMULA(INDEX($D$6:$D$228,MATCH(1,(O$5=$C$6:$C$228)*($L6=$B$6:$B$228),0))),)
When creating this matrix, we need to take care of the null entries for all Products on particular Dates where there were no sales. I have three options for no sales: place a zero (0), “N/A”, or no entry.
If I were to use a 0, this would create bias in the data and the forecasting would therefore understate the potential. Entering ‘N/A’ would require additional steps to analyse the data, which is superfluous. Therefore, I have decided to enter a blank/null cell in lieu of no sales.
Can you think of one last thing that we need to take care of before we continue with the forecasting analysis? We need to remove the outliers. These are, afterall, creating bias in the dataset and will skew our predicted sales over the next month. Manually combing through the data and deleting entries individual outlier entries would be time consuming and bad practice for future analysis on this dataset. So, let’s include it in the above formula.
Now that we know that the multiply function for these segmented arrays provides either a 1 or 0 for the matrix multiplication, let’s just add another condition for these outliers. So, I will add this next section to the end of the first MATCH function:
*(“No”=$O$6:$O$228)
This will remove any of the outlier entries in the table (Q5:AP27), reducing the biassed impact on the forecasting model whilst maintaining the original dataset.
Forecasting
Now that we have found all the necessary data for each product over the month, we can use this information to forecast sales for each of those products over the next ‘x’ amount of days. For this example, I have chosen 4 days. Before we forecast these sales though, let’s have a look at what the function is requiring of us:
FORECAST(x, data_y, data_x): Calculates the expected y-value for a specified x based on a linear regression of a dataset.
- x: The value on the x-axis to forecast.
- data_y: The range representing the array or matrix of dependent data.
- data_x: The range representing the array or matrix of independent data.
=FORECAST(AL$5,$O6:AK6,$O$5:AK$5)
For our circumstances, the independent variable, that we can reliably measure, is the date. So, this will be our x values. Since we have no other tracked business drives, we are measuring our Sales revenue, hence being the dependent variable.
Since I want to copy this formula both right (across 4 columns) and down (22 rows), I want to make sure that I don’t need to adjust any code to automate this. So, this is why I have set the relative and absolute referencing the way I have:
- AL$5: this is my date (independent) variable, so I don’t want this reference to move down/up rows, hence locking it into 5. However, I do want it to have the ability to move right along the date line.
- $O6:AK6: this is the historical dependent data (sales values) and it is what the forecasting values require. Since I want 02/02/2024 to use the forecasted data from 01/02/2024, I will keep AK6 relatively referenced, so that it can move across columns, and down rows. However, O is the first column for the sales data, and I don’t want it to have the ability to move (as this is necessary for accurate probability predictions). But, the 6 in O6 is required to move down/up the different rows.
Sparkline
Now that this actual and forecasted Sales data has been calculated, we can use the in-cell miniature chart tool to show us the trend for each product.
Actual
Forecast
For this sparkline, we are only using the last four (forecasted) days at the end of the table:
Chart
It’s all well and good to see the numbers, and two small sparklines charts of each product, but let’s break it out into a bigger visual representation. I want to make this a dynamic chart and allow users to interact with it. So, they will need to be able to pick a product to analyse and the line chart should update automatically so the product is visible.
First, I’ll create a new sheet named ‘Chart’ (to reduce confusion) and place the following headers in a new table:
- Product,
- Date,
- Sales, and
- Forecast.
Next, I will create a data validation, drop down list (from a range), so that a user can select individual Products.
Then, I will TRANSPOSE the data plot from the ‘Forecasting Sales’ sheet.
Next, I will create an INDEX-MATCH-MATCH function for the Sales column. I have created an IF statement to return 0 where there were no sales for a product on a certain date. I have chosen to do this so it doesn’t create visual discontinuity on the line graph:
=IF(
ISBLANK(
INDEX(‘Forecasting Sales’!$T$6:$AT$27,
MATCH($H$2,‘Forecasting Sales’!$Q$6:$Q$27,0),
MATCH($I2,‘Forecasting Sales’!$T$5:$AT$5,0)
)
),
0,
INDEX(‘Forecasting Sales’!$T$6:$AT$27,
MATCH($H$2,‘Forecasting Sales’!$Q$6:$Q$27,0),
MATCH($I2,‘Forecasting Sales’!$T$5:$AT$5,0)
)
)
This references all information on the ‘Chart’ sheet. (apart from the Product chosen in the drop down menu (H2) and Date (I2:I28)).
Finally, Forecast will be concerned with the last portion of the product matrix on the ‘Forecasting Sales’ sheet.
=IF(
ISBLANK(
INDEX(‘Forecasting Sales’!$T$6:$AT$27,
MATCH($H$2,‘Forecasting Sales’!$Q$6:$Q$27,0),
MATCH($I25,‘Forecasting Sales’!$T$5:$AT$5,0)
)
),
0,
INDEX(‘Forecasting Sales’!$T$6:$AT$27,
MATCH($H$2,‘Forecasting Sales’!$Q$6:$Q$27,0),
MATCH($I25,‘Forecasting Sales’!$T$5:$AT$5,0)
)
)
Next, I created the line graph chart:
In order to include an automated ‘Sales and Forecast for: <product name>’ chart title, I needed to create a script that updates the chart on any edit that is made on this ‘Chart’ sheet.
First, I initiated the function then created 4 variables; sheet (‘Chart’ sheet), (select the) chart, title to insert the value into, and final text for the chart title (concatenating static and dynamic text). Then I created the formula that tracks and updates the instance of the graph title.
Prior to visualising this data, we removed the outliers from the analysis. A question that might be flowing through your head is, ‘why?’ or ‘how much of an impact would it really have?’. Let’s have a look. This is the chart (for ‘Drinking Coco’) prior to removing the outliers:
And this is after removing the outliers:
The main difference that I can see are the slopes of the trend lines (R2). Although the correlation for the historical sales data is not extremely strong, the updated (no outliers present) has had a 3.55 increase in the trendline. Even though further analysis needs to be conducted on this dataset, it is already more accurate than before.
Conclusion
This can now give us some insights into how each product may perform over the next few days. We can use this table to track this performance and update our business unit performance. This allows us to create more accurate models for future reference, and reduce the variance between our budgeted and actual analysis.
Appendix: Updated Original Solution
When I had originally started this project, I used the following QUERY formula to retrieve the sales figures. However, this was going to create a few extra steps that could be avoided using the INDEX-MATCH method.
QUERY
=QUERY($C$6:$D,“SELECT Col2 WHERE Col1 = date ‘”&TEXT(L$6,“yyyy-mm-dd”)&“””)


2 responses to “31 – Forecasting Sales Data”
[…] Project 4 – Analysing Outliers with Z-Scores […]
[…] our good, old friend SPARKLINE? Well, we can actually get a bit more pop out of this function by integrating some colour. […]