52 – How I Constructed the Final Data Analyst Roadmap Milestone

Homework Review

It’s been two weeks since this post, but how did you go with the homework? Here is a link to the solutions, which I crafted, and the (shorted) implemented solution (below):

https://docs.google.com/spreadsheets/d/1SMesfxuEwgVPPm3O8UJgaeQ4RtWNrkd39y2UEkgb9jc/edit#gid=1362202755

  1. Default currency
  2. Conditional formatting:
    1. =IF(AND($B2>$D$2,$C2=”Tulsa”),1,0)
  1. Referencing other sheets

=SUBSTITUTE(SheetB!A1,“want”,“here”)

=SUBSTITUTE(SUBSTITUTE(SheetB!A2,“here”,“guilty”),“alibi”,“guilty”)

=SUM(SheetB!A5:A12)

=AVERAGE(SheetB!A5:A12)

=MEDIAN(SheetB!A5:A12)

  1. Remove duplicates
  2. List of Unique IDs

=IFERROR(VLOOKUP(A3,$A$2:$B2,2,0),“EMP”&MAX(RIGHT($B$2:B2,LEN($B$2:B2)-SEARCH(“p”,$B$2:B2))+1))

  1. Multiple dependent dropdown menus
  2. Pivot tables
  1. REGEXREPLACE

=REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(A1, “\d+”, “Y”), “e”, “x”),“[.’,]”, “p”)

  1. ARRAYFORMULAs

=IFERROR(ARRAYFORMULA({{1;2;3;4},{1;2;3;4}^2,{1;2;3;4}^(1/2),{1;2;3;4}/(RANDBETWEEN(0,4))}),“WHOOPSY!”)

  1. View Google Sheets like SQL

=QUERY(QUERY(‘FAQ7’!A2:I25,“SELECT Col2, Col4, Col5, Col6”,1),“SELECT Col1, Col2, Col3, Col4, Col3/((Col2Col2)/100)100”)

=VLOOKUP(ROUND(E2,2),$E$26:$F$29,2,1)

Data Analyst Roadmap – Data Storytelling: …My Journey

As I stated at the beginning of last week’s post, it took a lot of mental energy to create it. This is because I had actually created two different projects before I settled on the one that I had used. I mean, I had taken a few weeks to plan, prepare, develop, and analyse datasets before landing on a final one.

What I’ve decided to do in this post is to provide a brief overview as to how I approached what I did and highlight some major learning points along the way. Those learnings I will provide after the story of my approach. It’s probably best to do it this way so that you have context of the journey taken and how these problems-solutions combinations fit into that journey.

I started by researching how to tell a story. It seems simple, but it really is something that I have struggled to do my entire life. I totally understand the structure of a story (short or long), but creating anticipation, tension, and relief is something that has always eluded me. Next, was to learn how to tell a story with data. Here are two videos that I found to be quite helpful: video 1, video 2. That seems simple in itself, but I often need to remind myself that not everyone looks at a spreadsheet in the same way — especially the way that I would look at one.

After I had prepped my brain into how to plan ahead to tell a story, I reminded myself of the roadmap context and refreshed what I had covered at each previous milestone. Whilst thinking about each of those milestones, I wanted to remain considerate about what kind of story I wanted to tell. When you want to tell a story, you often will talk about the things you know. You want to remain authentic. I stuck to this approach when I started looking around for different ideas. I had recently taken a trip to Queensland for work and I was, naturally, comparing life in Tasmania to that (the thing I knew since I had grown up in Queensland).

As a job in education was/is still fresh in my mind, I thought about all of those studies that create links between education level and ‘quality of life’. Since I still participate in youth engagement, I was considering the services that are provided for young people in the Break O’Day region versus what is available to young people in Queensland. This got me thinking about the types of funding that governments (on various levels) provide to individuals and groups in different councils. For this portion of the post, here is the spreadsheet that I used:

https://docs.google.com/spreadsheets/d/18TDkH3TnAAccf7xDNR-y6sWBnRy3wU91W_27ynt-srY/edit?usp=sharing

I was able to track down publicly available data, provided by the Queensland government, that outlined the expenditure for the financial year and where it was allocated. I started digging through the data and saw how much detail was actually provided for each “transaction”. I started a brief look by performing a calculation for each suburb and noted how much was being allocated. For this, I used a combination of UNIQUE() and SUMIF() to process this allocation task (sheet: ‘2022-23-expenditure-consolidated’). Whilst analysing the set, I had the realisation that there was the potential that not all suburbs might be accounted for in this funding allocation. I decided to create a new sheet (sheet: ‘Sheet1’) where I could analyse the dataset from a broader perspective.

A brief Google search yielded a website where I could get a list of all Brisbane suburbs. Given what I knew about particular IMPORTxxx functions that Google Sheets has, I decided to IMPORTHTML. After I imported the first few tables, I realised how it wasn’t incredibly automated and that it wouldn’t allow for the dynamism that I was wanting. I often use the analogy when discussing “tech stuff” with people that I’m a lazy tech user. I will gladly spend the time to learn a program’s/software’s shortcuts if it means that I can reduce my search and click time to just a few keyboard combinations. Scroll below (Learnings; 1) to see how I was able to configure the combination of QUERY and IMPORTXML for each suburb within Brisbane.

Following this parsimonious approach to learning, I noticed that in these tables of Brisbane suburbs were URL links to the location on Google Maps. Thinking ahead, I wanted to create a map of Brisbane and showcasing the amount of money that was allocated to various areas (and even provide a time series evaluation to see how this allocation changed over time, between suburbs and amounts). Scroll below (Learnings; 2) to see how I went about implementing the data prep for the map (that I didn’t end up making).

Next I wanted to make some context for each suburb. In 2021, I was a field officer in the Census so I knew the type of information that was collected on people. Remembering that the Australian Bureau of Statistics (ABS) reported on various statistics, including household income, dependents, number of cars, etc, I used the ABS website to find datasets for each suburb throughout Queensland. I found these DataPacks to be extremely useful in this task. After importing ‘2021Census_G02_QLD_SA2.csv’ into Google Sheets (sheet: ABS_Income_data), it meant that I could now do a lookup for each suburb and calculate the weekly income for each household, on average:

=IFERROR(VLOOKUP(“*”&D3&“*”,ABS_Income_data!$B$2:$I,8,0),)

This IFERROR was an important component when it came to later analysis as either a blank or 0 value would have an impact on whether or not the analysis was statistically significant. There were some suburbs that were not included in this datapack, which can be seen in the spreadsheet, highlighted light yellow (below). I manually adjusted the cell by using an IMPORTXML function that would search for this suburb. Scroll below (Learnings; 3) to see how I implemented this process:

Now that I had some good bits of data to work with, I also wanted something to correlate between suburbs, funding, income, and disadvantage. Thinking back to those myriad of studies I’ve read, I decided to look into the Index of Relative Socio-economic Disadvantage (IRSD). Wouldn’t you know it, the ABS also holds this IRSD data for all of Australia. From here, I can perform a VLOOKUP for each suburb in the D column.

Given some lateral thinking, and funding limitations, there were some suburbs that were not included within the analysis.

  • Brisbane Airport: non-residential,
  • Lytton: was often combined with Port of Brisbane/Brisbane Airport given it’s proximity, and it was difficult to accurately separate the suburbs apart,
  • Banks Creek: zero funding, but 11 population,
  • Lake Manchester: zero funding and 0 population,
  • Mt Coot-tha: zero funding and 0 population — primarily public land,
  • Port of Brisbane: zero funding, but 8 population — primarily shipping,
  • Enoggera Reservoir: zero funding, but 39 population — primarily public land.

Note:

  • England Creek: received $13,872,330 in funding, but has 0 population.

Some suburbs needed to be adjusted throughout the analysis as there was inconsistency on the ABS website (and data) regarding suburbs named ‘Mount [xxx]’ or ‘Mt [xxx]’. I thought that Brisbane City was an interesting ‘Suburb’ to consider given that there are people who live in the city, but the funding that is provided to it acts more as a public good to the greater region of Brisbane as there will be those who enter the suburb who do not live it. This is more likely to impact more people from outside of Brisbane than any other suburb outside of Brisbane City as more people will enter Brisbane City than people will enter, and reside, in any other suburb.
With the Income and IRDS for each suburb now being accounted for, I started to do an analysis of the two. If we assume that income is related to IRDS, we would expect to see a linear line and a relatively strong R2. If this is true, we would then expect to see a strong correlation between low income areas and high government funding expenditure. Let’s start with the income and IRDS analysis.

Since some IRDS values were not listed, I originally assigned them a 0:

=IFERROR(VLOOKUP(“*”&D3&“*”,IRSD2!$B$2:$K,3,0),0)

I then plotted the two columns to view the correlation:

As you can see, the 0 assignment distorts the scatter plot. I will copy the LINEST result below (for later comparison):

Annual Inc on IRSD
IRSDAnnual Inc
Coeff80.8511764729658.62529
Std err10.3908328510958.27116
r^2; se_y0.244579107223596.73637
F; d_f60.54411983187
ss_reg; ss_resid33711327221104122715931

A valid method to handle null values is to allocate an average for the entries before it:

=IFERROR(VLOOKUP(“*”&D5&“*”,IRSD2!$B$2:$K,3,0),AVERAGE($T$3:T4))

Annual Inc on IRSD
IRSDAnnual Inc
Coeff311.0916051-217164.9896
Std err22.3857947623861.32164
r^2; se_y0.508052561419042.18571
F; d_f193.1219101187
ss_reg; ss_resid7002693867367807104478

By assigning an average IRSD value, over a 0, we can see that the model has increased by 2x the value (24% to 50%). To my eye, it looks as though this model exhibits heteroscedasticity, so I decided to manually run a White test:

White Test
LM (R * obvs)95.00582898
df2
p-value0
F36159722330
df12
df2185
p-value0.000000000

This test shows that the model did exhibit heteroscedasticity. To correct for this error, there are numerous approaches that can be taken, but I decided to implement the Heteroscedasticity-consistent Standard Errors (HCSE) solution. Since the number blew out of proportion (825,662,008.4), this is where I decided to stop the analysis (for now). Dividing the original RSS with this value provided the following results for the White Test:

White Test
LM (R * obvs)95.00582898
df2
p-value0
F43.79482399
df12
df2185
p-value0.022572277

It shows that heteroscedasticity exists for the 5% LOS, but not 2%.

Overall, this statistical analysis is starting to show some statistically significant insights. Yet, there is further analysis that is necessary to develop deeper understandings and inference to Brisbane suburbs and potential fundings.

Learnings

1. Using QUERY and IMPORTXML to List all Brisbane Suburbs

=QUERY(IMPORTXML(https://www.brisbane.qld.gov.au/about-council/council-information-and-rates/brisbane-suburbs&#8221;,“//tr”),“SELECT Col1 WHERE Col1 <>’Suburb’”, 0)

When I approached this problem, I had to finally put my fear of learning ‘IMPORTXML’ in order to obtain that efficiency and dynamism that I was wanting in this process. I found a fantastic YouTube video that clicked with me very easily. Whether the video/person is good (at explaining the concept) is up to each viewer, but I had enough contextual understanding behind all of the prior knowledge that I was able to make the jump to understand how to code for my own context. If you try just the IMPORTXML(https://www.brisbane.qld.gov.au/about-council/council-information-and-rates/brisbane-suburbs&#8221;,“//tr”) portion of this formula, you will notice that the output contains ‘Suburb’ for each table that is imported. This makes sense as the IMPORTXML function is calling the 

2. Pulling the GPS coordinates from the Brisbane Suburbs table data

I wanted to track which suburb I was tracking so I imported that same column of suburbs that I did above. Next I wanted to make sure that I was understanding the XML code correctly, so exporting each column and row element was important for my brain to comprehend the process. In the adjacent column I copied the below code:

=IMPORTXML(https://www.brisbane.qld.gov.au/about-council/council-information-and-rates/brisbane-suburbs&#8221;,“//table//tr//td[1]”)

I replaced the last element td[1] with td[2]. This provided me with the text ‘Map’ for each suburb. Great success!

In the next adjacent column, I wanted to pull the URL link for each suburb. This meant that I needed to adjust the (above) XML code to td[2]/a/@href. The output was the Google Maps URL for each suburb.

I observed a few of those links and noticed that they possessed latitude and longitude information. I recognised it due to my time working with a renewable energy company and I had to do some work with coordinates, so -27 and 152 stood out to me. I used a REGEXEXTRACT formula to retrieve both latitude and longitude coordinates for each suburb. I used the SPLIT function so that it would automatically place them into two columns:

=SPLIT(IFERROR(

“-“&LEFT(REGEXEXTRACT(AG3,“[0-9].*”),SEARCH(“&”,REGEXEXTRACT(AG3,“[0-9].*”))-1),

“-“&LEFT(REGEXEXTRACT(AG3,“[0-9].*”),SEARCH(“&”,REGEXEXTRACT(AG3,“[0-9].*”))-12)-1)),“,”)

I was then able to use a VLOOKUP function for my “main” analysis table:

Here is the Latitude code:

=IF(LEFT(VLOOKUP(D3,$AE$3:$AI$192,4,0),4)<>“-27.”,,VLOOKUP(D3,$AE$3:$AI$192,4,0))

Here is the Longitude code:

=IF(LEFT(VLOOKUP(D3,$AE$3:$AI$192,5,0),2)<>“15”,,VLOOKUP(D3,$AE$3:$AI$192,5,0))

There were some entries on the original suburbs list site, e.g. Stones Corner and Teneriffe, that had slightly different Google Maps URL links or they were linked to a Google Drive folder. Due to stopping at a certain point in this project and not continuing with the mapping concept (mainly because the default Google Sheets software doesn’t provide this kind of mapping) I didn’t code it for these cases:

3. “Manually” Importing Income Statistics for Suburbs

For those cells that did not contain the suburb information on ‘Sheet1’, I manually adjusted the cell by using an IMPORTXML function that would search for each suburb — let’s use the Tennyson example:

=IFERROR(VLOOKUP(“*”&D101&“*”,ABS_Income_data!$B$2:$I,8,0),IMPORTXML(https://abs.gov.au/census/find-census-data/quickstats/2021/&#8221;&INDEX(ABS_Website_Lookup!$B$3:$E$3670,MATCH(“*”&D101&” (QLD)”&“*”,ABS_Website_Lookup!$D$3:$D,0),1),“/html/body/div[1]/main/div[3]/div/div/div[2]/div[1]/div[3]/div/table/tbody/tr[3]/td”))

=IFERROR(

Perform this search in the ABS_Income_data sheet:

VLOOKUP(“*”&D101&“*”,ABS_Income_data!$B$2:$I,8,0),

If there is an error, perform this XML search:

IMPORTXML(https://abs.gov.au/census/find-census-data/quickstats/2021/&#8221;&INDEX(ABS_Website_Lookup!$B$3:$E$3670,MATCH(“*”&D101&” (QLD)”&“*”,ABS_Website_Lookup!$D$3:$D,0),1),“/html/body/div[1]/main/div[3]/div/div/div[2]/div[1]/div[3]/div/table/tbody/tr[3]/td”))

Essentially, this just searching the quickstats portion of the ABS website, but is using an INDEX-MATCH process to input the correct ABS Census code (located in the URL):

Since the suburb ‘Tennyson’ isn’t located in the ABS_Income_data sheet, it will perform a search for Tennyson in the ABS_Website_Lookup sheet: INDEX(ABS_Website_Lookup!$B$3:$E$3670,MATCH(“*”&D101&” (QLD)”&“*”,ABS_Website_Lookup!$D$3:$D,0),1)

Resulting in: SAL32771. Now we can search https://abs.gov.au/census/find-census-data/quickstats/2021/SAL32771:

Each website entry for each suburb will be the same, so I know that I can IMPORTXML the same attribute, with relatively good confidence that it will always be the same attribute. Hence I have used the XCOPY process to find the ‘Median weekly household income’ attribute on the website: /html/body/div[1]/main/div[3]/div/div/div[2]/div[1]/div[3]/div/table/tbody/tr[3]/td

Part 2

I arrived at the point I did with the Queensland funding project, but had decided to place it on the backburner. I would definitely be revisiting it again in the future because I found it so interesting. I continued my internet trawl through some different websites, e.g., Kaggle, in search of some datasets and found this E-Commerce Summer Product Ratings and Sales one. I imported it into Google Sheets and started analysing it.

https://docs.google.com/spreadsheets/d/1r5TS66anemiqGcikvFCrIC_w3nYqdn4x0M5Zo1u6OhI/edit#gid=1781005015

I scanned through the headers and data types, then asked myself some questions to consider for various types of analysis. Some of them had attempts and others did not:

  • Sales Performance Analysis

1. Analyse which products have the highest units sold and investigate if there’s a correlation with other factors like price, ratings, or use of ad boosts.

For the overall dataset, there doesn’t appear to be a high correlation between units sold and price, ratings, or ad boost.

But, when accounting only for the top 10, we can see the model having a higher R^2 value:

2. Look at the distribution of units sold across different product categories or tags to identify popular trends.

=TRANSPOSE(SPLIT(UNIQUE(data!S2:S),“,”))

=QUERY(K6:L29,“SELECT Col1, Col2 ORDER BY Col2 DESC LIMIT 10”)

3. What is the distribution of units sold,

  • Price and Retail Price Comparison

1. Compare the listed price to the retail price to see how much products are being marked up.

2. Identify products with the highest and lowest markups and explore possible reasons.

  • Rating and Review Analysis

1. Examine the relationship between ratings and sales to see if higher-rated products sell better.

=QUERY(data!$B$2:$H,“SELECT Col1, Col5, Col7 WHERE Col7 >=4.4 AND Col5 > 300 ORDER BY Col7 DESC, Col5 DESC”)

2. Analyse the distribution of ratings and see if there are products that are highly rated but not selling well, which could indicate missed marketing opportunities.

=QUERY(data!$B$2:$H,“SELECT Col1, Col5, Col7, ((Col4-Col3)/Col4) WHERE Col7 >=4 AND ((Col4-Col3)/Col4) >0 AND ((Col4-Col3)/Col4) <.1 AND Col5 < 1000 ORDER BY ((Col4-Col3)/Col4) DESC”)

  • Impact of Ad Boosts
    • How are sales impacted by ad boosts?

For this question, I was interested in investigating the quantity, revenue, ad boost, but also what the image looked like. I would then perform a comparative analysis (potentially look into some visual analysis through some AI metric) for top 10 vs bottom 10. I started by outputting all available information, but I was unable to see the image of the item, so I used the IMAGE() function to view it:

=QUERY(data!$A$2:$AN$1574,“SELECT Col6, Col6*Col5, Col7, Col40 WHERE Col7=1 ORDER BY Col6 DESC LIMIT 10”)

=ARRAYFORMULA(IMAGE(AH13:AH))

Again, appealing to the dynamism that I was attempting to achieve, I updated the QUERY function to account for this IMAGE() function, and also allowed for user input around the ad boost (or not) and viewing both top and bottom (user input) 5, 10, 20, etc:

=ARRAYFORMULA({

  ({“QTY”;

    QUERY(

      {data!A2:AN1574}, “SELECT Col6 WHERE Col7=”&AH24&” ORDER BY Col6 DESC LIMIT “&AF24

    )

  }),

  ({

    QUERY(

      {data!A2:AN1574}, “SELECT Col6*Col5 WHERE Col7=”&AH24&” ORDER BY Col6 DESC LIMIT “&AF24

    )

  }),

  ({“Boost”;

    QUERY(

      {data!A2:AN1574}, “SELECT Col7 WHERE Col7=”&AH24&” ORDER BY Col6 DESC LIMIT “&AF24

    )

  }),

  ({“IMG”;

    IMAGE(

      QUERY(

        {data!A2:AN1574}, “SELECT Col40 WHERE Col7=”&AH24&” ORDER BY Col6 DESC LIMIT “&AF24

      ),2

    )

  })

})

I then considered some managerial question:

  • Which products should be prioritised for marketing campaigns?
  • Is there a significant impact of ad boosts on sales, and is the investment justifiable?
  • How does pricing strategy affect sales and customer perception?
  • What is the optimal inventory level for different products to minimise stockouts and overstock situations?
  • How do product variations (colour, size) affect sales?
  • What is the return on investment for products with quality badges versus those without?
  • How can the company improve customer satisfaction and ratings?
  • Which geographic markets are most lucrative for different product categories?
  • Are there seasonal trends in product sales that can inform inventory and marketing strategies?
  • What strategic adjustments are needed based on competitor pricing and product offerings?

I then considered some more unique perspective questions:

  • Visual Appeal and Product Colour Psychology
    • Is there a relationship between product colours and their sales performance?
  • Psychological Pricing Impact
    • Is there a relationship between price and purchase amount?

By this point, I was somewhat struggling with the project, but still retained interest in it. I then decided to mapped the correlation between the sum of quantities purchased for all quantities in the dataset.

=QUERY(data!$A$2:$F$1574,“SELECT Col1, Col6, Col5, Col5*Col6 ORDER BY Col6 ASC”)

This was the resulting image:

Noticed that this looked similar to the Gini coefficient. I thought that this was an interesting approach to analyse this data, so I decided to follow that train of thought. I did some digging and found (remembered) that you can get Google to estimate the equation of the data. I used this equation and Symbolab to calculate the integral for me, which assisted me in calculating this data’s Gini coefficient.

=0.5-((0.000501*(2.71828182845904)^(0.00474))/2)

It estimated that its coefficient was 0.4997, which suggests high (almost very high) inequity.

I decided to map the correlation between sum of revenues for all revenue in the dataset with the intent of comparing the two:

This yielded 0.4995; =0.5-((0.000972*(2.71828182845904)^(0.00427))/2)

Although I could see where this analysis was headed, I didn’t feel like I was getting enough in the direction that I was hoping for. This is where I finished this analysis and moved to that third, and final, project that you saw in last week’s post.

Conclusion

Although this was a lengthy post, I think it highlights that not every published item on the internet, and within science, can highlight a ‘win’. What I enjoy about this industry is the trial & error, determination, and persistence that it takes. If this isn’t a nice little prep for next week’s post, then I don’t know what is!
Do you have any projects that are incomplete? Do you have any Fancy Projects, Templates, and Amazing Solutions!? I would love to see what is going on in your head, or where you have some roadblocks.

4 responses to “52 – How I Constructed the Final Data Analyst Roadmap Milestone”

  1. […] This kind of stands when it comes to websites. I totally understand how websites are built (from the HTML perspective) and how JavaScript has been injected into it over time. But, to draw information from it, the concept took me a while to really grasp. However, after watching this video (above), they were able to explain it so well and had a pretty natural scaffolding into pulling out different levels of data/attributes. I ended up revisiting it for the GPS information in this blog post about how I created my 8/8 Learn Data Storytelling post. […]

Leave a comment