87 – How to use IMPORTXML in Google Sheets

Have you come up against having to copy over a lot of information across various different parts of a website and it just takes…too long? I’ve spent years thinking about ways to quickly get all of this information. Many of these processes involved fine-tuned keyboard shortcuts heuristics. This can be a mixture of Find (text), learning the number of carriage returns throughout the process, Command+Tab and Command+Tab then Tilde (squiggly or twiddle), etc. However, once I understood how to read and pull particular information key values from a website, it made things so much easier to view data structures in various forms and gather this information.

I will be skipping over some fundamental website development and computing concepts, but from what I’ll be exhibiting, it should be able to get you started on some type of information gathering cough web scraping cough that’s suitable to your needs. Within the Google Sheets context, IMPORTHTML only allows you to import one of two element types; tables and lists. This can be really helpful for financial analysis from websites like Yahoo or Marketwatch (although there is a shift towards dynamism on websites; discussed later). IMPORTXML allows you to dig a little deeper into individual elements on the website.

HTML stands for Hypertext Markup Language, and it is the language built for the web. It structures and displays content on web pages. XML stands for Extensible Markup Language, and stores and transports data. It is essentially like the database for a website. When you write a novel or a report, you can write sentences and paragraphs. Here’s the first page of Jane Austen’s Pride and Prejudice:

Thing is, we can start to view this as XML. If we wanted to view these sentences as XML, we would add the <p> and </p> tags. Can you guess what ‘p’ stands for? Paragraph. You start the tag (open) with this syntax <> and close the tag with this syntax </>. Let’s turn the first two sentences into paragraph tags.

<p>It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife.</p>

<p>However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.</p>

The reason that you open and close the paragraph tags is for formatting. You want the code interpreter to understand the way the text should be presented on the screen. There’s two other things that we can actually transform into XML code from this Austen extract. Can you guess that they are? There are the chapter and the book titles. A synonym for title is heading. Heading is the terminology that XML uses and there are 6 in total; H1 to H6. Can you guess how we code ‘Pride and Prejudice’ as Heading 1 and ‘Chapter 1’ as Heading 2?

<h1>‘Pride and Prejudice’</h1> as Heading 1 and <h2>‘Chapter 1’</h2>

Headers

Let’s take this website, https://en.wikipedia.org/wiki/The_Office_(American_TV_series), and put this into practice:

https://docs.google.com/spreadsheets/d/12UkibdqXzUFvNelUccZOFDH9p9LgyAIbv4QCq2cSMk0/edit?gid=0#gid=0

How about we output all of the first headings (h1) in the website. When initiating an external link in Google Sheets, you must first grant access for this connection:

=IMPORTXML(A2,“//h1”)

This function has returned ‘The Office’ and ‘(American TV Series)’:

Let’s confirm this by finding it in the code of the site. I’ll right-click on The Office, then select Inspect. This will open up a panel on the right side of the screen:

In the website’s code, we can see the <h1 …> tag with The Office inside of it:

Let’s try the same with h2:

=IMPORTXML(A2,“//h2”)

Same thing, let’s find it on the website. Instead of looking for this through the website, then right-clicking on it to find the code, I’m actually going to do a ‘find’ on the heading h2:

This search has yielded 21 results:

But I’m just going to look at two:

I’m not going to follow the turtles all the way down here, but if you were to review what you’ve seen then look at the word Crew, what would you guess its heading number would be? Let’s have a look by right-clicking on it and selecting Inspect from the menu:

Paragraphs

Okay, last one. I want to make a point here and distinguish between a single slash, /, and a double slash, //. I’m going to call on the URL in A2 then search for a /p (paragraph) tag:

=IMPORTXML(A2,“/p”)

This screenshot shows an error of ‘Imported content is empty.’ If we look at the code for this site, we can see that there’s no p tag in the class or head:

However, if we set the tag to search recursively (worded definition, visual representation, mapped representation) throughout the website code, then it should output all p tags. So, let’s call, *gulp*, all of the paragraph tags:

=IMPORTXML(A2,“//p”)

A matrix of 86×58 (4,988 cells) is why I said gulp. Regardless, you’re getting the point. We can start to pipe in particular parts of the website that we’re looking for. Let’s say I know that I want to get just the h2 tag Production. I can actually grab that by using the Inspect tool again. I’ll right-click on Production then select Inspect from the menu to find it in the code:

XPATH

Next, I’m going to right-click on the code word Production, hover over Copy then select Copy full xpath:

I receive this text in my clipboard:

/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/div[3]/h2

This is what I receive when I add it to my IMPORTXML function in Google Sheets:

So, what do all of these slashes mean? Let me put it this way: where did you store a certain document on your computer? How about the file ‘20240302_ProjectX_Analysis.docx’? If you were asking me to find it on your computer, how would you explain it to me? Some might say, “go to my desktop, you’ll see a folder named ‘Work’. In there, you will see year numbers; click 2024. Next, click March,” etc. This is what those slashes mean. They are the website equivalent of a universal naming convention (UNC) path. I’ll show you only a few levels before you get the idea of what’s happening.

I’m going to start at the top level of the code; /html/:

I’ll go to the next level down:

/html/body/

Then I’ll find the second div tag:

/html/body/div[2]/

Then the first, and only, div tag:

/html/body/div[2]/div/

All the way down to the h2 tag; /html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/div[3]/h2:

IMPORTHTML

As a contrast to what IMPORTXML retrieves, this is what IMPORTHTML pulls from a website. I’m going to set up a second sheet, IMPORTHTML, with the same wiki link. As mentioned, IMPORTHTML only retrieves 1 or 2 datatypes; lists (ol and ul) and tables.

Lists

List 1:

=IMPORTHTML($A$2,“list”,A4)

List 2:

=IMPORTHTML($A$2,“list”,B4)

Tables

Looking at the website, you can see a few different tables; left-hand side, show main information:

Season Synopsis:

As you work your way down the page, you can essentially count the number of tables. So,, let’s use that same logic to pull these tables into Google Sheets:

Table 1

=IMPORTHTML($A$2,“table”,C4)

Table 2:

=IMPORTHTML($A$2,“table”,E4)

Difficulty with Some Websites

Have you ever come up across this issue when attempting to retrieve data on a website?

Even though it’s clearly there:

This is due to the dynamic nature of JavaScript on these websites. Sometimes websites will function from static (push) data and others are constantly looking to pull updates. Sometimes these sites make the transition from static to dynamic at random times…even the day before you release a blog post about pulling this exact type of data.

I love this stuff, but how can I use it?

Well, this is how I use it. Since the success of the remastered Tony Hawk’s Pro Skater 1 + 2 (Forbes, Blizzard, True Achievements), there has been great anticipation for subsequent remakes (I’m sure by Tony Hawk too) from the expansive series, especially 3+4, THUG 1&2, and American Wasteland. There was murmuring over social media about the obvious next step to remake the subsequent games, however, the next best alternative (leading the way) was located within the Call of Duty franchise. As the CoD franchise has accumulated $33 billion since 2003 and there’s an opportunity cost of the fastest selling game to reach 1 million units, this no doubt would have been a difficult decision.

There are many contributing factors for why THPS had the success that it did, but I believe some of the main ones, split between skate fans and gamers, are:

Timing of THPS1+2 release (4^th September, 2020),
Lack of groundbreaking arcade skate games in the market since Skate 3 (2010)
- This is not ignoring the existence of skate games for skaters; Session and Skater XL,
The growth of Nostalgia in a bull entertainment market since the 2016 release of Stranger Things.
- This was a bit of a left turn, especially popular since the wave of dark political hyperrealism that was taking control over the market with shows like House of Cards (2013) and Black Mirror (2011). I would also posit that this was perfectly timed with the newly appointed American president of the time.

I believe that all of these factors came to a head for the release of THPS1+2, hence its rapid success. The obvious next remake 3+4 was ultimately canned after a focus was placed on Call of Duty, which left fans sunken back in their Secret Labs chairs. After the acquisition of Blizzard by Microsoft, fans shifted back to the edges of their seats, and have been left waiting. Not long has there been mass anticipation for the next logical instalment onslaught of other games in the series (3 + 4, THUG 1 + 2, American Wasteland, etc). I always wondered what Tony would think about his shifting back and forth, and the politics around the business side of opportunities.

I’ve been back and forth on my opinion of Tony Hawk for years. Coolest dude ever in 1999 (all the way through to 2015). Heard different things about him through various sources, but a big one for me was the fantastic documentary All This Mayhem. I looked at all the evidence that was built against Tony, then I just looked at his consistency throughout the years. A massive nerd. A massive skate nerd. Skating was a counterculture that was reserved for the most extreme in the 1970s when it really boomed. It was even less popular in late 70s and early 80s when Tony was in high school (and was just about to buy his first house). He was the outcast of the outcasts. Why? He was a ‘circus skater’; gangley like any teenager who’s going through their young deer phase yet was always making new tricks. That said, throughout this entire time, he’s shown consistency in pushing the culture and presenting it to a mass audience. He was all for the original game because he wanted skating to be in the homes of people who didn’t skate. He didn’t ask to carry the torch for skateboarding, but he did regardless of his countless haters. I digress.

On 23rd February, 2025, I went to the official Tony Hawk game website and saw an announcement was coming out:

Funnily enough, this was also teased through CoD: Black Ops 6 multiplayer map ‘Grind’. People have already flooded reddit in droves, showcasing their Photoshop skills and cleaned up the extreme bokeh image of what appears to be Skater Island. People have also attempted to make connections between the original 3 and 4 covers that culminate in the new potential 3+4 cover; 3 is an Indy nosebone and 4 is a crossbone:

Source: https://www.reddit.com/r/THPS/comments/1iwwqbl/new_logo_merges_both_cover_tricks/

How these are related to a Rock ‘n’ Roll, is beyond me, but I am more than willing to tilt my head to the left and squint to make it work though.

Evidence

Background Image Skate Location

The background image is a pretty close match for Skater Island, from THPS 3. Original:

Impending Release:

You can clearly see the funbox on the right, with the squared down rail (potentially a ledge in the update), the rafters above on the right, long funbox on the left with the round rail, and an updated skylight in lieu of the luminescent light bulbs. This updated game even includes the water splashed on the ground (bottom left/middle) like its 1+2 counterpart.

Literally on the Website

Through my excitement, and the little amount of time that I had to spend selfishly digging into this project, I was able to find someone, Quelanight2324, on a Reddit post who was already down this rabbit hole and was able to locate a class on the website; ‘thps-3-4-countdown’:

I also wanted to dig into the website to see what I could uncover and also add to the public string theory boards with my own two cents:

Level Names

So, let’s use IMPORTXML and the inspector function in Chrome to dig into it a bit deeper. Since I know that the background image is a photo of Skater Island. I’m going to put forward a hypothesis that there is a level name somewhere in the website code. Let’s start by finding all of the THPS3 and 4 levels names:

Source: https://tonyhawkgames.fandom.com/wiki/Category:Tony_Hawk%27s_Pro_Skater_3_levels

Source: https://tonyhawkgames.fandom.com/wiki/Category:Tony_Hawk%27s_Pro_Skater_4_levels

I’m going to state the obvious here – yes, I could just copy, paste, and format these names. It would save me time, but my ultimate goal here is to understand process/programming efficiency (to the best of my current abilities) such that when I apply this to a project at a larger scale, it saves me, or someone else, time.

Before I was able to post this blog, I was unable to successfully apply the appropriate logic to this formula, but I got close. This was my resulting output from the following two formulae:

=LET(data,UNIQUE(TOCOL(ARRAYFORMULA(TRIM(SPLIT(REGEXREPLACE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(TOCOL(IMPORTXML(“https://tonyhawkgames.fandom.com/wiki/Category:Tony_Hawk%27s_Pro_Skater_3_levels”,“/html/body/div[4]/div[4]/div[2]/main/div[3]/div/div[3]”)),” (THPS3)”,),“Tony Hawk’s Pro Skater”,),“Category: 3 levels”,), “([BCDFLPSTW])”, “|$1”), “|”))),1,1),1,1),FILTER(UNIQUE(data),LEN(UNIQUE(data))>2))

=TOCOL(IMPORTXML(“https://tonyhawkgames.fandom.com/wiki/Category:Tony_Hawk%27s_Pro_Skater_4_levels”,“/html/body/div[4]/div[4]/div[2]/main/div[3]/div/div[3]”))

I ended up biting the bullet, and manually cleaned up the data to fit in the cells:

That said, I can now use the following formula to search through the website to see if any of these levels show up:

=IFERROR(MID(IMPORTXML(“https://www.tonyhawkthegame.com/”, “//*”),SEARCH(H6,IMPORTXML(“https://www.tonyhawkthegame.com/”, “//*”)),LEN(H6)),“Not on Website”)

Let’s highlight those positive results with some conditional formatting:

Rio ended up being located within the Vicarious Visions name, so we have to rule this portion of the hypothesis out:

Alcatraz can be observed in 6 different results, but all referred to the same link:

Vindication: https://www.tonyhawkthegame.com/3+4

Finally, Chicago has 32 entries on the website:

Vindication: https://www.tonyhawkthegame.com/34

Release Date

Let’s also not forget the fact that it’s being released on, albeit American date format, 03/04/2025 — 3+4. The numerologist in me is just saying.

Tyshawn Jones

…and then, I mean, there’s this interview with Tyshawn Jones on the Breakfast Club on 19th February, 2025:

Conclusion

This is just an apology to my work colleagues as I have a team meeting during this time (tomorrow). This is a confession to say that I will not be paying attention during the meeting, but I will be on the website, waiting to view the announcement. I can only assume that this will remind me of waiting for tickets to some big event like Soundwave back in 2009 and the website crashing. That, or I could just change the countdown timer and wait 5 seconds:

…the longest 5 seconds of my life.