23 – Having Issues Finding Peoples’ Names?

In a previous post, I had to anonymise some names before posting. During that process, I realised that it was the perfect opportunity to show how I performed this task at the moment’s notice. Here’s the spreadsheet if you want to follow along:

This is an example of the original dataset (with the original names removed):

Original Data Process

Before I could publish this data, I needed to anonymise the names in Column F (Participant). To do this, I followed the following steps:

Create a New Spreadsheet,
Pull all the unique entries from the dataset into the renaming sheet using the ‘UNIQUE’ function,

As there are numerous players who have played multiple rounds, it is necessary to account for these “duplicate” names within the dataset. This means that one player will not be entered under two different, unique names. This will lead to an increase in accurate data integrity and analysis.

Create a small statistics table based on the following bits of information:
1. Total names in the original dataset

=COUNTA(Original!F2:F)

Total unique (Participant) names, and

=COUNTA(A2:A)

Percentage of unique names (this is a statistical check so that performing this process is validated, especially for datasets with more than 15,000 entries).

=E2/E3

Create the list of anonymised ‘Player #’ names in Column B,

=ARRAYFORMULA(“Player “&SEQUENCE(E2))

Navigate back to the original dataset and create a column or anonymised names. Then, use the VLOOKUP function to replace the names in that new column.

Step 4 (Revised) – IMPORTHTML Function

To explain the IMPORTHTML portion of this process, let’s go back to step 4. Instead of using the SEQUENCE function and automatically assigning players (as they appear) in the original dataset as Player 1, Player 2, etc, I decided to import some names from the internet using the Google Sheets function IMPORTHTML. This function requires 3 (well, an arguably 4th option) variables to operate:

URL: The URL of the page to examine, including protocol (e.g. ‘http://’).
query: Either “list” or “table” depending on what type of structure contains the desired data.
index: The index, starting at ‘1’, which identifies which table or list as defined in the HTML source should be returned.
locale: A language and region locale code to use when parsing the data. If unspecified, the document locale will be used.

For this example, I’ll be using the following link for the list of characters names from The Office:

https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_characters

It seems like SO many people who work in data or IT love The Office. I don’t get it. I don’t argue it. I just fall into line with it. When you look at the website for that list of characters, you can see that there’s a paragraph and then some tables with character information.

When you look under the hood at the website’s code you can see this code that signals that this information is held within tables (css table class). The first listed table element shows the main characters.

The second listed table element shows the supporting characters.

Since there are multiple tables in this website, i.e., the main cast in table 1 and the supporting cast in table 2, we will want to retrieve the data from each of these tables, and put it into Google Sheets, but we will need to do so one at a time. To achieve this, I know that I need three things: the URL (web address), the element that I’m looking for (a table), and the index of that element (table 1, 2, or 5, etc). Entering the following code will yield the results in the screenshot below:

=IMPORTHTML(“https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_characters”,“table”,1)

This data can be interacted with on an individual cell level. So, I’m going to copy this character data over to Column B (removing the sequence formula):

I’m going to need more names to fill out the number of players in the original dataset. So, let’s get table 2 from the wiki page:

=IMPORTHTML(“https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_characters”,“table”,2)

Now that we’ve filled out the rest of the pseudonyms for all of the unique names in the dataset, we start the “replacing” portion of the cleaning process. So, this is now step 5 of the process outlined earlier.

Step 5 – VLOOKUP Function

In this sheet, I am going to insert a blank column on the right of the player names; column F (Participant):

I will title the column ‘Anonymised Player’.

Now, I can create the VLOOKUP formula for each matching participant’s name:

=VLOOKUP(F2,Renaming!$A$2:$B,2,0)

This VLOOKUP function is looking at the participant’s name on the ‘Original’ sheet, searches the “database” of names that we’ve created on the ‘Renaming’ sheet, and substitutes the name on the left (John Smith) with the name in the second column of the selected database (A:B) in the ‘Renaming’ sheet (Michael Scott). I have created absolute referencing as I don’t want the selection moving down the ‘Renaming’ spreadsheet when copying the formula down the column (with auto-fill):

As you can see from the screenshot, there are numerous entries that have been duplicated due to the number of rounds that have been played. As mentioned earlier, this was necessary to account for, hence pulling all ‘unique’ names from the original dataset so that one participant was not allocated multiple different names. This allows for accurate data processing and analysis for individual player scores and times.

Homework for next week

Using a combination of the following Excel functions, can you create a list of characters without any blanks, ‘Main characters’, or ‘Characters’: IMPORTHTML, UNIQUE, and FILTER? Answer in the next blog post.

Conclusion

What I enjoy about these processes is that I will end up finding a singular, or use only a few number of functions (nested) to complete the task. I enjoy that I can use a collection of projects that I’ve completed in the past, combined to create a solution.

3 responses to “23 – Having Issues Finding Peoples’ Names?”

50 – 10 Google Sheets Troubleshooting with some FAQs and not so FAQs – James Miller CV says:

June 17, 2024 at 8:24 am

[…] (newly found) datasets: https://github.com/tdmitch/DunderMifflin. Like I said in the past, ‘It seems like SO many people who work in data or IT love The Office.’ And this is how we’re going to start. We’re going to do a brief quality check on this […]
28 – How to Grab Those Little Bits – James Miller CV says:

January 15, 2024 at 9:41 am

[…] few posts ago, I was showing how you can use IMPORTHTML to enter a bunch of names into Google Sheets at a […]
24 – What IF We Look at it Like This? – James Miller CV says:

December 18, 2023 at 10:22 am

[…] Answer from last week’s blog: […]