56 – How to Test for Collusion in Google Sheets

Whilst considering ways to implement statistics and algorithms into my workplace tasks, I came across the idea of collusion. The definition of collusion is ‘secret or illegal cooperation or conspiracy in order to deceive others’, but I’m essentially looking for a way to detect if there’s been interference between 2 or more persons in regards to their responses to questions. An element of one of my current roles requires me to assess applications that people submit. Although there is no reason to assume that any of those applications have been cross-referenced with anyone else who submits them within the same year, I started to wonder about the possibilities. Different types of interventions could possibly occur, such as:

  • an uptick in people utilising AI tools for daily task assistance
    • this would require a hybrid of qualitative and quantitative analysis, including semantic similarity, distribution analysis, and different language model evaluation metrics (BLEU, ROUGE, or METEOR),
  • people who have successfully applied in the past might have had an impact to some degree on current applicants
    • this would require some kind of qualitative time-series analysis.

https://docs.google.com/spreadsheets/d/17sRjCqmBrJIk2kT3RrJoYrhUxjKZ3WS1Mu4It0I3EXE/edit#gid=0

Steps for Analysis

Here are the steps that I followed to produce the following processes. It is essentially the 7 steps of hypothesis testing that exists in many of those kind of textbooks:

  1. State the Null Hypothesis (H0),
  2. State the Alternative Hypothesis (H1)
  3. Set the rejection percentage for this hypothesis test (α),
  4. Collect Data,
  5. Calculate a test statistic (t),
  6. Construct Acceptance/Rejection regions,
  7. Draw a conclusion about H0.

Collect (Create) Data

To be honest with you, I wasn’t sure how this post was going to turn out, but I was still interested in developing it. What I enjoy about posts like this is that I get to learn something, find problems along the way, and work towards creating a solution to them. To get start, I created a matrix with 10 people (“test takers”) and 10 questions:

I then filled in the empty cells with answers (between A and E), created an answer key at the bottom, then got excited at the prospect of being able to detect potential collusion:

Well…can you see it? Can you spot the potential collusion? After I created the matrix, and entered the collusive participants, I legitimately had a hard time locating it after I turned away. Let’s use some basic statistics methods to uncover potential collusion in a dataset.

I will be covering this in a bit more detail in the automation portion of this blog (below), but as an FYI, I created this by implementing a combination of:

  • A table on another sheet with 1-5 and A-E in the respective A and B columns,
  • VLOOKUP(RANDBETWEEN(1,5),Sheet2!$A$1:$B$5,2,0) and dragged this over the matrix, then
  • Copy-pasting, and cleaning the data such that I could create collusive entries.

If we attempt to do a primary search for collusion using countif, we can see the number of correct responses for each question. This doesn’t give us a lot of insights though, but it does tell us that there’s at least 3 people we should be potentially investigating as there are at least 3 people, in all questions, there they have answered the questions correctly:

=COUNTIF(K2:K11,K12)

Transform Letters into Number

Since we want to perform a numerical statistical analysis on this MCQ test, it will be helpful to transform the letters into numbers. I made a lookup table where I can translate from numbers to letters using VLOOKUP (columns A and B) or from letters to numbers using VLOOKUP (columns B and C):

Since this transformation is for a letters to number, I will duplicate this matrix, paste it below, then implement a VLOOKUP function, using columns B and C:

=VLOOKUP(K12,lookupValues!$B$2:$C$6,2,0)

Create a Correlation Matrix

Next, I want to create a correlation matrix that shows how correlated each person’s answer set is with the other participants and the answer set. So, I will use the CORREL() function and use two different inputs (y and x). Y will be each other person’s test, and X will be the person in question. This means that I will create absolute referencing for the person in question (x; variable 2) and relative referencing for each person that we’re interested in comparing to. So, this first picture shows how correlated Person A is with Person A. This could be a 100% correlation as all of the answers in the row match one another:

=CORREL(B17:K17,$B$17:$K$17)

When we drag this function down the column, it shows how correlated Person A’s answers are with the rest of the participants in this test. It shows that Person A is 23% correlated with Person B, 48% with Person C, 19% with Person D, etc:

=CORREL(B26:K26,$B$17:$K$17)

We can now do the same for Person B. You will notice that the first result for Person B is 23% (C31). Now note that this is the exact same value as the cell B32. This is because we have already performed this correlation between Person A and Person B. For those who haven’t traversed through matrix work yet, can you see something that we will have to account for later on (so that we won’t get confused for strong correlations)? Write down your thoughts now and see if you’re correct or close:

=CORREL(B17:K17,$B$18:$K$18)

Same thing as Person A, let’s drag down (copy) this function column B to observe all of the correlations. It shows that Person B has a 23% correlation with Person A, 100% with themselves, -29% with Person C, -14% with Person D, etc:

=CORREL(B26:K26,$B$18:$K$18)

Carry on this process for the rest of the matrix:

=CORREL(B17:K17,$B$19:$K$19)

We can actually perform this same process for the Answer column. This will show us how correlated each person is with the Answer set. It shows that Person A is 48% correlated with the answer set, Person B provided -29% correlated with the answer set, and Person C is 100% correlated with the answer set:

=CORREL(B26:K26,$B$19:$K$19)

Next, I’m going to add some conditional formatting so that it’s easier to see what might be going on by selecting the numbers in the matrix then selecting Format > Conditional Formatting:

I’ll make sure that I’ve select the range that I want (B31:L40) and I’ll select this colour palette — green for least correlated and red for most correlated:

We now have a clearer view of the correlation between each person’s answers against other people within the dataset:

Create a Visual Correlation Matrix Checker

Sometimes looking at numbers can be a bit overwhelming. In this case, I’m going to create a verbal matrix to check particular people. I’m going to, again, copy and paste the previous matrix, deleting the values within it:

Next, I’m going to implement a function that takes in correlation percentages and reports back the person that should be checked. This person (these people) should only be checked if their correlation is greater than a particular threshold, in this case, 90%. It will only output one person (that whose column it isn’t), but that’s something to consider for reporting purposes later (enter foreshadowing here). The following function is going to reference the cells in the range A30:K40 (the correlation matrix).

First, I want to obtain the contents of a cell within that correlation matrix, based on the cell reference. I can use the INDIRECT() function for this purpose. Because I’m essentially doing a right-looking lookup, which doesn’t exist for VLOOKUP, I need to only account for the row changes and maintain the “A” column. So, I can use ROW() to provide us the cell locations that we want to look up, e.g. “A31”, “A32”, etc. INDIRECT() will then provide us with the information in that cell, e.g. [Person] “A”, [Person] “B”, [Person] “C”, etc.

INDIRECT(“A”&ROW($B31:$K31))

This is now where we can create the logical condition that will either tell us to check particular test responders or there’s nothing to consider. I will set up the threshold to 90%; if it is greater than 90%, it will advise the assessor to check these two people and, otherwise, “null”:

IF(B31>0.9,“Check: “&INDIRECT(“A”&ROW($B31:$K31)),“null”)

This is where your efforts for writing down your thoughts before are going to pay off. Did you think of something that we needed to account for to make our visuals easier to read? The diagonal on that matrix (of 100%) would start to interfere with our analysis. As each person is going to be 100% correlated with themselves (for any finalised entry) we won’t need to account for that information in the final stages. I want to take care of the diagonal because these metrics are redundant when it comes to our analysis. To make it obvious why, this is saying that Person A is 100% correlated with Person A, Person B is 100% correlated with Person B, etc. To handle this redundancy, I’m going to use the COUNTA function and count the number of rows down and number of columns across. If the two COUNTA functions are equal, then this means I will be on a diagonal. For example, let’s use the top 2 people; A and B:

COUNTA(rows) = COUNTA(=columns)

COUNTA(Person A) = COUNTA(Person A) are equivalent. This is a corollary. Let’s put some values into that logic: COUNTA(Person A = 1) = COUNTA(Person A = 1) are equivalent. Let’s move to the next row, but stick to the first column.

COUNTA(Person A + Person B) = COUNTA(Person A) are not equivalent. Let’s put some values into that logic: COUNTA(Person A + Person B = 2) = COUNTA(Person A = 1). Now, let’s stick with the same number of rows, but move to the second column.

COUNTA(Person A + Person B) = COUNTA(Person A + Person B) are equivalent. Let’s put some values into that logic: COUNTA(Person A + Person B = 2) = COUNTA(Person A + Person B = 2). Can you see how we are now accounting for the diagonal of the matrix?

=IF(COUNTA($A$44:$A44)=COUNTA($B$43:B$43),“null”,

IF(B31>0.9,“Check: “&INDIRECT(“A”&ROW($B31:$K31)),“null”))

Before I apply this to the entire matrix, I want to account for that 90% threshold check in the logical condition of the second (first nested) IF function. I’m going to update the hard-coded of 90% correlation tolerance to implementing a user-friendly input cell:

=IF(COUNTA($A$44:$A44)=COUNTA($B$43:B$43),“null”,

IF(B31>$M$44,“Check: “&INDIRECT(“A”&ROW($B31:$K31)),“null”))

Implement a Right-Tailed t-Test

Now that we know which people we want to check, we can set up a matrix that tests the t-distribution for all of these values. Let’s remind ourselves of the statistical test we want to perform against this dataset.

t = r * SQRT(df / (1 – r^2))

I will be taking this t-statistic and implementing it into the T.DIST.2T() formula to output a p-value for each cell in the matrix. The resulting matrix will display the p-values for the correlation between each of the test takers:

r = B31, and

df = (COUNTA($A$1:$A$12)-4), given that I need to account for ‘Person’, ‘Answer’ and then the number of variables (2).

=IFERROR(

IF(T.DIST.2T(ABS(B31*SQRT((COUNTA($A$1:$A$12)-4)/(1-(B31^2)))),(COUNTA($A$1:$A$12)-4))=0,

1,

T.DIST.2T(ABS(B31*SQRT((COUNTA($A$1:$A$12)-4)/(1-(B31^2)))),(COUNTA($A$1:$A$12)-4)))

,1)

Again, we can copy the formula down the column:

And then copy the formula across the columns:

To make this analysis easy on the eyes, I will incorporate conditional formatting for the cells B57:K66:

=IF($B$57:$K$66<(1-$M$44),1,0)

This now displays all p-values that are 1-x% placed in cell M44:

Construct an Answer Easy-Checker

Sometimes, this can even be too much to comprehend, so let’s make it a bit easier on the assessor. This is going to output the answers that a person entered into their quiz, into a table below. First, I’ll create a table that outlines each question and people to be checked:

VLOOKUP returns a value based on an input. The input for this purpose is the Person’s “name”, e.g. C. It is using the original data input table (,$A$2:$K$11) as the range. I will want the process to automate for me, so that I don’t have to drag the formula across, which means that the index will be an array of numbers. I’ll use the combination of SEQUENCE() and COUNTA(). SEQUENCE looks for information around the number of rows, number of columns, and starting number. COUNTA counts the number of cells in a range where some form of data is present in each cell. I know that the SEQUENCE function will only require 1 row, but will be dependent on the number of questions, so I will use COUNTA to count the number of present questions; COUNTA($A$2:$J$2) = 10. Since I don’t want the person’s name to appear in that lookup again, I’m going to start at column 2; SEQUENCE(1,COUNTA($A$2:$J$2),2). Since I’m using an array function within this VLOOKUP function, SEQUENCE(), I’m going to wrap this entire formula in an ARRAYFORMULA function:

=ARRAYFORMULA(VLOOKUP(B69,$A$2:$K$11,SEQUENCE(1,COUNTA($A$2:$J$2),2),0))

Underneath the two people answers table, I’m going to create a matching row. I’ll use an ARRAYFORMULA() function again because I want to avoid dragging the formula across. My IF function’s logical statement includes the array elements; C69:L69=C70:L70:

=ARRAYFORMULA(IF(C69:L69=C70:L70,“MATCH”,))

It would be useful to see how many answer matches the two people have:

=COUNTIF(D71:L71,“MATCH”)/10

Instead of copying and pasting the questions from the top, I’m going to implement the ARRAYFORMULA function and select the questions cells:

=ARRAYFORMULA(B12:K12)

Now that the test answers are located directly below the combination of people, we can easily compare the answers for our p-value analysis. Maintaining the ARRAYFORMULA approach, we can use an array within an IF function to identify all of the answer matches:

=ARRAYFORMULA(IF(C69:L69=$C$73:$L$73,“MATCH”,))

Now, we can copy the formula down, given the relative and absolute cell referencing:

Using this approach we can now compare the various cases where the p-value is 5% or less. Let’s test: (A,G), (A,I), (C,J).

(A,G); 2.57%:

This shows that the two exhibit a 30% match of questions (Q1, Q3, Q5, Q7). With respect to the answer set, they only have E in common with the answer set.

(A,I); 2.05%:

This shows that the two exhibit a 10% match of questions (Q10). With respect to the answer set, they don’t have any correct answers in common.

Initially, you would think that this suggests that these two sets of compared people potentially exhibit a Type I error; false-positive. However, when we constructed the original p-value matrix, we used the ABS() function to account for values that are negative. So, these two sets of tests (A,G) and (A,I) should be located in the left-tail test, given their distance from each other’s answers.

Finally, let’s investigate (C,J); 0.00007%:

This shows that the two exhibit a 90% match of questions (Q2-10). With respect to the answer set, the only answer they don’t have in common is Q1. This suggests that these two people should be investigated for potential collusion.

This process of detecting collusion in Google Sheets displays the fusion of statistical methodologies, correlation matrices, and investigative techniques to unveil potential irregularities and fraudulent activities. This post assumes that particular data analysis processes have been pre-processed (data cleaning), number of participants & questions. However, the overall flow of this post has taken us through the data analysis. We also explored how we can implement some basic statistical hypothesis testing and draw conclusions based on collected data.

The journey continued with practical steps, such as data creation, transformation, and correlation matrix construction. We delved into the significance of correlation coefficients in identifying patterns of collusion among test takers’ responses. The utilisation of conditional formatting and visual representations facilitated a comprehensive understanding of data correlations, enhancing interpretability.

Moreover, the implementation of right-tailed t-tests enabled statistical inference, determining the significance of correlations and potential collusion occurrences. The integration of automation, verbal matrices, and threshold-based checks streamlined the investigative process, aiding in the identification of suspicious correlations exceeding predefined thresholds.

Ethical considerations, including data integrity, fraud detection, and risk assessment, stress transparency and compliance. Continuous improvement is key, urging vigilance against collusion and promoting fairness and interpretability. Robust analytical techniques and ethical safeguards are vital for collusion detection. Leveraging statistical methods, automation, and visualisation enhances fraud detection, ensuring organisational integrity and trust in data-driven decisions.

3 responses to “56 – How to Test for Collusion in Google Sheets”

Leave a comment