Rik Voorhaar

Bias in figure skating judging

My wife is very enthuisiastic about figure skating. She often mentions that the judging is biased, in the sense that many judges give higher scores to athletes from their own country, and lower scores to athletes from other countries.

This doesn’t sound too surprising, but I wondered if it is actually true. Can I show that there is a statistically significant bias in figure skating scoring?

The data

To answer this question I need to have a dataset with scores of many skaters, including the nationalities of all the skaters and judges. The ISU publishes these results in PDF files on their website. Before the 2016/2017 season they randomized the order of the judges’ scores for each skater, so these seasons are not usable. Additionally they stopped publishing the nationalities of judges since the current 2019/2020 season. Therefore I downloaded all the scores of the 2016/2017 through 2018/2019 seasons from the isu websites. In a separate post I will go more into detail in how to mine PDF data like this.

A typical piece of a PDF file looks like this:

Example of ISU scores

Additionally on the website we get the following information regarding the judges presented in a table like this:

Example of Judge scores from ISU

Results

This is all we need. We can now build a table comparing the scoring of a particular judge and the average of all judges together. We split this between scoring for technical elements (which should be more objective) and scoring program components (which should be more subjective). In the example above the skater is from Russia, and we see that Judge No.1 (who is from the Netherlands) gives an average GOE of 1.0031.003 for the technical elements, compared to the average of all judges of 0.7960.796. This means that her scores are 0.2870.287 above the average. This in itself doesn’t mean much, but if we observe that there is a consistent bias over many different cases where a Dutch judge is judging a Russian skater, then we have identified a bias.

Then this is what we do: for each pair of countries we collect all the cases where a judge from country A judged a skater from country B. Then we record how their scoring was compared to the average scoring of all the judges. Finally we look at the distribution of these deviations from the average. Taking the example above, in my data there were 49 cases where a Dutch judge judged a Russian skater, and the distribution looks like this:

Distribution of score deviation from average for NED judging RUS

Here we see that the distribution is roughly that of a normal distribution. Furthermore the mean is not statistically different from 00; the pp-value is 0.8450.845. For us to conclude that there is a statistically significant bias this value should at the least be below 0.050.05. Since we have many pairs of countries (2586 to be precise), we might even put this criterion significantly lower (say 1/1000) to avoid false positives.

We can thus conclude that there is no statistically significant bias when it comes to Dutch judges scoring Russian skaters. But what if we look at say Russian judges scoring their own skaters? Well, then we see a very different story. We have 519 cases of this happening, and the distribution of score deviation looks like this:

Distribution of score deviation from average for RUS judging RUS

We see that the far majority (82%) of the time, the Russian judges gave higher scores to Russian skaters compared to their peers. In fact the mean deviation is 0.2420.242 (which is quite significant), with a pp-value of 4.13×10684.13\times 10^{-68}, which is most certainly statistically significant. So there you have it, Russian judges tend to score their own athletes significantly higher. But Russia is not the only country of doing this; every major figure skating country has such a bias. Out of those, the bias of Japan is the least with 0.160.16 points, and that of France the highest with 0.260.26. All of this is for the technical scores, but the component scores paint a very similar picture.

And we don’t just see that some countries like themselves, we also see that many countries tend to score their rivals significantly lower. If we set the barrier for statistical significance at a pp-value of 0.0010.001, then we find 29 country pairs with scores significantly less than 0 (and also 29 pairs with scores more than 0). With very few exceptions all the cases where a country gives significantly lower scores to another country, then this happens between a former Warsaw pact country and a non-Warsaw pact country. One can thus see that cold war politics are still very much alive in the world of figure skating.

For reference here is a table with the pairs of countries where the pp-value is less than 0.0010.001, sorted by the average deviation in GOE scores. If we increase the pp-value to 0.050.05, the number of country pairs with a negative/positive deviation increases to 100/127 respectively, but this likely also includes some false positives.

Country ACountry BGOE Deviation# Samplesstdp-value
GEOGER-0.423535130.2914390.000292112
FINPOL-0.39455870.1295260.000298878
GEOJPN-0.391477230.3694265.66014e-05
NORSVK-0.347513150.3040640.000767944
DENJPN-0.287993190.2566270.000156087
GEOCAN-0.278147230.3212450.000519587
GEOFRA-0.262415170.2363760.000411082
USAPOL-0.234252370.3195439.28026e-05
FINRUS-0.2237711160.4599138.11565e-07
CANUKR-0.212031540.2945832.83725e-06
USABLR-0.210608540.304425.83674e-06
USAESP-0.209474310.2692940.000185776
GEOUSA-0.199728480.3037474.34335e-05
USAHUN-0.198829320.3011360.000890425
UKRCAN-0.196541720.3270443.11877e-06
ITAUKR-0.190304270.2497080.000628967
NORRUS-0.167886420.2655850.000223648
GERRUS-0.1461942160.3658621.73015e-08
BLRCAN-0.144693570.2737740.00021738
HKGRUS-0.140813190.1476470.000757625
USARUS-0.1274774870.3461523.90599e-15
RUSKOR-0.11664690.2469560.000226837
CZEUSA-0.1039011310.3276180.000426771
RUSJPN-0.0938662270.2898682.11539e-06
KORRUS-0.09242311990.3291380.000108097
CZECAN-0.09157811170.2821660.000671065
RUSUSA-0.0887144000.2939153.75673e-09
CHNCAN-0.08175191670.2784670.000216408
RUSCAN-0.0721953130.2681763.0339e-06
CZECZE0.10548700.2311230.000317855
FRAJPN0.1112091020.2852430.000162456
HUNRUS0.132175890.3278030.000282576
JPNJPN0.1565232370.3271473.22684e-12
GERGER0.163652840.2481694.80504e-08
AUTAUT0.168284410.2722250.000348661
BLRUKR0.172673270.2343090.000876659
RUSBLR0.172728480.2612514.0039e-05
ITAITA0.191333830.2577422.20598e-09
FRASUI0.196021150.1533260.000291446
USAUSA0.2013053900.3443271.16234e-26
CANCAN0.2029643090.3282011.87699e-23
SLOSLO0.220573140.1833070.00080383
CHNCHN0.2377891260.277821.30817e-16
RUSRUS0.2421225190.2707324.12722e-68
ISRISR0.246661360.2783657.70606e-06
KORKOR0.255873650.3027274.86167e-09
FRAFRA0.2558811490.3343811.63481e-16
LTULTU0.264828180.1830771.53876e-05
GEOGEO0.31548180.1804451.46255e-06
UZBUZB0.327713140.1809621.91363e-05
ESPESP0.339558270.2784351.40513e-06
KAZKAZ0.345734300.3142781.9616e-06
MEXMEX0.34907150.2478220.000118384
ESTEST0.369731280.2585695.41857e-08
TURTUR0.435335250.2784686.76828e-08
BLRBLR0.455432380.3169181.57049e-10
HUNHUN0.471011340.3442484.63145e-09
UKRUKR0.505353410.3632556.7634e-11

Published on June 20, 2020

data-science

Keep reading

Teaser for On Kalman filters and how I made them 20x faster using Rust
On Kalman filters and how I made them 20x faster using Rust

7th of October, 2023

In my first dive into Rust, I implemented an unscented Kalman filter in and made it 20x faster than the equivalent Python implementation.

read more →
website data-science tools
Teaser for Dev log: interactive website dashboard
Dev log: interactive website dashboard

1st of May, 2023

I made an interactive dashboard for this website, and here is the story of how I did it.

read more →
website data-science tools
Teaser for Time series analysis of my email traffic
Time series analysis of my email traffic

13th of February, 2021

I have 15 years worth of email traffic data, let's take a closer look and discover some fascinating patterns.

read more →
data-science statistics
Teaser for Modeling uncertainty in exam scores
Modeling uncertainty in exam scores

9th of November, 2020

We use exams to determine how much a student knows, but exams aren't perfect. How can we estimate the uncertainty in students' exams scores?

read more →
data-science statistics education
Teaser for How big should my validation set be?
How big should my validation set be?

26th of August, 2020

Cross validation is extremely important, but how should we choose the size of our validation and test sets?

read more →
data-science statistics
Teaser for How do my music preferences evolve?
How do my music preferences evolve?

12th of August, 2020

I use last.fm to track my music listening. Let's look at my data to discover how my musical preferences evolve over time.

read more →
data-science music
Teaser for Is my data normal?
Is my data normal?

10th of August, 2020

Normally distributed data is great, but how do you know whether your data is normally distributed?

read more →
data-science statistics
Teaser for Introducing the IJ Programming Language
Introducing the IJ Programming Language

15th of January, 2025

I made an array programming language as a language extension to Rust

read more →
coding
Teaser for My self-hosting journey
My self-hosting journey

1st of August, 2024

Self-hosting your own cloud services not only saves money, it is also a great way to learn

read more →
website tools
Teaser for My thesis in a nutshell
My thesis in a nutshell

26th of February, 2023

Read this blog post if you're curious what I worked on during my PhD!

read more →
math
Teaser for GMRES: or how to do fast linear algebra
GMRES: or how to do fast linear algebra

29th of March, 2022

Linear least-squares system pop up everywhere, and there are many fast way to solve them. We'll be looking at one such way: GMRES.

read more →
mathematics linear-algebra code
Teaser for Machine learning with discretized functions and tensors
Machine learning with discretized functions and tensors

10th of March, 2022

We recently made a paper about supervised machine learning using tensors, here's the gist of how this works.

read more →
machine-learning mathematics linear-algebra code
Teaser for Low-rank matrices: using structure to recover missing data
Low-rank matrices: using structure to recover missing data

26th of September, 2021

A lot of data is naturally of 'low rank'. I will explain what this means, and how to exploit this fact.

read more →
machine-learning mathematics linear-algebra code
Teaser for How to edit Microsoft Word documents in Python
How to edit Microsoft Word documents in Python

29th of August, 2021

Parsing and editing Word documents automatically can be extremely useful, but doing it in Python is not that straightforward.

read more →
data-mining code
Teaser for Blind deconvolution #4: Blind deconvolution
Blind deconvolution #4: Blind deconvolution

31st of May, 2021

Finally, let's look at how we can automatically sharpen images, without knowing how they were blurred in the first place.

read more →
machine-learning computer-vision
Teaser for Blind Deconvolution #3: More about non-blind deconvolution
Blind Deconvolution #3: More about non-blind deconvolution

2nd of May, 2021

Deconvolving and sharpening images is actually pretty tricky. Let's have a look at some more advanced methods for deconvolution.

read more →
machine-learning signal-processing computer-vision
Teaser for Blind Deconvolution #2: Image Priors
Blind Deconvolution #2: Image Priors

9th of April, 2021

In order to automatically sharpen images, we need to first understand how a computer can judge how 'natural' an image looks.

read more →
machine-learning signal-processing computer-vision
Teaser for Blind Deconvolution #1: Non-blind Deconvolution
Blind Deconvolution #1: Non-blind Deconvolution

13th of March, 2021

Deconvolution is one of the cornerstones of image processing. Let's take a look at how it works.

read more →
machine-learning signal-processing computer-vision
Teaser for First post
First post

19th of June, 2020

My first post in this blog

read more →
jekyll