Rik Voorhaar

Modeling uncertainty in exam scores

We widely use exams in education to gauge the level of students. The result of an exam is really only indicator of the students actual level, and has a certainly level of uncertainty. In this article we will try to model the uncertainty in the grades of an exam through a Bayesian model, using the amount of points obtained in each question. Such an analysis may be useful in designing good exams, as in principal this error is something we would wish to minimize.

The data

We will use scores for 8 individual questions of a university math exam. We recorded scores of 100 students. Each question is worth 12 points for a total of 96 points, and students could obtain anywhere between 0 and 12 points for each question.

Notice: this data is modified from the original in multiple ways due to privacy reasons. The modifications preserve most qualitative statistical properties._

Modeling the data

Let SS denote the total exam score, and let GjG_j denote the score obtained for the jjth question. Obviously these variables are not independent, high scores in one question are a indicator the student will get high scores in another question as well. For simplicity we want to model the question scores as conditionally independent given the total score SS. The idea behind this is that SS is an unbiased measurement of the students abilities, and the score obtained for a question should only depend on the student’s abilities. This ignores the fact that some questions test similar aspects of the course, and so are more correlated compared to others. If we plot the covariance matrix of the scores we see that there doesn’t seem to be a strong correlation between question scores. We do however see that the 6th question has a higher variance than the other questions

png

Now we need to obtain a good model for the conditional probability P(GjS)P(G_j|S). We will model GjG_j by a binomial distribution, which corresponds to assuming GjG_j consists of 1212 ‘subproblems’, of which the student has equal chance of solving. While this does not seem like a natural assumption, it is very convenient. If mjm_j denotes the total score obtainable for question GjG_j this gives model

P(GjS)P(Gjpj,S)P(pjS), P(G_j|S) \propto P(G_j|p_j,S)P(p_j|S),

with

Gjpj,SBin(pj,mj). G_j|p_j,S \sim \mathrm{Bin}(p_j,m_j).

Next we need to model the conditional distribution P(pjS)P(p_j|S). We tried linear and quadratic regression on logit(pj)\mathrm{logit}(p_j), but this tends to perform badly near the limits S=0S=0 and S=96S=96, and in general tends to shift the predictions towards the global mean. Instead we want a model such that if S=0S=0 then pjp_j is concentrated near zero. To do this we model pjp_j with a beta distribution, whose parameters depend quadratically on SS. More precisely if S^=S/96\hat S= S/96 (which supported on [0,1][0,1]), then we set

pjS,α0,α1j,β0j,β1jBeta(eα0jS+eα1jS2+τ,eβ0j(1S)+eβ1j(1S)2+τ), p_j|S,\alpha_0,\alpha^j_1,\beta^j_0,\beta^j_1 \sim \mathrm{Beta}\left(e^{\alpha^j_0} S+e^{\alpha^j_1} S^2+\tau,\, e^{\beta^j_0} (1-S)+e^{\beta^j_1} (1-S)^2+\tau\right),

where τ=0.01\tau=0.01 is a small regularization parameter. This model has pjp_j constrained close to 00 if SS is small (and similarly pj1p_j\simeq 1 if SS close to 96), and also has enough flexibility to fit our data reasonably well. The exponentials are used both to force the parameters to be positive, and this parametrization also improves the fit. The parameters αi,βi\alpha_i,\beta_i are given normal priors with mean 00 and large sigma, although the fit is not sensitive on this prior.

We will model all the parameters in this model using a Monte Carlo Markov Chain (MCMC) simulation using pymc3. With pymc3 it is very straightforward to implement this model, and requires only a couple lines of code. In the end we obtain samples for the variables GjG_j, and we can model the error in the total score by looking at the distribution of the posterior j=18Gj\sum_{j=1}^{8} G_j.

Results

First we compare the distribution of the predicted exam scores to the actual exam scores. The graph below shows the real scores on the horizontal axis, and predicted scores on vertical axis. The vertically shaded area indicates one standard deviation. The model appears to by unbiased for lower scores, but picks up a slight bias for higher scores. In the original unmodified data this behavior was not apparent, and is therefore likely a byproduct of how the data was modified due to privacy concerns.

png

The goal of this project is to simulate the error in the exam score. Below we plot the the standard deviations of the predicted scores as function of total score. We compare this to a theoretical variance coming from a binomial distribution with 96 trials, and we see that the error is close to this theoretical variance. This means that most of the variance in posterior comes directly from the binomial distributions modeling the question scores GjG_j. We also plot the the variance in the posterior predictive, which is significantly higher. This makes sense, because for the posterior predictive we do not just model the error in each question score, we also have to predict the scores for each question given just the total score.

png

To measure how well our model describes the data, we compare the posterior predictive for each question to the data for each question as function of total score. Here we see that our model describes the distribution quite well for most questions, although in some questions the model underestimates the spread of the data. This is likely because the binomial model is not accurate. To model the questions scores more accurately we would need to step away from the binomial model. However this would make the model more complicated, and the current model does give reasonable predictions for the most part.

png

Evaluation the qualities of questions

We can also use the posterior predictive to qualitatively evaluate the quality of questions. Let us plot the standard deviation and mean of the posterior predictive for each individual question as function of the total score SS below. In a perfect question the variation is as low as possible, and the mean should be relatively close to the straight line. All in all this means that the question provides as much information about the students abilities. Based on this it seems that questions 2,4,8 are quite good, whereas 1,3,6 are not as good. In particular the points obtained for question 6 has the weakest correlation with the total score. Question 5 seems to have been relatively easy, whereas question 3 appears to have been difficult. Analyzing questions in such a manner can be useful for designing future tests, as well as determining what aspects of the course the students found easier or more difficult.

png

png

Published on November 9, 2020

data-science statistics education

Keep reading

Teaser for Time series analysis of my email traffic
Time series analysis of my email traffic

13th of February, 2021

I have 15 years worth of email traffic data, let's take a closer look and discover some fascinating patterns.

read more →
data-science statistics
Teaser for How big should my validation set be?
How big should my validation set be?

26th of August, 2020

Cross validation is extremely important, but how should we choose the size of our validation and test sets?

read more →
data-science statistics
Teaser for Is my data normal?
Is my data normal?

10th of August, 2020

Normally distributed data is great, but how do you know whether your data is normally distributed?

read more →
data-science statistics
Teaser for On Kalman filters and how I made them 20x faster using Rust
On Kalman filters and how I made them 20x faster using Rust

7th of October, 2023

In my first dive into Rust, I implemented an unscented Kalman filter in and made it 20x faster than the equivalent Python implementation.

read more →
website data-science tools
Teaser for Dev log: interactive website dashboard
Dev log: interactive website dashboard

1st of May, 2023

I made an interactive dashboard for this website, and here is the story of how I did it.

read more →
website data-science tools
Teaser for How do my music preferences evolve?
How do my music preferences evolve?

12th of August, 2020

I use last.fm to track my music listening. Let's look at my data to discover how my musical preferences evolve over time.

read more →
data-science music
Teaser for Bias in figure skating judging
Bias in figure skating judging

20th of June, 2020

Judging in figure skating is biased. Let's use data science to figure out just how bad the issue is.

read more →
data-science
Teaser for Introducing the IJ Programming Language
Introducing the IJ Programming Language

15th of January, 2025

I made an array programming language as a language extension to Rust

read more →
coding
Teaser for My self-hosting journey
My self-hosting journey

1st of August, 2024

Self-hosting your own cloud services not only saves money, it is also a great way to learn

read more →
website tools
Teaser for My thesis in a nutshell
My thesis in a nutshell

26th of February, 2023

Read this blog post if you're curious what I worked on during my PhD!

read more →
math
Teaser for GMRES: or how to do fast linear algebra
GMRES: or how to do fast linear algebra

29th of March, 2022

Linear least-squares system pop up everywhere, and there are many fast way to solve them. We'll be looking at one such way: GMRES.

read more →
mathematics linear-algebra code
Teaser for Machine learning with discretized functions and tensors
Machine learning with discretized functions and tensors

10th of March, 2022

We recently made a paper about supervised machine learning using tensors, here's the gist of how this works.

read more →
machine-learning mathematics linear-algebra code
Teaser for Low-rank matrices: using structure to recover missing data
Low-rank matrices: using structure to recover missing data

26th of September, 2021

A lot of data is naturally of 'low rank'. I will explain what this means, and how to exploit this fact.

read more →
machine-learning mathematics linear-algebra code
Teaser for How to edit Microsoft Word documents in Python
How to edit Microsoft Word documents in Python

29th of August, 2021

Parsing and editing Word documents automatically can be extremely useful, but doing it in Python is not that straightforward.

read more →
data-mining code
Teaser for Blind deconvolution #4: Blind deconvolution
Blind deconvolution #4: Blind deconvolution

31st of May, 2021

Finally, let's look at how we can automatically sharpen images, without knowing how they were blurred in the first place.

read more →
machine-learning computer-vision
Teaser for Blind Deconvolution #3: More about non-blind deconvolution
Blind Deconvolution #3: More about non-blind deconvolution

2nd of May, 2021

Deconvolving and sharpening images is actually pretty tricky. Let's have a look at some more advanced methods for deconvolution.

read more →
machine-learning signal-processing computer-vision
Teaser for Blind Deconvolution #2: Image Priors
Blind Deconvolution #2: Image Priors

9th of April, 2021

In order to automatically sharpen images, we need to first understand how a computer can judge how 'natural' an image looks.

read more →
machine-learning signal-processing computer-vision
Teaser for Blind Deconvolution #1: Non-blind Deconvolution
Blind Deconvolution #1: Non-blind Deconvolution

13th of March, 2021

Deconvolution is one of the cornerstones of image processing. Let's take a look at how it works.

read more →
machine-learning signal-processing computer-vision
Teaser for First post
First post

19th of June, 2020

My first post in this blog

read more →
jekyll