ben.enterprises | RSS

A Better Judging Algorithm for the Largest Hackathon in Wisconsin

Why Existing Judging Solutions Wouldn't Work for MadHacks and What We Did Instead

Published:

Prize winners and non-prize winners: this is the distinction every hackathon must make. In order for us to determine prize winners at MadHacks, projects are judged by a variety of judges including professors, alumni, and sponsor company representatives. While determining prize winners is an important part of a hackathon, existing solutions have flaws that rarely result in the best projects making it into the final judging round. In our simulations we found that if we had used the popular hackathon algorithm and platform Gavel, the top project would only make it to the final round 6% of the time, but under our new solution it would make it 51% of the time.

Rubric-Based Judging

We have historically given judges rubrics where they are asked to score roughly six projects each from 1-10 across a variety of dimensions. The top five projects then make it to a second round where they present their projects to everyone in the hackathon and the judges then agree on a final ranking. However, there are many problems with this approach (many of which have been documented elsewhere): Individual judges tend to give very similar scores across a variety of projects. The differences between places tend to therefore come down to very few points. If one judge is less generous with scoring than another judge it could skew the results entirely. Per-judge normalization can eliminate some of this bias, but there is still the issue of judges deciding how much better one project is than another being potentially unreliable.

Comparison-Based Judging

Instead of having our judges grade projects on a rubric, we could instead ask them to simply compare two projects. We can then assume these rankings are the result of a latent project 'quality', which I'll call q-factor, where projects with a higher q-factor than another project will have a high chance of being ranked higher by a judge, but projects with similar q-factor have a similar chance of being ranked higher by a judge. The probability project A will be ranked higher than project B can be given mathematically using a Bradley-Terry model:

P(pA>pB)=eqAeqA+eqBP(p_A > p_B) = \frac{e^{q_{A}}}{e^{q_{A}} + e^{q_{B}}}

Some observations about this model if we assume projects are sampled from an approximately normal q-factor distribution: The expected probability the highest q-factor project is ranked higher by a judge than the second highest q-factor project is a greater probability than the probability the second highest q-factor project will be ranked higher than the third highest q-factor (and so on for the top half of projects). This means it is 'easier' to infer which project should be first than which project should be second.

Existing Comparison-Based Solutions

A variety of solutions exist for inferring q-factor from comparison results generated by judging. An obvious approach is to simply pre-assign judges a set of project comparisons then infer q-factor based on either maximum likelihood estimation or a Bayesian statistics based approach (such as the one detailed in a PyMC blog post). However, these approaches require a large number of comparisons to be effective.

Gavel

Gavel is a judging platform and algorithm created for use at HackMIT and promises a better approach. Gavel is based on Crowd-BT and can be conceptually thought of as similar to a chess Elo system: Projects are all given the same initial q-factor, and an initial standard deviation representing the uncertainty in the predicted q-factor. This normal distribution represents Gavel's belief of the distribution the true q-factor is sampled from. Project q-factors and uncertainties are then updated for their respective comparisons after a pairwise comparison according to the Bayesian formulas derived in the Crowd-BT paper.

In an attempt to improve accuracy, Gavel actively tells judges where to go as they are judging. Judges visit a series of projects and compare each project they visit to the previous one. To pick the next project to send a judge to, Gavel picks whichever project (that the judge hasn't visited and currently isn't being visited by another judge) maximizes the expected information gain when compared against the current project (the current expected probability of each of the two possible outcomes times the KL-Divergence of the before and after q-factor distributions of both projects when that outcome is observed).

Fake Data Simulations

Andrew Gelman, a statistics professor at Columbia, considers it immoral to gather data without first testing your methods with synthetic data.1 In order to validate the effectiveness of existing solutions, I ran simulations assuming judges make decisions exactly according to a Bradley-Terry model by creating synthetic sets of project q-factors sampled from a normal distribution. The results were extremely underwhelming: assuming 27 judges each making 6 comparisons, 80 total projects, and a q-factor standard deviation of 1.52, Gavel only included the highest q-factor project in the top five (and therefore the final judging round) in 8% of 100 simulations, only slightly above picking winners at random (5/80 = 6.25%). While it is possible I made mistakes adapting the Gavel code for automated testing, I don't believe they were that significant since it performed similarly in practice to doing random comparisons and fitting a Bayesian model (which isn't a usable real world algorithm since it would involve judges visiting roughly twice the projects compared to Gavel since the comparisons are random instead of against the previous team).

Why the Disappointing Results?

I think there are a number of potential reasons why Gavel underperformed in these simulations:

  1. Judging result quality is simply not visible and these results are reasonable: since with real data we never know the true result, hackathons could have inaccurate judging that has gone unnoticed since the winners are usually going to be higher percentile teams even if their project isn't the true highest q-factor.
  2. Information gain maximization is less powerful than it sounds: an ideal algorithm would maximize information gain of the resultant top five projects, but that is difficult and a different expected information metric is used. The expected information metric used in Gavel could potentially assign high weight to relatively unimportant information (such as two teams with high uncertainties, but both low estimated q-factor). Additionally, the presence of many outstanding judgments (a judge has been sent but hasn't decided) at any given time could be reducing the reliability of even these estimates. Finally, only considering the immediate next project could lead to local maximums.
  3. Judge accuracy inference: while I didn't mention it previously Gavel also tries to infer how accurate judges are and weight their judgments accordingly. I don't think this makes much of a difference because of the low data environment and high starting assumed accuracy, but it is possible distrust of judges slightly slows convergence.
  4. Deviations from Crowd-BT: Crowd-BT was originally designed for data labeling via crowdsourcing services. In order to adapt it to be suitable for a hackathon reasonable restrictions were placed on the set of allowed comparisons (must compare against previous project, can't revisit project) which could be limiting the effectiveness of the information gain maximization.
  5. Lack of information: determining q-factor based on a project only being visited ~2 times is simply very difficult.

Beyond Pairwise

It seems we simply need more information to get reasonable results, but we lack the human cloning machine necessary to get a large amount of comparisons in a reasonable amount of time. In order to solve this paradox of obtaining more information while having judges visit the same number of teams, we can have judges compare multiple teams at once. Intuitively, the three-way comparison result (A > B > C) which claims ((A > (B and C)) and (B > C)) is more informative than the two pairwise comparison results (A > B) and (B > C) yet doesn't require visiting any additional projects. This does have the disadvantage of requiring the judges to make one difficult decision instead of many easier ones, but it allows higher information gain in roughly the same amount of time. Our Bradley-Terry model can be generalized into a Plackett-Luce model which allows n-wise comparisons (where the probability of a given outcome is the probability of the first ranked project being ranked above the rest times the probability of the second being ranked above the remaining projects and so on):

P(p1>p2>...pn)=eq1eq1+eq2+...+eqn×eq2eq2+...+eqn×...×eqneqnP(p_1 > p_2 > ... p_n) = \frac{e^{q_{1}}}{e^{q_{1}} + e^{q_{2}} + ... + e^{q_{n}}} \times \frac{e^{q_{2}}}{e^{q_{2}} + ... + e^{q_{n}}} \times ... \times \frac{e^{q_{n}}}{e^{q_{n}}}

This raises a new question: how many projects should judges compare at once? Given that larger comparisons are more informative and the potential issues with information gain algorithms outlined previously, simply pre-assigning judges six projects and having them stack rank them all seemed like a reasonable choice. Using a Bayesian Plackett-Luce model3 in the previous scenario with the judges now making decisions according to a Plackett-Luce model (and given realistic static judging assignments4), the highest q-factor project now appeared in the top five rankings in 50% of 100 simulations while still having the judges visit the same number of projects. Surprisingly, simply adding a 'point' to the top-ranked projects of these comparisons resulted in the highest q-factor project appearing in the top five rankings in 39% of 100 simulations. While I would of course like the accuracy of our judging to be higher, given the constraints I think having judges rank all the projects they visit then fitting a Bayesian Plackett-Luce model afterwards is a better solution for our hackathon than both our previous rubrics and other existing solutions.

Murphy's Law

Theorizing about how to run judging is one thing; actually running it is another. This year was our largest MadHacks ever with over 400 participants, and as the submission deadline on Sunday approached, the total project count rapidly increased to 114 projects! We wanted every project to get visited at least twice to ensure fairness, but with six visits per judge this would have required 40 judges. We had already planned to have organizers judge in the 'worst case', but even with the present organizers judging and each judge visiting 7 projects we did not have the required 33 judges present (so that each project gets visited at least twice) and had to scramble to find additional judges.

While simultaneously trying to figure out where to find a cloning machine, we decided to print out our judging forms. Another organizer had designed an algorithm to assign teams to judges such that multiple judges would be unlikely to visit a team at once and judges could avoid certain teams (to avoid conflicts of interest). Unfortunately, when we went to run this algorithm during the event it hung and didn't output anything. I had to, on the spot, write a simpler algorithm so we could print out judge assignments.5

Rewriting the Judge Assignment Code

We then went to print the signs for all the teams with their locator (section letter + team number) and team name, when Andrew noticed that the map drawn for telling judges where to go had an omitted and duplicate number (it was too difficult to make significant modifications to since it was hand-drawn). He then had to modify the assigned numbers to add a "B75ALT" and skip the missing number.

After all of the judges turned in their papers, we ran our model and the following teams made it into the final round: Factify, 3Docs, Sub2Lease, Collaboard, and Lexon. These teams then presented in front of everyone, and after the final judging round 3Docs placed first, Factify placed second, and Collaboard placed third.

Conclusion

While assessing the quality of hackathon results is difficult for reasons I hope are apparent by now, I'm happy with how the judging results felt this year. Anecdotal claims by judges that ranking the middle projects was more difficult indicated to me that our theoretical model of judge behavior is reasonable. In 1,000 simulations with our actual judge and project numbers as well as assuming the same 1.5 standard deviation and the correctness of the Plackett-Luce model, there was a 51% chance the highest q-factor project made it to the final round compared to 6% if we had used Gavel (although this standard deviation is probably slightly optimistic, I found a 94% credible interval for the actual standard deviation of the projects at MadHacks was [0.626, 1.547] with a very uninformative prior).

This is by no means the final say on hackathon judging, and there may very well be ways to improve upon this algorithm further. There is still no substitute for having as many judges as possible; a project only being visited twice means the results will be sensitive to individual judge behavior under any reasonable algorithm. The code and data for the analysis can be found on my GitHub. In the future we will probably devise an electronic way to collect judgments to improve our result turnaround time and reduce the manual labor required similar to the one in Gavel. I would like to thank Anish Athalye for their many blog posts about judging at HackMIT that served as a starting point for this research. Additional thanks to Martin Ingram for their blog post on Bradley-Terry models in PyMC that we based our code on. I would also like to thank all the corporate sponsors, organizers, and volunteers who helped make the hackathon successful this year. Finally, I would like to thank Richard McElreath for their book Statistical Rethinking and Andrew Gelman for their excellent statistics blog.

  1. ^ "I think it's now immoral to gather data without first simulating fake data and writing the code and fitting the model to the fake data." 31:12 Learning Bayesian Statistics - #20 Regression and Other Stories, with Andrew Gelman, Jennifer Hill & Aki Vehtari
  2. ^ I determined a standard deviation of 1.5 seemed reasonable based on a MLE Bradley-Terry model I fit on the raw 2015 HackMIT data and intuition. With this standard deviation a 95th percentile project has a 92% chance of being ranked above an average (50th percentile) project.
  3. ^ Using Markov chain Monte Carlo (MCMC) to do the estimation is probably the most theoretically sound, but it is slow. Using maximum a posteriori (MAP) estimation is less theoretically justified, but since it is considerably faster, it was much easier to test resulting in us being more confident in our approach.
  4. ^ The assignment algorithm I originally used in testing repeats the same set of projects across multiple judges due to a programming error which is slightly worse than one which has better randomization such as the one we ended up using. The numbers in this blog post were the result of a rerun with our final judge assignment algorithm.
  5. ^ The author of the original algorithm said it worked completely fine when he tried it after the event. When I programmed the replacement on his laptop, I also proved 8 <= 7 in Python. While this is probably due to me not saving my modified code because Apple's laptops have strange keyboards, I choose to believe his laptop is haunted.