Mel Harbour on the British Rowing personal ranking index part 2

Mel Harbour follows up on why the evidence suggests that the British Rowing Personal Ranking Index is flawed

British Rowing expects all competitions to be using the new points and ranking system by April 2018. Mel Harbour argued in Part One that the new system has fundamental flaws and that the maths underlying the system they’ve designed doesn’t work. Part Two seeks to further explain the flaws and answers some of the questions that have been posed following the first article.

So it seems that my first article has inspired a good deal of discussion around the concepts of points, rating, ranking and competition structure. I’ve had lots of feedback and questions, and I thought it might be useful to go through some of them. I firmly subscribe to the theory that it’s important to allow yourself to be questioned as it encourages you to work through any potential issues in what you’re proposing before they become a problem.

Since I wrote that article, the Scullers Head has also run, which nicely serves as a practical illustration of exactly the problems I’ve been describing. I’ll discuss this first, what it highlights and some possible solutions. I’ll then discuss some of the points and questions that people have raised.

Scullers Head

For those that haven’t been following, Scullers Head ran on the flood tide in December last year, and stream conditions appear to have varied significantly through the race, especially as it was run on a flood tide, rather than the normal ebb tide. This means that there appears to have been significant advantage in starting in certain parts of the draw, so not all results are directly comparable. This led the race committee to decide not to issue an overall set of results, but rather to restrict them to being ordered within each category only. I’m not going to discuss whether that’s the right or the wrong decision – that’s a matter for another day. I’m going to look at the PRI implications of the result. Given the way the system works, you have a limited set of options for how you award PRI based on the result. And they’re all terrible for various reasons:

Award points based on the overall time. This is the assumption that PRI works on – it’s a single division head race, so you get compared against everyone else in the same boat type as you. The problem with this is that it doesn’t take into account the changing conditions. You’re handing out points based on results that the race organisers consider to not be ‘fair’. You can argue legitimately that it’s an outdoor sport, and the result on the day should stand, however in terms of coming up with a good assessment of an individual’s standard, it means that the PRI is at the mercy of the conditions, and becomes much more random.

Reclassify the race as a multi-division race and make each category its own division. Now you face the problem that the division sizes are spectacularly different. To pick an example, Men’s Elite Lwt had four competitors, so the winner would get 4 points, while Men’s Masters E had 26 competitors, so the winner would get 45 points (see Table B in the Reference Book). Does anyone really think that the winner of Masters E is more than 11 times better than the winner of Elite Lwt? You also break the assumption that the bigger races by competitor number are more prestigious and of a higher standard.

Again divisionalise the race, but by marshalling division. For those that aren’t aware, races on the Tideway are usually marshalled into ‘divisions’. These help the marshals control a large number of crews on the water with more structure but they are usually started immediately following one another, rather than after the previous division has already completed the course. The problem here comes when people who are in the same category start in different divisions. A good example from the Scullers Head results comes from Men’s Masters C. Most of the competitors have start numbers in the 170-190 range, however one person started 480th. If that is in a different division, then you have a situation where people aren’t being compared to their direct competitors, and you potentially can wind up giving out more points for someone coming second in a head race, than the person coming first. I believe that this exact scenario has already happened in real races run under PRI.

So you’ve got really only three options under PRI and they’re all terrible. None of them serve to grade the people by standard fairly, and it’s a natural product of the fundamental mechanics discussed in the first article – the fact you’re scoring people on the number of people they’re beating, rather than the quality of those people. Add a measure of quality back in, and you can go some way to addressing the issue. I will say that there’s no method of completely solving it, in the same way that you couldn’t use seat racing results if conditions changed between races. But you can at least put yourself back in the realm of being able to do some fair grading.

If you go with option 1 of grading people overall, there are ways you can use all the results, but consider them ‘less authoritative’ if their start numbers are further apart. This sort of thing is standard practice in polling – you weight results based on how truthful you believe them to be. It’s complicated, and would be buried away in the maths underlying the system, but could be critiqued by people with appropriate skills.

If you go with either of the latter options of saying that you will grade people within some sort of division-based structure, the way such systems work is that they may cause a little bit of over/undershoot due to not having a full ranking, but that ought to be limited. The algorithm can say that person 1 beat person 2, but if person 1 keeps winning the amount they get pushed ahead of person 2 decreases dramatically. In order to keep progressing, person 1 would have to race ‘harder’ people.

Racing on the Tideway - Head of the River Fours 2017

Head of the River Fours on the Tideway

Why didn’t you raise all this earlier?

A mixture of reasons. Firstly, this isn’t my day job. I don’t sit there going over every change to the British Rowing rules of racing. I, like most people, rely on the various checks and balances that should be in place, and assume that if British Rowing were commissioning a new system like this they would engage people who have the expertise required to understand the problem. As I understand it, other people have been raising similar concerns about the system for some time (think years), but the view seems to be that since they didn’t represent a majority, there was no issue. Of course this misses the fairly obvious point that not everyone is equally well equipped to understand the problems – that’s not a criticism, just an observation. Ask me about a medical matter and I wouldn’t suggest that I would know better than a doctor!

Personally, I’ve been attempting to raise this with British Rowing for over 12 months now. Their response is that they are fully committed to implementing the new system and aren’t willing to accept any form of review or discussion about it until they have finished rolling it out. In their responses they consider much of what I’m presenting to be opinion. Unfortunately, it’s not (ask some more mathematicians if you’re unsure of that fact!).

PRI is in the implementation phase. We need real data both to see whether it works and then to tweak it to work.

As I’ve explained, the problem is with the fundamentals of how the system works. The core concepts are broken, so the ‘tweak’ required would be to pull the whole thing down and then recalculate ratings from scratch using a better system. Maybe that’s what some people might call a ‘tweak’, but that’s not what I’d call it!

As far as the ‘real data’ point goes, it’s unfortunately complete rubbish. Of course, any rating system dealing with lots of competitors and lots of input data is going to be complicated. But while you, a human, might think that this is a lot of data, in computing terms it really isn’t. So we can use a technique called Monte Carlo Simulation fairly easily. It’s a flashy term, but all it really means is that we put in random test race results and see what happens at the end. You can make your simulated inputs completely random, or you can try some probability distributions to see what happens (for example, you can assume that an international is fairly likely to win their races). The computer tries a huge number of scenarios for you – it rolls some virtual dice lots of times (hence the name of the method) . You would then look at the results. Let me be really clear – this isn’t some sort of a niche mathematical concept – it’s very standard, and something that is done day-in, day-out to model how lots of things are going to behave.

I’ve read your previous article, but I still don’t really get the concept.

That’s fine. Some of this builds on statistics and probability and I know that many people find maths challenging. That said, with a bit of time, almost anyone should be able to get their heads round the idea of how it works. You don’t need to understand every last detail to use it, in the same way that most people can drive a car, but very few of them can actually explain exactly how a modern engine works; lots of you will be reading this article on a computer but very few will understand exactly how it got to you via the internet!

At the highest level, rating systems all work the same, whether they’re the old British Rowing system, the new PRI system, Elo, TrueSkill or anything else. This procedure is what mathematicians and computer scientists call an ‘algorithm’, but is essentially just a sequence of steps. In the case of rating systems, rating a head to head race, the steps are:

  • Estimate the standard of the two participants
  • Compare the standard of the two participants
  • Based on that comparison, estimate what the outcome is going to be
  • Run the race
  • Compare the result to the estimate of the outcome
  • Revise the estimate of the standard of the two participants.

What’s different between the different systems is the methods of estimating and comparing the standards of the participants. The better the system, the better its estimate of the standard and the closer its estimate of the outcome of the race.

Doesn’t Elo have flaws as well?

Yes. It does. In fact I’m pretty convinced that there’s no such thing as perfect rating system. It will always be possible to construct an example that forces any rating system into an odd state. The advantage of sticking with a system that’s been well studied is that most of the work of understanding these flaws will have already been done. Rather than me copy other people’s work, I’ll refer you to an in depth article on rating systems here: http://www.lifewithalacrity.com/2006/01/ranking_systems.html

If you read through the article you’ll see some discussion of potential flaws, and also how they have been fixed in evolutions. To be clear, when I discussed a pure Elo type system in the previous article, what I was doing was illustrating the principle of how these systems work, rather than saying that any individual system should be used. It would be important to consider the alternatives and the specific situations we find in rowing.

The other key thing to consider in any discussion of systems is ‘what happens when it does go wrong’. If we assume that all systems have their flaws, and will, from time to time, rate someone incorrectly, we can think about what’s going to happen. Some examples illustrate this:

PRI – a person has too few points compared with their actual standard. In this case, they may be able to race ‘below their standard’ and pick up a good result. When they do so, they may, or may not, get enough points to push them up in standard. It depends on the number of entries. If we’re lucky, the system begins to correct itself. If not, they stay where they are.

PRI – a person has too many points compared with their actual standard. In this case, in order to correct the error, the system would have to reduce their points total. The only way it can do that (since it takes your 8 highest scoring races) is to wait for the points totals to reduce enough. So it will only correct if you manage to wait long enough for the points total to reduce while not accumulating any other high scoring but low standard races. It is guaranteed to keep your points too high for at least 12 months, since points do not reduce within the first year they are earnt.

TrueSkill or similar – whichever way the misvaluation of points has occurred, when a person next races, there will be comparison with others, and the system will begin to correct itself in whichever direction is required.

Won’t people always find a way to dodge points?

Possibly, yes. As I mentioned above, no system is perfect. But this is where you have to look at a wider question of why people were avoiding gaining points under the old system. One of the stated aims of the new system was:

“We need a competition system that discourages the practice of crews and individuals actively trying to avoid gaining points (and by extension, avoiding racing) so they could compete at a low level” (source https://www.britishrowing.org/events/competition-framework/)

I think this is a subtle misunderstanding of the problem. Competitors in all sports like to win things. Of course they do. But they also like to test themselves against people of their own standard. It’s not actually much fun for a GB squad athlete to turn up to a small regional regatta and win a pot against no opposition. So what’s happening? The reality is that people aren’t just completely avoiding points in order to race at a low standard; They’re trying to avoid winning cheap points. Ones that they didn’t have to work very hard (for their standard) to earn. IM2 at a small regatta was easier to win than IM2 at a big Dorney regatta, for example. The new system actually makes that problem even worse by creating more scenarios in which you wind up winning cheap points. In most fields where a rating system is used that truly reflects standard, a different behaviour emerges, where people actively want to increase their rating in order to prove their standard and race against people who are also of that standard.

How would it work for crew boats?

This is a question that the developers of the TrueSkill algorithm I linked to at the end of the previous article have already tackled. See here for details: https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/

Essentially you add up the crew members’ ratings and then after they race, you distribute any new points amongst the crew members. You can’t know which of the crew members were most responsible for the win without more information, but at least you continue to estimate their standard.

Again, I would draw your attention to the fact that people have thought about these questions long and hard. There’s no need to start a new rating system from scratch when many of the problems have already been solved.

You talk about an expected standard – doesn’t that mean you have to time every race?

In short, no. But the maths explanation for “why not” requires a little bit of thinking (sorry!). Recall that what we’re really trying to do is estimate what the result is going to be. Most people do this intuitively when they watch a race without even realising it. What you’re doing is the same thing – estimating the time you think both crews are going to finish in, and then comparing the two. You then compare the actual result to your expectation. It doesn’t matter how long the race actually turns out to be – you’re only interested in the comparison, not what a mathematician would call the absolute value.

That being said, a side effect of running a rating system as I’ve described is that you do actually start acquiring the ability to estimate the time differences between competitors. To my mind, this is an advantage for both competitors and race organisers. As the competitor, you get a strong sense that you ‘ought’ to be a certain time off someone else, so even if you still lose to them, you can get a sense of achievement/progress from having closed the gap to them. As the race organiser, it allows you to understand better who is likely to catch whom up. If you’ve got a tricky course where overtaking is difficult, you might be able to better seed your races to avoid that, while not having to leave such big time gaps between competitors.

But people won’t be able to calculate their own points with your idea?

Indeed this is true to an extent. I would argue that most people also aren’t going to be able to calculate the number of PRI they’re going to get accurately either. As I said in the original article, you can give people a strong idea of roughly what’s going to happen though:

  • Beat someone, you’re getting more points
  • Lose to someone, you’re getting fewer points
  • The more different their points are to yours, the more your points will move

Since the algorithms are entirely repeatable – given the same inputs, they will generate the same outputs – it would be perfectly possible to construct ‘calculators’ to let people work out how many extra points they would get for beating another competitor.

How does it work for head races?

Effectively you’d race every other crew out there. A bit like a massive matrix. A single head race gives you a massive amount of data from which to work.

As it happens, I think there’s an opportunity to improve upon a vanilla TrueSkill or similar implementation by taking into account head race times (or rather margins of victory). TrueSkill only looks at the result (win/draw/loss), but there’s no reason that it couldn’t be refined further if you believe that the times are representative. The same clearly isn’t true of a side-by-side race, since it’s perfectly reasonable for the person in the lead to not want to show their full hand for various reasons, but if that data is solid for a head race, then you ought to be able to use it.

Surely you can’t look at rating system in isolation? It’s part and parcel of the competition structure.

I totally agree. And that’s another reason why, although I can describe the fundamentals of a system here fairly easily, I wouldn’t suggest that it could be used without a fair chunk more work. I wouldn’t like to claim to be an expert on how the competition structure should work in general, beyond generally being of the opinion that individual races are in the best position to know their own ‘markets’, and whatever structure you put in place should support them with running in a format that suits them. Cambridge, with a river that’s so full at times that you could practically walk from one bank to the other across the amassed eights, has a completely different set of requirements to a small head race in the middle of nowhere (no offence to anyone!). Any system needs to work just as well for those who never row in the Thames Region as it does for those who only ever row there.

Mel Harbour studied mathematics and computer science at Cambridge University where he was introduced to the sport through Peterhouse Boat Club and Cambridge University Lightweight Rowing Club. Mel now works as a Software Development Manager for Redgate Software and coaches in his spare time. You can follow Mel on twitter @melharbour.

Mel Harbour on the British Rowing Competition Framework WEROW rowing UK - Mel Harbour follows up on why the evidence suggests that the British Rowing Personal Ranking Index is flawed
2018-01-09T19:45:02+00:00 January 10th, 2018|Categories: News|Tags: , , |