Below is the link to the most recent week’s comparison of the Rich Kern Poll, Pablo Ranking System, and AVCA Poll for D1 Women’s Volleyball:
For those unfamiliar with the Pablo Ranking System, here’s a brief description (edited for length and clarity from Rich Kern’s website):
I. HOW PABLO WORKS
1. An algorithm with a probabilistic approach
The underlying model for the Pablo ranking system is a probabilistic approach, which says, essentially, that the larger the difference in abilities between the two teams in a match, the more likely the better team is to win. For example, if we could clone a team and have it play itself, such that the teams were perfectly evenly matched, the chance that one side will win is 50%. On the other hand, in a total mismatch, the probability that better team will win approaches 100%. Somewhere in between, there will be teams whose differences result in a 75% chance of winning, or some other value.
What do these probabilities mean? Pablo interprets them in light of any good probability: as the number of trials gets very large, then the fraction of wins will approach the probability. Well, that’s all well and good, but it helps us little. Unfortunately, we don’t have a very large number of trials, and we need to make assessments of the differences between two teams on the basis of only 1 or 2 (or sometimes 3) head-to-head matches. Thus, the challenge of evaluating teams.
2. How to interpret the ratings
In the end, the ratings are set on an absolute scale. In the past, Pablo set the #1 team to 10000 points and rated everyone relative to that. In recent years Pablo switched to setting the average team to 5000. The thinking is that whereas the quality of the top might vary over the year(s), the average (median) teams are far more consistent. This is certainly true over the course of a season, and probably also from season to season. With this switch, you can more easily track the absolute progress of your team over the course of the season, and even year to year.
However, the absolute scale of the ranking does not matter when it comes to individual matches. What matters there is the difference between two teams. The difference between the two teams indicates the probability that the given team will win. The relationship between difference and probability in the Pure Points model is actually an integrated normal distribution (about diff/1000), and is fairly easily handled in Excel. However, because Pablo doesn’t justuse the Pure Points model anymore (see section I.5), calculating the probability is a little more complicated. There isn’t a rigorous expression that we can use, but empirically, it looks like the probability still follows the integrated normal distribution.
3. How does Pablo determine the probabilities?
team A beats team B in a single match, then that indicates that team A is a probably better than team B. The more lopsided the match, the larger the difference between the two teams on average. Thus, each outcome is scored according to how lopsided it is. In the past, I used to break the outcomes into 11 different scenarios:
Normal 3 gamer Lopsided 3 gamer 3 Game BlowOut Close 3 gamer Very Close 3 gamer Normal 4 gamer 4 Game BlowOut Close 4 gamer Normal 5 gamer 5 Game BlowOut Close 5 gamer Pablo would allow each of these outcomes to be scored differently (although they would not be required to be). However, recently I have discovered through match simulation that there is a very fundamental relationship between the probability of scoring points in a match and the probability of winning the match. Amazingly, the relationship between point probability and winning percentage is exactly the same as the relationship between Pablo ratings and winning probability. What this means is that there should be a linear relationship between point probability and Pablo ratings. Unfortunately, we don’t really know the point probability in each match, so I estimate it by using the point percentage. Therefore, each match is scored based on the percentage of points the team scored. It is a simple conversion, with
game score = 25650*(point% – 0.5)
Therefore, if a team scores 55% of the points in a match (e.g. wins by scores of 25-21, 25-20, 25-20), they will be rated 1350 points ahead of the losing team (teams that have a 55% probability of scoring win 91% of the time). A team that scores 52% of the time (e.g. 25-21, 20-25, 25-21, 25-21) is rated 630 points ahead of the losing team (winning about 75% of the time). The maximum score for any match is always 2500, which corresponds to a point percentage of about 0.59. Thus, all matches better than e.g. 25-17, 25-17, 25-171 are scored the same. 2500 points indicates a win probability of about 99.4%. The system doesn’t recognize any probabilities higher than that. (The proportionality constant of 25650 is slightly smaller than that used in years past; when sets were played to 30, the value was 27700.)
An interesting conclusion of this is that it is more important to know how many points are scored than how many sets are won, and a team that outscores their opponent but loses will be rated higher. However, this is the same conclusion that Pablo had drawn previously when the match weights were determined empirically, so it is not just a consequence of the modeling. In fact, the average ratings that Pablo finds for the different types of sets in the old breakdowns are essentially the same as the ratings that Pablo obtained when they were determined empirically. Therefore, the point % relationship that Pablo discovered through simulation does not just work in theory, it is confirmed by actual results.
To get the rankings, Pablo just varies the team ratings so to minimize the deviation between the actual game results and the calculated ratings differences. There’s more to it, of course, but that’s the basis for it. In a perfect model, the rating for each game would match the rating differences between the two teams, with a deviation of 0. In reality, because of the natural variation of performance and the nature of probability itself, it is impossible to create a perfect model.
4. Is that all there is to it?
It used to be. Past versions of Pablo rankings were based solely on the points approach. However, this year, after extensive testing, Pablo has modified things slightly. The rankings now are based on a model that combines this “Pure Points” approach with a “W/L” approach, in which only wins and losses are taken into account, and not the match scores. Although it is possible to run each model separately, in the current version they are carefully integrated together in a single package. Tests have shown that the accuracy of the combined model is similar to that of the pure points approach. The W/L model alone performs significantly worse. Performance issues are addressed in the next section.
II. PERFORMANCE QUESTIONS
1. Do you account for the home court?
Yes. Home court advantage is an adjustable parameter in my system. The home court advantage varies, but is usually worth 200 points. Some have offered suggestions about improving the home court advantage model, taking into account that, for example, not all neutral sites are really neutral, and that travel distance can affect the home court advantage. Although these are probably correct observations, they probably won’t have as much impact on the quality of the rankings as many other factors. Pablo may try to consider them later, after a few other issues are addressed.
2. Is it correct to include the “lopsidedness” of the match when calculating probabilities?
It is all the rage in NCAA football to disregard ranking systems that include the point spread for a game, on the grounds that “winning is all that matters.” However, the motivation behind such a movement is suspect. In fact, Jeff Sagarin has had major conflicts with the BCS (football ranking system) because of this very issue. Despite the extent to which people complain, Sagarin points out that not including the point spread in his ranking system leads to a less accurate prediction of future sets. Therefore, on the grounds that the goal of a ranking system is to predict which team is better, and which team will win, then it is better to include the point spread in the ratings analysis. The Pablo developer believes that the real reason people object to including point spread in ranking systems is because doing so gives results different from opinion polls.
3. But my team wins by a lot of blowouts and uses a lot of the bench, making the matches closer than what they would be if it was just the starters. Doesn’t that bias the results?
It might, but we don’t know what extent. If it was a consistent feature, that happened every game, it could make a difference, but it’s not clear that it happens that much. Regardless, it probably is too rare of an occurrence (affecting only a couple teams a year, at most) that it does not justify changing the model to accommodate it. Some have argued that the Ballicora model, where the match is only evaluated on whether it goes 3, 4, or 5 sets, is better because it does not penalize the team for subbing (provided that it doesn’t cost the team a game). However, the Pablo developer has tried the model where Pablo only considers whether it was a 3, 4, or 5 game match and found it to perform on the whole (much) worse than the full model.
4. There’s no way that (insert name here) is the XXth best team!
OK, it’s not a question, but it is a common complaint. The answer to this is highly dependent on the specifics. However, there are a couple of points that I need to emphasize:
a) Oftentimes, people are surprised to learn that, in fact, the team in question is not as bad as one believes. This is especially the case when the team is one without a reputation for being good. My first step when the Pablo developer hears this complaint is to look up the record of the team in question. Often it is better than expected.
b) As with any probabilistic model, it does a better job with more data. Therefore, the Pablo developer doesn’t put too much stock on rankings that come out early in the season. As the season progresses, the rankings become better.
c) One of the problems early in the season comes in the disparity in schedules. Because Pablo puts a limit on the team differences, a lot of wins in lopsided matches are not very useful from an evaluation standpoint. Therefore, even a dozen sets into the season, the ranking can be overly dependent on one or two sets. The Pablo developer’s belief is that the system does best when teams play a lot of sets against teams that are close in ability. The biggest errors show up when looking at teams with lots of lopsided matches (up or down). Florida A&M is particularly notorious in this regard. They are a very good team that plays in a very poor conference. Hence, a very large number of their matches end up being blowouts against very weak competition, which are basically ignored in the ranking (see section II-5). When this is the case, their ranking even at later stages of the season should be viewed with caution.
d) Teams are ranked solely on what they have done, not what they were predicted to do, or what fans or coaches expect them to do for the rest of the season. However, there is some predictive value in the rankings.
5. To what extent to you account for strength of schedule?
Pablo doesn’t have a specific “strength of schedule” parameter in its rankings, but it is included indirectly. A team that consistently plays and beats teams that are ranked within the top 50 will be ranked a lot higher than a team who only beats up on teams that in the bottom 100. Wins against good teams are duly rewarded, whereas wins against the worst teams do not help one much.
At some point there are diminishing returns. Matches can be only so lopsided, while differences between teams can be very, very large. Therefore, Pablo does have a limit to the differences between the two teams that it considers, beyond which it doesn’t pay attention to the actual value. Therefore, if the two teams are separated by 3000 points, but the limit is 2500, then Pablo will use the 2500 value in its fitting. Currently the limit is 2500.
6. My team beat that team. How can they be ranked higher?
The system tries to find the optimal values for ratings to give the best fit to the data, and it uses all the available data (except, to an extent, those data points for extremely lopsided matches, as noted above). While head-to-head data is useful, it is not clear that a single head-to-head is more informative than 5 matches with common opponents, or lots of matches against a common field.
7. Is it fair to compare teams by comparing sets against common opponents? Or is there too much variation?
Transitivity is a tough game, and gets even more challenging when you get to three and four opponents removed. But it’s not bad. The Pablo developer did an exercise a couple of years ago where he looked at the first half of the Big 10 schedule, and tried to predict the outcomes of head-to-head matches based solely on the differences with common opponents. He found that he could predict about 85% of the head-to-head results using a common opponent approach, which is essentially the same as what he got from the Pablo rating system. As expected, there were upsets (BTW, don’t try this exercise with the 2001 Big 10 football season; that was one whacky season with lots of upsets).
8. My team’s best player missed a match for some reason. Doesn’t that affect their ranking?
Sure does, which is one place where the Pablo rating system will run into problems. It assumes a constant level over the entire season. If certain players are missing, that will affect the outcome and hence the rating. However, the Pablo developer prefers this approach, wherein he integrate all sets over the entire season.
9. Do you count more recent sets more heavily?
Yes. The Pablo developer has implemented a procedure where more recent sets are weighted more heavily. He has found that the best approach is to weight sets played 42 days ago as half of those played today. While a one month half-life sounds appealing, don’t read too much into it. The Pablo developer says he is not sure physically what it means.
10. How do you determine Conference Ratings?
Over the years, the Pablo developer has struggled to figure out the best way to rate conferences. For conferences with a normal distribution of teams, a simple average works pretty well. But then again, do you use average rating or average ranking? Since ranking is not linear with rating (the difference between the #1 and #10 teams is a lot bigger than the difference between #170 and #180), average rating is probably a more realistic measure. However, this only works for a conference with a normal distribution of teams, and does not work well for conferences that are very top heavy or bottom heavy.
Instead of using average, the Pablo developer has come up with an approach that he thinks better reflects the quality of the conference including the distribution. It is what is called the 50/50 rating for a conference, and refers to the rating that a team would need in order to have a 0.500 record against conference teams. From a Pablo perspective, the conference 50/50 rating is the rating that would be needed to have an overall 50% chance of winning matches against the conference teams on neutral courts. The beauty of the 50/50 rating is that for a conference where the teams are normally distributed, the 50/50 rating is just the conference average. So a conference of 8 teams where the teams are rated 3000, 3500, 4000, 4500, 5000, 5500, 6000, and 6500, the 50/50 rating is just the average, 4750. Moreover, in this distribution you can clearly see that a team rated 5750 are likely to win just as many as they are to lose (there are just as many conference teams above 4750, and by the same amount, as below).
However, compare that conference to one where the 8 teams are rated 7200, 5400, 5200, 4700, 4400, 4000, 3600, and 3500. Despite the average still being 4750, the team rated 4750 will not fare the same. For starters, a team rated 4750 will be rated higher than 5 of the 8 teams. However, the average difference between them and the teams they are favored against is smaller than the difference between them and the teams against whom they are underdogs. The 50/50 rating lets us balance these two issues. The 50/50 rating for this conference is 4617.3, which indicates that the average overrates the overall strength of the conference.
The net result of the 50/50 rating is that it gives us an indication of the balance of the conference. When the 50/50 rating is higher than the average, as is generally found for the WAC, it indicates that the conference is top heavy. When the 50/50 rating is lower than the average, it indicates a conference that has a team out of line at the bottom.
It should be noted that there is really nothing inherent about the50/50 rating that makes it the best measure of conference strength. For example, one could use a 75/25 rating, which is the one that would be required to win 75% of matches. The reason the Pablo developer likes the 50/50 rating, however, is because it reduces to the average for a normal distribution of teams.