Quarterback evaluation is a tricky thing. It seems as though every website has their own way of quantifying how good a quarterback is. It ranges from an attempt at a raw, numerical formula (Football Outsiders) to subjective gameplay analysis (PFF scores). Regardless, every stat seems to be conveying the same message: “Rank quarterbacks by me and you’ll see who the best one is!” But when simply relying on one statistic, very odd things can happen. There are tons of unincorporated variables that skew a certain stat, and it raises several questions. Who were they throwing to? What was the pressure like? What was the down and distance? Are any of these metrics even good? About a month ago, I posted on reddit asking what the best QB stat was. I wanted to see what everyone thought was the best stat. The answers ranged from TD:INT ratio to DVOA to wins to cup size. In short, the community is insanely divided. Are metrics like passer rating, that account for completions, attempts, TDs, etc. reliable enough? Or the ratio of “big plays” or TD:INT? How do you compare stats that are made on a different scale? Before I even go into the complex nature of evaluating QB stats against each other, I should note my opinion that most QB stats (i.e. the ones that don’t include pressure statistics, drops, and other team-orinented adjustments) are team stats. Passer rating, TAY/P, TD:INT and almost every other metric reward QBs with prolific talent around them. None of these statistics, except arguably DVOA, adjust for poor play elsewhere on the offense.
Putting all that aside, another purpleptsd blogger Justicht did a metric butt ton of math and we tried to figure it out. Justicht took several stats and compared them using their coefficient of determination, or “r squared” (I’m going to call it RS). I went to film school, so I’ll just copy this sentence from the wikipedia page I just linked: “It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.” Basically, it measures how variable a set of data gets, and therefore, how reliably predictive it can be. Take a stat like DVOA, which claims to measure how much a QB contributes to his team relative to what would happen if you replaced him with an average QB. If DVOA has a good RS, that means it’s pretty good at doing that consistenty. If a QB ranks highly in several of these stats, he’s almost certainly a good QB. This calculation relies heavily on the logic that multiple stats paint a good picture because it compares different stats to each other, and how often they agree on a quarterback’s ranking.
For this exercise, we chose DVOA, TD:INT Ratio, Y/A, TAY/P, Comp %, Passer Rating (QBR) and ESPN QBR. We didn’t include raw stats like yards or touchdowns, because it’s pretty easy to discount those as all-encompassing QB metrics. We ranked the quarterbacks from different years and teams based on the stat we’re comparing. Then we charted that against the percentage of value of the same stat on a scale from 0 to 1. Using that correlation data, we get to the RS, which is charted below. Higher is better. The more high numbers in a stat’s row, the better the stat is doing.
Here is that data:
It should be noted that ESPN QBR only dates back to 2006 while the rest of the data is dated back to 1998.
The purple cells are how well stats do when their ranking is compared to their own percentage of value. Looking across any given row, a string of low numbers means that it’s very hard to predict that statistic no matter which metric you use. So a stat that none of the other stats agrees with probably means it’s not very reliable. It should be noted that this isn’t a percentage of probability, just a scale from 0 to 1 of how likely a statistic is to be accurate within a standard deviation.
The first thing I notice is that TD/INT ratio is a hot mess, like that friend you didn’t want to invite to your wedding party. Not only is it the only stat that has less than 0.9 (even 0.8) RS, but none of the other stats have any good RS values with it either. Nothing outside of comparing to itself produces a result better than 0.5. What that means is that when you compare the best QBs by TD:INT to the best QBs by something like passer rating, it will be a total mess. And it makes sense- TD:INT measures entire seasons with hundreds of attempts by looking at tens of plays. In fact, the best single season TD:INT ratio since 1998 belongs to 2013 Nick Foles. While that was a good season, I’d be hard pressed to agree that it was the best season since John Elway retired. Also in the top ten were Josh McCown, David Garrard and Damon Huard.
I also notice that completion % produces some incredibly variable results, namely when compared to TD:INT. While it seems to be relatively predictive of itself, it doesn’t seem to line up with the others. When compared to passer rating, it scores the lowest RS in the entire chart, not to mention the strange results that come from comparing completion % to TD:INT. Logically, this was also expected – Completion % may be a good measure of consistency, but it doesn’t measure for difficulty. Checking down is a completion, but not necessarily a good play. 2004 Brian Griese and 2012 Alex Smith score in the top ten of single-season completion %, and Kirk Cousins has the best completion % of the 2015 season. Those names aren’t associated with gunslinging, however, you can say that completion % is good at measuring consistency. Most of the data backs that up if you separate consistency from high production.
I expected completion % and TD:INT to rank poorly because they’re fairly reductive metrics, however, passer rating is right down there with them and it surprises me. I always thought that the more factors a metric includes (provided it’s weighted accurately), the more accurate it can be. Since passer rating incorporates TDs, INTs, completions and yards, you’d think it was a pretty good stat. Similarly, a simpler stat, Y/A, scored well. This challenges the notion I had going into this piece, but I can understand where I went wrong. Incorporating more factors gives you a better idea, however, incorporating “big play” factors with huge weights on them can be dangerous. If you can agree that TD:INT is a bad metric and Y/A is a good one, passer rating marries the two while weighting TDs and INTs way higher. This is not to say that TDs and INTs aren’t important, however, it does make a pretty strong argument that TDs and INTs aren’t predictive of QB performance. They aren’t even predictive of themselves.
Predictably, TAY/P and DVOA, the two most comprehensive metrics, score very well. They and Y/A are very closely related, so a good DVOA probably means a good Y/A. It would be very hard to argue against a quarterback with high marks in all three stats, and I feel no need to. Such QBs are 2004 Manning, 2007 Brady and 2011 Rodgers, three of the best QB performances in NFL history. Aside from ESPN QBR, TAY/P is the only stat that attempts to account for game situations by incorporating first downs into its formula. This seems to have a very positive impact on the stat’s accuracy, which makes sense. First downs lead to better drives which leads to scoring and momentum which inflates all of the other stats.
Perhaps the most fascinating result is how well ESPN QBR ranks. It’s a well-known stat for being completely ridiculous. However, looking at its comparison to TAY/P, and DVOA, two very well-respected metrics, it scores very well. This doesn’t mean that ESPN QBR is immune to strange outliers, but it does mean that it’s hard to criticize it while also praising TAY/P and DVOA. I did not expect to go into this article defending ESPN’s much maligned metric, but here we are. There’s no subjectivity in numbers. This calculation says there is a high likelihood that a good ESPN QBR predicts a good TAY/P, DVOA and Y/A, and it’s almost impossible to argue with that level of consistency across metrics. So how can this be? First, we have to set aside the notion that the stat is bad because we don’t know the formula. Whatever it is, it’s working well. Here’s ESPN’s explanation of the statistic. It uses “expected points” to measure how much a QB contributes to a win and how well the perform relative to the pressure of the situation. With a 0.993 rating against itself, it does a remarkable job of that. It incorporates a lot of situational data such as down and distance, score at the time of the play, etc. This means that it has a good chance of adjusting for things like QBs racking up garbage time stats in blowouts or the value of a 3 yard pass on 3rd and 2 (something a stat like passer rating ignores). Perhaps there is more to this statistic than we thought when we all dismissed it.
So Crown A Winner Already!
I’d be remiss if I wrote a whole article a bout finding the best QB stat and then didn’t actually pick one. By the logic on which we’ve been leaning so far, the winner of this contest is the statistic the you can look at and most accurately say “QB1 is good here, therefore QB1 will be good everywhere else.” First, we can throw out TD:INT, Completion % and Passer Rating because their scores were so low they’d skew the data. Think it like not making the playoffs. To figure out who wins between the rest, we averaged out the percentages of DVOA, TAY/P, ESPN QBR and Y/A, ranked qbs on that average, and took the RS of the stats compared to the average.So basically That metric is… TAY/P! I’m not entirely shocked that the most inclusive stat is also the most conclusive stat. It’s a very good metric on the surface and does not produce any strange results. As for the rest of the RS rankings (we ran the bottom 3 stats through this too for fun), you may want to sit down:
- TAY/P (0.948)
- DVOA (0.897)
- ESPN QBR (0.886)
- Y/A (0.872)
- Completion % (0.498)
- Passer Rating (0.472)
- TD:INT (0.389)
No, you aren’t having a stroke – ESPN QBR actually scored a close third to DVOA. Y/A is in 4th, and then there’s a huge dropoff to the bottom three stats. So basically a good ESPN QBR has a pretty good likelihood that it means good DVOA and good TAY/P. And considering how closely related ESPN QBR is toDVOA and TAY/P, I find it hard to dismiss it like I did before looking into this research. Looking into actual names, all three metrics rank Carson Palmer to be the best QB of 2015, along with putting Brees, Roethlisberger, Wilson and Dalton in the top ten. These metrics mostly agree on QB play in 2015, so they’re tied together. To disagree with ESPN QBR, you have to disagree with TAY/P and DVOA, which is a tough nut to crack based on nothing but a preconceived notion that ESPN QBR is bad. It’s also worth noting the huge gap between the top 4 and bottom 3 statistics. The more comprehensive stats seem to do a significantly better job at predicting QB play, which is exactly what they were intended to do.
Using the Data
In terms of league-wide data, TAY/P is fairly predictable. Carson Palmer tops the league, followed by Dalton’s incredible season, Wilson, Brady and Roethlisberger. Matt Cassell was the worst in the league, below even Johnny Manziel, and Andrew Luck rated above Matt Hasselbeck. Aaron Rodgers was an enigma, ranking 23rd in TAY/P, 17th in DVOA and 10th in ESPN QBR. Perhaps surprising to some, Kirk Cousins and Tyrod Taylor ranked top ten in TAY/P, DVOA and ESPN QBR. This article has a sortable table for TAY/P if you want to delve deeper. Also here are ESPN QBR’s rankings as well as the DVOA Rankings for 2015.
Since this is a Vikings blog, I should mention a few fun tidbits by fellow Skoldiers would like. The best QB since 1998 was in 1998, Randall Cunningham, winning in TAY/P, DVOA, and Y/A. Our current guy, Teddy Bridgewater, ranks 7th and 8th in TAY/P in that span, below Cunningham, 09 Favre, 3 Culpepper seasons and Jeff George. In ESPN QBR, Bridgewater (both years) is only bested by 09 Favre since that time, however that data only dates back to 2006. Hilariously, 2011 Ponder has a better TAY/P than 2010 Favre which puts that season into perspective. Daunte Culpepper has at least two seasons in the top 5 of each of TAY/P, DVOA and Y/A, in case you forgot that he was good. In 2015, Teddy ranked 24th in TAY/P, just 0.02 units below Aaron Rodgers. He was 13th in ESPN QBR and 22nd in DVOA.
I think it’s important to mention that no statistical analysis is without flaw. A comprehensive picture of TAY/P, ESPN QBR and DVOA will always paint a better picture than one of those stats on its own. Take for example, Teddy Bridgewater, who ranked below 20 in TAY/P and DVOA but 13th in ESPN QBR, pointing to mediocrity over an entire 60 minutes, but an uptick in quality in tight game situations. Another example would be Aaron Rodgers, whose rankings are all over the place. Considering QBR’s situational weighting, it points to Rodgers overcoming poor offensive performance at the most important times. Anecdotally, this is supported by several high profile Packers games, such as the divisional playoff game and the Hail Mary at home field. It’s also important to mention that when there is a consensus among the top stats, it’s very hard to argue against it, so you can confidently bring home the sizzling hot take that Kirk Cousins and Tyrod Taylor are top 10 QBs. Any stat can be misleading, but if you’re looking for a numerical way to evaluate QB production, some combination of these top four stats will be reliable. I would not be comfortable using TD:INT, Completion % or Passer rating in a definitive sense unless they were backed by the more successful statistics. Finally, as I’m sure many will bring up anecdotal evidence against stats they don’t like, I’d like to emphasize that a comprehensive package of 17 years of data includes those bad examples. If a stat scores well, it means those bad examples happen less often, regardless of how much attentions a certain stat receives. The all-encompassing nature of this study is what makes it reliable.
Thanks for reading!