Thursday, March 28, 2013

Why the Stats Say Martin Truex Jr and Jimmie Johnson are More Similar Than You Think

By digging deeper into NASCAR's Loop Data statistics, we can now run more unique analysis, measuring driver performance from a smarter angle.

One way of presenting this information is by visualizing a correlation matrix.

Very quickly: correlations between a pair of drivers measure how they perform together. High-correlation drivers will perform well at the same tracks, and poorly at the same tracks. They go up and down similarly. Low-correlation drivers will have performances that are unrelated to each other.  A correlation matrix simply puts into one grid the measurement of correlations between every pair of driver.

I spent time analyzing driver performances throughout the entire history of NASCAR's Loop Data statistics (starting in 2005). By using correlation matrices, we can quickly see which pairs of drivers have similar profiles in their results from track to track.

The most amazing insight that comes out of this research is how good Martin Truex, Jr. compares to certain championship drivers.

In the charts below, the darker boxes mean the two drivers that intersect have a high correlation. The diagonal line of solid purple boxes just means that a driver is 100% correlated with himself.

One note: In the charts below, we include all drivers that ran a minimum of 250 races during 2005-2012.

Look below at the correlation matrix for accumulating Fastest Laps during a race:

Greg Biffle and Matt Kenseth
Focus on the squares shaded in dark-blue. These are the pairs of drivers with the highest correlations:

1) Greg Biffle and Matt Kenseth have a high correlation. The data suggests they are both fast or both slow together. This all makes sense because they both drove for Roush Racing, and they clearly must have benefited from having the same equipment, crew chief notes, and setups.

2) Similarly, another pair of high correlation drivers are teammates Jeff Gordon and Dale Earnhardt, Jr (notice their dark intersecting square). They both drive for Hendrick, and so it makes sense they would have a high correlation in fastest laps. Also from Hendrick, notice the Jeff Gordon / Jimmie Johnson pairing also shows a high correlation. Both of these Hendrick pairs suggest that their equipment and teamwork are causing them to rise and fall together. When one driver has a lot of speed, the others will too.

3) We also see Ryan Newman and Martin Truex, Jr. with a high correlation as well. They tend to accumulate fastest laps at the same tracks, but this is interesting because they drive for different teams. Is there something about their driving style that explains why they perform similarly?

The next chart focuses on Laps Led per race:
We see one pair of dark squares that sticks out: a high correlation between Ryan Newman and Martin Truex, Jr. Let's think about some reasons why this could be true:

  • They have a similar driving style.
  • They prefer the same types of tracks.
  • Their crew chiefs have a similar style of setup

Finally, our last chart is Pass Differential per race:
For those of you who don't know, the pass differential stat counts passes for position during green flag runs: it adds how many times a driver passed others, and subtracts how many times other drivers passed him. The pass differential number can end up being negative or positive by the end of a race.

Jimmie Johnson and Martin Truex, Jr.
Martin Truex Jr. has a high correlation with both Jimmie Johnson and with Tony Stewart. 

We know Stewart and Johnson are both champions, but what is it about Truex and his driving skill that gives him a similar relationship to these other two? Truex does not have the same team or equipment as Johnson or Stewart, so that can't be the reason. We know Truex doesn't win as often as Stewart and Johnson, but now we know from the data that Truex's profile of passing cars is very similar to the profiles of Stewart and Johnson.

The data suggests three theories:
  1. Truex is a very similar driver to these champions, more so than we realize.
  2. If Truex were in the same equipment as Johnson and Stewart, he could match their results.
  3. Perhaps Truex is a championship-caliber driver like Stewart and Johnson, and we will see him get that result in time, if he can benefit from good luck, fast equipment, and the right circumstances.
  4. Or email me your theories at and I can throw some math at it for next time.

If you want to get crazy with correlation matrices, you could use them in multiple ways:

Is it the Car or the Driver? We can better answer this question by looking at how drivers within the same team are correlated with each other, and how these correlations shift when drivers change teams.

Hiring Drivers for Your Fantasy Team: Each week, do you want to load up on drivers that will perform similarly together (taking a risk they will all do badly), or find drivers that will hedge each other out (one driver's success can offset the failures of others)? Correlation matrices give you a way to attack that problem and customize your team.

Hiring Drivers for Your REAL Team: If you are a team owner, you can use driver correlations to see how drivers perform, better analyzing who has breakout potential. This works both for Cup free agents or minor-leaguers moving up the ranks. You can look beyond just their finishing position, and match up their performance with "benchmark" drivers who you would like them to emulate. You can also figure out which drivers might have a better fit or driving style that works with your equipment and setups.

Tuesday, March 26, 2013

Latest BSports video discussing the Predictability of Statistics

Tuesday, March 19, 2013

The Stability of NASCAR Stats: Which numbers are luck and which are real?

If you are trying to predict how a driver will do over time, and you only have past data to work with, how do you know which stats are going to be the most accurate predictors of future performance? What are the most stable factors year-in and year-out?

And conversely, how do you know which performance measures are the least helpful predictors? Which ones are most susceptible to noise, randomness, and luck?

By running a linear regression of all the major performance stats, focusing on NASCAR's "modern era" (since 1972), and only including drivers who started at least 25 races in consecutive years, we can calculate the slope parameter for each statistic, and can order them by their predictive power.

In this chart, we see all the major performance stats, ranked by their stability across consecutive seasons. Most stable at the top, least stable at the bottom. The most consistent stats from year to year are Average Start and Lead Lap Finishes, while the least consistent stats are Wins and Poles.

Here are some interesting conclusions we can infer from the chart:

1) The two measures of starting position are at the two extremes of the chart: Average starting position is the most stable measurement over time, but poles per year is the least stable stat. A driver's average starting position this year will be very close to their average starting position last year. But poles will vary widely from year to year, and it will be much tougher to predict that. What does that mean in racing terms? A driver's starting ability on average over the course of a season and a career is well-defined, a property of who that driver is and what their driving style is. They are who they are, and you see that year after year. Poles, however, are more about randomness: The margins are so close at the top of the qualifying leaderboard, that a lot of luck plays a factor in who gets the pole. As we have seen before, pole winners most often do not win races anyway, because the factors that go into winning a pole are generally unrelated to those involved in winning the race.

2) Winning races is the second hardest-to-predict measure. Any fan will know this is true, as win numbers can change drastically from year to year (Remember when Carl Edwards had 9 wins in 2008 and then 0 in 2009? Or when Mark Martin had 5 wins in 2009 after several years of 0?) There are many lucky winners (think about lucky fuel mileage gambles), and of course a plethora of drivers who unluckily lost races they "coulda, shoulda, woulda" won.

3) Crashes, failures and bad luck are a major factor in randomness. Notice that Laps Completed and Races Running at the Finish are near the bottom of the list. Both of these are related to the concept of keeping your car clean and getting it to the finish line in one piece. Think about crashes, engine failures, flat tires, and getting pulled into in accidents caused by others. Drivers can have a good year where everything goes their way, and a bad year where they seem to hit everything around them. Of course these stats are going to be hard to forecast, because effectively you are trying to predict how many accidents a driver will have, and this is very hard to do, when most of these events are out of their control.

Alright, so how can I use this table?
  • If you are in a fantasy league, think about which past stats are actually going to be the most useful for you to forecast future performance. Wins and poles don't really help you that much.
  • If you are in the media and discussing driver performance, perhaps Lead Lap Finishes is a stat to consider as something that can be repeated over time.
  • If you are an owner or sponsor looking to hire a new driver, remember to be careful when considering that driver's past performance statistics. Think about those stats where the driver is doing well: Are they repeatable over time (higher on the table), or perhaps just the result of some good luck (lower on the table)?
Readers, what else do you see in here?

Sunday, March 17, 2013

Kasey Kahne Won At Bristol! How Have My Other Predictions Fared?

Today, Kasey Kahne won his first race at Bristol. In my Bristol preview video two days ago for BSports, I said he would be a strong candidate to get his first win.

That being said, let's go back and see how all my previous picks have done in the 3 BSports videos.

Remember, these are the goals:

  • Highlight drivers with strengths and weaknesses at the specific track that weekend.
  • Identify drivers who, based on my statistical models, have a greater-than-expected chance to shine each week
  • Predict an overall winner

For Bristol:
Obvious Favorites to Win
Kyle Busch (Led 56 laps and finished second)
Brad Keselowski (Led 62 laps and finished third)
Jeff Gordon (Led 66 laps late before having a flat tire and crash)

Drivers Who Could Get Their First Win
Kasey Kahne (WON the race today and led 109 laps)
Greg Biffle (Finished 11th)

For Phoenix:
Drivers with a Strong Chance of Winning
Jimmie Johnson (Finished 2nd and led a lap)
Denny Hamlin (Finished 3rd)
Kyle Busch (Finished 23rd)

Non-Obvious Winning Picks
Mark Martin (Won the Pole and led 75 laps early before having problems)
Greg Biffle (Led 39 Laps early before fading to 17th)
Kurt Busch (Finished 27th)

Drivers Who Would Not Win
Brad Keselowski (Finished 4th)
Clint Bowyer (Finished 6th)
Kasey Kahne (Finished 19th)

And for Daytona:
Favorites to Win
Jimmie Johnson (WON the race and led 17 laps)
Kevin Harvick (Crashed out very early and finished 42nd)

Friday, March 15, 2013

Video of Aging Curves Plus a Bristol Race Preview

This 5 minute video preview the race this Sunday at Bristol, along with a discussion of the driver performance aging curves I wrote about earlier this week.

Wednesday, March 13, 2013

What's Age Got to Do With It? Can 37-Year-Old Jimmie Johnson Beat the Odds?

Let's take a look at how driver performance changes throughout the course of their careers. Is there an age at which drivers are at their highest potential? Do they peak?

We will examine an aging curve to try to answer this question.

We've already seen a lot of work with aging curves in baseball, specifically by Bill James and others who followed him. We can apply this same concept in NASCAR.

Top Speed
First, let's focus on the statistics that represent the most extreme end of high performance: wins, poles, and laps led.

We see in Figure 1 that driver wins peak at age 30. Based on this plot, a typical driver shows consistent improvement through their 20s, peaks at 30, remains steady in their 30s, and then drifts slowly downward after 40.

The amazing thing about Figure 1 is how poles, wins, and laps led are so closely interrelated: They appear and disappear exactly at the same ages.

Also relevant is the slope from 20 to 30, relative to the slope from 30 to 50. Drivers show rapid improvement in their 20s, and a much more gradual decline as they grow older.

Just to be clear with more supporting evidence, you see the same age characteristic for Top 5s and Top 10s, here in Figure 2, again both peaking around age 30.

Driver Examples
Jimmie Johnson demonstrates performance consistent with our aging curve. He won his first title in his age 30 season, his highest win total came at age 31, and his most poles came at age 32. In the past two seasons (ages 35+36), he did not win titles, and averaged 3.5 wins and 2.0 poles. That's a decrease from his average of 7.0 wins and 3.4 poles per year in his five title seasons aged 30-34. Has Johnson peaked? It's hard to imagine how that could actually be possible given how much of a threat he still is every week, but the past two seasons statistically may foreshadow a downward trajectory in career performance as Johnson approaches his 40s.

On the contrary, Mark Martin, who defies all age-related statistics, peaked in wins (7) when he was 39, with his next two best seasons (5 wins each) coming at ages 34 and 50.

Matt Kenseth, who won this past weekend on his 41st birthday, peaked in wins (5) when he was 30, exactly as the aging curve suggests. Accordingly, is it too much to ask to expect a career resurgence from him?

Looking to the Future
Look to the young guns, who still in their 20s, should keep getting better: Ricky Stenhouse, Aric Almirola, Joey Logano, Brian Vickers. We haven't seen a lot from them in the past, but there is so much upside to their careers still. Even true stars like Kyle Busch and Brad Keselowski, as excellent as they have been already, still have near-term upside in the next couple years.

Hope is Not Lost for the Veterans
The conventional wisdom about veteran drivers is they know how to get things done, save their equipment, and get to the end of the race. They are smarter about how to handle situations because they can use their mind and experience, not just pure speed. Well, the good thing about all that wisdom is seeing the numbers match up:

Figure 3 shows that a driver's ability to be running at the finish increases all the way until about the age of 40, and only barely drops off until the age of 45. This is a very different profile than the winning curve. Older drivers are actually better about getting their equipment to the end of the race.

Figure 4 says the same thing, that drivers in their 40s are doing a good job of getting Lead Lap Finishes. They know how to stay in the race until the end, and be around for a decent finish. Again, the peak for Lead Lap Finishes is near age 40, with decent performance into the mid 40s. These older drivers may not be able to win as much, but they still find ways to drive smartly, avoid crashes, and make it to the end of the race.

The Summary
If you are looking for extreme high-performance, look for drivers aged around 30. The potential upside comes from the young guns in their 20s. Drivers who have hit 40 should not be expected to do better than their past glory days. But they can still drive well enough to get their car to the finish line.