Assessing Team Popularity Using Google Trends Data

1 Reply

I recently completed a small project in which I analyzed the popularity of teams that compete in the four major sports. In the analysis, I looked at metropolitan areas that have at least one team in each of the four major sports. Using Google Trends, I gathered weekly data from 01/02/05 – 11/01/14 to assess the relative popularity of teams in cities that met the inclusion criteria. Each data point represents how frequently a team was queried on Google relative to the Miami Heat in the week of 06/16/13 – 06/22/13. This data point was set to 100. More importantly, higher numbers indicate a higher search frequency and, for all intents and purposes, greater popularity. Below, I’ve included a heat map that illustrates the contrasts in average popularity among city-sport combinations over the time period. Note that for cities home to more than one team in a given sport, I took the maximum team score each week. For instance, if the Yankees had a popularity score of 10 in one week and the Mets had a score of 5 in the same week, I would set the Yankees’ score to be the New York baseball score for that week.

R code I used to produce the heatmap (will require the gplots and RColorBrewer libraries):

row.names(all.data) <- all.data$City
all.data <- all.data[,-1]
scaleyellowred <- colorRampPalette(c("lightyellow", "red"), space = "rgb")(100)
heatmap.2(as.matrix(all.data), Rowv = NA, Colv = NA, col = scaleyellowred, density.info="none",
          trace="none", cexRow=0.8, cexCol=1, margins=c(6,6))

In my statistical analysis, I used three-way between-subjects ANOVA (with a season variable as a block) to assess whether certain cities and sports attract significantly greater levels of interest. I performed this analysis using data only from the year 2013. Furthermore, to account for violations of the normality assumption, I performed a log transformation on the data. This did not affect the interpretability of results because, as alluded to above, the numeric quantities themselves added little value to the analysis. Additionally, the ordinal relationships remained intact after the log transformation. Here are slides with results from the project and the raw data from Google Trends.

Advice from Peyton Manning

Leave a reply

In regard to constructing elegant proofs, one of my mathematics professors says, “It is important to know what to skip.” I write this because it is not necessary for me to statistically analyze Peyton Manning’s career in order to prove his greatness. This has been done time and time again, and it is virtually fact that he is one of the greatest quarterbacks in NFL history. However, I did want to offer a quick thought on a recent, atypical, Peyton Manning interview, in which The MMQB’s Peter King tried to ask the oft-interviewed Manning ten questions he had never answered before.

In particular, I thought Manning’s response to King’s question regarding what advice he would give to young NFL quarterbacks was terrific. Here is part of it:

‘Don’t ever go to a meeting to watch a practice or a game without having already watched it by yourself.’ That’s one thing that I have always done. When the coach is controlling the remote control, he’s gonna rewind when he wants to rewind. He’s gonna skip certain plays. He’s not watching every single detail. When you can control the rewind button, you can go in there and you watch—first, you better watch your mechanics. Watch what you’re doing. Is your drop good? How’s your throw? OK, now rewind it again. Now you better watch your receivers. OK, looks like Demaryius Thomas ran a good route here. Not sure what Julius Thomas was doing here. Then you better rewind it again and watch what the defense is doing. So, there’s time in that deal. You have to know what they were doing so you can help them. So that has helped me. When I go in and watch it with the coach, I’m watching it for the third, fourth, fifth time. That’s when you start learning.

The other thing I would tell them: ‘To ever watch film without a pen and paper in your hand is a complete waste of time.’ You do it that way, you’re only watching it, as I call it, to please the coach. If you’re in the QB room and you leave the door open so they can see you in there, don’t. Shut the door. You ought to have the door shut. Whether they know you’re in there or not, they’re gonna know by the way you play out on the field. Don’t go showing off.

I think Manning’s advice can be summed up in a way that makes it applicable to many people, not just NFL quarterbacks. Namely, Manning seems to be getting at the following: ‘Don’t do something–whether it is school work, a research project, or an assignment at a job–just because you are obligated to and your boss , instructor, or coach will be watching; instead, take complete ownership of it and do it to better yourself’. I think that too many times, people (myself certainly included) do things just because it looks good (like Manning says, leaving the door open so the coaches can see you studying or, perhaps, making sure that your boss sees that you came in early). In reality, we (to a certain extent) should not care about the external impression we make when completing a task. I think the best personal growth and outcomes come when we don’t do something for the sake of doing it, but do something with the Manningesque mindset of conquering it, regardless of whether people see us during the process. As Manning says, “Whether they know you’re in there or not, they’re gonna know by the way you play out on the field”.

Afternote: This is Manning when he is not mastering his craft.

Monte Carlo Estimation of Pi

2 Replies

Yesterday, I came across a neat way to approximate π using Monte Carlo simulation. I hadn’t seen this exercise before, but I think it is understandable and illustrative, so I decided to give it a try using both Excel and Python.

How it Works

Imagine we have a unit circle inscribed within a 2×2 square.

Now, let’s zoom into the top-right quadrant of this plot.

The area of this entire quadrant is equal to r² (since the base of the square equals the circle’s radius). The area in blue is a quarter-circle and is equal to (1/4)πr². Thus, if one were to plot points within this quadrant at random, we would expect that π/4 (approximately 78.54%) would lie within the blue region. Theoretically, then, to estimate π, one can generate a bunch of coordinate pairs with x and y values between 0 and 1. If the distance of a particular point from the origin is greater than 1 (greater than the radius of the circle), it is classified as outside of the circle. Otherwise, it is contained within the circle. With a large enough sample, the proportion of points within the circle should be close to π/4. We can then multiply this proportion by 4 to estimate π.

To get the distance of a point from the origin, we use the Pythagorean theorem:

Distance=sqrt(x² + y²).

Executing this using Excel

To carry out this process in Excel, I created two separate sets of 11,175 random numbers between 0 and 1 using the =RAND() function. I then calculated the distance for each pair. Next, I used an =IF() function to generate indicator variables corresponding to whether or not the calculated distance was greater than 1. Finally, I used =COUNTIF() to count those distances less than or equal to 1, divided this count by the total number of distances calculated (11,175), and multiplied this fraction by 4 to get an estimate of π.

Once all of this is in place, it is quick and easy to get estimates by pressing F9 to refresh the random numbers.

Here are 10 of the estimates I generated (accurate to 5 digits):

3.14917, 3.12447, 3.14309, 3.13879, 3.14452, 3.14130, 3.12197, 3.12841, 3.13128, 3.15347

For reference, the true value of π to 5 digits is 3.14159.

Here is a spreadsheet setup to carry out this simulation exercise.

Executing this using Python

I also replicated this process using Python. When replicating this process programatically, it is much easier to increase the scale (i.e., number of distance estimates).

Here is my Python code:

#single point estimate
import random
from math import *
from __future__ import division

n=100000 #number of random number pairs

my_randoms = [] #create empty list to store distance calculations

for _ in range(0,n):
    my_randoms.append(sqrt((random.random()**2) + (random.random()**2))) #distance calculation
    
pi_estimate=(sum(x<=1 for x in my_randoms)/n)*4

print pi_estimate

Using 100,000 random number pairs, here are 10 estimates that I got:
3.12924, 3.16036, 3.14132, 3.14204, 3.13932, 3.13904, 3.1464, 3.14028, 3.14208, 3.13856

Simulating this process programatically also allowed me to go one step further. Again, I used the random number pairs to generate an estimate of π (this time “only” 10,000 pairs to spare my computer from the extra calculations). However, using a nested for-loop, I repeated this process 10,000 times! One can then take the average of the 10,000 π estimates to get an even better approximation of π. Additionally, one can test the validity of the Central Limit Theorem to see if the 10,000 estimates are normally distributed.

Here’s my Python code to do just this:

#average of 10,000 estimates; takes some time to run
import random
from math import *
from __future__ import division
import matplotlib.pyplot as plt

n1= 10000 #number of pi estimates 
n2= 10000 #number of random number pairs to use in pi estimate

pi_list=[] #create empty list to store pi estimates

for _ in range(0,n1): #n1 iterations
 my_randoms=[] #create empty list to store distance calculations
 for _ in range(0,n2):
 my_randoms.append(sqrt((random.random()**2) + (random.random()**2))) #distance calculation
 pi_list.append((sum(x<=1 for x in my_randoms)/n2)*4)

print sum(pi_list) / float(len(pi_list)) #average pi estimate

plt.hist(pi_list) #histogram of pi estimates

Using this program, my distribution of π estimates looked like this:

Without any hesitation, I would say that the estimates follow a normal distribution.

After only one execution of this program, I received a π approximation (based on the average of 10,000 estimates) of 3.14168, which is just .00009 off from the true value of π.

Baseball HOF Interactive

Leave a reply

Major League Baseball welcomed its 2014 Hall of Fame class on July 27th. This inspired me to create this visualization. It is interactive, so click around! I couldn’t find a way to embed it within the site, so hopefully the link works.

Super(hero) Statistics, Pt. II

1 Reply

Building off my previous post, in which I introduced a neat superhero dataset and presented graphical descriptions of the data, in this post I would like to address differences in traits between superheroes and supervillains.

1. Descriptive Statistics

Below, I’ve included mean ratings for both superheroes and supervillains.

Variable	Mean for Supervillains (n=123)	Mean for Superheroes (n=304)
Intelligence	67.46	60.22
Strength	46.11	38.17
Speed	37.03	38.26
Durability	62.00	56.28
Power	59.67	55.48
Combat	59.87	60.50
Total	332.15	308.91
Source: http://www.superherodb.com

Surprisingly, we can see that, on average, the villains possess higher total ratings (sum of the six ratings) than do the heroes. However, we (or at least I) typically think of superheroes as being superior to their rivals. On the other hand, creating villains to be more extraordinary creatures than their counterparts makes the heroes’ ultimate triumphs even more impressive. This makes for good underdog stories.

Now, let’s add variance into the equation and look at boxplots of the data.

Note: “bad” refers to villains and “good” refers to heroes

Without any formal discussion, I’ll make a quick remark about the spread of the ratings. There is more spread (as measured by standard deviation) in ratings for villains in 5 of the 6 categories (all but combat). Practically, this leads me to believe that villains are more diverse in makeup than heroes. For an anecdotal example, Joker is a very different villain than Blob.

Name	Intelligence	Strength	Speed	Durability	Power	Combat
Joker	100	10	12	56	22	90
Blob	10	83	23	95	26	72
Difference	90	-73	-11	-39	-4	18

2. Incorporating Uncertainty

From the analysis so far, one can easily see that there are differences in the average scores between villains and heroes. Now, I was tempted to take these differences as “real” but, upon further consideration, second-guessed myself.

I faced a statistical conundrum which statisticians working with real world data rarely, if ever, face. The reason I was tempted to say, for instance, that villains are about 4 points more powerful than heroes is because I thought my sample was essentially the population. Upon further examination, however, I recalled that I had cut 171 subjects from my sample because of missing values and that there are probably superheroes and supervillains out there not included in the superhero database. So, since my sample is just that, a sample, I decided to incorporate some uncertainty into my estimates. (Now, if this were a real world problem, I would need to be sure that my missing data were missing at random to assure that I was working with a random sample. In this case, I will make that assumption and proceed, though it is probably not valid. I think that the results will still be interesting and worthy of discussion).

Below are estimated ranges for the true differences between supervillains and superheroes for each of the six ratings categories.

Variable	95% CI of Mean (Villain-Hero)
Intelligence	(2.5336, 11.9591)
Strength	(1.2240, 14.6616)
Speed	(-6.0548, 3.5935)
Durability	(-0.5996, 12.0338)
Power	(-1.5122, 9.8915)
Combat	(-5.4797, 4.2130)

After incorporating uncertainty, we only detect significant disparities in two (intelligence and strength) of the six categories. If I were to account for multiple comparisons, then only intelligence would differ significantly between the two groups.

3. P (Hero | Characteristics)

Now that we have been exposed to at least some evidence to suggest that there are detectable differences in traits between superheroes and supervillains, I am interested in answering the following question:

Given that a character possesses a particular set of ratings, can we predict if that character will be a hero or a villain?

“Practically”, this would allow one to predict the alignment of a new member to the world of superheroes and supervillains.

To address this question, I created a logistic regression model. Overall, I would say that the model does not do a great job of predicting alignment (perhaps due to the considerable amount of variability among both heroes and villains), but there is still some useful information which I will share. I’ll present these results in terms of odds ratios.

Effect	Point Estimate	95% Wald Confidence Limits
Intelligence	0.982	(0.971, 0.993)
Strength	0.992	(0.982, 1.001)
Speed	1.012	(1, 1.023)
Durability	0.998	(0.987, 1.008)
Power	0.998	(0.989, 1.008)
Combat	1.008	(0.998, 1.019)

When controlling for other ratings, intelligence and speed play a significant role in predicting alignment. Here are their interpretations:

Holding other factors fixed, a one unit increase in intelligence rating decreases the odds that a character is a superhero by 1.8%.
Holding other factors fixed, a one unit increase in speed rating increases the odds that a character is a superhero by 1.2%.

4. Conclusions

In my mind, the biggest takeaway from this analysis is the role that intelligence plays in distinguishing superheroes from supervillains. The villains possess significantly higher levels of intelligence than their enemies, the heroes. I guess it makes for a good story when the good guy outwits the evil mastermind.

******************************************************************************

As in Pt. I, I was working with this dataset.

Below is code used, which, for this particular post, I feel is rather straightforward.

R code for the boxplots:

#create a panel of boxplots; separate by alignment
hero_vill <- read.csv("~/Blog/hero_vill.csv", stringsAsFactors=TRUE)
par(mfrow=c(3,2))
boxplot(hero_vill$Intelligence~hero_vill$Alignment, col=rainbow(2), main="Intelligence Ratings")
boxplot(hero_vill$Strength~hero_vill$Alignment, col=rainbow(2), main="Strength Ratings")
boxplot(hero_vill$Speed~hero_vill$Alignment, col=rainbow(2), main="Speed Ratings")
boxplot(hero_vill$Durability~hero_vill$Alignment, col=rainbow(2), main="Durability Ratings")
boxplot(hero_vill$Power~hero_vill$Alignment, col=rainbow(2), main="Power Ratings")
boxplot(hero_vill$Combat~hero_vill$Alignment, col=rainbow(2), main="Combat Ratings")

SAS code for t-tests & logistic regression:

*data processing; data hero_vill; infile "~/Blog/hero_vill_1.csv" dlm=',' dsd missover; input Name :$32. Alignment $ Intelligence Strength Speed Durability Power Combat Total; if Alignment='good' then Hero=1; else if Alignment='bad' then Hero=0; else delete; run; *t-tests; proc ttest data=hero_vill plots=none; class alignment; var intelligence; run; proc ttest data=hero_vill plots=none; class alignment; var strength; run; proc ttest data=hero_vill plots=none; class alignment; var speed; run; proc ttest data=hero_vill plots=none; class alignment; var durability; run; proc ttest data=hero_vill plots=none; class alignment; var power; run; proc ttest data=hero_vill plots=none; class alignment; var combat; run; *logistic regression; proc logistic data=hero_vill; model Hero(event='1')=Intelligence Strength Speed Durability Power Combat; run;

Super(hero) Statistics, Pt. I

4 Replies

Today, I will post the first of several entries relating to a unique dataset that I’ve put together–a dataset that contains over 600 rows of data on a topic of great importance……………..superheroes and supervillains! Okay, analyzing superhero data may not be the most urgent matter that we as a society face, but:

1) I think that it is often useful to include entertaining applications when demonstrating what may otherwise be dry statistical methodology (though, admittedly, I sometimes find the “dry” statistics entertaining as well).

2) I find it fun to casually talk to friends about superheroes (especially those in recent movies) simply because, well, it’s just that: fun! And much like sabermetricians use statistics to challenge what is thought to be conventional baseball wisdom, I hope to use statistics to address pertinent topics relating to superheroes and supervillains (let’s call this discipline supermetrics?).

Setting the Stage

Before I get into the statistical analysis of the data, I must put a few things on the table.

1) Disclaimer I do not claim to be a superhero enthusiast, expert, or anything of the kind. I don’t think I have ever read a comic book in my life. I am, however, a fan of superhero movies (e.g., Man of Steel, The Dark Knight, Captain America, The Avengers, etc.) along with the television show, Arrow. So what does this tidbit have to do with anything? For one, with regard to my statistical analyses, it’s probably a good thing that I’m not a huge supporter of any given hero or villain. This should play a role in producing unbiased results. Additionally, my unfamiliarity with the comic book universe becomes a disadvantage in terms of data familiarity. Namely, I am fairly confident that I would not be able to recognize any superhero that is over- or underrated (more on the data below), making erroneous calculations more likely. With this in mind, if any comic book experts out there see any peculiar results, they are certainly welcome to leave a comment or drop me an email.

2) Data The data in the dataset that I will use to conduct my analyses come courtesy of the Superhero Database. This site houses a collection of information about 611 superheroes and villains. Each character (with data) is given a score between 0 and 100 (except in a few rare cases, where characters are deemed to be worthy of scores greater than 100) for six different traits: Intelligence, Strength, Speed, Durability, Power, and Combat. For example, here are the ratings for Batman:

Intelligence	Strength	Speed	Durability	Power	Combat
100	18	27	42	37	100

Correction. Each character actually has two sets of these ratings. One such set is attributable to the creation of the website (the website was created to log and rate superheroes) while the other set is an average of user ratings. I chose to use the site’s ratings (the first set) because it allowed for a more complete dataset (many heroes and villains were not rated by users). In the event that the users rated the character and the site itself did not, I simply used the users’ ratings. After using this strategy, I came out with 440 superheroes/villains on which to perform analyses (171 subjects without usable ratings from either the site or its users).

I have one final note on the data. It is obviously highly subjective. When a human being is rating anything, there is always an element of subjectivity. Superheroes pose a particular challenge because they are fictional (so we are assigning subjective ratings to characters that were developed based on an author’s subjectivity). So please do not take the analyses too seriously, but certainly have fun critiquing and debating them! (Note: I do like how ratings can take on any discrete value between 0 and 100. This large range allows raters to address subtle differences among superheroes/villains.)

Visualizing Differences in Superhero Traits

The first items that I will present in analyzing this dataset are heatmaps.

Heroes

I created a non-random sample of superheroes to include big names. I did this using my opinions of which superheroes are popular and by surfing the web for lists of popular superheroes. After I had a list of about 20 heroes, I generated a heat matrix using R that color-coded the six attributes for each hero.

Superheroes and their ratings. A darker shade corresponds to a higher rating relative to peers.

I really like this form of visualization. One is able to view 126 individual data points in this graphic and I can assure you that it is easier to gain a grasp of the data using colors in the matrix rather than numeric values. Does anything stand out to you? Martian Manhunter, Superman, and Thor all have relatively high ratings across the board. And as we saw above, while Batman is exceptionally intelligent and excels in combat, he could use work in the other four departments.

Villains

I replicated the heatmap visualization using popular supervillains in place of superheroes. (Note: This graphic is probably a bit biased toward Batman villains, because those are the most familiar to me and I felt compelled to include them!)

Supervillains and their ratings. A darker shade corresponds to a higher rating relative to peers.

Here we can see that General Zod–who you may have recently seen in Man of Steel— is, for lack of a better term, a beast. He possesses an attribute score of at least 94 in each of the six categories. Interestingly, it becomes apparent that many of the famous Batman villains (e.g., Joker, Penguin, Riddler, and Two-Face) rely on a high level of intellect to challenge Batman. Then, however, we have Ra’s Al Ghul and Bane; both villains possess extraordinary combat ratings.

Closing Remarks

This post was devoted to introducing the superhero/villain dataset with which I will be working in subsequent posts. Additionally, I took a first step in statistically analyzing the data by visualizing differences in characteristics among heroes and villains. I’d be interested to hear whether anything jumps out to you when studying those graphics!

In future posts, I aim to move beyond visualizing the data by applying various statistical methods to the data. Hopefully you find the superhero data and the relevant questions that can be addressed both fun and entertaining, but also of practical importance with regard to the statistical methods being employed.

******************************************************************************

Download the data (.xls): Superhero & Supervillain Data

R code:

#source used: http://flowingdata.com/2010/01/21/how-to-make-a-heatmap-a-quick-and-easy-solution/
#will need to install ggplot2 and RColorBrewer

#generate heatmap for select, non-random, heroes
super <- read.csv("~/Blog/select.csv", stringsAsFactors=FALSE) #read-in CSV file; insert appropriate path
super <- super[order(super$Name),] #sort heroes alphabetically
row.names(super) <- super$Name
super <- super[,3:8] #using these 6 columns for analysis
super_matrix <- data.matrix(super)
super_heatmap <- heatmap(super_matrix, Rowv=NA, Colv=NA,
			col = colorRampPalette(brewer.pal(9,"Greens"))(1000), scale="column",
			margins=c(5,10))

#generate heatmap for select, non-random, villains
vill <- read.csv("~/Blog/selectvill.csv", stringsAsFactors=FALSE) #read-in CSV file; insert appropriate path
vill <- vill[order(vill$Name),] #sort villains alphabetically
row.names(vill) <- vill$Name
vill <- vill[,3:8] #using these 6 columns for analysis
vill_matrix <- data.matrix(vill)
vill_heatmap <- heatmap(vill_matrix, Rowv=NA, Colv=NA,
                         col = colorRampPalette(brewer.pal(9,"Reds"))(1000), scale="column",
                         margins=c(5,10))

Visualizing Differences in NBA Team Construction

Leave a reply

Recently, both on television broadcasts of games and on talk radio shows, I’ve heard several discussions relating to roster composition in sports. In the NBA, much discussion regards the notion of needing to use free agency in order to acquire a “big three” and compete for an NBA championship. And with the NBA finals, featuring Miami’s well-publicized “big three”, tipping-off tonight, I decided to look into how each NBA team acquired its players.

Without any complementary analysis (I would like to be able get into that another time), I’ve included a radar chart and table in this post. (Note: the table illustrates the percent of a team’s 2013-2014 roster, according to RealGM Basketball, derived from a specific talent acquisition avenue). Again, it would be fascinating to conduct a detailed study concerning optimal roster construction (and to add money devoted to each method of acquiring players into the analysis), but for now, let’s note that both NBA finals participants (particularly Miami) have relied heavily upon free agency in crafting their rosters.

Click to enlarge

Reblog: A Statistician’s View on Big Data and Data Science

Leave a reply

An insightful presentation by Diego Kuonen that addresses differences, overlaps, opportunities, challenges and much more relating to big data, data science, and traditional statistics.

A Statistician’s View on Big Data and Data Science from Diego Kuonen

The Physics Behind Josh Beckett’s No-Hitter

Leave a reply

Josh Beckett of the Los Angeles Dodgers tossed the first no-hitter of the 2014 MLB season on Sunday. Over 9 innings of work, the 6’5″, 230 lb. righthander threw 128 pitches, striking out 6 Phillies’ batters and facing 30 batters in all, just 3 over the minimum.

Boxscore statistics can only tell us so much about Beckett’s gem. Here are some graphics that can help us in analyzing the underpinnings of Beckett’s success on Sunday:

All of the data used in producing these graphics were retrieved from the excellent baseball research resource, http://www.brooksbaseball.net. The site lists Beckett as having featured a fourseam fastball, sinker, changeup, curveball, cutter, and splitter at at least one point or another throughout the 2014 campaign (note, however, that Beckett did not throw his splitter during his no-hitter).

Interpreting the Graphs

During the no-hitter, Beckett’s average velocity on each of his pitches was essentially the same as it has been all season. He garnered more horizontal movement on each of his pitches though, which leads me to believe that both his “stuff” and command were sharper than normal. It is neat that analyzing this type of data can allow one to quantify abstract ideas such as the effectiveness of a pitcher’s “stuff” and the command which he holds over his pitches. Perhaps the most telltale sign that Beckett was at the top if his game is that he was keeping his pitches down in the zone, which is illustrated in the third plot. Notice that the average location of each of Beckett’s pitches (except for the cutter, which he threw only 8 times on Sunday) was noticeably lower than it has been all season.

Further Reading

I have only scratched the surface on what can be done with this sort of PITCHf/x data. Dr. Alan Nathan has developed volumes of neat research on the physics of baseball. FanGraphs has a lot of statistics and graphics that are developed using PITCHf/x data. Finally, as I mentioned above, BrooksBaseball is also a tremendous resource.

MLB Team Balance after 47 Games or So: The Unbelievable Start in Oakland

Leave a reply

On this past Thursday’s edition of Baseball Tonight, Jonah Keri was called upon to discuss MLB team run differentials (he also wrote about it here). He pointed out that this early in the season, run differential may be even more powerful in predicting end-of-season win-loss records than current win-loss records themselves. This is not the first time I’ve heard this tidbit, and Sean Forman does a nice job of explaining the rationale in this piece. In sum, run-differential says more about the quality of team than does win-loss record because, especially early on in the season, win-loss record can be skewed based on chance. Hence, a team’s 162-game win-loss record is best explained by a combination of runs scored, runs allowed, and luck.

Keri pointed out that this is about the time of the season when we can start trusting run differential to tell us about the quality of a team. Russell A. Carleton illustrates why this is so here, though his more advanced calculation, which employs Cronbach’s alpha, seems to suggest we should wait a few more weeks.

Nevertheless, this topic got me thinking about which MLB teams are most balanced (a lot of runs scored and few runs allowed, or, in other words, solid hitting, pitching and fielding).

First, I looked into team OPS numbers:

Quick Comments:

As is to be expected, the Rockies, who play well-above sea level in Denver, CO, have the highest OPS (though I would argue that the number is still impressive even after accounting for geography). I did not adjust for park factors in this particular analysis.
The Angels, Tigers, and (especially) the A’s all have impressive places within the plot (a team that has a better OPS than that which it surrenders to opponents is of course below the 45-degree line).

I then plotted runs/game vs. runs allowed/game and calculated a standardized version of run-differential ([runs scored-runs allowed]/[games played]), putting the resulting numbers in tabular form:

Quick Comments:

Again, the Oakland A’s post a very, very impressive run differential figure.
The 17-28 Cubs post a record not at all reflective of their run-differential.
The D’backs could use some pitching help.
Though Colorado scores the most runs in baseball (see discussion above), they are still holding opponents to much fewer runs than they score. Thus, seeing that they play their games in the same ballpark as their opponent, this analysis bodes very well for the Rockies.

With all of this in mind, the most impressive teams thus far in terms of balance (which is vital to sustained success in any sport) and the ones that would seem to be in it for the long haul are:

Oakland
Detroit
San Francisco
LA Angels
Colorado Rockies

You can of course interpret the data however you’d like. These are my five most balanced teams, as of now. Oakland’s season is off to an unbelievable start and they deserve particular recognition (at least double the per game run differential of that for every other team!) but, of course, who knows what the future holds. This list is very likely to change as the season progresses, but, for now, realize that the A’s are playing tremendous (dare I say historic?) baseball.

Jon's Jibber-Jabber

My thoughts on statistics, sports, television shows, movies, books, science, economics, and whatever else piques my curiosity.

Assessing Team Popularity Using Google Trends Data

Advice from Peyton Manning

Monte Carlo Estimation of Pi

Baseball HOF Interactive

Super(hero) Statistics, Pt. II

Super(hero) Statistics, Pt. I

Visualizing Differences in NBA Team Construction

Reblog: A Statistician’s View on Big Data and Data Science

The Physics Behind Josh Beckett’s No-Hitter

MLB Team Balance after 47 Games or So: The Unbelievable Start in Oakland