Category Archives: Superheroes

What is the best superhero TV show?

Super(hero) Statistics, Pt. II

Building off my previous post, in which I introduced a neat superhero dataset and presented graphical descriptions of the data, in this post I would like to address differences in traits between superheroes and supervillains.

1. Descriptive Statistics

Below, I’ve included mean ratings for both superheroes and supervillains.

Variable	Mean for Supervillains (n=123)	Mean for Superheroes (n=304)
Intelligence	67.46	60.22
Strength	46.11	38.17
Speed	37.03	38.26
Durability	62.00	56.28
Power	59.67	55.48
Combat	59.87	60.50
Total	332.15	308.91
Source: http://www.superherodb.com

Surprisingly, we can see that, on average, the villains possess higher total ratings (sum of the six ratings) than do the heroes. However, we (or at least I) typically think of superheroes as being superior to their rivals. On the other hand, creating villains to be more extraordinary creatures than their counterparts makes the heroes’ ultimate triumphs even more impressive. This makes for good underdog stories.

Now, let’s add variance into the equation and look at boxplots of the data.

Note: “bad” refers to villains and “good” refers to heroes

Without any formal discussion, I’ll make a quick remark about the spread of the ratings. There is more spread (as measured by standard deviation) in ratings for villains in 5 of the 6 categories (all but combat). Practically, this leads me to believe that villains are more diverse in makeup than heroes. For an anecdotal example, Joker is a very different villain than Blob.

Name	Intelligence	Strength	Speed	Durability	Power	Combat
Joker	100	10	12	56	22	90
Blob	10	83	23	95	26	72
Difference	90	-73	-11	-39	-4	18

2. Incorporating Uncertainty

From the analysis so far, one can easily see that there are differences in the average scores between villains and heroes. Now, I was tempted to take these differences as “real” but, upon further consideration, second-guessed myself.

I faced a statistical conundrum which statisticians working with real world data rarely, if ever, face. The reason I was tempted to say, for instance, that villains are about 4 points more powerful than heroes is because I thought my sample was essentially the population. Upon further examination, however, I recalled that I had cut 171 subjects from my sample because of missing values and that there are probably superheroes and supervillains out there not included in the superhero database. So, since my sample is just that, a sample, I decided to incorporate some uncertainty into my estimates. (Now, if this were a real world problem, I would need to be sure that my missing data were missing at random to assure that I was working with a random sample. In this case, I will make that assumption and proceed, though it is probably not valid. I think that the results will still be interesting and worthy of discussion).

Below are estimated ranges for the true differences between supervillains and superheroes for each of the six ratings categories.

Variable	95% CI of Mean (Villain-Hero)
Intelligence	(2.5336, 11.9591)
Strength	(1.2240, 14.6616)
Speed	(-6.0548, 3.5935)
Durability	(-0.5996, 12.0338)
Power	(-1.5122, 9.8915)
Combat	(-5.4797, 4.2130)

After incorporating uncertainty, we only detect significant disparities in two (intelligence and strength) of the six categories. If I were to account for multiple comparisons, then only intelligence would differ significantly between the two groups.

3. P (Hero | Characteristics)

Now that we have been exposed to at least some evidence to suggest that there are detectable differences in traits between superheroes and supervillains, I am interested in answering the following question:

Given that a character possesses a particular set of ratings, can we predict if that character will be a hero or a villain?

“Practically”, this would allow one to predict the alignment of a new member to the world of superheroes and supervillains.

To address this question, I created a logistic regression model. Overall, I would say that the model does not do a great job of predicting alignment (perhaps due to the considerable amount of variability among both heroes and villains), but there is still some useful information which I will share. I’ll present these results in terms of odds ratios.

Effect	Point Estimate	95% Wald Confidence Limits
Intelligence	0.982	(0.971, 0.993)
Strength	0.992	(0.982, 1.001)
Speed	1.012	(1, 1.023)
Durability	0.998	(0.987, 1.008)
Power	0.998	(0.989, 1.008)
Combat	1.008	(0.998, 1.019)

When controlling for other ratings, intelligence and speed play a significant role in predicting alignment. Here are their interpretations:

Holding other factors fixed, a one unit increase in intelligence rating decreases the odds that a character is a superhero by 1.8%.
Holding other factors fixed, a one unit increase in speed rating increases the odds that a character is a superhero by 1.2%.

4. Conclusions

In my mind, the biggest takeaway from this analysis is the role that intelligence plays in distinguishing superheroes from supervillains. The villains possess significantly higher levels of intelligence than their enemies, the heroes. I guess it makes for a good story when the good guy outwits the evil mastermind.

******************************************************************************

As in Pt. I, I was working with this dataset.

Below is code used, which, for this particular post, I feel is rather straightforward.

R code for the boxplots:

#create a panel of boxplots; separate by alignment
hero_vill <- read.csv("~/Blog/hero_vill.csv", stringsAsFactors=TRUE)
par(mfrow=c(3,2))
boxplot(hero_vill$Intelligence~hero_vill$Alignment, col=rainbow(2), main="Intelligence Ratings")
boxplot(hero_vill$Strength~hero_vill$Alignment, col=rainbow(2), main="Strength Ratings")
boxplot(hero_vill$Speed~hero_vill$Alignment, col=rainbow(2), main="Speed Ratings")
boxplot(hero_vill$Durability~hero_vill$Alignment, col=rainbow(2), main="Durability Ratings")
boxplot(hero_vill$Power~hero_vill$Alignment, col=rainbow(2), main="Power Ratings")
boxplot(hero_vill$Combat~hero_vill$Alignment, col=rainbow(2), main="Combat Ratings")

SAS code for t-tests & logistic regression:

*data processing; data hero_vill; infile "~/Blog/hero_vill_1.csv" dlm=',' dsd missover; input Name :$32. Alignment $ Intelligence Strength Speed Durability Power Combat Total; if Alignment='good' then Hero=1; else if Alignment='bad' then Hero=0; else delete; run; *t-tests; proc ttest data=hero_vill plots=none; class alignment; var intelligence; run; proc ttest data=hero_vill plots=none; class alignment; var strength; run; proc ttest data=hero_vill plots=none; class alignment; var speed; run; proc ttest data=hero_vill plots=none; class alignment; var durability; run; proc ttest data=hero_vill plots=none; class alignment; var power; run; proc ttest data=hero_vill plots=none; class alignment; var combat; run; *logistic regression; proc logistic data=hero_vill; model Hero(event='1')=Intelligence Strength Speed Durability Power Combat; run;

Super(hero) Statistics, Pt. I

4 Replies

Today, I will post the first of several entries relating to a unique dataset that I’ve put together–a dataset that contains over 600 rows of data on a topic of great importance……………..superheroes and supervillains! Okay, analyzing superhero data may not be the most urgent matter that we as a society face, but:

1) I think that it is often useful to include entertaining applications when demonstrating what may otherwise be dry statistical methodology (though, admittedly, I sometimes find the “dry” statistics entertaining as well).

2) I find it fun to casually talk to friends about superheroes (especially those in recent movies) simply because, well, it’s just that: fun! And much like sabermetricians use statistics to challenge what is thought to be conventional baseball wisdom, I hope to use statistics to address pertinent topics relating to superheroes and supervillains (let’s call this discipline supermetrics?).

Setting the Stage

Before I get into the statistical analysis of the data, I must put a few things on the table.

1) Disclaimer I do not claim to be a superhero enthusiast, expert, or anything of the kind. I don’t think I have ever read a comic book in my life. I am, however, a fan of superhero movies (e.g., Man of Steel, The Dark Knight, Captain America, The Avengers, etc.) along with the television show, Arrow. So what does this tidbit have to do with anything? For one, with regard to my statistical analyses, it’s probably a good thing that I’m not a huge supporter of any given hero or villain. This should play a role in producing unbiased results. Additionally, my unfamiliarity with the comic book universe becomes a disadvantage in terms of data familiarity. Namely, I am fairly confident that I would not be able to recognize any superhero that is over- or underrated (more on the data below), making erroneous calculations more likely. With this in mind, if any comic book experts out there see any peculiar results, they are certainly welcome to leave a comment or drop me an email.

2) Data The data in the dataset that I will use to conduct my analyses come courtesy of the Superhero Database. This site houses a collection of information about 611 superheroes and villains. Each character (with data) is given a score between 0 and 100 (except in a few rare cases, where characters are deemed to be worthy of scores greater than 100) for six different traits: Intelligence, Strength, Speed, Durability, Power, and Combat. For example, here are the ratings for Batman:

Intelligence	Strength	Speed	Durability	Power	Combat
100	18	27	42	37	100

Correction. Each character actually has two sets of these ratings. One such set is attributable to the creation of the website (the website was created to log and rate superheroes) while the other set is an average of user ratings. I chose to use the site’s ratings (the first set) because it allowed for a more complete dataset (many heroes and villains were not rated by users). In the event that the users rated the character and the site itself did not, I simply used the users’ ratings. After using this strategy, I came out with 440 superheroes/villains on which to perform analyses (171 subjects without usable ratings from either the site or its users).

I have one final note on the data. It is obviously highly subjective. When a human being is rating anything, there is always an element of subjectivity. Superheroes pose a particular challenge because they are fictional (so we are assigning subjective ratings to characters that were developed based on an author’s subjectivity). So please do not take the analyses too seriously, but certainly have fun critiquing and debating them! (Note: I do like how ratings can take on any discrete value between 0 and 100. This large range allows raters to address subtle differences among superheroes/villains.)

Visualizing Differences in Superhero Traits

The first items that I will present in analyzing this dataset are heatmaps.

Heroes

I created a non-random sample of superheroes to include big names. I did this using my opinions of which superheroes are popular and by surfing the web for lists of popular superheroes. After I had a list of about 20 heroes, I generated a heat matrix using R that color-coded the six attributes for each hero.

Superheroes and their ratings. A darker shade corresponds to a higher rating relative to peers.

I really like this form of visualization. One is able to view 126 individual data points in this graphic and I can assure you that it is easier to gain a grasp of the data using colors in the matrix rather than numeric values. Does anything stand out to you? Martian Manhunter, Superman, and Thor all have relatively high ratings across the board. And as we saw above, while Batman is exceptionally intelligent and excels in combat, he could use work in the other four departments.

Villains

I replicated the heatmap visualization using popular supervillains in place of superheroes. (Note: This graphic is probably a bit biased toward Batman villains, because those are the most familiar to me and I felt compelled to include them!)

Supervillains and their ratings. A darker shade corresponds to a higher rating relative to peers.

Here we can see that General Zod–who you may have recently seen in Man of Steel— is, for lack of a better term, a beast. He possesses an attribute score of at least 94 in each of the six categories. Interestingly, it becomes apparent that many of the famous Batman villains (e.g., Joker, Penguin, Riddler, and Two-Face) rely on a high level of intellect to challenge Batman. Then, however, we have Ra’s Al Ghul and Bane; both villains possess extraordinary combat ratings.

Closing Remarks

This post was devoted to introducing the superhero/villain dataset with which I will be working in subsequent posts. Additionally, I took a first step in statistically analyzing the data by visualizing differences in characteristics among heroes and villains. I’d be interested to hear whether anything jumps out to you when studying those graphics!

In future posts, I aim to move beyond visualizing the data by applying various statistical methods to the data. Hopefully you find the superhero data and the relevant questions that can be addressed both fun and entertaining, but also of practical importance with regard to the statistical methods being employed.

******************************************************************************

Download the data (.xls): Superhero & Supervillain Data

R code:

#source used: http://flowingdata.com/2010/01/21/how-to-make-a-heatmap-a-quick-and-easy-solution/
#will need to install ggplot2 and RColorBrewer

#generate heatmap for select, non-random, heroes
super <- read.csv("~/Blog/select.csv", stringsAsFactors=FALSE) #read-in CSV file; insert appropriate path
super <- super[order(super$Name),] #sort heroes alphabetically
row.names(super) <- super$Name
super <- super[,3:8] #using these 6 columns for analysis
super_matrix <- data.matrix(super)
super_heatmap <- heatmap(super_matrix, Rowv=NA, Colv=NA,
			col = colorRampPalette(brewer.pal(9,"Greens"))(1000), scale="column",
			margins=c(5,10))

#generate heatmap for select, non-random, villains
vill <- read.csv("~/Blog/selectvill.csv", stringsAsFactors=FALSE) #read-in CSV file; insert appropriate path
vill <- vill[order(vill$Name),] #sort villains alphabetically
row.names(vill) <- vill$Name
vill <- vill[,3:8] #using these 6 columns for analysis
vill_matrix <- data.matrix(vill)
vill_heatmap <- heatmap(vill_matrix, Rowv=NA, Colv=NA,
                         col = colorRampPalette(brewer.pal(9,"Reds"))(1000), scale="column",
                         margins=c(5,10))

Jon's Jibber-Jabber

My thoughts on statistics, sports, television shows, movies, books, science, economics, and whatever else piques my curiosity.

Category Archives: Superheroes

What is the best superhero TV show?

Super(hero) Statistics, Pt. II

Super(hero) Statistics, Pt. I

Episode	Rating
1	8.9
2	8.7
3	8.9
4	8.9
5	8.9
6	8.6
7	8.6
8	9.1
9	9.1
10	8.5
11	8.8
12	8.9
13	8.8
14	8.9
15	9.1
16	8.7
17	8.5
18	9.1
19	9.1
20	9.1
21	9.1
22	9.1
23	9.1