Category Archives: Superheroes

What is the best superhero TV show?

4241687-8569729115-arrow-

As evidenced in some of my old posts (like this one and this one), I’ve become somewhat of a fan of superhero movies and television shows (though, as I have said before, not of the original comic books themselves). The fall 2014 television season featured a number of comic-based programs. Interested in comparing the popularity and success of the shows that fit into this genre, I gathered some data from IMDb (accurate as of 01/12/2015) and created some charts (included below).  I focused on the five shows discussed in this article: Agents of S.H.I.E.L.D., Arrow, Constantine, The Flash, and Gotham. Personally, I am very much a fan of both Arrow and The Flash (both CW products) and am (at least for the time being) a semi-fan of Gotham (Fox). I have never watched S.H.I.E.L.D. (ABC) or Constantine (NBC). As an interesting side note, ABC just recently started airing episodes of its newest Marvel program, Agent Carter. There are some other shows that have (arguably less obvious) origins in the comics, but I decided to look at the five major ones listed above.

The overall IMDb ratings for the five shows look like this (as of 01/13/2015):

Agents of S.H.I.E.L.D.: 7.5/10

Arrow: 8.2/10

Constantine: 7.6/10

The Flash: 8.3/10

Gotham: 8.1/10

However, I did not create charts with overall ratings in mind. Instead, I was interested in performing an analysis which included data pertaining to each episode of each show. I think that this sort of analysis is probably better suited to evaluate a show’s success, as in my mind a good show possesses consistency. In other words, each episode is worth watching. I would prefer a show that has consistent 8/10 episodes than one that has a handful of 6s and a handful of 10s. This is why I enjoyed the second season of Arrow so much. It seemed as though each episode significantly contributed to the overarching plot of the showdown between Oliver and Slade or to a subplot that was just as riveting. Just take a look at the episode ratings for that season. The average rating  is an 8.9/10 and all episodes are within 0.4 of the average.

Arrow Season Two IMDb Episode Ratings

Episode Rating
1 8.9
2 8.7
3 8.9
4 8.9
5 8.9
6 8.6
7 8.6
8 9.1
9 9.1
10 8.5
11 8.8
12 8.9
13 8.8
14 8.9
15 9.1
16 8.7
17 8.5
18 9.1
19 9.1
20 9.1
21 9.1
22 9.1
23 9.1

The first season, while still worth watching, posts an average episode rating of 8.5 and all episodes fall within 0.6 of the average.

Anyway, here are the charts that I created. They are enlarged when clicked upon.

imdb beeswarm

This plot displays the ratings of all episodes in each series. I used the beeswarm package in R to allow for ‘jittering’ as well as to experiment with alternate displays.

 

imdb boxplot

A box plot including individual points.

R code used to produce the two charts above:

#beeswarm package required--install.packages('beeswarm')
library(beeswarm)

#bee swarm plot overlaying box plot
boxplot(Rating ~ Show, data = tvraw,
        outline = FALSE,
        main = 'IMDb Episode Ratings',  xlab="",
        ylim=c(7, 10))
beeswarm(Rating ~ Show, data = tvraw, col = 4, add=TRUE, method="center") #plot on top of boxplot

#standalone bee swarm plot
beeswarm(Rating ~ Show, data = tvraw, pch = 16, col=1:5,
         main = 'IMDb Episode Ratings', xlab="")

I also created an interactive scatterplot that can be viewed by clicking the link.

Here is the IMDb dataset used in this analysis: Super IMDb.csv

Update (01/14/2015): In comparing seasons 1 and 2 of Arrow, it may be more instructive to compare the coefficient of variation (CV) in each sample. The CV for season 1 is 3.7% while the CV for season 2 is just 2.3%. 

Super(hero) Statistics, Pt. II

Building off my previous post, in which I introduced a neat superhero dataset and presented graphical descriptions of the data, in this post I would like to address differences in traits between superheroes and supervillains.

1. Descriptive Statistics

Below, I’ve included mean ratings for both superheroes and supervillains.

Variable Mean for Supervillains (n=123) Mean for Superheroes (n=304)
Intelligence 67.46 60.22
Strength 46.11 38.17
Speed 37.03 38.26
Durability 62.00 56.28
Power 59.67 55.48
Combat 59.87 60.50
Total 332.15 308.91
Source: http://www.superherodb.com

Surprisingly, we can see that, on average, the villains possess higher total ratings (sum of the six ratings) than do the heroes. However, we (or at least I) typically think of superheroes as being superior to their rivals. On the other hand, creating villains to be more extraordinary creatures than their counterparts makes the heroes’ ultimate triumphs even more impressive. This makes for good underdog stories.

Now, let’s add variance into the equation and look at boxplots of the data.

boxplots

Note: “bad” refers to villains and “good” refers to heroes

Without any formal discussion, I’ll make a quick remark about the spread of the ratings. There is more spread (as measured by standard deviation) in ratings for villains in 5 of the 6 categories (all but combat). Practically, this leads me to believe that villains are more diverse in makeup than heroes. For an anecdotal example, Joker is a very different villain than Blob.

Name Intelligence Strength Speed Durability Power Combat
Joker 100 10 12 56 22 90
Blob 10 83 23 95 26 72
Difference 90 -73 -11 -39 -4 18

2. Incorporating Uncertainty

From the analysis so far, one can easily see that there are differences in the average scores between villains and heroes. Now, I was tempted to take these differences as “real” but, upon further consideration, second-guessed myself.

I faced a statistical conundrum which statisticians working with real world data rarely, if ever, face. The reason I was tempted to say, for instance, that villains are about 4 points more powerful than heroes is because I thought my sample was essentially the population. Upon further examination, however, I recalled that I had cut 171 subjects from my sample because of missing values and that there are probably superheroes and supervillains out there not included in the superhero database. So, since my sample is just that, a sample, I decided to incorporate some uncertainty into my estimates. (Now, if this were a real world problem, I would need to be sure that my missing data were missing at random to assure that I was working with a random sample. In this case, I will make that assumption and proceed, though it is probably not valid. I think that the results will still be interesting and worthy of discussion).

Below are estimated ranges for the true differences between supervillains and superheroes for each of the six ratings categories.

Variable 95% CI of Mean (Villain-Hero)
Intelligence (2.5336, 11.9591)
Strength (1.2240, 14.6616)
Speed (-6.0548, 3.5935)
Durability (-0.5996, 12.0338)
Power (-1.5122, 9.8915)
Combat (-5.4797, 4.2130)

After incorporating uncertainty, we only detect significant disparities in two (intelligence and strength) of the six categories. If I were to account for multiple comparisons, then only intelligence would differ significantly between the two groups.

3. P (Hero | Characteristics)

Now that we have been exposed to at least some evidence to suggest that there are detectable differences in traits between superheroes and supervillains, I am interested in answering the following question:

Given that a character possesses a particular set of ratings, can we predict if that character will be a hero or a villain? 

“Practically”, this would allow one to predict the alignment of a new member to the world of superheroes and supervillains.

To address this question, I created a logistic regression model. Overall, I would say that the model does not do a great job of predicting alignment (perhaps due to the considerable amount of variability among both heroes and villains), but there is still some useful information which I will share.  I’ll present these results in terms of odds ratios.

Effect Point Estimate 95% Wald Confidence Limits
Intelligence 0.982 (0.971, 0.993)
Strength 0.992 (0.982, 1.001)
Speed 1.012 (1, 1.023)
Durability 0.998 (0.987, 1.008)
Power 0.998 (0.989, 1.008)
Combat 1.008 (0.998, 1.019)

When controlling for other ratings, intelligence and speed play a significant role in predicting alignment. Here are their interpretations:

  • Holding other factors fixed, a one unit increase in intelligence rating decreases the odds that a character is a superhero by 1.8%.
  • Holding other factors fixed, a one unit increase in speed rating increases the odds that a character is a superhero by 1.2%.

4. Conclusions

In my mind, the biggest takeaway from this analysis is the role that intelligence plays in distinguishing superheroes from supervillains. The villains possess significantly higher levels of intelligence than their enemies, the heroes. I guess it makes for a good story when the good guy outwits the evil mastermind.

******************************************************************************

As in Pt. I, I was working with this dataset.

Below is code used, which, for this particular post, I feel is rather straightforward.

R code for the boxplots:

#create a panel of boxplots; separate by alignment
hero_vill <- read.csv("~/Blog/hero_vill.csv", stringsAsFactors=TRUE)
par(mfrow=c(3,2))
boxplot(hero_vill$Intelligence~hero_vill$Alignment, col=rainbow(2), main="Intelligence Ratings")
boxplot(hero_vill$Strength~hero_vill$Alignment, col=rainbow(2), main="Strength Ratings")
boxplot(hero_vill$Speed~hero_vill$Alignment, col=rainbow(2), main="Speed Ratings")
boxplot(hero_vill$Durability~hero_vill$Alignment, col=rainbow(2), main="Durability Ratings")
boxplot(hero_vill$Power~hero_vill$Alignment, col=rainbow(2), main="Power Ratings")
boxplot(hero_vill$Combat~hero_vill$Alignment, col=rainbow(2), main="Combat Ratings")

SAS code for t-tests & logistic regression:


*data processing;
data hero_vill;
infile "~/Blog/hero_vill_1.csv" dlm=',' dsd missover;
input Name :$32. Alignment $ Intelligence Strength Speed Durability Power Combat Total;
if Alignment='good' then Hero=1;
else if Alignment='bad' then Hero=0;
else delete;
run;
*t-tests;
proc ttest data=hero_vill plots=none;
class alignment;
var intelligence;
run;
proc ttest data=hero_vill plots=none;
class alignment;
var strength;
run;
proc ttest data=hero_vill plots=none;
class alignment;
var speed;
run;
proc ttest data=hero_vill plots=none;
class alignment;
var durability;
run;
proc ttest data=hero_vill plots=none;
class alignment;
var power;
run;
proc ttest data=hero_vill plots=none;
class alignment;
var combat;
run;
*logistic regression;
proc logistic data=hero_vill;
model Hero(event='1')=Intelligence Strength Speed Durability Power Combat;
run;

Super(hero) Statistics, Pt. I

Superheroes

Today, I will post the first of several entries relating to a unique dataset that I’ve put together–a dataset that contains over 600 rows of data on a topic of great importance……………..superheroes and supervillains! Okay, analyzing superhero data may not be the most urgent matter that we as a society face, but:

1) I think that it is often useful to include entertaining applications when demonstrating what may otherwise be dry statistical methodology (though, admittedly, I sometimes find the “dry” statistics entertaining as well).

2) I find it fun to casually talk to friends about superheroes (especially those in recent movies) simply because, well, it’s just that: fun! And much like sabermetricians use statistics to challenge what is thought to be conventional baseball wisdom, I hope to use statistics to address pertinent topics relating to superheroes and supervillains (let’s call this discipline supermetrics?).

Setting the Stage

Before I get into the statistical analysis of the data, I must put a few things on the table.

1) Disclaimer I do not claim to be a superhero enthusiast, expert, or anything of the kind. I don’t think I have ever read a comic book in my life. I am, however, a fan of superhero movies (e.g., Man of Steel, The Dark Knight, Captain America, The Avengers, etc.) along with the television show, Arrow. So what does this tidbit have to do with anything? For one, with regard to my statistical analyses, it’s probably a good thing that I’m not a huge supporter of any given hero or villain. This should play a role in producing unbiased results. Additionally, my unfamiliarity with the comic book universe becomes a disadvantage in terms of data familiarity. Namely, I am fairly confident that I would not be able to recognize any superhero that is over- or underrated (more on the data below), making erroneous calculations more likely. With this in mind, if any comic book experts out there see any peculiar results, they are certainly welcome to leave a comment or drop me an email.

2) Data The data in the dataset that I will use to conduct my analyses come courtesy of the Superhero Database. This site houses a collection of information about 611 superheroes and villains. Each character (with data) is given a score between 0 and 100 (except in a few rare cases, where characters are deemed to be worthy of scores greater than 100) for six different traits: Intelligence, Strength, Speed, Durability, Power, and Combat. For example, here are the ratings for Batman:

Intelligence Strength Speed Durability Power Combat
100 18 27 42 37 100

Correction. Each character actually has two sets of these ratings. One such set is attributable to the creation of the website (the website was created to log and rate superheroes) while the other set is an average of user ratings. I chose to use the site’s ratings (the first set) because it allowed for a more complete dataset (many heroes and villains were not rated by users). In the event that the users rated the character and the site itself did not, I simply used the users’ ratings. After using this strategy, I came out with 440 superheroes/villains on which to perform analyses (171 subjects without usable ratings from either the site or its users).

I have one final note on the data. It is obviously highly subjective. When a human being is rating anything, there is always an element of subjectivity. Superheroes pose a particular challenge because they are fictional (so we are assigning subjective ratings to characters that were developed based on an author’s subjectivity). So please do not take the analyses too seriously, but certainly have fun critiquing and debating them! (Note: I do like how ratings can take on any discrete value between 0 and 100. This large range allows raters to address subtle differences among superheroes/villains.)

Visualizing Differences in Superhero Traits

The first items that I will present in analyzing this dataset are heatmaps.

Heroes

I created a non-random sample of superheroes to include big names. I did this using my opinions of which superheroes are popular and by surfing the web for lists of popular superheroes. After I had a list of about 20 heroes, I generated a heat matrix using R that color-coded the six attributes for each hero.

heroes

Superheroes and their ratings. A darker shade corresponds to a higher rating relative to peers.

I really like this form of visualization. One is able to view 126 individual data points in this graphic and I can assure you that it is easier to gain a grasp of the data using colors in the matrix rather than numeric values. Does anything stand out to you? Martian Manhunter, Superman, and Thor all have relatively high ratings across the board. And as we saw above, while Batman is exceptionally intelligent and excels in combat, he could use work in the other four departments.

Villains

I replicated the heatmap visualization using popular supervillains in place of superheroes. (Note: This graphic is probably a bit biased toward Batman villains, because those are the most familiar to me and I felt compelled to include them!)

villains

Supervillains and their ratings. A darker shade corresponds to a higher rating relative to peers.

Here we can see that General Zod–who you may have recently seen in Man of Steel— is, for lack of a better term, a beast. He possesses an attribute score of at least 94 in each of the six categories. Interestingly, it becomes apparent that many of the famous Batman villains (e.g., Joker, Penguin, Riddler, and Two-Face) rely on a high level of intellect to challenge Batman. Then, however, we have Ra’s Al Ghul and Bane; both villains possess extraordinary combat ratings.

Closing Remarks

This post was devoted to introducing the superhero/villain dataset with which I will be working in subsequent posts. Additionally, I took a first step in statistically analyzing the data by visualizing differences in characteristics among heroes and villains. I’d be interested to hear whether anything jumps out to you when studying those graphics!

In future posts, I aim to move beyond visualizing the data by applying various statistical methods to the data. Hopefully you find the superhero data and the relevant questions that can be addressed both fun and entertaining, but also of practical importance with regard to the statistical methods being employed.

******************************************************************************

Download the data (.xls): Superhero & Supervillain Data

R code:

#source used: http://flowingdata.com/2010/01/21/how-to-make-a-heatmap-a-quick-and-easy-solution/
#will need to install ggplot2 and RColorBrewer

#generate heatmap for select, non-random, heroes
super <- read.csv("~/Blog/select.csv", stringsAsFactors=FALSE) #read-in CSV file; insert appropriate path
super <- super[order(super$Name),] #sort heroes alphabetically
row.names(super) <- super$Name
super <- super[,3:8] #using these 6 columns for analysis
super_matrix <- data.matrix(super)
super_heatmap <- heatmap(super_matrix, Rowv=NA, Colv=NA,
			col = colorRampPalette(brewer.pal(9,"Greens"))(1000), scale="column",
			margins=c(5,10))

#generate heatmap for select, non-random, villains
vill <- read.csv("~/Blog/selectvill.csv", stringsAsFactors=FALSE) #read-in CSV file; insert appropriate path
vill <- vill[order(vill$Name),] #sort villains alphabetically
row.names(vill) <- vill$Name
vill <- vill[,3:8] #using these 6 columns for analysis
vill_matrix <- data.matrix(vill)
vill_heatmap <- heatmap(vill_matrix, Rowv=NA, Colv=NA,
                         col = colorRampPalette(brewer.pal(9,"Reds"))(1000), scale="column",
                         margins=c(5,10))