The following packages are used in this analysis:

library(dplyr)
library(ggplot2)
library(jsonlite)
library(knitr)
library(readr)
library(stringr)
library(tidyr)
library(zoo)

Introduction

Hearthstone is a popular collectible card game published by Blizzard Entertainment in 2014, which is based on the Warcraft series by the same company. The goal of the game is to build a deck of 30 cards and defeat the opponent who also has a deck of 30 cards.

Cards can be classified according to the following categories:

  • Class: Neutral cards can be used by all nine classes, while Class cards can only be used by the indicated class.
  • Rarity: How often cards can be found when opening card packs.
    • Cards are indicated as Free, Common, Rare, Epic and Legendary, ordered by increasing rarity.
  • Type: Different types of cards have different effects in the game:
    • Minions are played on the game board and can attack Heroes or other minions.
    • Spells are Class abilities that generate a variety of effects on the board.
    • Weapons are items that Heroes can equip to attack Heroes or other minions.
    • Heroes represent the player; the player loses when the Hero’s health reaches 0.
  • Set: Cards are released in designated sets, which are usually based on a certain theme in the Warcraft universe.
    • The core game consists of two sets, Basic and Classic, which are always available in the game.
    • Expansion sets add newer cards to the game, and are released at regular intervals

Approaching the Question

Which are the most popular cards used in Ranked decks?

Which are the most popular cards used in Ranked decks?

We focus on the Ranked format where players get to decide which cards to include in their deck, therefore the cards’ popularity are more accurately represented, and the gameplay is not subject to additional constraints that other game modes (like Tavern Brawls and Adventures) may impose.

How do we determine popularity?

We determine popularity by the number of decks that include at least 1 copy in the starting 30 cards (not generated by other effects).

A deck can include at most 2 copies of any card (1 for Legendary cards), thus a card’s popularity is not heavily influenced by the number of copies players wish to use.

Possible biases to consider

  • Decks and cards presented in this data may not necessarily be used by players in the game itself. However, without the means to track actual matches being played in the game itself, we instead use the list of decks submitted by players as an approximation of the decks and cards often used by the playerbase.

  • Since Neutral cards can be used by multiple classes, they should be more popular than Class cards.

  • For the Wild format, cards from the older sets may be more popular simply because they have been in the game longer.

  • For the Standard format, cards from the Basic and Classic sets will be more popular because they do not rotate out of the format unlike expansion cards.

Why address such a question?

If a certain card becomes too popular (i.e. the community thinks players must include it in their decks), it reduces the card variety in the metagame and makes gameplay frustrating for other players (amongst other consequences). In the long term, this may lead to player attrition and loss of potential revenue (when players purchase card packs or other cosmetics).

Historically, Blizzard has dealt with problematic cards in one of several ways:

Overview of Datasets used

There are three datasets we will use:

  • data.csv contains a list of decks submitted by players to HearthPwn from 2013 (pre-launch) to 2017,
  • refs.json contains detailed information about all cards (both collectible and non-collectible) up to March 2017.
  • cards_collectible.json contains detailed information about all the cards that are collectible in the game.

Loading the data

Assuming that the datasets are stored in a separate folder at the same subfolder level, the following code reads the datasets to be used:

data_path <- file.path("..", "data")

# data for decks
decks_raw <- read_csv(file.path(data_path, "data.csv"))
# data for collectible cards
cards_raw <- tbl_df(fromJSON(file.path(data_path, "cards_collectible.json"), flatten = TRUE))
# data for all cards
cards_all_raw <- fromJSON(file.path(data_path, "refs.json"), flatten = TRUE)

Decks data

The first few rows and columns of the raw data decks_raw is shown below:

craft_cost date deck_archetype deck_class deck_format deck_id deck_set deck_type rating title
9740 2016-02-19 Unknown Priest W 433004 Explorers Tavern Brawl 1 Reno Priest
9840 2016-02-19 Unknown Warrior W 433003 Explorers Ranked Deck 1 RoosterWarrior
2600 2016-02-19 Unknown Mage W 433002 Explorers Theorycraft 1 Annoying
15600 2016-02-19 Unknown Warrior W 433001 Explorers None 0 Standart pay to win warrior
7700 2016-02-19 Unknown Paladin W 432997 Explorers Ranked Deck 1 Palamix
5740 2016-02-19 Unknown Warrior W 432995 Explorers Ranked Deck 2 Kolento’s Elise Control Warrior

The decks_raw data has 346232 rows and 41 columns. The first 11 columns craft_cost to user describe the deck’s attributes (like date submitted, class, deck format) while the remaining 30 columns describe the cards each deck contains (based on the card’s unique ID which can be referenced from the cards_raw data).

Detailed information on the variables can be found on the Kaggle dataset.

names(decks_raw)
##  [1] "craft_cost"     "date"           "deck_archetype" "deck_class"    
##  [5] "deck_format"    "deck_id"        "deck_set"       "deck_type"     
##  [9] "rating"         "title"          "user"           "card_0"        
## [13] "card_1"         "card_2"         "card_3"         "card_4"        
## [17] "card_5"         "card_6"         "card_7"         "card_8"        
## [21] "card_9"         "card_10"        "card_11"        "card_12"       
## [25] "card_13"        "card_14"        "card_15"        "card_16"       
## [29] "card_17"        "card_18"        "card_19"        "card_20"       
## [33] "card_21"        "card_22"        "card_23"        "card_24"       
## [37] "card_25"        "card_26"        "card_27"        "card_28"       
## [41] "card_29"
glimpse(decks_raw) 
## Observations: 346,232
## Variables: 41
## $ craft_cost     <int> 9740, 9840, 2600, 15600, 7700, 5740, 1800, 1800...
## $ date           <date> 2016-02-19, 2016-02-19, 2016-02-19, 2016-02-19...
## $ deck_archetype <chr> "Unknown", "Unknown", "Unknown", "Unknown", "Un...
## $ deck_class     <chr> "Priest", "Warrior", "Mage", "Warrior", "Paladi...
## $ deck_format    <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W...
## $ deck_id        <int> 433004, 433003, 433002, 433001, 432997, 432995,...
## $ deck_set       <chr> "Explorers", "Explorers", "Explorers", "Explore...
## $ deck_type      <chr> "Tavern Brawl", "Ranked Deck", "Theorycraft", "...
## $ rating         <int> 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,...
## $ title          <chr> "Reno Priest", "RoosterWarrior", "Annoying", "S...
## $ user           <chr> "FunKaliTy", "RooosterRooo", "Messalm", "KingSn...
## $ card_0         <int> 9, 75, 195, 75, 476, 75, 1007, 48, 272, 147, 11...
## $ card_1         <int> 237, 75, 195, 290, 476, 75, 90, 48, 613, 290, 3...
## $ card_2         <int> 279, 285, 315, 304, 679, 400, 285, 397, 613, 29...
## $ card_3         <int> 545, 401, 315, 338, 943, 400, 389, 753, 635, 41...
## $ card_4         <int> 613, 401, 531, 401, 890, 401, 1026, 753, 220, 4...
## $ card_5         <int> 791, 546, 555, 401, 756, 401, 523, 1014, 825, 4...
## $ card_6         <int> 797, 546, 555, 546, 1073, 546, 531, 1029, 825, ...
## $ card_7         <int> 841, 596, 614, 546, 1073, 546, 642, 1029, 1186,...
## $ card_8         <int> 1186, 596, 614, 581, 1167, 596, 654, 1092, 1650...
## $ card_9         <int> 1189, 635, 662, 785, 1721, 596, 757, 1098, 1650...
## $ card_10        <int> 1363, 635, 662, 785, 1753, 636, 825, 1940, 1753...
## $ card_11        <int> 1365, 785, 1084, 810, 1783, 785, 921, 1997, 175...
## $ card_12        <int> 1367, 785, 1084, 825, 1786, 785, 985, 2012, 179...
## $ card_13        <int> 1650, 810, 1659, 825, 1786, 912, 994, 2012, 179...
## $ card_14        <int> 1933, 1023, 1659, 912, 1793, 1023, 1037, 2033, ...
## $ card_15        <int> 1938, 1186, 1753, 1023, 1793, 1023, 1073, 2033,...
## $ card_16        <int> 2040, 1657, 1753, 1023, 1804, 1657, 1098, 2035,...
## $ card_17        <int> 2067, 1659, 1783, 1074, 1804, 1659, 1109, 2046,...
## $ card_18        <int> 1935, 1659, 1783, 1074, 2026, 1659, 1182, 2046,...
## $ card_19        <int> 2262, 1721, 1793, 1186, 2026, 1793, 1940, 2065,...
## $ card_20        <int> 374, 1781, 1793, 1657, 2027, 1793, 2050, 2067, ...
## $ card_21        <int> 2280, 1781, 1801, 1721, 2029, 1805, 2064, 2071,...
## $ card_22        <int> 2511, 2021, 1801, 2018, 2029, 1805, 2064, 2071,...
## $ card_23        <int> 2555, 2021, 2037, 2296, 2064, 1808, 2257, 2073,...
## $ card_24        <int> 2566, 2064, 2037, 2262, 2078, 1808, 2490, 2093,...
## $ card_25        <int> 2582, 2064, 2064, 336, 374, 2021, 2495, 2306, 2...
## $ card_26        <int> 2683, 2078, 2064, 2729, 2717, 2729, 2495, 2490,...
## $ card_27        <int> 2736, 2510, 2078, 2729, 2717, 2729, 2521, 2507,...
## $ card_28        <int> 2568, 2729, 38710, 2736, 2889, 2736, 2576, 2520...
## $ card_29        <int> 2883, 2736, 38710, 2760, 2889, 2951, 2580, 2752...

There are 8 rows that contain missing data. The missing values are in the 10th column, which contains the decks’ title as submitted by the users, so they can be safely ignored.

which(is.na(decks_raw), arr.ind = TRUE)
##         row col
## [1,]  16747  10
## [2,] 175608  10
## [3,] 216047  10
## [4,] 238021  10
## [5,] 278491  10
## [6,] 326192  10
## [7,] 329285  10
## [8,] 329286  10

Cards data

The first few rows and columns of the cards_raw data is shown below (some columns are shown truncated):

col_trunc <- function(col, width, side = c("right", "left", "center"), 
                      ellipsis = "..."){
    if (class(col) == "character"){
        col <- str_trunc(col, width, side, ellipsis)
        return(col)
    } else {
        return(col)
    }
}

cards_raw[1:12] %>% 
    lapply(col_trunc, width = 10, side = "right") %>%
    as_tibble() %>% 
    head()
artist cardClass collectible cost dbfId flavor id name rarity set text type
Nutthap… MAGE TRUE 5 2539 It’s on… AT_001 Flame L… COMMON TGT Deal $8… SPELL
Tooth MAGE TRUE 3 2541 Burning… AT_002 Effigy RARE TGT Secr… SPELL
Arthur … MAGE TRUE 2 2545 And he … AT_003 Fallen … RARE TGT Your He… MINION
Gabor S… MAGE TRUE 1 2572 Now wit… AT_004 Arcane … EPIC TGT Deal $2… SPELL
Mike Sass MAGE TRUE 3 2542 It’s al… AT_005 Polymor… RARE TGT Transfo… SPELL
Dan Scott MAGE TRUE 4 2549 Is he a… AT_006 Dalaran… COMMON TGT Insp… MINION

This dataset has 1751 rows and 65 columns. The first 32 columns artist to questReward describe the characteristics of each card. Some of the characteristics, including name, Mana cost, race, health and text, can also been seen on the card itself.

Each card has two unique identifiers: a character/string id and an integer dbfId. The integer dbfId is used across both the deck and card datasets, which can be joined later.

names(cards_raw[1:32])
##  [1] "artist"             "cardClass"          "collectible"       
##  [4] "cost"               "dbfId"              "flavor"            
##  [7] "id"                 "name"               "rarity"            
## [10] "set"                "text"               "type"              
## [13] "mechanics"          "attack"             "health"            
## [16] "referencedTags"     "race"               "elite"             
## [19] "targetingArrowText" "durability"         "overload"          
## [22] "spellDamage"        "armor"              "faction"           
## [25] "howToEarn"          "howToEarnGolden"    "collectionText"    
## [28] "classes"            "multiClassGroup"    "entourage"         
## [31] "hideStats"          "questReward"

Some of the characteristics, including name, Mana cost, race, health and text, can also been seen on the card itself. Key mechanics are highlighted bold in the card text:

The remaining columns are in fact nested under the playRequirements field, which specifies how certain cards can only be played in the game. These requirements are also implicitly/explicitly stated in the card text. (Column names are shown truncated)

names(cards_raw)[33:65] %>% 
    str_trunc(25, "left")
##  [1] "...ents.REQ_MINION_TARGET" "...nts.REQ_TARGET_TO_PLAY"
##  [3] "...ments.REQ_ENEMY_TARGET" "...s.REQ_TARGET_WITH_RACE"
##  [5] "..._MINIMUM_ENEMY_MINIONS" "...s.REQ_TARGET_FOR_COMBO"
##  [7] "...ts.REQ_FRIENDLY_TARGET" "...EQ_TARGET_IF_AVAILABLE"
##  [9] "...s.REQ_NUM_MINION_SLOTS" "...NIMUM_FRIENDLY_MINIONS"
## [11] "...ARGET_WITH_DEATHRATTLE" "...OF_RACE_DIED_THIS_GAME"
## [13] "..._MINION_DIED_THIS_GAME" "...s.REQ_LEGENDARY_TARGET"
## [15] "...BLE_AND_DRAGON_IN_HAND" "....REQ_TARGET_MAX_ATTACK"
## [17] "...nts.REQ_NONSELF_TARGET" "...s.REQ_STEALTHED_TARGET"
## [19] "...ements.REQ_HERO_TARGET" "...T_OR_MANA_CRYSTAL_SLOT"
## [21] "...s.REQ_UNDAMAGED_TARGET" "...ts.REQ_WEAPON_EQUIPPED"
## [23] "...nts.REQ_DAMAGED_TARGET" "...EQ_MUST_TARGET_TAUNTER"
## [25] "....REQ_TARGET_MIN_ATTACK" "..._MINIMUM_TOTAL_MINIONS"
## [27] "...ments.REQ_DRAG_TO_PLAY" "...ONE_CAP_FOR_NON_SECRET"
## [29] "...ents.REQ_FROZEN_TARGET" "...NO_3_COST_CARD_IN_DECK"
## [31] "...NIMUM_FRIENDLY_SECRETS" "...s.REQ_CANNOT_PLAY_THIS"
## [33] "...ENTAL_PLAYED_LAST_TURN"

Additional information on the variables can be found on HearthstoneJSON.

The following card, based on its text, would require certain conditions to be played (a minion on the board, that has not taken any damage):

The following computes the number of missing values in each field, with the exception of those that are present as lists or data frames (mechanics, referencedTags, classes, entourage).

Many of the columns contain significant amounts of missing values. However, some key attributes that define the card such as cardClass, collectible, dbfId, name, set, and type do not contain any missing values.

# code is not run
sapply(cards_raw, function(x) sum(is.na(x)))

Pre-processing

Decks data

The raw dataset is split into two, decks_attr containing the deck attributes and decks_comp containing the deck composition (cards), joined by the deck_id identifier. We also exclude decks created before launch (there were many card changes in the alpha and beta stages, making card popularity very volatile).

The decks_comp data will be pivoted to long format later on, thus excluding fields that are not related to the cards will minimize the size of the dataset.

launch_date <- as.Date("2014-03-11")

decks_attr <- decks_raw %>% 
    filter(date >= launch_date) %>% 
    select(deck_id, craft_cost:deck_format, deck_set:user)

decks_comp <- decks_raw %>% 
    filter(date >= launch_date) %>% 
    select(deck_id, card_0:card_29)

Within decks_attr, the factor columns are identified and converted to the appropriate type.

fct_cols_attr <- c("deck_archetype", "deck_class", "deck_format", "deck_set", "deck_type")

decks_attr[fct_cols_attr] <- lapply(decks_attr[fct_cols_attr], factor)
rm(fct_cols_attr)

Years and Months

While each deck has a submission date, we may also be interested in grouping the decks by month (which corresponds to Ranked seasons) and by year (which is marked by expansion release dates instead of calendar dates).

Years in the game based on a time period that:

  • Starts with the release of the first card set of each year, which usually falls around April.
  • Ends with the release of the first card set the next year (non-inclusive).

So based on the release dates (taken from the above link), the years would be:

  • 2014-03-11 to 2015-04-01 (Live, Naxxramas, Goblin vs Gnomes)
  • 2015-04-02 to 2016-04-25 (Blackrock, Grand Tournament, League of Explorers)
  • 2016-04-26 to 2017-03-19 (Old Gods, Karazhan, Gadgetzan)
# month (with year)
decks_attr$hsmonth <- as.yearmon(decks_attr$date)

# year (not by calendar)
decks_attr$hsyear <- case_when(
        decks_attr$date <= as.Date("2015-04-01") ~ "2014",
        decks_attr$date >= as.Date("2015-04-02") & 
            decks_attr$date <= as.Date("2016-04-25") ~ "2015",
        decks_attr$date >= as.Date("2016-04-26") ~ "2016"
    ) %>% 
        factor()

Deck Format

The Standard and Wild formats were formally introduced into the game on 2016-04-26 with the release of Whispers of the Old Gods; however the plot below shows that many decks from June 2014 to April 2016 were mistakenly labelled as Wild.

We can simply relabel all decks created before 2016-04-26 as Standard since all cards before then are not separated by format:

decks_attr$deck_format[decks_attr$date < as.Date("2016-04-26")] <- "S"

A summary of the processed data is shown below:

## Observations: 324,104
## Variables: 13
## $ deck_id        <int> 433004, 433003, 433002, 433001, 432997, 432995,...
## $ craft_cost     <int> 9740, 9840, 2600, 15600, 7700, 5740, 1800, 1800...
## $ date           <date> 2016-02-19, 2016-02-19, 2016-02-19, 2016-02-19...
## $ deck_archetype <fct> Unknown, Unknown, Unknown, Unknown, Unknown, Un...
## $ deck_class     <fct> Priest, Warrior, Mage, Warrior, Paladin, Warrio...
## $ deck_format    <fct> S, S, S, S, S, S, S, S, S, S, S, S, S, S, S, S,...
## $ deck_set       <fct> Explorers, Explorers, Explorers, Explorers, Exp...
## $ deck_type      <fct> Tavern Brawl, Ranked Deck, Theorycraft, None, R...
## $ rating         <int> 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,...
## $ title          <chr> "Reno Priest", "RoosterWarrior", "Annoying", "S...
## $ user           <chr> "FunKaliTy", "RooosterRooo", "Messalm", "KingSn...
## $ hsmonth        <S3: yearmon> Feb 2016, Feb 2016, Feb 2016, Feb 2016,...
## $ hsyear         <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,...
##     deck_id         craft_cost         date           
##  Min.   : 36923   Min.   :    0   Min.   :2014-03-11  
##  1st Qu.:253573   1st Qu.: 2840   1st Qu.:2015-05-26  
##  Median :428597   Median : 5120   Median :2016-02-09  
##  Mean   :419989   Mean   : 5745   Mean   :2015-12-21  
##  3rd Qu.:603508   3rd Qu.: 7840   3rd Qu.:2016-08-09  
##  Max.   :749548   Max.   :48000   Max.   :2017-03-19  
##                                                       
##          deck_archetype     deck_class    deck_format
##  Unknown        :220501   Mage   :42230   S:307743   
##  Midrange Shaman:  5472   Priest :41756   W: 16361   
##  Control Priest :  5135   Paladin:39368              
##  Control Warrior:  4939   Warlock:35598              
##  Tempo Mage     :  4545   Druid  :35488              
##  Midrange Hunter:  4371   Shaman :33969              
##  (Other)        : 79141   (Other):95695              
##              deck_set              deck_type          rating        
##  Explorers       : 57307   Arena        :  8178   Min.   :   0.000  
##  Old Gods        : 49895   None         : 75120   1st Qu.:   1.000  
##  Blackrock Launch: 38900   PvE Adventure:  9059   Median :   1.000  
##  Gadgetzan       : 31329   Ranked Deck  :202104   Mean   :   2.777  
##  Naxx Launch     : 22283   Tavern Brawl :  6360   3rd Qu.:   1.000  
##  Yogg Nerf       : 22175   Theorycraft  : 19686   Max.   :4016.000  
##  (Other)         :102215   Tournament   :  3597                     
##     title               user              hsmonth      hsyear      
##  Length:324104      Length:324104      Min.   :2014   2014: 65119  
##  Class :character   Class :character   1st Qu.:2015   2015:128062  
##  Mode  :character   Mode  :character   Median :2016   2016:130923  
##                                        Mean   :2016                
##                                        3rd Qu.:2017                
##                                        Max.   :2017                
## 

Cards data

Many columns in the cards_raw data pertain to the card’s stats, mechanics and play requirements, which is better explained by the text on the card image. So we choose to only include the columns that we consider are the core properties of the card:

  • ID
  • Name
  • Class
  • Rarity
  • Type
  • Card Set
cards_simple <- select(cards_raw,
                       dbfId,  name, cost,   cardClass, 
                       rarity, type,  set, collectible, id)
## Observations: 1,751
## Variables: 9
## $ dbfId       <int> 2539, 2541, 2545, 2572, 2542, 2549, 2571, 2544, 25...
## $ name        <chr> "Flame Lance", "Effigy", "Fallen Hero", "Arcane Bl...
## $ cost        <int> 5, 3, 2, 1, 3, 4, 3, 6, 8, 5, 4, 4, 1, 3, 2, 2, 4,...
## $ cardClass   <chr> "MAGE", "MAGE", "MAGE", "MAGE", "MAGE", "MAGE", "M...
## $ rarity      <chr> "COMMON", "RARE", "RARE", "EPIC", "RARE", "COMMON"...
## $ type        <chr> "SPELL", "SPELL", "MINION", "SPELL", "SPELL", "MIN...
## $ set         <chr> "TGT", "TGT", "TGT", "TGT", "TGT", "TGT", "TGT", "...
## $ collectible <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
## $ id          <chr> "AT_001", "AT_002", "AT_003", "AT_004", "AT_005", ...

The set column contains abbreviated names or nicknames and is not necessarily informative; we create a new column that uses the actual names of the card sets:

unique(cards_simple$set)
##  [1] "TGT"          "BOOMSDAY"     "BRM"          "GANGS"       
##  [5] "CORE"         "EXPERT1"      "HOF"          "NAXX"        
##  [9] "GILNEAS"      "GVG"          "HERO_SKINS"   "ICECROWN"    
## [13] "KARA"         "LOE"          "LOOTAPALOOZA" "OG"          
## [17] "UNGORO"
# list for recoding card sets
cardset_lst <- list(
    "CORE" = "Basic",
    "EXPERT1" = "Classic",
    # 2014 sets
    "NAXX" = "Curse of Naxxramas",
    "GVG" = "Goblins vs Gnomes",
    # 2015 sets
    "BRM" = "Blackrock Mountain",
    "TGT" = "The Grand Tournament",
    "LOE" = "League of Explorers",
    # 2016 sets
    "OG" = "Whispers of the Old Gods",
    "KARA" = "One Night in Karazhan",
    "GANGS" = "Mean Streets of Gadgetzan",
    "HOF" = "Hall of Fame",
    # 2017 sets
    "UNGORO" = "Journey to Un'Goro",
    "ICECROWN" = "Knights of the Frozen Throne",
    "LOOTAPALOOZA" = "Kobolds & Catacombs",
    # 2018 sets
    "GILNEAS" = "The Witchwood",
    "BOOMSDAY" = "The Boomsday Project"
)

cards_simple$card_set <- recode(cards_simple$set, !!!cardset_lst,
                                .default = "Other")

# drop the set column to avoid confusion (it remains in the cards_raw dataset)
cards_simple$set <- NULL
# remove list (no longer needed)
rm(cardset_lst)

Some columns are entirely uppercase, which we convert to title case for readability:

titlecase_cols <- c("cardClass", "rarity", "type")
cards_simple[titlecase_cols] <- lapply(cards_simple[titlecase_cols], 
                                       str_to_title)
rm(titlecase_cols)

The factor columns are then identified and converted to the appropriate type:

fct_cols_cards <- c("cardClass", "rarity", "type", "card_set")

cards_simple[fct_cols_cards] <- lapply(cards_simple[fct_cols_cards], factor)
rm(fct_cols_cards)

A summary of the processed data is shown below:

##      dbfId           name                cost          cardClass  
##  Min.   :    7   Length:1751        Min.   : 0.000   Neutral:657  
##  1st Qu.: 1987   Class :character   1st Qu.: 2.000   Paladin:123  
##  Median :38957   Mode  :character   Median : 4.000   Hunter :122  
##  Mean   :25375                      Mean   : 3.856   Mage   :122  
##  3rd Qu.:43163                      3rd Qu.: 5.000   Warlock:122  
##  Max.   :53187                      Max.   :20.000   Druid  :121  
##                                     NA's   :22       (Other):484  
##        rarity        type      collectible         id           
##  Common   :612   Hero  :  33   Mode:logical   Length:1751       
##  Epic     :298   Minion:1192   TRUE:1751      Class :character  
##  Free     :142   Spell : 471                  Mode  :character  
##  Legendary:253   Weapon:  55                                    
##  Rare     :446                                                  
##                                                                 
##                                                                 
##                          card_set  
##  Classic                     :236  
##  Basic                       :142  
##  Journey to Un'Goro          :135  
##  Knights of the Frozen Throne:135  
##  Kobolds & Catacombs         :135  
##  The Boomsday Project        :135  
##  (Other)                     :833

Mislabelled cards

Card IDs in the decks_comp data may be incorrect, due to the following reasons:

  • Deck submissions are generated by user input.
  • Users choose cards based on card names, not card IDs.
  • However, there are cards that share the same name, but have different IDs.
    • Amongst each set of replicates, only one is collectible in-game.

Specifically, we are looking for the version of each card that is collectible (since all cards used in Ranked decks must be collectible). This step requires using the full card data cards_all_raw (which contains non-collectible cards).

The first step is to compile a list of unique card IDs found in decks_comp:

# IDs of all cards used
cards_used  <- decks_comp %>%
        select(card_0:card_29) %>%
        unlist(use.names = FALSE) %>%  # flatten into a vector
        unique() %>%
        sort()

The following code filters for those IDs that are not found in our collectible cards data cards_raw:

# IDs of missing cards (using dbfId)
missing_cards <- cards_used[!cards_used %in% cards_raw$dbfId]

Using the above vector of IDs, we filter out the relevant non-collectible cards from the cards_all_raw dataset.

# filter for those missing cards only
mssng_cards <- cards_all_raw %>% 
    select(dbfId, name, cardClass, type, collectible) %>% 
    filter(dbfId %in% missing_cards)
dbfId name cardClass type collectible
40341 Cleave NEUTRAL SPELL NA
2177 Dark Wispers DRUID SPELL NA
42146 Doppelgangster NEUTRAL MINION NA
38319 Druid of the Claw DRUID MINION NA
2230 Druid of the Fang DRUID MINION NA
2310 Druid of the Flame DRUID MINION NA

To find the correct IDs, we join the cards by name to the cards_simple dataset. The dbfID.x on the left would be replaced by the dbfID.y on the right:

mislabelled <- mssng_cards %>% 
    select(dbfId, name) %>% 
    # also drops uncollectible mislabelled cards (which are not in simple_cards)
    inner_join(cards_simple, by = "name") %>% 
    arrange(name)
mislabelled
dbfId.x name dbfId.y cost cardClass rarity type collectible id card_set
40341 Cleave 940 2 Warrior Free Spell TRUE CS2_114 Basic
2177 Dark Wispers 2009 6 Druid Epic Spell TRUE GVG_041 Goblins vs Gnomes
42146 Doppelgangster 40953 5 Neutral Rare Minion TRUE CFM_668 Mean Streets of Gadgetzan
38319 Druid of the Claw 692 5 Druid Common Minion TRUE EX1_165 Classic
2230 Druid of the Fang 2048 5 Druid Common Minion TRUE GVG_080 Goblins vs Gnomes
2310 Druid of the Flame 2292 3 Druid Common Minion TRUE BRM_010 Blackrock Mountain
40402 Evolve 38266 1 Shaman Rare Spell TRUE OG_027 Whispers of the Old Gods
41409 Jade Idol 40372 1 Druid Rare Spell TRUE CFM_602 Mean Streets of Gadgetzan
468 Mark of Nature 151 3 Druid Common Spell TRUE EX1_155 Classic
41609 Nefarian 2261 9 Neutral Legendary Minion TRUE BRM_030 Blackrock Mountain
38113 Raven Idol 13335 1 Druid Common Spell TRUE LOE_115 League of Explorers
1161 Starfall 86 5 Druid Rare Spell TRUE NEW1_007 Classic
38710 Unstable Portal 1929 2 Mage Rare Spell TRUE GVG_003 Goblins vs Gnomes
38653 Wisp 179 0 Neutral Common Minion TRUE CS2_231 Classic
137 Wrath 836 2 Druid Common Spell TRUE EX1_154 Classic

We create a named list that can be used within recode():

# list values are correct ids
mislab_recode <- as.list(mislabelled$dbfId.y)
# list names are mislabelled ids
names(mislab_recode) <- mislabelled$dbfId.x

The following step applies the recode() function to all the card columns card_0 to card_29 in our decks_comp data, which will relabel all the incorrect IDs with their respective IDs (dbfID.x with dbfID.y based on the table above).

# relabel ids in data
decks_comp <- mutate_at(decks_comp, vars(starts_with("card")), 
                        recode, !!!mislab_recode)
# intermediate objects no longer needed
rm(cards_used, missing_cards, cards_all_raw, mssng_cards)

Other objects

The following objects are used for plotting and visualization:

# class colours
class_colors <- c(
    "Druid" = "#FF7D0A",
    "Hunter" = "#569B56", #"#ABD473",
    "Mage" = "#40C7EB",
    "Paladin" = "#F58CBA",
    "Priest" = "#FFFFFF",
    "Rogue" = "#FFF569",
    "Shaman" = "#0070DE",
    "Warlock" = "#8787ED",
    "Warrior" = "#C79C6E",
    "Neutral" = "#777777"
)

# rarity colours
rarity_colors <- c(
    "Free" = NA,
    "Common" = "#000000",
    "Rare" = "#1E90FF",
    "Epic" = "#9932CC",
    "Legendary" = "#FFB90F"
)

# release dates of card sets
release_dates <- c(
    "Launch" = "2014-03-11", 
    "Naxxramas" = "2014-07-22", 
    "GvG" = "2014-12-08",
    "Blackrock" = "2015-04-02",
    "TGT" = "2015-08-24",
    "Explorers" = "2015-11-12",
    "Old Gods" = "2016-04-26",
    "Karazhan" = "2016-08-11",
    "Gadgetzan" = "2016-12-01"
)

# darker plot theme for this doc
theme_doc <- theme(panel.spacing = unit(1, "points"),
                   panel.grid.major.y = element_blank(), 
                   plot.title = element_text(size = 12),
                   axis.text.y = element_text(size = 9),
                   panel.background = element_rect(fill = "#C3C3C3"))

Summary and Reflection

So far, we have looked at cards that users tend to include in decks in Standard format for Ranked play, which is also used for official Hearthstone tournaments - making these popular cards highly visible to a wide audience. We have also looked at card popularity when broken down by various categories, such as class, time period and card set.

Are there any limitations to the data that may have affected our analysis?

The major limitation of this data is that it only looks at decks submitted to a third-party website, which brings up the following issues:

  • Decks submitted may not necessarily be played in the game itself, either because other players think it is too weak, or because the user submits a joke deck that contains absurd combinations of cards and is not meant to be taken seriously.
  • There is no data on how often decks and cards are actually played in the Ranked format.
  • Likewise, there is no information on how effective the decks are at winning games. While the rating attribute may reflect how strong other players consider a deck, it is also biased towards the popularity of the user as well as the date of submission:
    • A deck may be initially strong and highly rated, but as new cards are introduced and old cards are removed from Standard format, the deck may wane in strength, but users are unlikely to retract their votes by this point in time.

How can we expand on this analysis?

  • Examine popular combination of cards that complement each other well.
  • Examine whether the crafting cost (in dust) of decks has any relation to its popularity (rating).