What Review Scores Mean for Games

Tyler Honeywell

In recent years, public opinions on the media we consume have been increasingly dominated by the idea of scores. Where once we might have judged a movie or video game based on promotional material, box art, and description, now consumers tend to look to secondhand opinions prior to a purchase, or even free download. It might be opinions from friends or reading full reviews online, but generally the easiest way to get an impression of whether to spend time or money on a piece of media is to look at the scores it recieves. Reviews can be found in hundreds of places online, and take many forms. Some score with stars, some with points, some with percentages. Some take the form of recommend/don't recommend, and some have no scoring system at all. In this analysis, we'll take a look at two platforms that handle reviews for games:

Metacritic

Metacritic is a review aggregation site, which compiles data from numerous other partnered sites as well as non-critic user reviews. This takes the form of two separate scores - a "metascore", compiled from the weighted average of reviews from a number of partnered sites, and a user score, from reviews submitted by users of the site. The metascore is on a scale of 0 to 100, while the user score is on a scale of 0 to 10, with one decimal place. Essentially, they are on the same scale.

Steam

Steam is a PC-based game launcher and marketplace with a user review feature. Reviews take the form of "Recommend" or "Don't Recommend", compiled into a percentage and accompanying descriptor, such as "Mostly Positive," "Mixed," or "Overwhelmingly Negative." Steam also has a separate category for recent reviews - this is a measure to better inform the potential buyer of the nature of the reviews, and is implemented in addition to systems that detect large volumes of reviews in a short time. Games which are being "review bombed" - recieving a large number of negative reviews due to some controversy in the community - are somewhat protected by this system, informing the buyer that the game's score may not be accurate to the actual value of the game. This also helps games which are being continuously developed while available for purchase - as the game grows and improves the reviews will become more positive, and older reviews will no longer be accurate to the state of the game.


Goals of Our Analysis

Using data science, we can explore the following questions about game reviews:

Starting Off

For this analysis, we'll be using the following Python libraries:

Pandas - Allows us to store data in DataFrames, and do most data processing operations.
Numpy - Mathematics backbone for our analysis.
Matplotlib - The backbone for making plots of our data.
Seaborn - A plotting utility with a wide array of plots to choose from.
Sklearn - A machine learning utility that enables predictions and regressions.

Data Collection

A vital part of the data science pipeline is acquiring usable data. For this analysis, I searched for existing datasets using Kaggle, an online service with a number of publically available datasets to choose from. This was the much easier option, compared to scraping these sites directly or using an API, since we would be working with a monumental amount of data. Additionally, websites have become increasingly protective of their data and hostile to scraping in recent years.
I located this Metacritic dataset from 2016, as well as this Steam dataset from 2019. Both sets of data are stored as CSV, or comma-separated-value, files, which we can read into a DataFrame using Pandas.

Let's take a look at our two datasets.

There is a lot to work with here.

In our Metacritic dataset, we have releases on each platform, genres, publishers, developers, sales in 4 different regions, developers, and ESRB rating, aside from the reviews we are primarily interested in. However, we also have missing data, which is represented as "NaN" in the table.

In the Steam dataset, we have the unique app ID used by Steam, developers and publishers, the age required to see the game's listing, and a handful of other things like tags.

Tidying Data

Let's begin tidying up our first dataset - the Metacritic reviews.

We aren't interested in the sales in different regions, so we can drop them from the dataset. We also won't be able to do any useful analysis with the data that is missing, so we drop any entry with missing data. Lastly, it will be helpful to convert the user score to be on the same scale as the critic score so we can compare them later on.

For the second dataset, there are a lot more attributes we don't care about. We can drop them to get it looking a lot cleaner. We also need to do some conversions. To align with our first dataset, let's simplify "release-date" to be the year of release, and "owners" to be the middle of the estimated number of users, divided by one million to make the numbers line up.

Exploratory Data Analysis

With our data nicely polished, let's move on to some analysis.
In the EDA phase of data science, we want to learn more about our data through the use of data visualization and statistics. This is to help us get a grasp of what we can do with our data, as well as some of its basic properties.

A personal favorite of mine is the scatter plot. It gives a sense of the general layout of your data, and whether or not two variables are correlated. Pandas makes scatter plots very easy to produce, via DataFrame.plot.scatter(). Let's see how critic score measures against sales figures.

Looks like a stray datapoint is making for an ugly graph - Wii Sports blows every other game out of the water in terms of overall sales. Let's take it out and try again.

And just from this basic visualization, we can see that critical reception seems to correlate with number of copies sold. But what about user reception? They are the ones going out and buying copies, after all.

Now we start to see something a bit unexpected. And this is part of why EDA is helpful in getting to know your data. There seems to be a tendency for better selling games to be incapable of getting scores that are extremely high - in the 96-100 range - that was not present in the critic scores.

Let's see if we find the same to be true of Steam games.

Another ugly looking graph. Due to the nature of the data we used, our values of "owners" are very rough. And what are the outliers here? It seems our highest datapoint has 150 million copies sold. That's higher than our previous best, Wii Sports. Is that right? Let's look at our games with the most users.

Now we get a glimpse of why our "sales" are so high. 6 of the 8 most owned games are free-to-play. That is, they don't cost money to start playing. Maybe we should take a look at only games that require a purchase (whose price != 0) since they are quite different from bought games. We'll also exclude any games with > 10 million owners to get a closer look at the bottom of the graph.

Still oddly shaped, but there is a very clear positive correlation! Additionally, we see a hint of the same phenomenon - only games with a low number of purchases will recieve near-perfect user scores.


Let's move on to more intriguing exploration - we were more or less certain that better scoring games sell better. What about the difference between user scores and critic scores? We might think to take the mean of both values in our DataFrame, but this isn't the true distribution of scores - we count games with a single reviewer the same as those with 100. To obtain a true estimate of the average score, we have to weight the scores for each game. It's important to remember what your data actually represents when trying to draw conclusions.

From this simple metric, we can infer that critics are, in fact, more generous than users in their reviews.

An excellent way to view the overall distribution of data in a number of categories is through the use of a violin plot, which we can produce with the use of Seaborn. Violin plots are very similar to box plots, with extensions to more finely represent the distribution of the data. You can take a look at violin plots and Seaborn's other utilities on their site. Thicker parts of the "violin" represent locations in which a greater number of data points exist. Let's look at how our Metacritic data is distributed among the then-current-generation consoles, a few well-known publishers, and the most common ratings.

We can see there are significant differences in each plot. PC games are rated above average, as are M-rated games, compared to the other plots. Games published by Valve have much, much better scores than those published by other large publishers. Nintendo also scores well, and Ubisoft scores poorly.

Using the same technique, why don't we take a look at which genres score well?

Interesting. It looks like Action and Adventure games both do noticeably worse than other genres. Sports games do much better.

We should keep all this in mind for our later analysis.


Now we want to explore trends in user scoring over time. Do they rise or fall? And when? What was the best year for games?

The data that we have access to is rather patchy in earlier years - Steam only found popularity after around 2007, and there aren't many games from before the turn of the century to plot on a graph. Let's pick some arbitrary cutoffs where we are confident our data is decently complete: 2006-2019 for Steam, and 2000-2016 for Metacritic.

This is rather inconclusive.

Regression Analysis

Through EDA we can get a general idea of what our data looks like and draw a few conclusions about it. However, to prove something about our data rigorously we need to apply some statistics. Namely, we can use linear regression with scikit-learn to try to model the data and predict what outcome we might see given a certain set of inputs. Say we want to make the best game possible, and need to choose a platform, rating, and genre. Let's take the platforms and ESRB ratings we were previously interested in, and see which we should aim for to get our hypothetical game the best scores.

A positive coefficient here means that scores go up with the presence of a given variable. For example, being a Racing game gives you 2 points compared to not being a Racing game. Looking at these values, we can see that, in our data, it is best to be an M-rated puzzle game for PC. If we were to release it in 2020, what would our model anticipate the review score to be?

It seems like this score is fairly average. This might be tipping us off to the fact that our model does not fit the data very tightly. We can calculate the R^2 value of our regression to give us a numerical estimation of how well our model fits the data it has. High values of R^2 indicate that the model is a better fit for our data.

This is a very low value for R^2, indicating that these parameters alone are not good estimators of how well a game will be scored.

Let's try fitting how well a game will do with the publisher of the game. We have to be cautious of overfitting here, it's entirely possible to mistakenly add too many attributes to cause our model to be extremely accurate to the data that we have, but inaccurate to new data. To this end, we should choose publishers with a large number of published games. Since the dataset we have is organized by popularity, the first 10 developers should fit this requirement nicely.

This gives us an idea of what our model thinks of these studios' impact. What is our R^2 score?

Well, it looks like this isn't an effective predictor of a game's critical reception, either. Presumably a game's score is determined mostly by its quality, but this analysis affirms that hypothesis.

Conclusions

We've been able to determine a number of things about review scores in this analysis.

  1. Review scores correlate well with how many purchases a game will have.
  2. Critics are more generous to games than consumers.
  3. Platform, publisher, and ESRB rating all correlate with how well a game reviews.
  4. Review scores have stayed generally constant over the full range of published games over the years.
  5. We can't accurately predict a game's score without actually playing it.

What does this tell us about video game reviews as a whole? Well, it seems that reviews are necessary because consumers can't rely on consistency from publishers or genres. Additionally, reviews are good at reflecting how many purchases a game will have. Effectively, reviews are, as a whole, accurate.
Therefore, we can conclude that review scores are actually a helpful guide for determining a game's value. Consumers are justified in turning to scores to inform their purchasing decisions.

There is much more that could be done with these datasets. Perhaps you might want to look at the playtime and price statistics in order to determine how many hours of entertainment a consumer received per dollar spent. There are many other datasets available online for you to play with, so I encourage you to look online for any data that might interest you.