Rating the Ratings: Understanding Book Review Scores
How Genre, Review Count, and Time Shape Ratings on a 5-Star Scale
Breaking Down the Numbers: How Readers Rate Books
Last week, I told you that we want to calibrate our understanding of the accuracy of reviews based on the number of reviews. Now we’re going to look at the reviews themselves. Just to be clear, the reviews are a rating on a scale of 0 to 5 with 5 reflecting higher satisfaction in the book.
Do Fiction and Non-Fiction Differ? Not in Ratings
By now, you’re probably noticing. that I begin the most simple analysis for each dependent variable: whether fiction or non-fiction matters. Well, this one is no different. Except actually it is. There is no difference in review quality between fiction and non-fiction. they both get exactly 4.6 stars out of 5.
Do Genres Matter? Yes, But Only Slightly
Just because the category of book doesn’t matter, that doesn’t mean genre doesn’t. In fact, the genre is still significant (H = 64.11(17), p <.001). This is a medium effect (r-squared 0.06). This doesn’t produce a particularly interesting graph, because the median value range from 4.4 to 4.7 for all genres.
Crime is at 4.4 and it’s the only one. We’ll remember that it had fewer median reviews, so it would possibly not stay that way with more.
While this is a significant result statistically, I don’t believe it’s remotely interesting practically. You can even note that the medians for fiction and non-fiction (both 4.6) are within the range of the individual genres.
More Reviews, More Stability: The 4.6 Phenomenon
There’s a significant, depending on how you define significance correlation between the number of reviews and the quality of the review, p = .002. This is a very weak correlation. I’m not showing you the graph because I honestly had to stare at it for a while to make sense of it (it takes a while to notice that the relationship is a curve). Basically, most of the time, books have 1000 or fewer reviews. Most of those reviews are between 3.5 and 5. There are about seven books with almost no reviews and they’re given very low reviews.
Essentially, over time, books tend to end up with a review of about 4.6, which is exactly what we expect statistically. I wasn’t expecting 4.6 specifically, but that extreme reviews tend to cancel out of time. To me, it looks like this is a weak effect because most reviews seem to be pretty close to 4.6 to begin with.
Year of Publication: A Predictable Flatline in Ratings
Last but not least, is there any relationship between the year a book was published and the reviews it gets? It turns out that there is a significant relationship here too (H =73.97(33), p <.001). It even has a medium effect size (r-squared 0.07).
And, surprise, surprise, it’s actually not that interesting a result. It’s almost a flat line. It actually does become a flat line starting 2019 for the rest of the graph. For context, we don’t get enough books to have much confidence in the median review quality until about 2017. Basically, over time, moving toward present day, the data has more books per year. I’ve talked about this before. The more books that go into determining the median, the closer that median gets to 4.6
This is saying the exact same thing as the last section.
What We’ve Learned About Ratings and Review Dynamics
This data is showing three things. One, I was right, no matter how you look at it, when you get more reviews (either by year or by looking at it directly or by genre with more reviews) you get closer to median overall. Like I said before, this is just statistics for you.
Two, when you have a restricted range, you get less variability. A scale of 0-5, when those reporting on the scale must pick whole values, is going to severely restrict variability. That’s going to get you this convergence on a single value faster. It’s also going to mean small differences show up. If people gave a percentage rating, I suspect that it would still converge across genres and years and the number of reviews, but maybe not as precisely. Some genres, for example, might be a little different from each other. But we can’t know that based on this set of circumstances.
Lastly, people are weird about scales. That this is skewed the way it is, isn’t surprising. I believe there’s research on this, but I’ve (and so have others) noticed that people tend to select values on Likert scales and ones like these that are the right. You might think that extreme values averaging out might get you something closer to 2.5 in this case. I think that this tendency might be at play. And also, these are published books, they should be at least decent in quality.