Today I embarked on analysing the top 50 Amazon bestselling books from 2009 – 2019. It is a well analysed dataset, but, as I love books, I thought it would be fun to have a look at it myself. I actually found some of the results quite surprising.

Tools I used:

-python pandas

-MS Excel

The dataset is not actually that large (550 records, duh), so a filtered and sorted spreadsheet would suffice to perform most of the analysis. Pandas would still prove useful for gathering and summarizing certain statistics.

To start, I wanted to get straight down to which authors dominated the bestseller list during the eleven year period. The value_counts() function afforded by pandas would provide a quick and easy way of obtaining this. As the name suggests, this gives a count of each time a value appears in the specified column. We assign the result to ‘author_counts’, and use the head method to look at the first 10 rows:

import pandas as pd

file = pd.read_csv('amazon_top50_books.csv')
df = pd.DataFrame(file)

author_counts = df['Author'].value_counts()
author_counts.head(10)
Author
Jeff Kinney 12
Gary Chapman 11
Rick Riordan 11
Suzanne Collins 11
American Psychological Association 10
Dr. Seuss 9
Gallup 9
Rob Elliott 8
Stephen R. Covey 7
Stephenie Meyer 7

Et voila. I actually didn’t know who half of these were! (Honestly). In my defence, once I looked people up, I can say at least I was aware of their work, and aware that it was in mainstream popularity. For anyone else who is in the dark, in descending order by rank these are:

  1. the author of the kids series ‘Diary of a Wimpy Kid’
  2. the author of the relationship guide, ‘The 5 Love Languages’
  3. the author of the young adult fantasy series, ‘Percy Jackson and the Olympians’
  4. the author of the young adult dystopian fantasy series, ‘The Hunger Games’
  5. the publisher of the APA style guide. Now, I was surprised by this one, as I actually studied psychology between 2011 – 2014, and I never needed to buy this! I did some research, and it turns out that an increasing number of disciplines outside of Psychology actually do use APA style for their academic publications. Many libraries and academic institutions are probably required to order it in bulk. And it got them to no.5 on the list (with a book that is probably as dry as hell). So there you go.
  6. Good old Dr Seuss! It is actually one book in particular, ‘Oh, the places you will go!’ that has done him proud, however ‘What pet should I get?’ did also win a place on the bestsellers list in 2015.
  7. publisher of ‘Strengthsfinder 2.0’
  8. the author of a couple of kids’ joke books
  9. the author of ‘The 7 habits of highly successful people’
  10. the author of the ‘Twilight’ series

Now, it occurred to me that just because an author makes the most appearnaces on the bestsellers list, that does not necessarily make them the ‘best’ or most successful author. Many of the above are authors of a series, and have multiple entries that could potentially get them a place on the list.

Going back to my spreadsheet, I determined that the highest user rating for any best seller was 4.9/5 (no such thing as a perfect ‘5’). The analysis now turns to which books were awarded the highest rating. In pandas, I created a new dataframe that filtered user rating by the identified top value 4.9. Some books received the top rating in multiple years, so I created another subset of that dataframe just to show the books that uniquely received the top score:

top_rated_books = df[df['User Rating'] == 4.9]
unique_top_rated_books = top_rated_books.drop_duplicates(subset=['Name'])

print(unique_top_rated_books)

40            Brown Bear, Brown Bear, What Do You See?        Bill Martin Jr.   
81   Dog Man and Cat Kid: From the Creator of Capta...             Dav Pilkey   
82   Dog Man: A Tale of Two Kitties: From the Creat...             Dav Pilkey   
83   Dog Man: Brawl of the Wild: From the Creator o...             Dav Pilkey   
85   Dog Man: Fetch-22: From the Creator of Captain...             Dav Pilkey   
86   Dog Man: For Whom the Ball Rolls: From the Cre...             Dav Pilkey   
87   Dog Man: Lord of the Fleas: From the Creator o...             Dav Pilkey   
146  Goodnight, Goodnight Construction Site (Hardco...   Sherri Duskey Rinker   
151                           Hamilton: The Revolution     Lin-Manuel Miranda   
153  Harry Potter and the Chamber of Secrets: The I...           J.K. Rowling   
155  Harry Potter and the Goblet of Fire: The Illus...          J. K. Rowling   
156  Harry Potter and the Prisoner of Azkaban: The ...           J.K. Rowling   
157  Harry Potter and the Sorcerer's Stone: The Ill...           J.K. Rowling   
174                       Humans of New York : Stories        Brandon Stanton   
187  Jesus Calling: Enjoying Peace in His Presence ...            Sarah Young   
207  Last Week Tonight with John Oliver Presents A ...             Jill Twiss   
219                                  Little Blue Truck         Alice Schertle   
244                        Obama: An Intimate Portrait             Pete Souza   
245                          Oh, the Places You'll Go!              Dr. Seuss   
288  Rush Revere and the Brave Pilgrims: Time-Trave...          Rush Limbaugh   
289  Rush Revere and the First Patriots: Time-Trave...          Rush Limbaugh   
303             Strange Planet (Strange Planet Series)         Nathan W. Pyle   
420               The Legend of Zelda: Hyrule Historia         Patrick Thorpe   
431                                 The Magnolia Story            Chip Gaines   
476                        The Very Hungry Caterpillar             Eric Carle   
486                   The Wonderful Things You Will Be  Emily Winfield Martin   
521                             Unfreedom of the Press          Mark R. Levin   
545       Wrecking Ball (Diary of a Wimpy Kid Book 14)            Jeff Kinney

So J.K. Rowling is still going strong. Interesting to note that a majority of these would appear to be childrens’ books!

To finish, I thought it would be nice to do a visualization. I wanted to investigate with a bar chart how the popularity of fiction vs. non-fiction has changed over the eleven year period.

First, I did a group by in pandas by year, then genre to get the aggregate statistics of interest:

genre_counts_per_year = df.groupby(['Year', 'Genre']).size().unstack(fill_value=0)

print(genre_counts_per_year)

Genre  Fiction  Non Fiction
Year                       
2009        24           26
2010        20           30
2011        21           29
2012        21           29
2013        24           26
2014        29           21
2015        17           33
2016        19           31
2017        24           26
2018        21           29
2019        20           30

Now, using this output, I created my visualization with a chart in MS Excel:

And there you go! So, it looks like with the exception of 2014, the general trend is that non-fiction features somewhat more than fiction, but the difference is not that big.

There are lots of reasons why you might expect this- people don’t just buy books for entertainment, which tends to be the primary purpose of fiction. Non-fiction books serve a variety of purposes, including entertainment but also study, instruction (food and cookery books come to mind), and self-help. A limitation of the dataset is that it only categorizes ‘genre’ in terms of whether a book is fiction or non-fiction. It would be interesting to look in depth at which subgenres (e.g., fantasy, crime, self-help, what have you) feature on the list. Part of the challenge of course is that identifying what genre a book falls into can be subjective, and people might have differing opinions on the matter.

Follow-up considerations:

  • I thought it was interesting that there was a trend for the most highly rated books to be childrens’ books. Is there an explanation for that? Is it natural to assume that adult literature is just held up to a higher standard, and so you would expect the reviewers response to be more critical and harsh? Are childrens’ books genuinely just more enjoyable? Could it be because, a lot of childrens’ books are actually bought as gifts for children, and it is not actually the children themselves who give the ratings, but the adults, and their rating is distorted somewhat by the elevated response of their youthful recipient?
  • I noticed that all the authors considered were native english language authors. it turns out that this is because the data only considers the top selling english language titles! There are bestseller lists for other languages as well. It might be interesting to compare, and find out what the most popular titles in other languages are.
  • Could I attempt creating my own sub-categories in the original dataset, to look more specifically at what kinds of books dominated the list?
  • The data only goes up to 2019. It would be interesting to scrape data up to present date, to find out which books are now in vogue. Would we see that some of the top contenders have continued to hold their place?

Posted in

Leave a comment