Have you ever visited https://www.reddit.com/r/dataisbeautiful/ and wondered how they created such beautiful graphs? Well, now you can create some of your own! This tutorial will teach you how to create similarly beautiful graphs. In this project, we will create interactive graphs that show the ratings of all episodes in a TV show.

We will learn how to use cinemagoer package to retrieve information from IMDb, basic file I/O, how to process a CSV file using Pandas, and how to create interactive plots with Plotly. This knowledge is easily transferrable to any other data science project.

All of the code for this project can be found here: https://github.com/mjdargen/tutorials/tree/main/plot_tv_ratings

Demos of the interactive graphs:

Step 1: Setting Up Environment & Dependencies

In order to run the code required for this project, you will need to install Python 3 and git. The installation steps can vary slightly based on your operating system, so I will link you to the official resources which will provide the most up-to-date guides for your operating system.

Git Project Files

Now you will need to retrieve the source files. If you would like to run git from the command line, you can install git by following the instructions here: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git. However, you can just as easily download the files in your browser.

To download the files required, you can navigate to the repository in your browser and download a zip file or you can use your git client. The link and the git command are described below.

https://github.com/mjdargen/tutorials/tree/main/plot_tv_ratings

git clone https://github.com/mjdargen/tutorials.git

Installing Dependencies

You need to install the following programs on your computer in order to run this program:

Python 3 with pip: https://www.python.org/downloads/
Install cinemagoer (imdbpy), pandas, & plotly: `pip3 install -r requirements.txt`

Step 2: Cinemagoer (IMDbPY)

IMDb (Internet Movie Database) is an online database of information related to movies and TV shows. It includes information about the productions including cast, crew, biographies, plot summaries, trivia, ratings, and reviews. It's a great resource for looking up who was in what movie or related films. In this project, we will use the cinemagoer (previously called IMDbPY up until a few days ago) to retrieve data from IMDb. Cinemagoer is a Python package for retrieving and managing the data of the IMDb movie database about movies, people, and companies. Full documentation is available here: https://imdbpy.readthedocs.io/en/latest/.

Step 3: Pandas & Plotly

Pandas is a Python software library created for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Think of it as the programmatic equivalent to Excel. We will be using it to parse the data from CSV (comma-separated values) file and extract the needed information for plotting.

pandas Documentation - pandas can do a lot so this site may seem very overwhelming. You can search this website or use a search engine to find the specific information you need.
pandas Tutorials
Intro to Dataframes
Reading in CSV Data
Manipulating Data
pandas Cheatsheet

Plotly is a software library created for data analytics and visualizations. We will be using it to make beautiful interactive graphs that you can zoom in and select the relevant data you want to inspect. The Plotly documentation is a lot less overwhelming and has a number of helpful examples.

Step 4: Retrieving Data From IMDb

This project uses cinemagoer to retrieve TV show ratings information from IMDb. get_ratings() is the function that retrieves the ratings and writes them to a CSV file. It takes input from the user and looks it up using the IMDb search functionality. It then filters results for only TV shows. Occasionally, it may first return some strange series with a similar name based on the IMDb search functionality, so it confirms the top result with the user first. It displays the title and the year for the user to confirm. Otherwise, it will keep returning subsequent results until the user finds the correct show.

    ia = IMDb()  # create instance of imdb class

    # search for show to get id
    results = ia.search_movie(show)
    # search results not the best, filter and check with user
    results = [res for res in results if res['kind'] == 'tv series']
    i = 0
    r = input(
        f"Top result is {results[i]['title']} ({results[i]['year']}). Is that correct (y/n)? ")
    while 'y' not in r.lower():
        i += 1
        try:
            r = input(
                f"Next result is {results[i]['title']} ({results[i]['year']}). Is that correct (y/n)? ")
        except IndexError:
            print('No more results. Exiting program...')
            quit()
    id = results[i].getID()
    show = results[i]['title']

Once the program has confirmed the TV show, it retrieves the series information using the ID. To get the ratings information for each episode, there are two loop structures within the program. The outermost loop steps through each season and the innermost loop steps through each episode in a season. There is some weird error handling that has to occur for shows that are still airing and double episodes. Shows that are still airing may have future episodes without ratings. There may also be double episodes where only the first half will have a rating.

    # use id to retrieve episode data
    series = ia.get_movie(id)
    ia.update(series, 'episodes')

    # open csv file to save episode title and rating
    filename = show
    for punc in string.punctuation:
        filename = filename.replace(punc, '')
    filename = filename.replace(' ', '_').lower()
    f = open(f'{DIR_PATH}/input/{filename}.csv', 'w')
    f.write('Episode,Rating\n')
    # for each season
    for i in range(1, series['number of seasons']+1):
        # for each episode in season
        for j in range(1, len(series['episodes'][i].keys())+1):
            try:
                # check air dates for show still presently airing
                air = series['episodes'][i][j]['original air date']
                air = air.replace('.', '')
                try:
                    air = datetime.datetime.strptime(air, '%d %b %Y')
                except ValueError:
                    air = datetime.datetime.strptime(air, '%Y')
                if air > datetime.datetime.now():
                    f.close()
                    return show, filename
                # write data to file
                f.write(f"\"S{i}E{j} {series['episodes'][i][j]['title']}\","
                        + f"{series['episodes'][i][j]['rating']}\n")
            except KeyError:
                pass  # typically issue with episode 0 or double ep
    f.close()
    return show, filename

Step 5: Extracting Best/Worst Episodes

We want to extract the top 5 best episodes and the worst 5 episodes to display them on the graph. The functions get_best() and get_worst() are used for this. We use the built-in Pandas dataframe methods nlargest and nsmallest to easily retrieve the episode titles from the CSV file.

# returns a list containing the titles of the 5 highest rated episodes
def get_best(filename):
    df = pd.read_csv(f'{DIR_PATH}/input/{filename}.csv', encoding='latin')
    return list(df.nlargest(5, 'Rating')['Episode'])


# returns a list containing the titles of the 5 lowest rated episodes
def get_worst(filename):
    df = pd.read_csv(f'{DIR_PATH}/input/{filename}.csv', encoding='latin')
    return list(df.nsmallest(5, 'Rating')['Episode'])

Step 6: Plotting Data

We plot the data on bar charts using Plotly. plot_ratings() is the function that is responsible for retrieving information from the CSV file, plotting the data, showing the graph, and saving the graph file. In addition to plotting the ratings by episode, we also add the annotations for our best and worst episodes. To prevent the episode labels from overflowing and shrinking the graph for long episode titles, we hard limit the episode titles on the graph to 40 characters.

Once the graph is ready, we show the plot. The plot is generated as an HTML file that will open in your browser for you to interact with. It also saves a local copy of the HTML for you to use later.

# plot the ratings on an interactive graph using plotly
def plot_ratings(show, filename, best, worst):
    df = pd.read_csv(f'{DIR_PATH}/input/{filename}.csv', encoding='latin')
    df['Episode'] = df['Episode'].str.slice(0, 40)
    fig = px.bar(df, title=f'{show} IMDb Ratings', x='Episode', y='Rating',
                 color='Rating', color_continuous_scale='bluyl')
    fig.add_annotation(
        text='<b>Highest Rated Episodes:</b><br>' + '<br>'.join(best),
        xref="paper", yref="paper", x=0.1, y=0.02,
        align='left', bgcolor="#f1f1f1", showarrow=False
        )
    fig.add_annotation(
        text='<b>Lowest Rated Episodes:</b><br>' + '<br>'.join(worst),
        xref="paper", yref="paper", x=0.9, y=0.02,
        align='left', bgcolor="#f1f1f1", showarrow=False
        )
    fig.show()
    fig.write_html(f'{DIR_PATH}/output/{filename}.html')

Step 7: Run!

Now, it's time to run the program! Run the program from a terminal using the command below or run it by double-clicking the executable Python script. There will be two user prompts: one asking for the name of the TV show and the other asking you to confirm the search result.

python3 main.py

There are some additional modifications you can make to the code if you want to customize it. For example, you can change the color gradient for the bar graphs. To see which gradients are available, you can go here and look at all of the built-in color scales.