Decoding IMDB

Decoding IMDB

A Data Science Approach to Movie Ratings

Author

Riley Svensson

Published

April 15, 2025

Act I: Setting the Stage


You’re killing me, Smalls

Run, Forrest, Run!

May the Force be with you


Cover


The avid movie watcher instantly recognizes these iconic lines, each one imprinting smiles and memories across generations, etched into film history. But are movies like this dying? Are we as audiences, to blame for growing too selective? Has filmmaking itself changed? Or is there something more complex at play?

This analysis aims to answer these questions using data and machine learning 1. While some may fear that analytics could turn filmmaking into an equation; understanding the key factors that influence audience perception will always be important. The role of AI in filmmaking, like in any industry, is not to replace creativity but to enhance understanding. These tools are merely that, not blueprints for high-quality storytelling, but lenses through which we can analyze trends, audience preferences, and the shifting landscape of cinema.


Data Collection & Cleaning

A strong analysis starts with a strong dataset. At first, we considered using a pre-scraped IMDb dataset of 1,000 movies, but the limitations quickly became clear. The sample was too small, and there was no way to confirm that the movies had been selected randomly (as we can’t simply select the best movies or most recents). To ensure a dataset built for real insights, we needed to collect our own data. This would allow full control over what was included, minimizing the need for changes like removing inconsistencies in formatting, and allowing me to have full confidence in generalizing our results to the entire movie population.

Initially, our scraping2 approach was straightforward: pull random IMDb movie IDs and filter out bad data afterward. But as we worked through the dataset, a clear pattern emerged. Movies with missing ratings and descriptions almost always had low vote counts. Instead of collecting first and filtering later, we restructured the process. Our scraper first checked a movie’s vote count using IMDbPY. Only if the votes exceeded 5,000 did it proceed to retrieve the full movie data. This approach ensured that our dataset was built on movies that have garnered significant audience engagement, eliminating the need for excessive wrangling3 and reducing the likelihood of bias from underseen, low-quality films.

After cleaning and reshaping the data—engineering a few new features and standardizing variables for analysis—each row represented a single movie, ready for exploration totaling over 9,660 movies. Here’s a snapshot of Forrest Gump, classified as an Epic, with over 2.3 million votes, a 34-word description, and a standout IMDb rating of 8.8.

Show Code
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display
import plotly.io as pio
pio.renderers.default = "iframe_connected"  # allows for stable ploty rendering for HTML export

# Set pandas to display the full content of columns
pd.set_option('display.max_colwidth', None) 

# Read-in data into our dataframe
df = pd.read_csv("/Users/rileysvensson/Desktop/CAA - Data Analyst/Projects/imdb_data/final/IMDB_clean/IMDB.csv")

# Clean the 'Genres' column by removing "Back to top" text from our scraping
df["Genres"] = df["Genres"].str.replace("Back to top", "", regex=False).str.strip()


# Define the set of well-known genres to extract from niche genres
standard_genres = {
    "Action", "Adult", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Documentary",
    "Drama", "Family", "Fantasy", "Film Noir", "Game Show", "History", "Horror", "Musical",
    "Music", "Mystery", "News", "Reality-TV", "Romance", "Sci-Fi", "Short", "Sport",
    "Talk-Show", "Thriller", "War", "Western"
}

# Function to strictly extract standard genres if they appear anywhere in the genre string
def extract_standard_genres(genre_str):
    if pd.isna(genre_str):
        return None  # Handle missing values
    genre_str_lower = genre_str.lower()  # Convert to lowercase for matching
    filtered_genres = {std_genre for std_genre in standard_genres if std_genre.lower() in genre_str_lower}
    return ", ".join(filtered_genres) if filtered_genres else None

# Apply the strict genre extraction function
df["Cleaned_Genres"] = df["Genres"].apply(extract_standard_genres)

# Create a binary column for "Highly Rated" movies (IMDb rating >= 7.0)
df["Highly_Rated"] = (df["Rating"] >= 7).astype(int)

# Function to clean and deduplicate actor names within each row, handling NAN's
def clean_stars_column(stars):
    if pd.isna(stars):
        return ""  
    
    # Split name by commas, strip spaces, standardize formatting, and remove duplicates
    unique_names = sorted(set(name.strip() for name in stars.split(",")))
    
    # Rejoin into a single, clean string
    return ", ".join(unique_names)

# Apply function to "Stars" column
df["Stars"] = df["Stars"].apply(clean_stars_column)

# Create function to count # of words in the movie description
def count_words(description):
    
    # Check if "Read all", "..." or "--" is in the text and remove truncated parts
    truncated = False
    for marker in ["Read all", "...", "--"]:
        if marker in description:
            description = description.split(marker)[0].strip()
            truncated = True

    # Count words in the cleaned description
    word_count = len(description.split())

    # Indicate if the description was truncated, by adding +
    return word_count if not truncated else f"{word_count}+"  

# Apply function to Description column
df["Description Word Count"] = df["Description"].apply(count_words)

# Create function to bin word count
def categorize_word_count(word_count):
    count = int(str(word_count).replace("+", ""))
    return "<5" if count < 5 else "5-10" if count < 10 else "10-15" if count < 15 else "15-20" if count < 20 else "20-25" if count < 25 else "25-30" if count < 30 else "30+"

# Apply binning function
df["Word_Count_Binned"] = df["Description Word Count"].apply(categorize_word_count)

# Count number of unique genres
df["Num_Genres"] = df["Cleaned_Genres"].apply(lambda x: len(x.split(",")) if pd.notna(x) and x.strip() else 0)

# Keep only rows where 'Year' starts with 19 or 20
df = df[df["Year"].astype(str).str.match(r"^(19|20)\d{2}$")]

# Convert Year to numeric
df["Year"] = pd.to_numeric(df["Year"])

# Function to extract the first genre from "Cleaned_Genres"
def extract_primary_genre(genre_str):
    if pd.isna(genre_str) or genre_str.strip() == "":
        return None  # Handle missing values
    return genre_str.split(",")[0].strip()  # Take the first genre

# Apply function to create the "Cleaned_Primary_Genre" column
df["Cleaned_Primary_Genre"] = df["Cleaned_Genres"].apply(extract_primary_genre)

# Create function to create new column 1 if the Writer of a movie is also the Director, 0 otherwise
def director_is_writer(row):
    director = str(row['Director']).strip().lower()
    writers = str(row['Writers']).split(',')
    writers_cleaned = [w.strip().lower() for w in writers]
    return int(director in writers_cleaned)

df['Director_Is_Writer'] = df.apply(director_is_writer, axis=1)

# Drop movies with a duration greater than 300 minutes, as these were very obscure outliers
df = df[df["Duration"] <= 300]

# Display Forest Gump as an example record
display_movie_ex = df[df['Title'].isin(['Forrest Gump',''])]

display_movie_ex
IMDB_ID Title Stars Year Genres Description Certificate Writers Director Primary Genre Votes Rating Duration Cleaned_Genres Highly_Rated Description Word Count Word_Count_Binned Num_Genres Cleaned_Primary_Genre Director_Is_Writer
7519 tt0109830 Forrest Gump Gary Sinise, Robin Wright, Tom Hanks 1994 Epic, Drama, Romance, The history of the United States from the 1950s to the '70s unfolds from the perspective of an Alabama man with an IQ of 75, who yearns to be reunited with his childhood sweetheart. PG-13 Winston Groom, Eric Roth Robert Zemeckis Epic 2359386 8.8 142.0 Romance, Drama 1 34 30+ 2 Romance 0

Get to Know the Cast: Data Definition

Every film has its cast of characters and so does every dataset. This section introduces the key variables behind each movie in the analysis, from obvious stars like IMDb rating and genre, to quieter supporting roles like description length and certificate. Understanding these features4 lays the groundwork for later modeling, helping us interpret which factors might truly shape audience opinion.


Show Code
import pandas as pd
from IPython.display import HTML

# Define glossary as a list of dictionaries
glossary_data = [
    {"Variable": "Title", "Description": "The name of the movie"},
    {"Variable": "Year", "Description": "The year the movie was released"},
    {"Variable": "Genres", "Description": "All genres from IMDb"},
    {"Variable": "Cleaned_Genres", "Description": "Filtered version of `Genres` with only standard genres"},
    {"Variable": "Cleaned_Primary_Genre", "Description": "First genre listed in `Cleaned_Genres`"},
    {"Variable": "Rating", "Description": "IMDb rating out of 10"},
    {"Variable": "Votes", "Description": "Total number of IMDb votes received (as of March 2025)"},
    {"Variable": "Duration", "Description": "Runtime in minutes"},
    {"Variable": "Certificate", "Description": "Age suitability rating (e.g., PG, PG-13, R)"},
    {"Variable": "Stars", "Description": "List of lead actors"},
    {"Variable": "Writers", "Description": "List of lead writers"},
    {"Variable": "Director", "Description": "Main director"},
    {"Variable": "Description", "Description": "IMDb plot summary"},
    {"Variable": "Description Word Count", "Description": "Word count of the description"},
    {"Variable": "Word_Count_Binned", "Description": "Binned version of description length (e.g., <5, 10-15, 30+)"},
    {"Variable": "Num_Genres", "Description": "Count of genres per movie"},
    {"Variable": "Highly_Rated", "Description": "1 if movie’s rating is above 7.0, otherwise 0"},
    {"Variable": "Director_Is_Writer", "Description": "1 if the director is also a writer (creative control)"},
]


# Convert DataFrame to HTML without index
glossary_html = glossary_df.to_html(index=False, escape=False)

# Display
display(HTML(glossary_html))
Variable Description
Title The name of the movie
Year The year the movie was released
Genres All genres from IMDb
Cleaned_Genres Filtered version of `Genres` with only standard genres
Cleaned_Primary_Genre First genre listed in `Cleaned_Genres`
Rating IMDb rating out of 10
Votes Total number of IMDb votes received (as of March 2025)
Duration Runtime in minutes
Certificate Age suitability rating (e.g., PG, PG-13, R)
Stars List of lead actors
Writers List of lead writers
Director Main director
Description IMDb plot summary
Description Word Count Word count of the description
Word_Count_Binned Binned version of description length (e.g., <5, 10-15, 30+)
Num_Genres Count of genres per movie
Highly_Rated 1 if movie’s rating is above 7.0, otherwise 0
Director_Is_Writer 1 if the director is also a writer (creative control)


Exploratory Data Analysis

Through exploratory data analysis (EDA), we let the data speak for itself while also applying our intuition to guide initial insights. EDA is a crucial first step in any analysis, as it helps surface hidden patterns, validate assumptions, and frame the questions worth asking as we dig deeper into the data. With this approach in mind, we turned to the IMDb dataset to uncover what shapes audience ratings. One of the first questions we asked was whether the sheer volume of films produced each year could be impacting how audiences perceive and rate them.

Show Code
import plotly.graph_objects as go
import pandas as pd
import plotly.io as pio

import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

# Filter out invalid years (keeping only reasonable movie years)
df_movie_count_over_time = df[(df['Year'] >= 1940) & (df['Year'] <= 2023)]

# Count movies per year
movie_counts = df_movie_count_over_time["Year"].value_counts().sort_index()
df_movies = pd.DataFrame({'Year': movie_counts.index, 'Count': movie_counts.values})

fig = go.Figure(go.Bar(
    x=df_movies['Year'], 
    y=df_movies['Count'], 
    marker_color="#E6B91E"
))

fig.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    title=dict(text="🎬 Movies by Year", font=dict(size=26, color="gold"), x=0.5),
    xaxis=dict(
        title="Year",
        tickmode='linear',
        dtick=5,
        tickfont=dict(size=14, color='white'),
        title_font=dict(size=16, color='white'),
        linecolor='white'
    ),
    yaxis=dict(
        title="Number of Movies",
        gridcolor="white",
        tickfont=dict(size=14, color='white'),
        title_font=dict(size=16, color='white'),
        linecolor='white'
    ),
    showlegend=False
)

fig
Show Code
import plotly.graph_objects as go
import pandas as pd
import plotly.io as pio

# Use this to render inside Quarto or Jupyter Notebook
pio.renderers.default = "plotly_mimetype+notebook"

# Filter out invalid years
df_rating_over_time = df[(df['Year'] >= 1940) & (df['Year'] <= 2023)]

# Calculate average rating per year
avg_rating_per_year = df_rating_over_time.groupby("Year")["Rating"].mean().reset_index()

# Create line chart
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=avg_rating_per_year["Year"],
    y=avg_rating_per_year["Rating"],
    mode="lines+markers",
    marker=dict(color="red", size=6),
    line=dict(color="red", width=2),
    name="Average Rating"
))

# Update layout to match your bar chart theme
fig.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    title=dict(text="⭐ Average IMDb Rating Over Time", font=dict(size=26, color="gold"), x=0.5),
    xaxis=dict(
        title="Year",
        tickmode='linear',
        dtick=5,
        tickfont=dict(size=14, color='white'),
        title_font=dict(size=16, color='white'),
        linecolor='white',
        gridcolor='white'
    ),
    yaxis=dict(
        title="Average Rating",
        range=[5, 9],
        tickfont=dict(size=14, color='white'),
        title_font=dict(size=16, color='white'),
        linecolor='white',
        gridcolor='white'
    ),
    showlegend=False
)

fig


Our intuition wasn’t far off.

These figures paint a compelling story. As more movies are being made in modern society, ratings in-turn have steadily declined. While correlation doesn’t imply causation, this pattern raises several key questions:

  • Has the rapid increase in film production diluted overall quality?
  • Have audiences become more critical & selective in a world of endless choice?
  • Or have audiences become more selective and critical, mindless to the time, effort, and money put into every film?

It’s also possible that modern films face a tougher climb. Older classics that remain in circulation today may reflect a kind of survivorship bias, as only the best from past decades endure in public memory and have votes to show for it. Meanwhile, today’s releases are flooded into a saturated market, fighting for attention and cultural relevance. In this environment, even strong films may struggle to stand out, and average ones fade quickly into obscurity.


Movie Ratings Increase with Number of Votes

While ratings offer a snapshot of how a film is received, they don’t exist in a vacuum. Behind every number is an audience—watching, reacting, and choosing whether or not to engage. To better understand what drives these ratings and how they gain traction, it’s helpful to look not just at the scores themselves, but at the level of audience participation behind them.

Show Code
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

# Define asymptotic model
def asymptotic_model(x, L=9, k=0.00001):
    return L - np.exp(-k * x)

# Data
votes = df["Votes"]
ratings = df["Rating"]

# Masks
outlier_mask = votes >= 150_000

# Asymptotic curve
x_vals = np.linspace(votes.min(), votes.max(), 1000)
y_asymptote = asymptotic_model(x_vals)

# Main plot
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
main_ax = plt.gca()
main_ax.set_facecolor("black")
plt.gcf().patch.set_facecolor("black")

# All data
sns.scatterplot(x=votes, y=ratings, alpha=0.6, color="#E6B91E", s=30, ax=main_ax, label="All Movies")
main_ax.plot(x_vals, y_asymptote, color="green", linestyle="--", linewidth=2, label="Asymptotic Fit (→ 9)")

# Main plot styling
main_ax.set_xlim(0, votes.max())
main_ax.set_ylim(0, 10)
main_ax.set_xlabel("Number of Votes (mil)", fontsize=14, color="white")
main_ax.set_ylabel("Rating", fontsize=14, color="white")
main_ax.set_title("IMDb Rating vs. Number of Votes", fontsize=20, color="gold", weight="bold")
main_ax.tick_params(colors='white', labelsize=12)
for spine in main_ax.spines.values():
    spine.set_edgecolor('white')
main_ax.grid(True, linestyle='--', alpha=0.5, color='white')
main_ax.legend(facecolor="black", edgecolor="white", fontsize=12, labelcolor='white')

# Layout adjustment (no tight_layout)
plt.subplots_adjust(top=0.92, bottom=0.1, left=0.1, right=0.95)
plt.show()


The figure above illustrates a clear and expected trend: movies with higher ratings tend to receive a significantly greater number of votes. This positive, logarithmic 5 relationship (which levels off around 9.0, as no film has ever reached a perfect 10) suggests that top-rated films don’t just flourish in their release. They also inspire wider attention and engagement. Acclaimed movies often benefit from word-of-mouth, rewatchability, and cultural impact, all of which contribute to the snowball effect of visibility and votes.

In other words, ratings and votes often reinforce one another: the better a movie is perceived, the more people are inclined to watch and rate it.


Act II: What Makes a High Rated Movie?

Building a Model to Predict IMDB Rating

This brings us to the heart of the analysis: modeling6.

This model is not designed to decode the perfect formula for producing a high-rated movie, as doing so would strip the art of its soul. Instead, its purpose is to explore the patterns behind audience perception, helping us understand how films resonate across time. Just as AI assists in medicine, finance, and countless other fields, here it serves not as a creator, but as an observer—one that can uncover insights without dictating creativity.

To evaluate model performance, we used two standard scoring7 metrics:

Show Code
from IPython.display import HTML

display(HTML('''
<table style="width:100%; border-collapse: collapse; margin: 2em 0; font-size:15px;">
  <thead style="background-color:#f9f9f9;">
    <tr>
      <th style="padding:10px; border: 1px solid #999;">Metric</th>
      <th style="padding:10px; border: 1px solid #999;">What It Means</th>
      <th style="padding:10px; border: 1px solid #999;">Why It Matters</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="padding:10px; border: 1px solid #999;"><strong>RMSE</strong> (Root Mean Squared Error)</td>
      <td style="padding:10px; border: 1px solid #999;">Shows how far off the model’s predictions are from the actual movie ratings, on average.</td>
      <td style="padding:10px; border: 1px solid #999;">An RMSE of 0.89 means the model is usually within 1 star of the actual IMDb rating (on a 10-star scale).</td>
    </tr>
    <tr>
      <td style="padding:10px; border: 1px solid #999;"><strong>R² Score</strong> (Coefficient of Determination)</td>
      <td style="padding:10px; border: 1px solid #999;">Tells us how well the model explains the differences in movie ratings.</td>
      <td style="padding:10px; border: 1px solid #999;">A higher R² means the model does a better job at understanding what drives audience scores.</td>
    </tr>
  </tbody>
</table>
'''))
Metric What It Means Why It Matters
RMSE (Root Mean Squared Error) Shows how far off the model’s predictions are from the actual movie ratings, on average. An RMSE of 0.89 means the model is usually within 1 star of the actual IMDb rating (on a 10-star scale).
R² Score (Coefficient of Determination) Tells us how well the model explains the differences in movie ratings. A higher R² means the model does a better job at understanding what drives audience scores.

Together, these metrics provide a lens for assessing how well the model understands audience sentiment. Not perfectly, but well enough to reveal patterns beneath the noise.


Show Code
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from tensorflow.keras import layers, models
from tensorflow.keras.regularizers import l2
from sklearn.metrics import mean_squared_error, r2_score
from IPython.display import Markdown, display

# Suppress TensorFlow logs
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# ========================
# 1. Feature Configuration
# ========================

categorical_features = ["Certificate", "Word_Count_Binned","Director", "Cleaned_Primary_Genre", "Director_Is_Writer"]
numerical_features = ["Duration", "Num_Genres", "Year"]

drop_cols = ["IMDB_ID", "Title", "Highly_Rated", "Stars","Writers",
             "Genres", "Description", "Primary Genre", 
             "Description Word Count", "Votes"]

# ========================
# 2. Data Preparation
# ========================

df_movies_features = df.drop(columns=drop_cols, errors="ignore").dropna()

X = df_movies_features[categorical_features + numerical_features]
y = df_movies_features["Rating"].values.reshape(-1, 1)

# Normalize target
scaler_y = MinMaxScaler()
y = scaler_y.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ========================
# 3. Preprocessing Pipeline
# ========================

ct_all = ColumnTransformer([
    ("cat", Pipeline([
        ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
    ]), categorical_features),

    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), numerical_features)
]).set_output(transform="pandas")

X_train = ct_all.fit_transform(X_train)
X_test = ct_all.transform(X_test)

X_train = np.array(X_train, dtype=np.float32)
X_test = np.array(X_test, dtype=np.float32)

# ========================
# 4. Neural Network Training
# ========================

model = models.Sequential([
    layers.Input(shape=(X_train.shape[1],)),
    layers.Dense(256, activation="relu", kernel_regularizer=l2(0.01)),
    layers.Dropout(0.3),
    layers.Dense(128, activation="relu", kernel_regularizer=l2(0.01)),
    layers.Dropout(0.3),
    layers.Dense(64, activation="relu"),
    layers.Dense(1)
])

model.compile(optimizer="adam", loss="mse", metrics=["mae"])

history = model.fit(X_train, y_train, epochs=200, batch_size=16, validation_data=(X_test, y_test), verbose=0)

# ========================
# 5. Evaluate Model
# ========================

test_loss, test_mae = model.evaluate(X_test, y_test, verbose=1)

y_pred = model.predict(X_test)
y_pred = scaler_y.inverse_transform(y_pred)
y_test = scaler_y.inverse_transform(y_test)

nn_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
nn_r2 = r2_score(y_test, y_pred)

display(Markdown(f"""
### 🧠 Neural Network Performance
**Features used:** {', '.join(categorical_features + numerical_features)}  
**Neural Network RMSE:** {nn_rmse:.2f}  
**Neural Network R² Score:** {nn_r2:.2f}
"""))
91/91 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - loss: 0.0153 - mae: 0.0890
91/91 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

🧠 Neural Network Performance

Features used: Certificate, Word_Count_Binned, Director, Cleaned_Primary_Genre, Director_Is_Writer, Duration, Num_Genres, Year
Neural Network RMSE: 0.95
Neural Network R² Score: 0.17

Show Code
import os
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from IPython.display import Markdown, display

# Suppress TensorFlow logs (in case you're switching between models)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# ========================
# 1. Feature Configuration
# ========================

categorical_features = ["Certificate", "Word_Count_Binned", "Director", "Cleaned_Primary_Genre", "Director_Is_Writer"]
numerical_features = ["Duration", "Num_Genres", "Year"]

drop_cols = ["IMDB_ID", "Title", "Highly_Rated", "Stars", "Writers",
             "Genres", "Description", "Primary Genre",
             "Description Word Count", "Votes"]

# ========================
# 2. Data Preparation
# ========================

df_movies_features = df.drop(columns=drop_cols, errors="ignore").dropna()

X = df_movies_features[categorical_features + numerical_features]
y = df_movies_features["Rating"].values.reshape(-1, 1)

scaler_y = MinMaxScaler()
y_scaled = scaler_y.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_scaled, test_size=0.3, random_state=43)

# ========================
# 3. Preprocessing Pipeline
# ========================

ct_all = ColumnTransformer([
    ("cat", Pipeline([
        ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
    ]), categorical_features),

    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), numerical_features)
]).set_output(transform="pandas")

X_train = ct_all.fit_transform(X_train)
X_test = ct_all.transform(X_test)

X_train = np.array(X_train, dtype=np.float32)
X_test = np.array(X_test, dtype=np.float32)

# ========================
# 4. Train Random Forest
# ========================

rf_model = RandomForestRegressor(n_estimators=200, random_state=2, n_jobs=-1)
rf_model.fit(X_train, y_train.ravel())

y_pred_rf = rf_model.predict(X_test)

y_pred_rf = scaler_y.inverse_transform(y_pred_rf.reshape(-1, 1))
y_test_original = scaler_y.inverse_transform(y_test)

# ========================
# 5. Evaluate Model (Train/Test Split)
# ========================

rf_rmse = np.sqrt(mean_squared_error(y_test_original, y_pred_rf))
rf_r2 = r2_score(y_test_original, y_pred_rf)

display(Markdown(f"""
### 🌳 Random Forest Performance
**Features used:** {', '.join(categorical_features + numerical_features)}  
**Random Forest RMSE:** {rf_rmse:.2f}  
**Random Forest R² Score:** {rf_r2:.2f}
"""))

# ========================
# 6. Cross-Validation (Optional but Robust)
# ========================

# Use original unscaled y for meaningful R²
y_unscaled = y.ravel()

# Create full pipeline using the existing ColumnTransformer
pipeline_rf = make_pipeline(
    ct_all,
    RandomForestRegressor(n_estimators=200, random_state=2, n_jobs=-1)
)

# Set up cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_r2_scores = cross_val_score(pipeline_rf, X, y_unscaled, cv=kf, scoring="r2")

# Display results
display(Markdown(f"""
#### Cross-Validation Performance (5-Fold)
**Average R² Score:** {cv_r2_scores.mean():.3f}  
**R² Scores per Fold:** {', '.join(f'{score:.3f}' for score in cv_r2_scores)}
"""))

🌳 Random Forest Performance

Features used: Certificate, Word_Count_Binned, Director, Cleaned_Primary_Genre, Director_Is_Writer, Duration, Num_Genres, Year
Random Forest RMSE: 0.87
Random Forest R² Score: 0.27

Cross-Validation Performance (5-Fold)

Average R² Score: 0.289
R² Scores per Fold: 0.272, 0.315, 0.302, 0.294, 0.263

Show Code
import os
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score

# Suppress TensorFlow logs
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# ========================
# 1. Feature Configuration
# ========================

categorical_features = ["Certificate", "Word_Count_Binned", "Director", "Cleaned_Primary_Genre", "Director_Is_Writer"]
numerical_features = ["Duration", "Num_Genres", "Year"]

drop_cols = ["IMDB_ID", "Title", "Highly_Rated", "Stars", "Writers",
             "Genres", "Description", "Primary Genre",
             "Description Word Count", "Votes"]

# ========================
# 2. Data Preparation
# ========================

df_movies_features = df.drop(columns=drop_cols, errors="ignore").dropna()

X = df_movies_features[categorical_features + numerical_features]
y = df_movies_features["Rating"].values.reshape(-1, 1)

scaler_y = MinMaxScaler()
y = scaler_y.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ========================
# 3. Preprocessing Pipeline
# ========================

ct_all = ColumnTransformer([
    ("cat", Pipeline([
        ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
    ]), categorical_features),

    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), numerical_features)
]).set_output(transform="pandas")

X_train = ct_all.fit_transform(X_train)
X_test = ct_all.transform(X_test)

X_train = np.array(X_train, dtype=np.float32)
X_test = np.array(X_test, dtype=np.float32)

# ========================
# 4. Train XGBoost Regressor
# ========================

xgb_model = XGBRegressor(n_estimators=200, max_depth=6, learning_rate=0.1, random_state=42, n_jobs=-1)
xgb_model.fit(X_train, y_train.ravel())

# Predict
y_pred_xgb = xgb_model.predict(X_test)

# Rescale predictions
y_pred_xgb = scaler_y.inverse_transform(y_pred_xgb.reshape(-1, 1))
y_test_original = scaler_y.inverse_transform(y_test)

# ========================
# 5. Evaluate Model
# ========================

xgb_rmse = np.sqrt(mean_squared_error(y_test_original, y_pred_xgb))
xgb_r2 = r2_score(y_test_original, y_pred_xgb)

display(Markdown(f"""
### 🔋 Extreme Gradient (XG) Boost Performance
**Features used:** {', '.join(categorical_features + numerical_features)}  
**XGB RMSE:** {xgb_rmse:.2f}  
**XGB R² Score:** {xgb_r2:.2f}
"""))

🔋 Extreme Gradient (XG) Boost Performance

Features used: Certificate, Word_Count_Binned, Director, Cleaned_Primary_Genre, Director_Is_Writer, Duration, Num_Genres, Year
XGB RMSE: 0.89
XGB R² Score: 0.28


Show Code
from IPython.display import HTML

display(HTML('''
<h3 style="text-align:center; margin-top:30px;">Model Showdown: A Tight Cut</h3>
<table style="width:100%; border-collapse: collapse; font-size:15px; margin: 2em 0;">
  <thead style="background-color:#f9f9f9;">
    <tr>
      <th style="padding:10px; border: 1px solid #999;">Model Type</th>
      <th style="padding:10px; border: 1px solid #999;">Strengths</th>
      <th style="padding:10px; border: 1px solid #999;">Weaknesses</th>
      <th style="padding:10px; border: 1px solid #999;">RMSE</th>
      <th style="padding:10px; border: 1px solid #999;">R² Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="padding:10px; border: 1px solid #999;">Random Forest</td>
      <td style="padding:10px; border: 1px solid #999;">Easy to interpret, handles mixed data well</td>
      <td style="padding:10px; border: 1px solid #999;">Slightly less flexible than XGBoost</td>
      <td style="padding:10px; border: 1px solid #999;">0.87</td>
      <td style="padding:10px; border: 1px solid #999;">0.289</td>
    </tr>
    <tr>
      <td style="padding:10px; border: 1px solid #999;">XGBoost</td>
      <td style="padding:10px; border: 1px solid #999;">Captures complex nonlinear interactions</td>
      <td style="padding:10px; border: 1px solid #999;">More complex to interpret</td>
      <td style="padding:10px; border: 1px solid #999;">0.89</td>
      <td style="padding:10px; border: 1px solid #999;">0.28</td>
    </tr>
  </tbody>
</table>
'''))

Model Showdown: A Tight Cut

Model Type Strengths Weaknesses RMSE R² Score
Random Forest Easy to interpret, handles mixed data well Slightly less flexible than XGBoost 0.87 0.289
XGBoost Captures complex nonlinear interactions More complex to interpret 0.89 0.28

Predicting movie ratings isn’t just a technical challenge, it’s a human one. Preferences are personal, emotional, and unpredictable. Even Netflix’s famed 1 million competition only improved RMSE by 0.09 points over their baseline, reinforcing a natural ceiling for accuracy. This project’s 0.89 RMSE lands within that same range, despite a far simpler modeling stack. It speaks to a key truth: complexity doesn’t always win, especially when interpretability matters.

While XGBoost is often favored for squeezing out marginal gains, Random Forest edged it out here, beating its RMSE with (0.87) and slightly surpassing it in R² (0.289 vs. 0.28). With values above 0.25 already considered strong in applied contexts, Random Forest offers the best trade-off between performance and interpretability, making it the most practical choice here.


Feature Selection & Importance

In machine learning, a black box model refers to an algorithm that can make highly accurate predictions, but doesn’t easily reveal how it arrives at those predictions. Unlike simpler models where each variable’s effect is clearly laid out, black box models like Random Forest and XGBoost rely on layered decision rules that are often hidden from direct view.

But “black box” doesn’t have to mean unknowable. With the right tools, we can begin to unpack the inner logic of these models and pull meaning from complexity. Instead of diving into dense technical plots (see Appendix), we’ve highlighted a few of the most influential factors that shaped predictions, focusing on patterns around runtime, and creative control.


Show Code
import numpy as np

# Define the threshold for "Top Rated" movies (e.g., IMDb rating >= 7.0)
top_rated_movies = df[df["Highly_Rated"] == 1]["Duration"].dropna()

# Confidence intervals to calculate
confidence_intervals = [75, 90, 99]

# Dictionary to store results
duration_ranges = {}

for confidence in confidence_intervals:
    # Compute lower and upper bounds dynamically based on confidence interval
    lower_percentile = (100 - confidence) / 2  # 75%, this gives 12.5%
    upper_percentile = 100 - lower_percentile  # e.g., for 75%, this gives 87.5%

    lower_bound, upper_bound = np.percentile(top_rated_movies, [lower_percentile, upper_percentile])

    # Convert minutes to hours and minutes
    lower_hours, lower_minutes = divmod(int(lower_bound), 60)
    upper_hours, upper_minutes = divmod(int(upper_bound), 60)

    # Store result
    duration_ranges[confidence] = (f"{lower_hours}h {lower_minutes}m", f"{upper_hours}h {upper_minutes}m")

# 75% of top-rated movies are between 1h 32m and 2h 22m
# 90% of top-rated movies are between 1h 24m and 2h 42m
# 99% of top-rated movies are between 1h 5m and 3h 21m

While most top-rated movies fall within the typical runtime sweet spot, a few of the highest-rated films, like The Godfather and Lord of the Rings, break this mold with runtimes of over 3 hours. These exceptions show that while longer films can be the best of the best; they tend to be rare and must truly earn their length.


Act III: Show Me the Money (data)


Sometimes, descriptive analytics8 are just as telling and valuable to an organization as a well-built predictive model. So when we have both, why not use them? In a landscape where it’s difficult to gather accurate, representative data on the film industry due to the sheer volume of creative works being released every day, these types of insights can be just as impactful.

Assuming who the most popular actors and directors are based on public perception is useful in itself, but being able to attach a measurable value to that perception is even more powerful. These creatives are proven to elevate the rating of a feature film simply by having their name in the credits, and their average rating across all the works they’ve been part of supports that claim.

Show Code
import pandas as pd
from IPython.display import HTML

# Ensure 'Rating' is numeric
df["Rating"] = pd.to_numeric(df["Rating"], errors="coerce")

# --- 1️⃣ Actors with Highest Average IMDb Rating (Min 10 Movies) ---
df_actors = df.assign(Actor=df["Stars"].str.split(", ")).explode("Actor")
actor_stats = df_actors.groupby("Actor").agg(
    avg_rating=("Rating", "mean"),
    num_movies=("Actor", "count")
).reset_index()

# Filter, round, and sort
top_actors = actor_stats[actor_stats["num_movies"] >= 10].copy()
top_actors["avg_rating"] = top_actors["avg_rating"].round(2)
top_actors = top_actors.sort_values("avg_rating", ascending=False)

top_actors_display = top_actors.head(20).rename(columns={
    "Actor": "Actor",
    "avg_rating": "Avg. Rating",
    "num_movies": "Number of Movies (min 10)"
})

# Add a ranking column from 1 to 20
top_actors_display = top_actors_display.reset_index(drop=True)
top_actors_display.index += 1
top_actors_display.index.name = "Rank"
top_actors_display_md = top_actors_display.reset_index()

# Display as HTML table
display(HTML('<h3 style="text-align: center;">🎭 Top 20 Actors by Average Rating</h3>'))
display(HTML(top_actors_display_md.to_html(index=False, classes="table", border=0)))

🎭 Top 20 Actors by Average Rating

Rank Actor Avg. Rating Number of Movies (min 10)
1 Humphrey Bogart 7.70 11
2 James Stewart 7.64 12
3 Leonardo DiCaprio 7.58 20
4 Orson Welles 7.55 10
5 Henry Fonda 7.54 11
6 Ingrid Bergman 7.50 10
7 Ian McKellen 7.50 13
8 Tony Leung Chiu-wai 7.46 11
9 Song Kang-ho 7.45 10
10 Cary Grant 7.37 17
11 Joan Allen 7.31 10
12 Laurence Olivier 7.28 10
13 Brad Pitt 7.28 30
14 Patrick Stewart 7.27 11
15 James Mason 7.27 10
16 Anthony Quinn 7.25 11
17 Kirk Douglas 7.22 12
18 Chris Cooper 7.21 10
19 Marlon Brando 7.21 19
20 Irrfan Khan 7.19 12
Show Code
# --- 2️⃣ Directors with Highest Average IMDb Rating (Min 10 Movies) ---
df_directors = df.groupby("Director").agg(
    avg_rating=("Rating", "mean"),
    num_movies=("Director", "count")
).reset_index()

top_directors = df_directors[df_directors["num_movies"] >= 10].sort_values("avg_rating", ascending=False)
top_directors_display = top_directors.head(20).rename(columns={
    "Director": "Director",
    "avg_rating": "Avg. Rating",
    "num_movies": "Number of Movies (min 10)"
})

# Round "Avg. Rating" to 2 decimal places
top_directors_display["Avg. Rating"] = top_directors_display["Avg. Rating"].round(2)

# Reset index and add rank
top_directors_display = top_directors_display.reset_index(drop=True)
top_directors_display.index += 1
top_directors_display.index.name = "Rank"
top_directors_display_md = top_directors_display.reset_index()

# Display as markdown
display(HTML('<h3 style="text-align: center;">🎬 Top 20 Directors by Average Rating</h3>'))
display(HTML(top_directors_display_md.to_html(index=False, classes="table", border=0)))

🎬 Top 20 Directors by Average Rating

Rank Director Avg. Rating Number of Movies (min 10)
1 Christopher Nolan 8.15 11
2 Hayao Miyazaki 7.97 11
3 Stanley Kubrick 7.87 10
4 Billy Wilder 7.74 10
5 Peter Jackson 7.68 12
6 David Fincher 7.66 11
7 Martin Scorsese 7.56 24
8 Alfred Hitchcock 7.53 21
9 Joel Coen 7.50 10
10 Howard Hawks 7.43 12
11 Steven Spielberg 7.40 31
12 Yimou Zhang 7.40 12
13 Guy Ritchie 7.32 11
14 James Mangold 7.27 11
15 Alan Parker 7.25 11
16 Kevin Macdonald 7.22 10
17 Ken Loach 7.21 11
18 Robert Zemeckis 7.18 16
19 John Huston 7.16 12
20 Roman Polanski 7.16 14
Show Code
# --- 3️⃣ Actors Who Appeared in the Most Movies ---
most_movies_actors = actor_stats.copy()
most_movies_actors["avg_rating"] = most_movies_actors["avg_rating"].round(2)
most_movies_actors = most_movies_actors.sort_values("num_movies", ascending=False)

most_movies_display = most_movies_actors.head(20).rename(columns={
    "Actor": "Actor",
    "avg_rating": "Avg. Rating",
    "num_movies": "Number of Movies (min 10)"
})

# Round "Avg. Rating" to 2 decimal places
most_movies_display["Avg. Rating"] = most_movies_display["Avg. Rating"].round(2)

# Reset index and add rank
most_movies_display = most_movies_display.reset_index(drop=True)
most_movies_display.index += 1
most_movies_display.index.name = "Rank"
most_movies_display_md = most_movies_display.reset_index()

# Display as HTML with rank
display(HTML('<h3 style="text-align: center;">⭐️ Most Active Actors</h3>'))
display(HTML(most_movies_display_md.to_html(index=False, classes="table", border=0)))

⭐️ Most Active Actors

Rank Actor Avg. Rating Number of Movies (min 10)
1 Robert De Niro 6.73 63
2 Samuel L. Jackson 6.45 60
3 Nicolas Cage 6.07 54
4 Bruce Willis 6.15 53
5 Liam Neeson 6.43 47
6 Johnny Depp 6.83 45
7 Dennis Quaid 6.43 45
8 Nicole Kidman 6.46 43
9 Morgan Freeman 6.53 42
10 Denzel Washington 6.94 41
11 Tom Hanks 7.14 41
12 Anthony Hopkins 6.65 40
13 Woody Harrelson 6.53 39
14 Sylvester Stallone 6.07 38
15 Matt Damon 7.02 38
16 Gene Hackman 6.74 38
17 Ethan Hawke 6.66 37
18 Jeff Bridges 6.73 37
19 Clint Eastwood 6.94 37
20 Ewan McGregor 6.58 36
Show Code
# --- 4️⃣ Directors Who Have Directed the Most Movies ---
most_movies_directors = df_directors.sort_values("num_movies", ascending=False)
most_directors_display = most_movies_directors.head(20).rename(columns={
    "Director": "Director",
    "avg_rating": "Avg. Rating",
    "num_movies": "Number of Movies (min 10)"
})

# Round "Avg. Rating" to 2 decimal places
most_directors_display["Avg. Rating"] = most_directors_display["Avg. Rating"].round(2)

# Add a new 1-based index for display
most_directors_display = most_directors_display.reset_index(drop=True)
most_directors_display.index += 1
most_directors_display.index.name = "Rank"
most_directors_display_md = most_directors_display.reset_index()

# Display as HTML with rank
display(HTML('<h3 style="text-align: center;">🎥 Most Active Directors</h3>'))
display(HTML(most_directors_display_md.to_html(index=False, classes="table", border=0)))

🎥 Most Active Directors

Rank Director Avg. Rating Number of Movies (min 10)
1 Clint Eastwood 6.92 33
2 Steven Spielberg 7.40 31
3 Ridley Scott 6.99 26
4 Martin Scorsese 7.56 24
5 Steven Soderbergh 6.81 22
6 Alfred Hitchcock 7.53 21
7 Woody Allen 7.00 21
8 Brian De Palma 6.73 20
9 Tim Burton 6.99 20
10 Ron Howard 7.02 20
11 Sidney Lumet 6.88 19
12 Francis Ford Coppola 7.12 19
13 Walter Hill 6.47 19
14 Barry Levinson 6.56 18
15 Oliver Stone 6.83 18
16 Renny Harlin 5.58 17
17 Joel Schumacher 6.39 17
18 Blake Edwards 6.52 16
19 Robert Zemeckis 7.18 16
20 Tony Scott 6.81 16
Show Code
from IPython.display import Markdown, display

df["Actor_List"] = df["Stars"].apply(lambda x: [actor.strip() for actor in x.split(",") if actor.strip()])
df_collab = df.explode("Actor_List")
df_collab["Actor_Director_Collab"] = df_collab["Actor_List"] + " & " + df_collab["Director"]

collab_stats = (
    df_collab.groupby("Actor_Director_Collab")
    .agg(
        Num_Collaborations=("Title", "count"),
        Avg_IMDB_Rating=("Rating", "mean"),
        Percent_Highly_Rated_Movies=("Highly_Rated", "mean")
    )
    .reset_index()
)

collab_stats = collab_stats[collab_stats["Num_Collaborations"] >= 3].copy()
collab_stats["Avg_IMDB_Rating"] = collab_stats["Avg_IMDB_Rating"].round(2)
collab_stats["Percent_Highly_Rated_Movies"] = (collab_stats["Percent_Highly_Rated_Movies"] * 100).round(2)

top_collabs = collab_stats.sort_values(by="Avg_IMDB_Rating", ascending=False).head(20).reset_index(drop=True)
top_collabs.index += 1
top_collabs = top_collabs.reset_index().rename(columns={
    "index": "Rank",
    "Actor_Director_Collab": "Actor–Director Duo",
    "Num_Collaborations": "Number of Collaborations",
    "Avg_IMDB_Rating": "Avg. IMDb Rating",
    "Percent_Highly_Rated_Movies": "% Highly Rated Movies"
})


# Display as HTML with rank
display(HTML('<h3 style="text-align: center;">☯️ Best-Rated Actor–Director Duos</h3>'))
display(HTML(top_collabs.to_html(index=False, classes="table", border=0)))

☯️ Best-Rated Actor–Director Duos

Rank Actor–Director Duo Number of Collaborations Avg. IMDb Rating % Highly Rated Movies
1 Marlon Brando & Francis Ford Coppola 3 8.93 100.00
2 Elijah Wood & Peter Jackson 3 8.90 100.00
3 Al Pacino & Francis Ford Coppola 4 8.75 100.00
4 Tarik Akan & Ertem Egilmez 3 8.53 100.00
5 Christian Bale & Christopher Nolan 4 8.52 100.00
6 Uma Thurman & Quentin Tarantino 4 8.45 100.00
7 Brad Pitt & David Fincher 3 8.40 100.00
8 Ian McKellen & Peter Jackson 5 8.38 100.00
9 Clint Eastwood & Sergio Leone 3 8.30 100.00
10 James Caan & Francis Ford Coppola 3 8.23 66.67
11 Joe Pesci & Martin Scorsese 4 8.20 100.00
12 Robert Downey Jr. & Anthony Russo 3 8.20 100.00
13 Tatsuya Nakadai & Akira Kurosawa 3 8.17 100.00
14 Samuel L. Jackson & Quentin Tarantino 3 8.07 100.00
15 Toshirô Mifune & Akira Kurosawa 4 8.07 100.00
16 Grace Kelly & Alfred Hitchcock 3 8.03 100.00
17 James Stewart & Alfred Hitchcock 4 8.02 100.00
18 Arnold Schwarzenegger & James Cameron 3 8.00 100.00
19 Michael Biehn & James Cameron 3 8.00 100.00
20 Leonardo DiCaprio & Martin Scorsese 5 7.98 100.00

Comparing Originals to Their Successors

Show Code
import pandas as pd
import re
from rapidfuzz import process, fuzz
from collections import defaultdict

# Step 1: Clean title
def clean_title(title):
    title = title.lower()
    title = re.sub(r'[^\w\s]', '', title)
    return re.sub(r'\s+', ' ', title).strip()

df_titles = df[["Title", "Rating", "Year"]].copy()
df_titles["clean_title"] = df_titles["Title"].apply(clean_title)

# Step 2: Extract first strong word (excluding filler words)
filler_words = {"the", "a", "an", "of", "and", "in", "on", "at", "to", "for"}
def get_main_word(title):
    words = title.split()
    for word in words:
        if word not in filler_words:
            return word
    return words[0] if words else "unknown"

df_titles["block_key"] = df_titles["clean_title"].apply(get_main_word)

# Step 3: Group by block_key
df_titles["cluster_id"] = -1
matched = set()
clusters = defaultdict(list)
cluster_id = 0

for key, group in df_titles.groupby("block_key"):
    if len(group) == 1:
        idx = group.index[0]
        df_titles.at[idx, "cluster_id"] = cluster_id
        cluster_id += 1
        continue  # skip singletons
    
    titles = group["clean_title"].tolist()
    index_map = dict(zip(titles, group.index))
    
    for i, title in enumerate(titles):
        if index_map[title] in matched:
            continue
        results = process.extract(title, titles, scorer=fuzz.token_set_ratio, score_cutoff=70)
        matched_idxs = [index_map[r[0]] for r in results if index_map[r[0]] not in matched]
        
        for idx in matched_idxs:
            matched.add(idx)
            clusters[cluster_id].append(idx)
        cluster_id += 1

# Step 4: Assign back
for cid, indices in clusters.items():
    df_titles.loc[indices, "cluster_id"] = cid

df_titles = df_titles.sort_values(["cluster_id", "Year"])

# Filter to only clusters with more than 1 member
non_singleton_clusters = df_titles["cluster_id"].value_counts()
non_singleton_clusters = non_singleton_clusters[non_singleton_clusters > 1].index

df_clustered = df_titles[
    df_titles["cluster_id"].isin(non_singleton_clusters) & (df_titles["cluster_id"] != -1)
].sort_values(["cluster_id", "Year"])

We wanted to see how the original movie of a series—whether a remake like Top Gun and Top Gun: Maverick, or a sequel like Cars and Cars 2—compares to its “follower,” for lack of a better term.

To do this, we first used fuzzy matching9 to cluster together movies with similar titles. After running the algorithm, we manually checked each grouping to confirm they belonged to the same franchise or storyline and ensure accuracy. This comparison also accounts for situations where, due to scraping limitations, only the second through fourth movies in a series were captured. In those cases, the earliest available film was treated as the “original.”

Simply put, we compared every movie’s first release date to all following films within its cluster10, allowing for a larger sample size of comparisons. The results were astounding, but sensible.

Show Code
import pandas as pd
import plotly.graph_objects as go
from scipy.stats import ttest_ind

# Load and clean
df_series_remakes = pd.read_csv("fuzzy_matching_clusters_sanity_check_2.csv")
df_series_remakes = df_series_remakes[df_series_remakes['cluster_id'] != -1]
df_series_remakes = df_series_remakes.sort_values(by=['cluster_id', 'Year'])

# Tag original vs. follower
def tag_role(group):
    group = group.sort_values(by='Year')
    group['role'] = ['Originals'] + ['Followers'] * (len(group) - 1)
    return group

df_tagged = df_series_remakes.groupby('cluster_id', group_keys=False).apply(tag_role).reset_index(drop=True)

# Extract ratings
originals = df_tagged[df_tagged['role'] == 'Originals']['Rating']
followers = df_tagged[df_tagged['role'] == 'Followers']['Rating']

# T-test
t_stat, p_val = ttest_ind(originals, followers, equal_var=False)

# Average ratings
avg_ratings = df_tagged.groupby('role')['Rating'].mean().reindex(['Originals', 'Followers'])

# IMDb yellow & Netflix red
bar_colors = ['#F5C518', '#E50914']

# Plot
fig = go.Figure()

fig.add_trace(go.Bar(
    x=avg_ratings.index,
    y=avg_ratings.values,
    text=[f"{val:.2f}" for val in avg_ratings.values],
    textposition='outside',
    marker_color=bar_colors,
    hoverinfo='x+y',
))

# Annotate t-test result (moved lower + left-aligned)
fig.add_annotation(
    x=0.0,
    y=1.0,
    text=f"T-stat: {t_stat:.2f} | P-value: {p_val:.4f}",
    showarrow=False,
    font=dict(size=12, color="white"),
    align='center',
    bgcolor="#333333",
    bordercolor="#888",
    borderwidth=1
)

# Layout
fig.update_layout(
    title=dict(
        text='Average IMDb Rating: Originals vs. Followers 🍿',
        x=0.5,
        xanchor='center',
        font=dict(color='#F5C518', size=22)
    ),
    yaxis=dict(title='Average Rating', range=[0, 10]),
    xaxis=dict(title='Film Type'),
    plot_bgcolor='#1C1C1C',
    paper_bgcolor='#1C1C1C',
    font=dict(family="Arial", size=14, color="#FFFFFF")
)

fig.show()


The comparison wasn’t just anecdotal, it was statistically significant. Originals consistently outperformed their sequels, remakes, and follow-ups. A two-sample t-test confirmed that the difference in ratings was not due to random chance or a coincidence.

Show Code
import pandas as pd
from IPython.display import Markdown, display

# Start fresh with a copy of your existing DataFrame
df_clustered = df_series_remakes.copy()

# Filter out unclustered (-1)
df_clustered = df_clustered[df_clustered['cluster_id'] != -1]

# Sort by cluster and year
df_clustered = df_clustered.sort_values(by=['cluster_id', 'Year'])

# Label roles: first = original, rest = follower
def assign_roles(group):
    group = group.sort_values(by='Year')
    group['role_new'] = ['original'] + ['follower'] * (len(group) - 1)
    return group

df_clustered_labeled = df_clustered.groupby('cluster_id', group_keys=False).apply(assign_roles)

# Compare followers to original
comparison_rows_new = []

for cluster_id, group in df_clustered_labeled.groupby('cluster_id'):
    group = group.sort_values(by='Year')
    if len(group) < 2:
        continue
    original = group.iloc[0]
    for _, follower in group.iloc[1:].iterrows():
        comparison_rows_new.append({
            'cluster_id': cluster_id,
            'original_title': original['Title'],
            'original_rating': original['Rating'],
            'follower_title': follower['Title'],
            'follower_rating': follower['Rating'],
            'follower_same_or_better': follower['Rating'] >= original['Rating']
        })

# Create new comparison DataFrame
df_follower_vs_original = pd.DataFrame(comparison_rows_new)

# Summary stats
total_comparisons_new = len(df_follower_vs_original)
better_or_equal_new = df_follower_vs_original['follower_same_or_better'].sum()
percentage_new = (better_or_equal_new / total_comparisons_new) * 100

display(Markdown(f"""
 Out of **{total_comparisons_new}** remake comparisons,  
 **{better_or_equal_new}** were rated the same or better than the original.  
 That’s **{percentage_new:.2f}%** of the time.
"""))

Out of 494 remake comparisons,
89 were rated the same or better than the original.
That’s 18.02% of the time.

This confirms a broader trend: while follow-ups may have higher budgets or better effects, they rarely capture the audience approval earned by the original.


Act IV: The Verdict


So, why does any of this matter?

We weren’t just modeling for the sake of prediction — we were trying to understand what really connects with audiences, and why some films seem to resonate while others fade. Here’s what stood out:

Creative Choices that Resonate

  • Runtime: There’s a consistent sweet spot — the majority of best-rated films tend to run between 1h 32m and 2h 22m.

  • Director-Writer Control: When the same person directs and writes, ratings are higher — likely a result of more unified vision.

  • Realism Resonates: Genres like Biography and Documentary outperform others, pointing to a strong audience appetite for real, grounded storytelling.

  • Followers Rarely Outperform Originals: While sequels & remakes are common, it’s very rare for them to surpass the original in ratings — suggesting that they need to be done right to receive the same attention as the original story.

Simple is Special - Modeling

  • We didn’t need budget data to get meaningful results. Using only surface-level features like cast, runtime, genre, and creative roles, we were able to predict IMDb ratings with strong accuracy.

  • We even tested the models on upcoming releases (see Credits) to see if the results still hold.

The Mystery Remains — But We’re Closer

We opened with the question: Are movies getting worse or are they simply getting missed?

And while there may never be a definitive answer, this report gets us closer.

Streaming has dramatically increased content volume, and with it, audience saturation. Today’s well-made films often disappear quickly, overshadowed by aggressive release schedules and the pressure to feed platforms with new content. Even movies that follow the “successful” formula struggle to break through the noise. Ultimately, the challenge may not be that we’re making worse moviesit’s that we’re creating too many, with too little space for any of them to breathe.

The art is still there. It’s just harder to notice.


Credits: A Real-World Application

Show Code
# ========================
# 6. Predict for All Movies in File
# ========================

# Load the file (CSV or Excel)
df_predict = pd.read_excel("/Users/rileysvensson/Desktop/CAA - Data Analyst/Projects/imdb_data/final/predict_movies.xlsx")  

# Drop unneeded columns
X_new = df_predict.drop(columns=drop_cols, errors="ignore")

# Transform using the pipeline
X_new_trans = ct_all.transform(X_new[categorical_features + numerical_features])
X_new_trans = np.array(X_new_trans, dtype=np.float32)

# Predict and rescale
y_preds_scaled = rf_model.predict(X_new_trans)
y_preds_original = scaler_y.inverse_transform(y_preds_scaled.reshape(-1, 1))

# Attach predictions to original dataframe
df_predict["Predicted_Rating"] = y_preds_original

# Sort by predicted rating (highest to lowest)
df_predict_sorted = df_predict.sort_values("Predicted_Rating", ascending=False)

# Print title and predicted rating for each movie
for title, rating in zip(df_predict_sorted["Title"], df_predict_sorted["Predicted_Rating"]):
    print(f"🎬 {title}: {rating:.2f}")
🎬 Sinners: 6.74
🎬 Until Dawn: 6.60
🎬 Lilo & Stitch: 6.44
🎬 Sneaks: 6.31
🎬 Final Destination: Bloodlines: 6.16
🎬 Magic Farm: 6.10
🎬 A Minecraft Movie: 5.95
🎬 The Wedding Banquet: 5.89
🎬 The Lost Princess: 4.94
🎬 Bears on a Ship: 4.71

Glossary

1. Machine Learning – Algorithms that learn patterns from data to make predictions or decisions without being explicitly programmed.

2. Scraping - The automated process of extracting data from websites or online sources.

3. Data Wrangling - Cleaning, restructuring, and enriching raw data into a usable format.

4. Features - Individual measurable properties or a synonym to the variables used as inputs in a model.

5. Logarithmic - A type of growth pattern where values increase rapidly at first, then gradually slow down and level off, and used to model relationships that rise quickly before approaching a natural ceiling

6. Modeling - The process of training a machine learning algorithm on data to learn patterns and relationships, with the goal of making predictions or gaining insights

7. Scoring - The use of evaluation metrics to measure a model’s predictive performance.

8. Descriptive Analytics - Analyzing historical data to understand what has happened.

9. Fuzzy Matching - A technique used to find approximate matches between strings, useful for identifying similar but not identical text.

10. Cluster - A group of related records, typically grouped by a shared ID or characteristic — in this case, movies that belong to the same franchise or series.

Appendix

Show Code
import seaborn as sns
import matplotlib.pyplot as plt

from tabulate import tabulate

# Get feature names after transformation
feature_names = ct_all.get_feature_names_out()

# Get feature importances
importances = rf_model.feature_importances_

# Create a DataFrame for better readability
feature_importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

# Convert top 20 to markdown table
top_20_features = feature_importance_df.head(20)

#
#| Feature                                |   Importance |
#|----------------------------------------|--------------|
#| num__Duration                          |   0.175685   |
#| num__Year                              |   0.117457   |
#| num__Num_Genres                        |   0.0316729  |
#| cat__Cleaned_Primary_Genre_Drama       |   0.0293566  |
#| cat__Cleaned_Primary_Genre_Documentary |   0.0179949  |
#| cat__Cleaned_Primary_Genre_Biography   |   0.0126313  |
#| cat__Cleaned_Primary_Genre_Action      |   0.0086014  |
#| cat__Word_Count_Binned_30+             |   0.00728996 |
#| cat__Certificate_Not Rated             |   0.00728064 |
#| cat__Certificate_R                     |   0.00695221 |
#| cat__Certificate_PG-13                 |   0.006874   |
#| cat__Director_Is_Writer_1              |   0.00664878 |
#| cat__Director_Is_Writer_0              |   0.00639114 |
#| cat__Word_Count_Binned_20-25           |   0.0063814  |
#| cat__Word_Count_Binned_25-30           |   0.00635609 |
#| cat__Cleaned_Primary_Genre_Adult       |   0.00603386 |
#| cat__Word_Count_Binned_15-20           |   0.00577349 |
#| cat__Cleaned_Primary_Genre_Sport       |   0.00572772 |
#| cat__Director_Jon M. Chu               |   0.00543463 |
#| cat__Certificate_PG                    |   0.00534882 |
#

# IMDb theme colors
BACKGROUND = '#000000'  
TEXT = '#f5f5f5'        
HIGHLIGHT = '#f5c518'  

# Set overall seaborn and matplotlib style
sns.set_style("darkgrid")
plt.rcParams.update({
    'axes.facecolor': BACKGROUND,
    'figure.facecolor': BACKGROUND,
    'axes.labelcolor': TEXT,
    'xtick.color': TEXT,
    'ytick.color': TEXT,
    'text.color': TEXT,
    'axes.edgecolor': TEXT,
    'grid.color': '#444444',  # soft gridlines
    'axes.titleweight': 'bold',
    'axes.titlepad': 15,
    'axes.titlesize': 14
})

# 1️⃣ Duration vs IMDb Rating — Sweet spot check
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df['Duration'], y=df['Rating'], alpha=0.4, color='white')  # Data points
sns.lineplot(x='Duration', y='Rating', data=df, estimator='mean', errorbar=None, color=HIGHLIGHT, linewidth=2)
plt.title('IMDb Rating by Duration')
plt.xlabel('Duration (min)')
plt.ylabel('Rating')
plt.grid(True)
plt.tight_layout()
plt.show()

# 2️⃣ Director also being the writer (1 = Yes, 0 = No)
plt.figure(figsize=(6, 4))
ax = sns.barplot(x='Director_Is_Writer', y='Rating', data=df, palette=[TEXT, HIGHLIGHT])
plt.title('Average Rating: Director Also Writer')
plt.ylim(6.0, 8.5)
plt.xlabel('Is Director Writer?')
plt.ylabel('Avg IMDb Rating')
plt.tight_layout()
plt.show()