How I Built a Sub-Second Movie Similarity Engine With a 10-Line SQL Query

Published: February 17, 2026

On my IMDb dashboard (tigzig.com/movie-explorer), you can click a lightning icon next to any movie and get a list of similar movies. Match percentages, shared factors, the whole thing. It runs in under a second.

No ML - just SQL

The Problem

IMDb has 12 million titles and 97 million person-to-title credit records. I wanted a 'movies like this' feature that could compare any movie against all others and return results fast enough to feel instant. The constraint was that I only had public metadata - no user ratings or watch history. Just titles, genres, cast, crew, year, and runtime.

Brainstormed with Claude Code on approaches - from ML-based recommender systems to simpler methods - looking at trade-offs, deployment constraints etc. Here's what I went with.

The Approach: Jaccard Similarity With Weighted Tokens

Jaccard similarity is simple. Take two sets. Count what they share. Divide by the total unique items across both sets. Two movies that share 5 out of 20 total unique factors get a 25% match score.

The trick is what you put in those sets. I capture each movie's attributes as a list of tokens - text labels that describe its properties. A movie might have tokens like genre:Drama, director:nm0000229, star:nm0000151, decade:1990s, runtime:standard, rating_band:excellent.

The problem is that standard Jaccard treats every token equally. Sharing a lead actor should matter more than sharing a runtime category. But Jaccard has no built-in weighting mechanism.

The workaround is token duplication. I repeat important tokens multiple times in the list. A top-billed star gets about 10 tokens - star:, lead:, and actor: variants of their ID, repeated. A director gets about 5 tokens. A writer gets 1. Runtime gets 1. Genre gets 1. When two movies share a star actor, those ~10 duplicate tokens all appear in the intersection, pushing the Jaccard score up far more than sharing a single runtime:medium token would. The weighting is baked into the data at table creation time - the similarity query itself is completely unweighted. It just counts matching tokens versus total tokens. The weighting already happened.

The factors I settled on: genre, directors, actors (with star, lead, and supporting distinctions), writers, producers, decade, runtime category, and rating band.

The Implementation

I cannot run this comparison on raw data. The cast and crew information lives in title_principals - 97 million rows. Genres and runtime are in title_basics - 12 million rows. Ratings are in title_ratings - 1.6 million rows. To build a token set for one movie, you need to join across all three tables. To compare one movie against all others on the fly, you'd be hitting that 97 million row table repeatedly.

What I did was build a pre-computed table called movie_tokens. One row per movie, with all similarity factors already assembled from those three source tables and encoded as a DuckDB list (array) column.

The key design choice: I don't tokenize all 12 million titles

I filter to titleType = 'movie' (drops TV shows, shorts, episodes, etc.) and then to movies with 10,000+ votes. That takes 12 million titles down to about 12,000 movies. These are movies that enough people have actually watched and rated. Nobody needs a similarity match for a zero-vote short film from 1927. This filter is what makes the whole thing fast - the similarity query compares against 12,000 rows, not 12 million.

So instead of joining across 97 million + 12 million + 1.6 million rows at query time, I join them once during the data pipeline and store the result. The movie_tokens table is about 10MB. Each row has a tconst, title, year, rating, votes, genres, and a tokens array.

The query that powers the whole feature

WITH source AS (
  SELECT tokens FROM movie_tokens WHERE tconst = 'tt0111161'
)
SELECT
  m.tconst, m.title, m.year, m.rating, m.votes, m.genres,
  LEN(list_intersect(m.tokens, s.tokens)) * 1.0 /
  LEN(list_distinct(list_concat(m.tokens, s.tokens))) as jaccard_score,
  list_intersect(m.tokens, s.tokens) as matching_tokens
FROM movie_tokens m, source s
WHERE m.tconst <> 'tt0111161'
  AND LEN(list_intersect(m.tokens, s.tokens)) > 0
ORDER BY jaccard_score DESC, m.votes DESC
LIMIT 30

But building the pre-computed table is only half the problem

When a user clicks that lightning icon, the query still has to take one movie's token list and compare it against all 12,000 other movies in real time. That's 12,000 set comparisons - find the intersection, find the union, compute the ratio - all in one query.

This is where DuckDB earns its keep. The CTE (WITH clause) grabs the source movie's tokens. Then the main query cross-joins that against every other row in movie_tokens. For each pair, list_intersect finds the shared tokens, list_concat + list_distinct gives the union, and the division produces the Jaccard score. The matching_tokens column returns exactly which tokens matched - those are the tags you see in the "Why Similar" column on the UI.

DuckDB is built for this kind of columnar, analytical workload. It processes all 12,000 comparisons - each involving array intersection and union operations - in under a second. It's not doing 12,000 separate queries. It's one vectorized scan across the entire table. That's the difference between an analytical database like DuckDB and a row-oriented database.

The backend cache means repeated lookups for the same movie are instant after the first hit.

Jaccard on token lists is not the most sophisticated similarity algorithm

But it has properties I care about: it's explainable (you can show exactly which factors matched), it runs in pure SQL (no external libraries or services), and it's fast enough for a real-time feature.

The matching_tokens field is what makes this useful. When a user sees '24% match' next to a movie, they can also see Actor, 1990s, Drama, Lead, Star - the actual reasons. That transparency matters more than a slightly better algorithm that feels like a black box.

The full implementation is open source. Frontend code, backend code, the token generation query, the similarity SQL - all in the repos linked on the dashboard. The data pipeline docs walk through exactly how the movie_tokens table is built.