Optimizing Apache Spark Performance for Skewed Data - Advanced Techniques and Case Study

53 minutes. That’s how long the Spark job ran. For a chain of joins on a table that was big but not that big — the kind of job you expect to finish while you grab coffee, not while you eat lunch and wonder if something broke.

The Spark UI told the story before I understood it. Two hundred and fifty tasks would breeze through in seconds. Then one task — a single task — would sit there, unmoving, for the remaining 50-plus minutes. One partition had somehow inherited nearly all the data. The rest of the cluster sat idle while that one executor burned.

This is skewed data: when one or a handful of keys in your dataset has a wildly disproportionate number of rows compared to every other key. Spark splits work by partitioning — it divides rows into chunks, each processed by a single task on a single machine. When the split is even, everything hums. When it’s not, one task chokes while the rest of the cluster twiddles its thumbs.

What Skew Actually Breaks

The symptoms are predictable once you’ve seen them:

Imbalanced workload. Uneven partitions mean most tasks finish fast while a few drag. The job runs at the speed of the slowest task.
Out-of-memory errors. The overloaded partition can exceed the memory allocated to its executor — especially if you’re caching intermediate results for iterative processing.
Wasted resources. CPU and memory sit idle on the under-loaded machines. You’re paying for the full cluster but using a fraction.
Shuffle pain. Operations like joins and aggregations that require data to be redistributed across the network (a shuffle) are particularly sensitive — skewed keys concentrate the shuffle onto a single machine.
Job failures. In extreme cases, the overloaded task exceeds memory limits or execution timeouts, and the whole job dies.

The Standard Toolkit

Before I get to what actually fixed the 53-minute job, here’s what the literature tells you to try:

Salting. Append a random value to each key before the shuffle, so the data fans out across more partitions. After the shuffle, strip the salt. Think of it as artificially diversifying a key that’s too popular. It adds some overhead but prevents any single key from dominating a partition.

Co-partitioning. When joining two datasets, ensure both are already partitioned the same way on the join key. Spark can then join them locally within each partition without reshuffling either side across the network.

Skew join optimization. Spark has built-in detection for skewed joins. When it finds a key with disproportionate data, it splits that key’s processing across multiple tasks automatically — though the effectiveness varies by Spark version and configuration.

Configuration tuning. More partitions, more memory per executor, more CPU. Sometimes brute force works. Usually it doesn’t, and you’re just burning money on a cluster that’s still 90% idle.

These are all real techniques. None of them solved this job.

What Actually Happened

The job ran as part of a batch processing pipeline. Multiple consecutive joins, large tables, high traffic — and every run, the same single-partition bottleneck.

First wrong turn: blaming the file format. The input was compressed GZip files. GZip isn’t splittable — Spark can’t hand different chunks of a single gzipped file to different tasks. It creates one partition per file by default, which kills parallelism at the read stage.

I installed the SplittableGZipCodec by Niels Basjes (Rahul Singha has a good walkthrough for Databricks). This codec feeds the same gzip file to multiple Spark tasks. Each task seeks to a specific byte offset and begins decompressing from there — a “fast-forward” into the compressed stream. It runs multiple tasks on the same file, cutting wall-clock time at the cost of slightly higher total core-hours.

To install it in a Databricks notebook: cluster details → Libraries tab → Install new → Maven → Search packages → “splittablegzip” → install. (Figures 3 and 4 in the Spark UI walkthrough show this.)

And it worked — sort of. Parallelism increased (Figure 5). But writes were still slow. Stage 251 still had a long-tail task (Figure 6). The Spark UI confirmed what I should have checked first: one task was processing dramatically more data than the others (Figure 7). The parallelism fix at the read layer didn’t touch the real problem.

The real problem was the DAG. Spark represents every transformation — filter, map, join, group-by — as a node in a Directed Acyclic Graph, the lineage that tracks how each piece of data was derived. When you chain 18 joins without materializing intermediate results, Spark has to carry the entire lineage forward. Each new operation adds to the graph. By the 18th join, the DAG is so deep I had to zoom my browser to 30% to screenshot it (Figure 8).

Long lineages don’t just slow things down in the abstract. Spark uses the DAG for fault tolerance — if a partition is lost, it can recompute from any ancestor node. But a monster lineage means every task carries the overhead of tracking 18 ancestors, and memory pressure compounds at each step.

The fix: checkpointing. Cut the lineage after each join by materializing the intermediate DataFrame. In Spark, “checkpointing” writes the intermediate result to disk and severs the DAG — the next join starts from a clean slate rather than chaining onto 17 prior joins. You can also use .persist() with the DISK_ONLY storage level for a similar effect without the full checkpoint overhead.

Figure 9 shows the PySpark modification to the perform_joins function that does exactly this. And Figure 10 shows the result: a dramatically shorter DAG, each join isolated from the tangle.

53 minutes dropped to 11 seconds.

That’s not a typo. 53 minutes → 11 seconds. The joins themselves were cheap. The overhead of carrying 18 generations of lineage through every task was what burned the time.

What I’d Do Differently Next Time

The SplittableGZipCodec was the right instinct — parallelism at the read layer matters — but it was the second problem to solve, not the first. I should have opened the Spark UI DAG visualization before touching the file format. The long-tail task in Stage 251 was visible immediately; I just didn’t look.

The broader lesson isn’t about Spark specifically. It’s that distributed systems make you pay for complexity in surprising places. The joins themselves were small. The lineage tracking was enormous. The cost wasn’t in the compute — it was in the bookkeeping.

The standard advice — salt your keys, tune your partitions, bump your memory — would have been useless here. No amount of salting fixes a DAG that’s 18 joins deep. The UI tells you the answer if you’re willing to read it. I wasn’t, at first. Next time I will be.