Monitoring & Debugging - Spark UI, execution plans, debugging failed jobs

The Debugging Mental Model

Before diving into individual tools, anchor yourself in this hierarchy:

Cluster --> Driver --> Executor --> Stage --> Task --> Table (Delta)

Level	What to Check
Cluster	Infrastructure, lifecycle events, autoscaling, init script failures
Driver	Job orchestration, SparkContext errors, DAG planning failures
Executor	Task execution, OOM, garbage collection, shuffle issues
Stage	Shuffle boundaries, data distribution, spill
Task	Fine-grained execution, skew, straggler tasks
Table (Delta)	Data operations, file count, table history, small file problem

Rule of thumb:

Debug top-down (start broad, narrow down)
Optimize bottom-up (fix root causes at the task level)

Spark UI

The Spark UI is the primary tool for understanding job performance. It gives you a live, structured view of every job, stage, and task running on the cluster.

How to access: Compute page --> Select your cluster --> Spark UI tab

Spark UI Tabs

Tab	What It Shows	When to Use
Jobs	Lists every Spark action triggered. Shows duration, status, number of stages per job.	Start here to find which job is slow or failed.
Stages	Breaks each job into stages. Shows shuffle read/write, spill to disk, task distribution.	Key for finding data skew and expensive shuffles.
Storage	Cached RDDs and DataFrames -- size in memory, on-disk spill, fraction cached.	Check if caching is working or wasting memory.
Executors	Executor activity: cores, active/failed tasks, GC time, shuffle read/write, memory usage.	Identify under-utilized or overloaded executors.
SQL	Shows query execution plans visually as a DAG. Displays scan, filter, join, aggregate nodes with metrics.	Understand how SQL/DataFrame queries are executed.
Environment	All Spark configurations, JVM info, system properties, classpath.	Verify configuration settings are applied correctly.
Streaming	Streaming job metrics: input rate, processing time, batch duration, completed batches.	Monitor streaming pipeline health.

Jobs Tab - What to Look For

Job duration: Which job is taking the longest?
Status: Failed jobs show red -- click to drill into the failing stage.
Number of stages: Many stages may indicate excessive shuffles from wide transformations (joins, groupBy, repartition).

Stages Tab - What to Look For

This is where most debugging happens.

Metric	What It Means	Action
Shuffle Read/Write	Data exchanged between stages (wide transformations).	High shuffle = expensive. Consider broadcast joins or pre-partitioning.
Spill (Memory)	Data that didn't fit in executor memory during shuffle.	Increase `spark.sql.shuffle.partitions` or executor memory.
Spill (Disk)	Data spilled from memory to disk.	Same as above -- disk spill is slower than memory spill.
Task Duration (Summary)	Min, 25th percentile, median, 75th percentile, max task duration.	Large gap between median and max = data skew.
Input Size	Data read by the stage.	Unexpectedly large input = missing filters or predicate pushdown.

Tasks Tab - What to Look For

Drill into a specific stage to see individual task metrics.

Symptom	Diagnosis
A few tasks taking 10x-100x longer than others	Data skew -- one partition has far more data
All tasks slow uniformly	Resource issue (too few executors, small instances) or I/O bottleneck
Many failed tasks with retries	OOM, corrupt data, or unstable spot instances
High GC time on tasks	Executor memory pressure -- increase memory or reduce partition size

Executors Tab - What to Look For

Column	Healthy	Problematic
Active Tasks	Spread evenly across executors	All tasks on 1-2 executors = skew or bad partitioning
Failed Tasks	0	> 0 = investigate OOM, data corruption
GC Time	< 5% of total time	> 10% = memory pressure, increase executor memory
Shuffle Read/Write	Balanced across executors	One executor with 10x more = data skew
Dead Executors	0	> 0 = OOM kills, spot eviction, or resource limits

DAG Visualization

Available in the Jobs tab (click a job) or SQL tab.
Shows the directed acyclic graph of transformations.
Narrow stages (map, filter): no shuffle, fast.
Wide stages (join, groupBy, repartition): shuffle boundary, expensive.
Look for unnecessarily wide stages and redundant shuffles.

Execution Plans

What is an Execution Plan?

When you write a DataFrame transformation or SQL query, Spark does not execute it immediately (lazy evaluation). The Catalyst Optimizer transforms your code through multiple stages before execution:

Your Code --> Parsed Logical Plan --> Analyzed --> Optimized --> Physical Plan --> Execution

Catalyst Optimizer Pipeline

Stage	What Happens
Parsed Logical Plan	Raw AST from your code. Table/column names are unresolved strings.
Analyzed Logical Plan	Resolves table names, column names, data types against the catalog. Throws `AnalysisException` if references are invalid.
Optimized Logical Plan	Catalyst applies optimization rules: predicate pushdown, constant folding, column pruning, join reordering.
Physical Plan	Determines actual execution strategy: join algorithms (broadcast vs sort-merge vs shuffle-hash), data access methods, shuffle strategies. Multiple physical plans are generated; Spark picks the most efficient based on cost.

EXPLAIN Modes

# PySpark
df.explain("simple")     # Only physical plan (default)
df.explain("extended")   # All 4 plans: parsed, analyzed, optimized, physical
df.explain("codegen")    # Physical plan + generated Java code
df.explain("cost")       # Optimized plan with CBO cost estimates
df.explain("formatted")  # Clean tree with node IDs and splits

-- SQL
EXPLAIN SELECT * FROM table;
EXPLAIN EXTENDED SELECT * FROM table;
EXPLAIN FORMATTED SELECT * FROM table;

Mode	When to Use
simple	Quick check of what Spark will actually execute
extended	Full end-to-end debugging; understand Catalyst transformations
codegen	Investigate code generation problems or fused operator performance
cost	Understand why Spark chose one plan over another (requires CBO enabled)
formatted	Clean, readable output for documentation and sharing

Key Things to Look For in Execution Plans

What to Look For	Good Sign	Bad Sign
Predicate Pushdown	`PushedFilters: [IsNotNull(col), EqualTo(col, value)]`	Filters applied after full scan
Column Pruning	Only required columns in `ReadSchema`	Reading all columns when only a few are needed
Broadcast Join	`BroadcastHashJoin` for small tables (< 10MB default)	`SortMergeJoin` on a small table = missed broadcast opportunity
Shuffle	Minimal `Exchange` nodes	Multiple `Exchange hashpartitioning` = expensive shuffles
Whole-Stage CodeGen	`*(n)` prefix on operators = fused into single Java function	Missing `*` = not code-generated (UDFs, complex types)

Example: Reading an Execution Plan

df = spark.read.table("sales")
result = df.filter(df.region == "US").groupBy("product").sum("amount")
result.explain("extended")

Output walkthrough:

Parsed: Unresolved sales, region, product, amount
Analyzed: Resolved to catalog table with typed columns
Optimized: Filter pushed down to scan, only product and amount columns read (column pruning)
Physical: FileScan with pushed filters --> HashAggregate (partial) --> Exchange hashpartitioning --> HashAggregate (final)

Debugging Failed Jobs

Step-by-Step Debugging Workflow

1. Find the failed run/task
       |
2. Read the error message
       |
3. Check driver and executor logs
       |
4. Reproduce in a notebook
       |
5. Fix and prevent

Step 1: Find the Failed Run

Go to Jobs UI (Lakeflow Jobs) --> Click the job name --> Runs tab
The matrix view shows a history grid of all task runs -- red cells = failures
Hover over a failed task to see: start/end time, duration, cluster name, error summary
Click the failed task to open Task Run Details with full error message and logs

Step 2: Read the Error Message

Error Type	Common Message	Root Cause
OOM (Driver)	`java.lang.OutOfMemoryError: Java heap space`	`collect()` on large dataset, too many broadcast variables, or under-sized driver
OOM (Executor)	`Container killed by YARN for exceeding memory limits`	Data skew, large shuffle partitions, insufficient executor memory
Schema Error	`AnalysisException: cannot resolve column name`	Schema drift in source data, renamed/dropped columns
File Not Found	`FileNotFoundException`	Source file moved/deleted, mount point expired, stale Delta table cache
Permission Error	`AccessDeniedException`	IAM/UC permissions not configured for the storage location
Timeout	`SparkException: Job aborted due to stage failure`	Cluster under-provisioned, network issues, or data source timeout
Corrupt Data	`SparkException: Malformed records`	Bad input files (truncated CSV, invalid JSON)

Step 3: Check Logs

Log Type	Where to Find	What It Shows
Driver stdout/stderr	Cluster --> Driver Logs tab --> stdout / stderr	Print statements, application-level errors, Python exceptions
Driver log4j	Cluster --> Driver Logs tab --> log4j	Spark internal logs, WARNING/ERROR messages
Executor logs	Spark UI --> Executors tab --> click executor --> stderr	Task-level errors, OOM, GC issues
Cluster Event Log	Compute --> Cluster --> Event Log tab	Cluster start/stop, autoscaling, init script success/failure
Spark Event Log	Configured via `spark.eventLog.dir`	Historical job analysis after cluster termination

Step 4: Reproduce in a Notebook

Attach to the same cluster type and runtime version
Run the failing logic in isolation
Use count() on intermediate DataFrames to identify which transformation fails
Use df.explain(True) to check the execution plan

Step 5: Fix and Prevent

Fix the root cause (not just the symptom)
Repair run: In the Jobs UI, click "Repair Run" to re-run only failed/skipped tasks (no need to re-run successful tasks)
Add monitoring: Set up email/Slack alerts for job failures
Add data quality checks: Use DLT expectations or custom assertions
Persist logs: Configure cluster log delivery to ADLS/S3 for post-mortem analysis

Log Types in Databricks

Cluster Event Logs

Event	What It Means
Cluster starting	Resources being provisioned
Cluster terminated	Job finished, idle timeout, or failure
Unexpected termination	Spot eviction, policy violation, or cloud quota exceeded
Scaling up	Workload requires more parallelism
Scaling down	Cluster underutilized
Frequent scale up/down	Unstable workload or poor partitioning
Init script failure	Dependency/setup issue (libraries, mounts, configs)

Driver Logs

stdout: Application output, print() statements
stderr: Python exceptions, tracebacks, Spark errors
log4j: Spark internal messages (INFO, WARN, ERROR)

The driver log is where you'll find the root cause error message for most job failures.

Executor Logs

Task-level errors and stack traces
OOM kills, GC overhead, shuffle failures
Each executor has its own stderr/stdout
Access via: Spark UI --> Executors tab --> click executor ID --> stderr

Cluster Log Delivery

For post-mortem analysis after cluster termination, configure log delivery to cloud storage:

Azure: ADLS Gen2 or Azure Blob Storage
AWS: S3 bucket
Logs are organized by cluster ID and date
Not supported on serverless compute (at the time of writing)

Best practice: Always enable log delivery for production clusters.

Common Debugging Patterns

1. Data Skew

Symptom: A few tasks take 10x-100x longer than the rest in the Stages tab.

Detection:

Spark UI --> Stages --> Task Duration Summary: large gap between median and max
One executor with disproportionately high shuffle read

Solutions:

Salting: Add a random prefix to the skewed key to distribute data evenly
Broadcast join: If the smaller table fits in memory (< 10MB default, configurable)
Adaptive Query Execution (AQE): Enable spark.sql.adaptive.enabled=true (auto-handles skew in Spark 3.0+)
Repartition: Increase spark.sql.shuffle.partitions to create smaller partitions

2. Shuffle Spill

Symptom: "Spill (Memory)" and "Spill (Disk)" metrics visible in the Stages tab.

Solutions:

Increase spark.sql.shuffle.partitions (default 200 -- try 400-2000 for large datasets)
Increase executor memory
Reduce data volume before shuffle (filter early, aggregate early)

3. Small File Problem

Symptom: Stage has thousands of tasks but each processes only a few KB of data.

Detection: DESCRIBE DETAIL table shows high numFiles with small sizeInBytes.

Solutions:

Run OPTIMIZE table to compact small files into larger ones
Enable Auto Optimize: delta.autoOptimize.optimizeWrite = true
Increase batch size for streaming ingestion

4. OOM Errors

OOM Location	Cause	Fix
Driver	`collect()` on large data, too many partitions in DAG	Avoid `collect()`, increase driver memory, reduce partition count
Executor	Large partitions, data skew, insufficient memory config	Increase executor memory, increase shuffle partitions, fix skew

5. Straggler Tasks

Symptom: Job mostly complete but waiting for 1-2 slow tasks.

Solutions:

Enable speculative execution: spark.speculation=true (re-launches slow tasks on other executors)
Check for data skew (the most common cause)
Check for spot instance evictions causing task retries

Symptom-to-Tool Quick Reference

Symptom	First Tool to Check
Job is running slow	Spark UI --> Jobs tab --> find longest job
Stage has massive data skew	Spark UI --> Stages tab --> Task Duration Summary
Need to see completed job history	Spark Event Logs
Cluster terminated unexpectedly	Cluster Event Logs
Library failed to install at startup	Cluster Logs --> Init script logs
Executor throwing OOM errors	Executor Logs (via Spark UI --> Executors)
Want to step through code interactively	Built-in Debugger in notebooks
UDF returning wrong values	Built-in Debugger + `display()` on intermediate results
Need to verify filter pushdown	`df.explain("formatted")`
Query using wrong join type	SQL tab in Spark UI or `df.explain("extended")`

Python Logging Best Practice in Databricks

For production pipelines, use Python's logging module instead of print():

import logging

logger = logging.getLogger("my_pipeline")
logger.setLevel(logging.INFO)

# File handler -- persist logs to Unity Catalog Volume
file_handler = logging.FileHandler("/Volumes/catalog/schema/volume/pipeline.log")
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
file_handler.setFormatter(formatter)

# Avoid duplicate handlers in interactive notebooks
if not logger.hasHandlers():
    logger.addHandler(file_handler)
    logger.addHandler(logging.StreamHandler())  # Also print to stdout

logger.info("Pipeline started")
logger.warning("Schema drift detected in source table")
logger.error("Failed to write to target table", exc_info=True)

Why not print()?

No log levels (can't distinguish INFO vs ERROR)
No timestamps or module names
Can't route to files or external systems
Not production-grade

Interview Question

#databricks #spark