Monitoring & Debugging - Spark UI, execution plans, debugging failed jobs
The Debugging Mental Model
Before diving into individual tools, anchor yourself in this hierarchy:
Cluster --> Driver --> Executor --> Stage --> Task --> Table (Delta)
| Level | What to Check |
|---|---|
| Cluster | Infrastructure, lifecycle events, autoscaling, init script failures |
| Driver | Job orchestration, SparkContext errors, DAG planning failures |
| Executor | Task execution, OOM, garbage collection, shuffle issues |
| Stage | Shuffle boundaries, data distribution, spill |
| Task | Fine-grained execution, skew, straggler tasks |
| Table (Delta) | Data operations, file count, table history, small file problem |
Rule of thumb:
- Debug top-down (start broad, narrow down)
- Optimize bottom-up (fix root causes at the task level)
Spark UI
The Spark UI is the primary tool for understanding job performance. It gives you a live, structured view of every job, stage, and task running on the cluster.
How to access: Compute page --> Select your cluster --> Spark UI tab
Spark UI Tabs
| Tab | What It Shows | When to Use |
|---|---|---|
| Jobs | Lists every Spark action triggered. Shows duration, status, number of stages per job. | Start here to find which job is slow or failed. |
| Stages | Breaks each job into stages. Shows shuffle read/write, spill to disk, task distribution. | Key for finding data skew and expensive shuffles. |
| Storage | Cached RDDs and DataFrames -- size in memory, on-disk spill, fraction cached. | Check if caching is working or wasting memory. |
| Executors | Executor activity: cores, active/failed tasks, GC time, shuffle read/write, memory usage. | Identify under-utilized or overloaded executors. |
| SQL | Shows query execution plans visually as a DAG. Displays scan, filter, join, aggregate nodes with metrics. | Understand how SQL/DataFrame queries are executed. |
| Environment | All Spark configurations, JVM info, system properties, classpath. | Verify configuration settings are applied correctly. |
| Streaming | Streaming job metrics: input rate, processing time, batch duration, completed batches. | Monitor streaming pipeline health. |
Jobs Tab - What to Look For
- Job duration: Which job is taking the longest?
- Status: Failed jobs show red -- click to drill into the failing stage.
- Number of stages: Many stages may indicate excessive shuffles from wide transformations (joins, groupBy, repartition).
Stages Tab - What to Look For
This is where most debugging happens.
| Metric | What It Means | Action |
|---|---|---|
| Shuffle Read/Write | Data exchanged between stages (wide transformations). | High shuffle = expensive. Consider broadcast joins or pre-partitioning. |
| Spill (Memory) | Data that didn't fit in executor memory during shuffle. | Increase spark.sql.shuffle.partitions or executor memory. |
| Spill (Disk) | Data spilled from memory to disk. | Same as above -- disk spill is slower than memory spill. |
| Task Duration (Summary) | Min, 25th percentile, median, 75th percentile, max task duration. | Large gap between median and max = data skew. |
| Input Size | Data read by the stage. | Unexpectedly large input = missing filters or predicate pushdown. |
Tasks Tab - What to Look For
Drill into a specific stage to see individual task metrics.
| Symptom | Diagnosis |
|---|---|
| A few tasks taking 10x-100x longer than others | Data skew -- one partition has far more data |
| All tasks slow uniformly | Resource issue (too few executors, small instances) or I/O bottleneck |
| Many failed tasks with retries | OOM, corrupt data, or unstable spot instances |
| High GC time on tasks | Executor memory pressure -- increase memory or reduce partition size |
Executors Tab - What to Look For
| Column | Healthy | Problematic |
|---|---|---|
| Active Tasks | Spread evenly across executors | All tasks on 1-2 executors = skew or bad partitioning |
| Failed Tasks | 0 | > 0 = investigate OOM, data corruption |
| GC Time | < 5% of total time | > 10% = memory pressure, increase executor memory |
| Shuffle Read/Write | Balanced across executors | One executor with 10x more = data skew |
| Dead Executors | 0 | > 0 = OOM kills, spot eviction, or resource limits |
DAG Visualization
- Available in the Jobs tab (click a job) or SQL tab.
- Shows the directed acyclic graph of transformations.
- Narrow stages (map, filter): no shuffle, fast.
- Wide stages (join, groupBy, repartition): shuffle boundary, expensive.
- Look for unnecessarily wide stages and redundant shuffles.
Execution Plans
What is an Execution Plan?
When you write a DataFrame transformation or SQL query, Spark does not execute it immediately (lazy evaluation). The Catalyst Optimizer transforms your code through multiple stages before execution:
Your Code --> Parsed Logical Plan --> Analyzed --> Optimized --> Physical Plan --> Execution
Catalyst Optimizer Pipeline
| Stage | What Happens |
|---|---|
| Parsed Logical Plan | Raw AST from your code. Table/column names are unresolved strings. |
| Analyzed Logical Plan | Resolves table names, column names, data types against the catalog. Throws AnalysisException if references are invalid. |
| Optimized Logical Plan | Catalyst applies optimization rules: predicate pushdown, constant folding, column pruning, join reordering. |
| Physical Plan | Determines actual execution strategy: join algorithms (broadcast vs sort-merge vs shuffle-hash), data access methods, shuffle strategies. Multiple physical plans are generated; Spark picks the most efficient based on cost. |
EXPLAIN Modes
# PySpark
df.explain("simple") # Only physical plan (default)
df.explain("extended") # All 4 plans: parsed, analyzed, optimized, physical
df.explain("codegen") # Physical plan + generated Java code
df.explain("cost") # Optimized plan with CBO cost estimates
df.explain("formatted") # Clean tree with node IDs and splits
-- SQL
EXPLAIN SELECT * FROM table;
EXPLAIN EXTENDED SELECT * FROM table;
EXPLAIN FORMATTED SELECT * FROM table;
| Mode | When to Use |
|---|---|
| simple | Quick check of what Spark will actually execute |
| extended | Full end-to-end debugging; understand Catalyst transformations |
| codegen | Investigate code generation problems or fused operator performance |
| cost | Understand why Spark chose one plan over another (requires CBO enabled) |
| formatted | Clean, readable output for documentation and sharing |
Key Things to Look For in Execution Plans
| What to Look For | Good Sign | Bad Sign |
|---|---|---|
| Predicate Pushdown | PushedFilters: [IsNotNull(col), EqualTo(col, value)] |
Filters applied after full scan |
| Column Pruning | Only required columns in ReadSchema |
Reading all columns when only a few are needed |
| Broadcast Join | BroadcastHashJoin for small tables (< 10MB default) |
SortMergeJoin on a small table = missed broadcast opportunity |
| Shuffle | Minimal Exchange nodes |
Multiple Exchange hashpartitioning = expensive shuffles |
| Whole-Stage CodeGen | *(n) prefix on operators = fused into single Java function |
Missing * = not code-generated (UDFs, complex types) |
Example: Reading an Execution Plan
df = spark.read.table("sales")
result = df.filter(df.region == "US").groupBy("product").sum("amount")
result.explain("extended")
Output walkthrough:
- Parsed: Unresolved
sales,region,product,amount - Analyzed: Resolved to catalog table with typed columns
- Optimized: Filter pushed down to scan, only
productandamountcolumns read (column pruning) - Physical:
FileScanwith pushed filters -->HashAggregate(partial) -->Exchange hashpartitioning-->HashAggregate(final)
Debugging Failed Jobs
Step-by-Step Debugging Workflow
1. Find the failed run/task
|
2. Read the error message
|
3. Check driver and executor logs
|
4. Reproduce in a notebook
|
5. Fix and prevent
Step 1: Find the Failed Run
- Go to Jobs UI (Lakeflow Jobs) --> Click the job name --> Runs tab
- The matrix view shows a history grid of all task runs -- red cells = failures
- Hover over a failed task to see: start/end time, duration, cluster name, error summary
- Click the failed task to open Task Run Details with full error message and logs
Step 2: Read the Error Message
| Error Type | Common Message | Root Cause |
|---|---|---|
| OOM (Driver) | java.lang.OutOfMemoryError: Java heap space |
collect() on large dataset, too many broadcast variables, or under-sized driver |
| OOM (Executor) | Container killed by YARN for exceeding memory limits |
Data skew, large shuffle partitions, insufficient executor memory |
| Schema Error | AnalysisException: cannot resolve column name |
Schema drift in source data, renamed/dropped columns |
| File Not Found | FileNotFoundException |
Source file moved/deleted, mount point expired, stale Delta table cache |
| Permission Error | AccessDeniedException |
IAM/UC permissions not configured for the storage location |
| Timeout | SparkException: Job aborted due to stage failure |
Cluster under-provisioned, network issues, or data source timeout |
| Corrupt Data | SparkException: Malformed records |
Bad input files (truncated CSV, invalid JSON) |
Step 3: Check Logs
| Log Type | Where to Find | What It Shows |
|---|---|---|
| Driver stdout/stderr | Cluster --> Driver Logs tab --> stdout / stderr | Print statements, application-level errors, Python exceptions |
| Driver log4j | Cluster --> Driver Logs tab --> log4j | Spark internal logs, WARNING/ERROR messages |
| Executor logs | Spark UI --> Executors tab --> click executor --> stderr | Task-level errors, OOM, GC issues |
| Cluster Event Log | Compute --> Cluster --> Event Log tab | Cluster start/stop, autoscaling, init script success/failure |
| Spark Event Log | Configured via spark.eventLog.dir |
Historical job analysis after cluster termination |
Step 4: Reproduce in a Notebook
- Attach to the same cluster type and runtime version
- Run the failing logic in isolation
- Use
count()on intermediate DataFrames to identify which transformation fails - Use
df.explain(True)to check the execution plan
Step 5: Fix and Prevent
- Fix the root cause (not just the symptom)
- Repair run: In the Jobs UI, click "Repair Run" to re-run only failed/skipped tasks (no need to re-run successful tasks)
- Add monitoring: Set up email/Slack alerts for job failures
- Add data quality checks: Use DLT expectations or custom assertions
- Persist logs: Configure cluster log delivery to ADLS/S3 for post-mortem analysis
Log Types in Databricks
Cluster Event Logs
| Event | What It Means |
|---|---|
| Cluster starting | Resources being provisioned |
| Cluster terminated | Job finished, idle timeout, or failure |
| Unexpected termination | Spot eviction, policy violation, or cloud quota exceeded |
| Scaling up | Workload requires more parallelism |
| Scaling down | Cluster underutilized |
| Frequent scale up/down | Unstable workload or poor partitioning |
| Init script failure | Dependency/setup issue (libraries, mounts, configs) |
Driver Logs
- stdout: Application output,
print()statements - stderr: Python exceptions, tracebacks, Spark errors
- log4j: Spark internal messages (INFO, WARN, ERROR)
The driver log is where you'll find the root cause error message for most job failures.
Executor Logs
- Task-level errors and stack traces
- OOM kills, GC overhead, shuffle failures
- Each executor has its own stderr/stdout
- Access via: Spark UI --> Executors tab --> click executor ID --> stderr
Cluster Log Delivery
For post-mortem analysis after cluster termination, configure log delivery to cloud storage:
- Azure: ADLS Gen2 or Azure Blob Storage
- AWS: S3 bucket
- Logs are organized by cluster ID and date
- Not supported on serverless compute (at the time of writing)
Best practice: Always enable log delivery for production clusters.
Common Debugging Patterns
1. Data Skew
Symptom: A few tasks take 10x-100x longer than the rest in the Stages tab.
Detection:
- Spark UI --> Stages --> Task Duration Summary: large gap between median and max
- One executor with disproportionately high shuffle read
Solutions:
- Salting: Add a random prefix to the skewed key to distribute data evenly
- Broadcast join: If the smaller table fits in memory (< 10MB default, configurable)
- Adaptive Query Execution (AQE): Enable
spark.sql.adaptive.enabled=true(auto-handles skew in Spark 3.0+) - Repartition: Increase
spark.sql.shuffle.partitionsto create smaller partitions
2. Shuffle Spill
Symptom: "Spill (Memory)" and "Spill (Disk)" metrics visible in the Stages tab.
Solutions:
- Increase
spark.sql.shuffle.partitions(default 200 -- try 400-2000 for large datasets) - Increase executor memory
- Reduce data volume before shuffle (filter early, aggregate early)
3. Small File Problem
Symptom: Stage has thousands of tasks but each processes only a few KB of data.
Detection: DESCRIBE DETAIL table shows high numFiles with small sizeInBytes.
Solutions:
- Run
OPTIMIZE tableto compact small files into larger ones - Enable Auto Optimize:
delta.autoOptimize.optimizeWrite = true - Increase batch size for streaming ingestion
4. OOM Errors
| OOM Location | Cause | Fix |
|---|---|---|
| Driver | collect() on large data, too many partitions in DAG |
Avoid collect(), increase driver memory, reduce partition count |
| Executor | Large partitions, data skew, insufficient memory config | Increase executor memory, increase shuffle partitions, fix skew |
5. Straggler Tasks
Symptom: Job mostly complete but waiting for 1-2 slow tasks.
Solutions:
- Enable speculative execution:
spark.speculation=true(re-launches slow tasks on other executors) - Check for data skew (the most common cause)
- Check for spot instance evictions causing task retries
Symptom-to-Tool Quick Reference
| Symptom | First Tool to Check |
|---|---|
| Job is running slow | Spark UI --> Jobs tab --> find longest job |
| Stage has massive data skew | Spark UI --> Stages tab --> Task Duration Summary |
| Need to see completed job history | Spark Event Logs |
| Cluster terminated unexpectedly | Cluster Event Logs |
| Library failed to install at startup | Cluster Logs --> Init script logs |
| Executor throwing OOM errors | Executor Logs (via Spark UI --> Executors) |
| Want to step through code interactively | Built-in Debugger in notebooks |
| UDF returning wrong values | Built-in Debugger + display() on intermediate results |
| Need to verify filter pushdown | df.explain("formatted") |
| Query using wrong join type | SQL tab in Spark UI or df.explain("extended") |
Python Logging Best Practice in Databricks
For production pipelines, use Python's logging module instead of print():
import logging
logger = logging.getLogger("my_pipeline")
logger.setLevel(logging.INFO)
# File handler -- persist logs to Unity Catalog Volume
file_handler = logging.FileHandler("/Volumes/catalog/schema/volume/pipeline.log")
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
file_handler.setFormatter(formatter)
# Avoid duplicate handlers in interactive notebooks
if not logger.hasHandlers():
logger.addHandler(file_handler)
logger.addHandler(logging.StreamHandler()) # Also print to stdout
logger.info("Pipeline started")
logger.warning("Schema drift detected in source table")
logger.error("Failed to write to target table", exc_info=True)
Why not print()?
- No log levels (can't distinguish INFO vs ERROR)
- No timestamps or module names
- Can't route to files or external systems
- Not production-grade