Monitoring & Debugging - Spark UI, execution plans, debugging failed jobs

The Debugging Mental Model

Before diving into individual tools, anchor yourself in this hierarchy:

Cluster --> Driver --> Executor --> Stage --> Task --> Table (Delta)
Level What to Check
Cluster Infrastructure, lifecycle events, autoscaling, init script failures
Driver Job orchestration, SparkContext errors, DAG planning failures
Executor Task execution, OOM, garbage collection, shuffle issues
Stage Shuffle boundaries, data distribution, spill
Task Fine-grained execution, skew, straggler tasks
Table (Delta) Data operations, file count, table history, small file problem

Rule of thumb:


Spark UI

The Spark UI is the primary tool for understanding job performance. It gives you a live, structured view of every job, stage, and task running on the cluster.

How to access: Compute page --> Select your cluster --> Spark UI tab

Spark UI Tabs

Tab What It Shows When to Use
Jobs Lists every Spark action triggered. Shows duration, status, number of stages per job. Start here to find which job is slow or failed.
Stages Breaks each job into stages. Shows shuffle read/write, spill to disk, task distribution. Key for finding data skew and expensive shuffles.
Storage Cached RDDs and DataFrames -- size in memory, on-disk spill, fraction cached. Check if caching is working or wasting memory.
Executors Executor activity: cores, active/failed tasks, GC time, shuffle read/write, memory usage. Identify under-utilized or overloaded executors.
SQL Shows query execution plans visually as a DAG. Displays scan, filter, join, aggregate nodes with metrics. Understand how SQL/DataFrame queries are executed.
Environment All Spark configurations, JVM info, system properties, classpath. Verify configuration settings are applied correctly.
Streaming Streaming job metrics: input rate, processing time, batch duration, completed batches. Monitor streaming pipeline health.

Jobs Tab - What to Look For

Stages Tab - What to Look For

This is where most debugging happens.

Metric What It Means Action
Shuffle Read/Write Data exchanged between stages (wide transformations). High shuffle = expensive. Consider broadcast joins or pre-partitioning.
Spill (Memory) Data that didn't fit in executor memory during shuffle. Increase spark.sql.shuffle.partitions or executor memory.
Spill (Disk) Data spilled from memory to disk. Same as above -- disk spill is slower than memory spill.
Task Duration (Summary) Min, 25th percentile, median, 75th percentile, max task duration. Large gap between median and max = data skew.
Input Size Data read by the stage. Unexpectedly large input = missing filters or predicate pushdown.

Tasks Tab - What to Look For

Drill into a specific stage to see individual task metrics.

Symptom Diagnosis
A few tasks taking 10x-100x longer than others Data skew -- one partition has far more data
All tasks slow uniformly Resource issue (too few executors, small instances) or I/O bottleneck
Many failed tasks with retries OOM, corrupt data, or unstable spot instances
High GC time on tasks Executor memory pressure -- increase memory or reduce partition size

Executors Tab - What to Look For

Column Healthy Problematic
Active Tasks Spread evenly across executors All tasks on 1-2 executors = skew or bad partitioning
Failed Tasks 0 > 0 = investigate OOM, data corruption
GC Time < 5% of total time > 10% = memory pressure, increase executor memory
Shuffle Read/Write Balanced across executors One executor with 10x more = data skew
Dead Executors 0 > 0 = OOM kills, spot eviction, or resource limits

DAG Visualization


Execution Plans

What is an Execution Plan?

When you write a DataFrame transformation or SQL query, Spark does not execute it immediately (lazy evaluation). The Catalyst Optimizer transforms your code through multiple stages before execution:

Your Code --> Parsed Logical Plan --> Analyzed --> Optimized --> Physical Plan --> Execution

Catalyst Optimizer Pipeline

Stage What Happens
Parsed Logical Plan Raw AST from your code. Table/column names are unresolved strings.
Analyzed Logical Plan Resolves table names, column names, data types against the catalog. Throws AnalysisException if references are invalid.
Optimized Logical Plan Catalyst applies optimization rules: predicate pushdown, constant folding, column pruning, join reordering.
Physical Plan Determines actual execution strategy: join algorithms (broadcast vs sort-merge vs shuffle-hash), data access methods, shuffle strategies. Multiple physical plans are generated; Spark picks the most efficient based on cost.

EXPLAIN Modes

# PySpark
df.explain("simple")     # Only physical plan (default)
df.explain("extended")   # All 4 plans: parsed, analyzed, optimized, physical
df.explain("codegen")    # Physical plan + generated Java code
df.explain("cost")       # Optimized plan with CBO cost estimates
df.explain("formatted")  # Clean tree with node IDs and splits
-- SQL
EXPLAIN SELECT * FROM table;
EXPLAIN EXTENDED SELECT * FROM table;
EXPLAIN FORMATTED SELECT * FROM table;
Mode When to Use
simple Quick check of what Spark will actually execute
extended Full end-to-end debugging; understand Catalyst transformations
codegen Investigate code generation problems or fused operator performance
cost Understand why Spark chose one plan over another (requires CBO enabled)
formatted Clean, readable output for documentation and sharing

Key Things to Look For in Execution Plans

What to Look For Good Sign Bad Sign
Predicate Pushdown PushedFilters: [IsNotNull(col), EqualTo(col, value)] Filters applied after full scan
Column Pruning Only required columns in ReadSchema Reading all columns when only a few are needed
Broadcast Join BroadcastHashJoin for small tables (< 10MB default) SortMergeJoin on a small table = missed broadcast opportunity
Shuffle Minimal Exchange nodes Multiple Exchange hashpartitioning = expensive shuffles
Whole-Stage CodeGen *(n) prefix on operators = fused into single Java function Missing * = not code-generated (UDFs, complex types)

Example: Reading an Execution Plan

df = spark.read.table("sales")
result = df.filter(df.region == "US").groupBy("product").sum("amount")
result.explain("extended")

Output walkthrough:

  1. Parsed: Unresolved sales, region, product, amount
  2. Analyzed: Resolved to catalog table with typed columns
  3. Optimized: Filter pushed down to scan, only product and amount columns read (column pruning)
  4. Physical: FileScan with pushed filters --> HashAggregate (partial) --> Exchange hashpartitioning --> HashAggregate (final)

Debugging Failed Jobs

Step-by-Step Debugging Workflow

1. Find the failed run/task
       |
2. Read the error message
       |
3. Check driver and executor logs
       |
4. Reproduce in a notebook
       |
5. Fix and prevent

Step 1: Find the Failed Run

Step 2: Read the Error Message

Error Type Common Message Root Cause
OOM (Driver) java.lang.OutOfMemoryError: Java heap space collect() on large dataset, too many broadcast variables, or under-sized driver
OOM (Executor) Container killed by YARN for exceeding memory limits Data skew, large shuffle partitions, insufficient executor memory
Schema Error AnalysisException: cannot resolve column name Schema drift in source data, renamed/dropped columns
File Not Found FileNotFoundException Source file moved/deleted, mount point expired, stale Delta table cache
Permission Error AccessDeniedException IAM/UC permissions not configured for the storage location
Timeout SparkException: Job aborted due to stage failure Cluster under-provisioned, network issues, or data source timeout
Corrupt Data SparkException: Malformed records Bad input files (truncated CSV, invalid JSON)

Step 3: Check Logs

Log Type Where to Find What It Shows
Driver stdout/stderr Cluster --> Driver Logs tab --> stdout / stderr Print statements, application-level errors, Python exceptions
Driver log4j Cluster --> Driver Logs tab --> log4j Spark internal logs, WARNING/ERROR messages
Executor logs Spark UI --> Executors tab --> click executor --> stderr Task-level errors, OOM, GC issues
Cluster Event Log Compute --> Cluster --> Event Log tab Cluster start/stop, autoscaling, init script success/failure
Spark Event Log Configured via spark.eventLog.dir Historical job analysis after cluster termination

Step 4: Reproduce in a Notebook

Step 5: Fix and Prevent


Log Types in Databricks

Cluster Event Logs

Event What It Means
Cluster starting Resources being provisioned
Cluster terminated Job finished, idle timeout, or failure
Unexpected termination Spot eviction, policy violation, or cloud quota exceeded
Scaling up Workload requires more parallelism
Scaling down Cluster underutilized
Frequent scale up/down Unstable workload or poor partitioning
Init script failure Dependency/setup issue (libraries, mounts, configs)

Driver Logs

The driver log is where you'll find the root cause error message for most job failures.

Executor Logs

Cluster Log Delivery

For post-mortem analysis after cluster termination, configure log delivery to cloud storage:

Best practice: Always enable log delivery for production clusters.


Common Debugging Patterns

1. Data Skew

Symptom: A few tasks take 10x-100x longer than the rest in the Stages tab.

Detection:

Solutions:

2. Shuffle Spill

Symptom: "Spill (Memory)" and "Spill (Disk)" metrics visible in the Stages tab.

Solutions:

3. Small File Problem

Symptom: Stage has thousands of tasks but each processes only a few KB of data.

Detection: DESCRIBE DETAIL table shows high numFiles with small sizeInBytes.

Solutions:

4. OOM Errors

OOM Location Cause Fix
Driver collect() on large data, too many partitions in DAG Avoid collect(), increase driver memory, reduce partition count
Executor Large partitions, data skew, insufficient memory config Increase executor memory, increase shuffle partitions, fix skew

5. Straggler Tasks

Symptom: Job mostly complete but waiting for 1-2 slow tasks.

Solutions:


Symptom-to-Tool Quick Reference

Symptom First Tool to Check
Job is running slow Spark UI --> Jobs tab --> find longest job
Stage has massive data skew Spark UI --> Stages tab --> Task Duration Summary
Need to see completed job history Spark Event Logs
Cluster terminated unexpectedly Cluster Event Logs
Library failed to install at startup Cluster Logs --> Init script logs
Executor throwing OOM errors Executor Logs (via Spark UI --> Executors)
Want to step through code interactively Built-in Debugger in notebooks
UDF returning wrong values Built-in Debugger + display() on intermediate results
Need to verify filter pushdown df.explain("formatted")
Query using wrong join type SQL tab in Spark UI or df.explain("extended")

Python Logging Best Practice in Databricks

For production pipelines, use Python's logging module instead of print():

import logging

logger = logging.getLogger("my_pipeline")
logger.setLevel(logging.INFO)

# File handler -- persist logs to Unity Catalog Volume
file_handler = logging.FileHandler("/Volumes/catalog/schema/volume/pipeline.log")
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
file_handler.setFormatter(formatter)

# Avoid duplicate handlers in interactive notebooks
if not logger.hasHandlers():
    logger.addHandler(file_handler)
    logger.addHandler(logging.StreamHandler())  # Also print to stdout

logger.info("Pipeline started")
logger.warning("Schema drift detected in source table")
logger.error("Failed to write to target table", exc_info=True)

Why not print()?

Interview Question


#databricks #spark