Databricks Interview Questions
Here I will be adding the questions as I learn and work on the databricks.
Question I think are important
1. Difference b/w Serverless Compute Plane and Classic Compute Plane
The primary difference between a serverless and a classic compute plane comes down to infrastructure ownership, startup latency, and billing granularity. In a classic compute plane, the virtual machines run inside my own cloud account, giving me full control over network topology and instance tuning at the cost of management overhead. In a serverless compute plane, the infrastructure runs in the vendor’s account, abstracting away the hardware layer completely so I can focus purely on data processing and transformation code.
2. What is a Databricks Workspace?
A workspace is the central collaborative environment where users organize and manage all Databricks assets -- notebooks, files, folders, dashboards, queries, libraries, experiments, and repos. It provides a unified web UI to access every Databricks feature.
3. What are the workspace folder types?
- Users/ -- Private per-user space
- Shared/ -- Team collaboration space
- Repos/ -- Git-connected folders (GitHub, Azure DevOps, GitLab)
4. What is the difference between a Workspace and a Cluster?
Workspace is the organizational layer (where you manage assets). Cluster is the compute layer (where code actually runs). A notebook in the workspace must be attached to a cluster to execute.
5. (IMP) What is the difference between %run and dbutils.notebook.run()?
| ~ | %run |
dbutils.notebook.run() |
|---|---|---|
| Execution context | Shared -- variables from called notebook are available in caller | Isolated -- separate execution context |
| Return value | No explicit return | Returns a string via dbutils.notebook.exit() |
| Parameters | No parameter passing | Supports parameter passing via arguments={} |
| Use case | Importing shared functions/configs | Orchestrating pipelines, chaining notebooks |
6. What are notebook permission levels?
| Level | View | Comment | Run | Edit | Manage |
|---|---|---|---|---|---|
| NO PERMISSIONS | |||||
| CAN READ | x | x | |||
| CAN RUN | x | x | x | ||
| CAN EDIT | x | x | x | x | |
| CAN MANAGE | x | x | x | x | x |
- Creators get
CAN MANAGEby default - Notebooks inherit permissions from parent folder
7. What is a Temporary View vs Global Temporary View?
| ~ | Temporary View | Global Temporary View |
|---|---|---|
| Scope | Session-scoped -- available only in the notebook/session that created it | Application-scoped -- available across all notebooks on the same cluster |
| Namespace | Default namespace | global_temp schema |
| Syntax | CREATE OR REPLACE TEMP VIEW v AS ... |
CREATE OR REPLACE GLOBAL TEMP VIEW v AS ... |
| Access | SELECT * FROM v |
SELECT * FROM global_temp.v |
| Lifetime | Ends when session/notebook detaches | Ends when cluster terminates |
8. What are Databricks Widgets? Why are they used?
Widgets add interactive input parameters to notebooks (text fields, dropdowns, comboboxes, multiselect). Used to parameterize notebooks for reusable pipelines -- e.g., passing env, date, region at runtime.
# to create a dropdown widget
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])
# get value from a widget
env = dbutils.widgets.get("env")
9. What are the key dbutils modules?
| Module | Purpose | Key Commands |
|---|---|---|
dbutils.fs |
File system (DBFS) | ls, cp, mv, rm, mkdirs, head, put, mount/unmount |
dbutils.secrets |
Secure secret access | listScopes, list, get |
dbutils.widgets |
Notebook parameterization | text, dropdown, get, removeAll |
dbutils.notebook |
Notebook orchestration | run, exit |
dbutils.jobs |
Job features | Task-level values |
10. (IMP) How do you manage secrets in Databricks?
- Create a secret scope (backed by Databricks or Azure Key Vault)
- Store secrets in the scope
- Access in notebooks via
dbutils.secrets.get("scope", "key") - Secret values are redacted in notebook output -- never printed in plain text
11. What is DBFS? Is it still recommended?
DBFS (Databricks File System) is a distributed file system mounted into the workspace. It abstracts cloud storage. However, DBFS root is now considered legacy. Unity Catalog with External Locations and Volumes is the recommended approach for data access.
12. (IMP) What is the difference between mounting storage and using Unity Catalog External Locations?
| ~ | Mount Points (Legacy) | UC External Locations (Recommended) |
|---|---|---|
| Setup | dbutils.fs.mount(...) |
Configured via Unity Catalog |
| Governance | No fine-grained access control | Full UC governance (ACLs, lineage, audit) |
| Scope | Workspace-level | Cross-workspace |
| Credentials | Stored in mount config | Managed via Storage Credentials |
13. How does Databricks integrate with Azure?
Azure Databricks is a first-party Azure service jointly developed by Databricks and Microsoft. It is deployed as an Azure resource and integrates natively with the Azure ecosystem.
14. What Azure services does Databricks integrate with?
| Category | Service | How It Integrates |
|---|---|---|
| Storage | ADLS Gen2 | Primary lakehouse storage |
| Batch Ingestion | Azure Data Factory | Orchestrate batch ETL into ADLS |
| Streaming | Event Hubs / IoT Hub | Real-time data ingestion |
| Identity | Entra ID + SCIM | SSO, user provisioning |
| Secrets | Azure Key Vault | Backed secret scopes |
| Governance | Microsoft Purview | Data discovery, classification, lineage export from UC |
| BI | Power BI | Optimized connector + Direct Lake via Fabric |
| Monitoring | Azure Monitor | Diagnostics, telemetry |
| CI/CD | Azure DevOps / GitHub | Deployment automation |
| Cost | Cost Management | Track Databricks spend |
| Networking | VNet, Private Link, NSG | Secure classic workspace isolation |
15. (IMP) What is the difference between Azure Databricks and Databricks on AWS/GCP?
The core platform is the same. Azure-specific differences:
- Deployed via Azure Portal as a native Azure resource
- Uses ADLS Gen2 (not S3/GCS) as primary storage
- Identity via Microsoft Entra ID (not AWS IAM)
- Governance exports to Microsoft Purview
- BI via Power BI optimized connector
- Networking uses Azure VNet injection and Private Link
16. How does data flow in an Azure Databricks architecture?
Sources --> Ingestion --> Databricks (Medallion) --> Serving
| | | |
|-- Batch -> ADF --------> Bronze (raw) |-> Power BI
|-- Stream-> Event Hubs -> Silver (clean) |-> DBSQL
|-- CDC ---> Auto Loader-> Gold (aggregated) |-> ML Models
|
ADLS Gen2 (Delta Lake)
17. How do you connect Azure Databricks to ADLS Gen2?
Three methods (know all three):
- Unity Catalog External Locations (recommended) -- governed, cross-workspace
- Service Principal + OAuth -- via Spark config with client ID/secret
- Mount Points (legacy) --
dbutils.fs.mount()with OAuth config
18. What is the difference between a Driver Node and Worker Nodes in a Spark cluster?
| Driver Node | Worker Nodes | |
|---|---|---|
| Count | Exactly 1 per cluster | 0 to many |
| Role | Coordinator - holds SparkContext, plans the execution DAG, distributes tasks across workers, collects results | Executors -- process data partitions in parallel, store intermediate results |
| Failure impact | Entire job fails | Task is retried on another worker (Spark's fault tolerance) |
| Spot instance? | Never - always use on-demand (eviction kills the whole job) | Yes -- recommended for cost savings on fault-tolerant workloads |
| Sizing | Needs enough memory to hold the DAG plan, broadcast variables, and collected results | Needs enough memory/CPU for the data partitions being processed |
Key follow-up: The driver is the single point of failure. That's why you never put it on a spot instance, and why collect() on large datasets can crash the driver (all data is pulled to one node).
19. What is the difference between All-Purpose Clusters and Job Clusters? When would you use each?
| All-Purpose Cluster | Job Cluster | |
|---|---|---|
| Creation | Manual (UI, CLI, API) | Automatic (created when a job is triggered) |
| Lifecycle | Stays running until manually stopped or auto-terminated | Auto-terminates when the job completes |
| Sharing | Multiple users/notebooks can attach simultaneously | Isolated to a single job run |
| Cost (Premium) | ~$0.55/DBU | ~$0.15/DBU - 3-4x cheaper |
| Restart | Can be restarted by users | Cannot restart - new cluster per run |
| Use case | Development, exploration, ad-hoc analysis, collaboration | Production ETL, scheduled workflows, ML training |
The most common follow-up: "Why not use All-Purpose for everything?"
Answer: Cost. At 3-4x the price, running production pipelines on All-Purpose clusters is the single biggest source of unnecessary Databricks spend. All-Purpose is for humans iterating; Job Clusters are for automated workloads.
20. How does Databricks autoscaling work? What are the key considerations?
Databricks autoscaling monitors the pending task queue in Spark:
- When pending tasks exceed available executor cores - scales up (adds worker nodes from the cloud provider or instance pool)
- When workers sit idle beyond a threshold - scales down (removes workers)
You configure two parameters:
- Min workers: baseline floor (keeps cluster responsive, avoids cold start delays)
- Max workers: budget ceiling (caps cost)
Key considerations:
| Consideration | Guidance |
|---|---|
| Narrow range (e.g., 4-8) | Latency-sensitive workloads -- predictable performance |
| Wide range (e.g., 2-50) | Throughput-heavy batch jobs -- maximize parallelism |
| Ultra-short jobs | Don't autoscale -- startup overhead exceeds benefit |
| Instance Pools | Combine with autoscaling for faster node acquisition (seconds vs minutes) |
| Streaming workloads | Keep min workers higher -- scaling latency can cause processing delays |
| Scale-down behavior | Databricks is conservative on scale-down to avoid thrashing (removing then re-adding nodes) |
21. What factors do you consider when creating a Databricks cluster?
| Factor | What to Decide |
|---|---|
| Compute type | All-Purpose (dev) vs Job Cluster (prod) vs SQL Warehouse (SQL/BI) vs Serverless |
| Access mode | Standard (shared, UC-governed) vs Dedicated (single user, RDD/GPU/R support) |
| Runtime version | Latest for dev; LTS for production stability |
| Photon | Enable for SQL/DataFrame-heavy workloads (2-4x speedup) |
| Node type | General Purpose (balanced), Memory Optimized (ML, large joins), Compute Optimized (CPU-heavy ETL), GPU (deep learning) |
| Workers | Number based on data volume; enable autoscaling with sensible min/max |
| Spot instances | Use for workers on fault-tolerant jobs; never for the driver |
| Auto-termination | Always enable for interactive clusters (10-30 min recommended) |
| Instance pool | Use for fast startup on frequently used clusters |
| Cluster policy | Apply admin-defined policies for governance and cost guardrails |
| Init scripts | For custom library installs or environment setup at cluster startup |
| Tags | Add for cost tracking by team/project/environment |
22. What is Photon and how does it improve performance?
Photon is Databricks' native vectorized query engine written in C++. It replaces the standard JVM-based Spark execution for SQL and DataFrame operations.
How it improves performance:
- Vectorized execution: processes data in batches (columnar) instead of row-by-row
- Native C++ execution: bypasses JVM overhead (no garbage collection, no serialization)
- Optimized for I/O: better scan, filter, and aggregation performance on Delta Lake
- Result: 2-4x faster (and up to 10x in some benchmarks) for SQL-heavy workloads
When Photon helps:
- Large aggregations, joins, and scans
- SQL Warehouse queries (Photon is enabled by default)
- DataFrame transformations on wide tables
When Photon may NOT help:
- Pure Python/RDD workloads (Photon operates at the SQL/DataFrame layer)
- Ultra-short jobs completing in under 2 seconds
- UDF-heavy pipelines (UDFs bypass Photon's optimized execution)
Cost implication: Photon uses more DBUs per hour, but jobs complete faster -- net cost is often lower.
23. What are the different SQL Warehouse types? When would you use each?
| Type | Startup | Management | When to Use |
|---|---|---|---|
| Serverless | 2-6 seconds | Fully managed by Databricks | Recommended default. Best for BI dashboards, ad-hoc SQL, ETL. Rapid autoscaling, no infra config. |
| Pro | Minutes | Customer-managed compute | When serverless is unavailable in your region, or you need custom networking (VNet injection, firewall rules, federation). |
| Classic | Minutes | Customer-managed compute | Entry-level. Basic interactive SQL exploration when Pro/Serverless are not options. Least features. |
All SQL Warehouses come with Photon enabled by default.
Key differences from clusters:
- SQL Warehouses are SQL-only (no Python, Scala, R)
- Optimized for concurrent BI queries (Power BI, Tableau, Looker)
- Serverless warehouses have auto-suspend -- no idle cost when not querying
24. How would you reduce Databricks costs in a production environment?
Ordered by impact:
| # | Strategy | Savings | Effort |
|---|---|---|---|
| 1 | Switch production workloads from All-Purpose to Job Clusters | 3-4x DBU reduction | Low |
| 2 | Enable auto-termination on all interactive clusters (10-30 min) | Eliminates idle waste (often 20-40% of bill) | Low |
| 3 | Use Spot Instances for worker nodes | Up to 90% infra savings | Low |
| 4 | Enable autoscaling with tight min/max boundaries | Pay only for what you use | Low |
| 5 | Enforce Cluster Policies | Prevent over-provisioning org-wide | Medium |
| 6 | Enable Photon for SQL/DataFrame workloads | Faster execution = fewer DBUs consumed | Low |
| 7 | Right-size instance types | Avoid paying for unused CPU/memory | Medium |
| 8 | Use Serverless SQL Warehouses for SQL workloads | Auto-suspend = zero idle cost | Low |
| 9 | Use Instance Pools for shared compute | Faster startup, better resource utilization | Medium |
| 10 | Tag everything for cost visibility | Identifies waste by team/project | Low |
The first three alone typically cut costs by 50-70%.
25. Your cluster cost increased significantly last month. How would you investigate and optimize?
Step-by-step investigation:
- Check cluster uptime -- Are all-purpose clusters running idle overnight or over weekends? This is the most common culprit.
- Identify compute type -- Are production jobs running on All-Purpose ($0.55/DBU) instead of Job Clusters ($0.15/DBU)?
- Review auto-termination settings -- Is it disabled or set too high (e.g., 120 min default)?
- Inspect autoscaling config -- Are max workers set unreasonably high? Are clusters scaling to max and staying there?
- Check instance types -- Are teams using expensive GPU or memory-optimized VMs when general-purpose would suffice?
- Review cluster policies -- Are guardrails in place, or can any user create unrestricted clusters?
- Look at job history -- Any failed jobs retrying repeatedly? A retry loop on a 50-node cluster burns cost fast.
- Check tags -- Identify which team/project drove the spike.
- SQL Warehouse usage -- Is a serverless warehouse running expensive, unoptimized queries at scale?
- Use cloud cost tools -- Azure Cost Management / AWS Cost Explorer to trace the infrastructure portion of the bill.
Optimization actions: Apply the strategies from Q24 based on findings. Typically, enforcing auto-termination + switching to Job Clusters resolves 60-70% of cost spikes.
26. A production ETL job that usually completes in 20 minutes is now taking 2 hours. The cluster configuration hasn't changed. How would you debug this?
Systematic approach:
-
Check Spark UI -- Look at the stage timeline. Is one stage taking disproportionately long? That signals a data skew (one partition has far more data than others).
-
Check input data volume -- Did the source data grow significantly? A table that was 10 GB last week might be 100 GB now due to upstream changes.
-
Check for shuffle spills -- In the Spark UI, look for "Spill (Memory)" and "Spill (Disk)". Spills mean workers don't have enough memory for the shuffle, causing disk I/O.
-
Check for small file problem -- Too many small files in Delta Lake cause excessive task overhead. Run
OPTIMIZEto compact files. -
Check for missing VACUUM/OPTIMIZE -- Delta Lake tables accumulate old files over time. Without maintenance, read performance degrades.
-
Check upstream schema changes -- Did a column type change? Did new columns get added? Schema evolution can trigger full table scans.
-
Check cluster health -- Are any workers in an unhealthy state? Check for spot instance evictions causing task retries.
-
Check concurrent workloads -- Is another heavy job sharing the same cluster and competing for resources?
-
Compare execution plans -- Run
EXPLAINon the query and compare with a previous successful run. Look for missing predicate pushdown or broadcast join changes. -
Check Delta Lake stats -- Run
DESCRIBE DETAILon the table. ChecknumFiles,sizeInBytes. If file count exploded, that's the problem.
27. Your team uses All-Purpose clusters for everything -- dev, test, and production. Management wants to cut Databricks costs by 50%. What changes would you propose?
Phased proposal:
Phase 1 -- Immediate wins (Week 1-2, ~40% savings):
- Move all scheduled/production jobs to Job Clusters ($0.15 vs $0.55/DBU = 3-4x savings)
- Set auto-termination to 15 minutes on all remaining All-Purpose clusters
- Enable Spot Instances on worker nodes for all non-critical workloads
Phase 2 -- Governance (Week 3-4, ~10-15% additional savings):
- Implement Cluster Policies restricting max workers, instance types, and enforcing auto-termination
- Migrate SQL/BI workloads to Serverless SQL Warehouses (auto-suspend = zero idle cost)
- Set up Instance Pools for dev teams needing fast startup
Phase 3 -- Optimization (Month 2, ~5-10% additional savings):
- Enable Photon on SQL-heavy pipelines
- Right-size VMs based on actual utilization metrics
- Implement tagging for cost attribution and set up alerts for budget thresholds
- Schedule dev clusters to auto-terminate outside business hours
Expected result: 50-65% cost reduction without any loss in functionality or performance.
The key insight: The bulk of savings (40%) comes from one single change -- switching production from All-Purpose to Job Clusters. Everything else is incremental.
28. How do you monitor and debug a failed Spark job in Databricks?
Follow a structured 5-step workflow:
Step 1: Find the failed run and task
- Go to Jobs UI (Lakeflow Jobs) --> click the job --> Runs tab
- Use the matrix view to spot red (failed) and grey (skipped) tasks
- Hover over the failed task for metadata (duration, cluster, error summary)
Step 2: Read the error message properly
- Click the failed task to open Task Run Details
- Don't stop at the first line -- scroll to the root cause (often buried in the stack trace)
- Classify the error: OOM? Schema drift? Permission issue? Timeout?
Step 3: Check logs
- Driver logs (stdout/stderr): Application-level errors, Python tracebacks
- Executor logs (via Spark UI --> Executors tab --> stderr): Task-level OOM, GC pressure
- Cluster Event Log: Unexpected terminations, init script failures, autoscaling issues
Step 4: Reproduce in a notebook
- Attach to the same cluster type and runtime version
- Run the failing logic in isolation
- Use
count()on intermediate DataFrames to pinpoint the failing transformation - Use
df.explain(True)to inspect the execution plan
Step 5: Fix and prevent
- Fix the root cause, not the symptom
- Use "Repair Run" in Jobs UI to re-run only failed/skipped tasks
- Set up email/Slack alerts for failures
- Enable cluster log delivery to ADLS/S3 for post-mortem analysis
29. What are the key tabs in the Spark UI and what does each tell you?
| Tab | What It Shows | Key Use |
|---|---|---|
| Jobs | Every Spark action triggered, with duration, status, and stage count | Find which job is slow or failed |
| Stages | Stage-level metrics: shuffle read/write, spill (memory/disk), task distribution | Primary debugging tab -- find skew, spills, expensive shuffles |
| Storage | Cached RDDs/DataFrames, memory and disk usage, fraction cached | Verify caching is working or identify wasted memory |
| Executors | Per-executor metrics: cores, active/failed tasks, GC time, shuffle I/O | Identify overloaded executors, OOM, spot evictions |
| SQL | Visual query execution plan DAG with per-operator metrics | Understand how SQL/DataFrame queries are physically executed |
| Environment | All Spark configs, JVM info, classpath, system properties | Verify configuration settings are correctly applied |
| Streaming | Input rate, processing time, batch duration, completed batches | Monitor streaming pipeline throughput and lag |
The most important tab for performance debugging is Stages. Start there, then drill into Tasks for skew analysis.
30. Your Databricks job is running 4x slower in production. How would you identify and fix the bottleneck?
Systematic approach:
1. Identify the slow stage:
- Open Spark UI --> Jobs tab --> find the longest-running job
- Drill into the longest stage in that job
- Check the Task Duration Summary: is there a large gap between median and max duration?
2. Check for data skew:
- If median task = 2 sec but max task = 5 min, one partition has disproportionately more data
- Fix: Enable AQE (
spark.sql.adaptive.enabled=true), salt the skewed key, or use broadcast join
3. Check for shuffle spill:
- In the Stages tab, look for "Spill (Memory)" and "Spill (Disk)"
- Fix: Increase
spark.sql.shuffle.partitions, increase executor memory, or filter data earlier
4. Check for small file problem:
- Run
DESCRIBE DETAIL table-- ifnumFilesis very high with small average size - Fix: Run
OPTIMIZE table, enable auto-optimize
5. Check input data volume:
- Compare current input size vs historical. Did source data grow significantly?
- Fix: Add incremental processing, partition pruning, or Z-Order on filter columns
6. Check execution plan:
- Run
df.explain("formatted")and compare with a previous successful run - Look for: missing predicate pushdown, SortMergeJoin where BroadcastHashJoin was expected, extra Exchange nodes
7. Check cluster health:
- Executors tab: any dead executors? High GC time? Spot evictions?
- Cluster Event Log: unexpected scaling or termination events?
31. What is a Spark Execution Plan? How do you read it?
When you write a transformation, Spark's Catalyst Optimizer transforms your code through four stages before executing:
Parsed Logical Plan --> Analyzed --> Optimized --> Physical Plan
| Stage | What Happens |
|---|---|
| Parsed | Raw AST. Table/column names are unresolved strings. |
| Analyzed | Resolves all references against the catalog. Throws AnalysisException if invalid. |
| Optimized | Applies rules: predicate pushdown, constant folding, column pruning, join reordering. |
| Physical | Chooses execution strategy: join algorithms, shuffle methods, scan approaches. |
How to view:
df.explain("simple") # Physical plan only
df.explain("extended") # All 4 stages
df.explain("formatted") # Clean tree with node IDs
What to look for in the physical plan:
BroadcastHashJoin= efficient (small table broadcast).SortMergeJoin= expensive shuffle on both sides.PushedFiltersinFileScan= predicate pushdown is working.Exchange hashpartitioning= shuffle happening. Minimize these.*(n)prefix = whole-stage code generation (good -- operators fused into native code).- Missing
*= operators running outside codegen (often UDFs).
32. How do you identify and handle data skew in Spark?
Detection:
- Spark UI --> Stages tab --> Task Duration Summary
- If a few tasks take 10x-100x longer than the median, you have skew
- Also check: one executor with disproportionately high shuffle read
Common causes:
- Joining on a key where one value dominates (e.g.,
country = "US"is 90% of data) - GroupBy on a low-cardinality column
- Null values in join keys (all nulls go to one partition)
Solutions (ordered by preference):
| Solution | How | When |
|---|---|---|
| AQE Skew Join | spark.sql.adaptive.enabled=true + spark.sql.adaptive.skewJoin.enabled=true |
First option -- automatic, no code changes |
| Broadcast Join | spark.sql.autoBroadcastJoinThreshold=10485760 or broadcast(df) hint |
When one side of the join is small enough to fit in memory |
| Salting | Add random prefix to the skewed key, join on salted key, then remove prefix | When both tables are large and AQE isn't sufficient |
| Pre-aggregation | Aggregate the skewed data before the join to reduce volume | When downstream logic allows it |
| Filter nulls | Handle null join keys separately | When nulls are the source of skew |
| Repartition | Increase spark.sql.shuffle.partitions to create smaller partitions |
General mitigation, not a targeted fix |
33. How would you debug intermittent (non-deterministic) job failures in production?
Intermittent failures are harder than consistent failures because they can't be reliably reproduced. Approach:
1. Check for spot instance evictions:
- Cluster Event Log --> look for "NODE_LOST" or "SPOT_INSTANCE_TERMINATION"
- If workers are being evicted mid-task, Spark retries tasks but some may fail
- Fix: Use on-demand for workers on critical jobs, or configure fallback to on-demand
2. Check for resource contention:
- Is the cluster shared? Other workloads might compete for memory/CPU
- Fix: Use isolated job clusters for critical pipelines
3. Check for upstream data changes:
- Source files may arrive late, be partially written, or have inconsistent schemas
- Fix: Add retry logic, data validation checks, and schema enforcement
4. Check for API/external dependency timeouts:
- Network issues, rate limiting on external APIs, database connection pool exhaustion
- Fix: Add retry with exponential backoff, increase connection timeouts
5. Check for race conditions:
- Concurrent writes to the same Delta table without proper isolation
- Fix: Use Delta Lake's ACID guarantees, avoid concurrent
OPTIMIZEon the same table
6. Implement structured logging:
- Log job parameters, input record counts, and key metrics at each step
- When the failure recurs, logs provide context that the error message alone doesn't
7. Enable retry policies:
- Configure job-level retries in Workflows (e.g., 2 retries with 5-minute delay)
- This handles transient failures automatically
34. What are common executor issues and how do you troubleshoot them?
| Issue | Symptom in Spark UI | Root Cause | Fix |
|---|---|---|---|
| OOM (Out of Memory) | Dead executors, failed tasks, Container killed by YARN for exceeding memory limits |
Large partitions, data skew, or insufficient memory config | Increase executor memory, increase shuffle partitions, fix skew |
| High GC Time | GC Time > 10% of total task time in Executors tab | Too many objects in memory, large caches, or small heap | Increase executor memory, reduce cache size, tune GC settings |
| Shuffle Issues | High shuffle read/write, spill to disk | Wide transformations on large datasets | Increase shuffle partitions, use broadcast joins, filter early |
| Too Many Dead Executors | Dead executor count > 0, tasks retried | Spot evictions, OOM kills, or cloud resource limits | Use on-demand instances, increase memory, check cloud quotas |
| Uneven Task Distribution | Active tasks concentrated on 1-2 executors | Data skew or bad partitioning | Repartition data, salt keys, enable AQE |
Debugging steps:
- Spark UI --> Executors tab: Check active vs dead, GC time, shuffle metrics
- Click a dead executor --> stderr: Read the full stack trace
- Spark UI --> Stages tab: Find the stage with failed tasks, check task duration distribution
- Adjust configuration:
spark.executor.memory-- increase for OOMspark.sql.shuffle.partitions-- increase for spillspark.dynamicAllocation.enabled-- disable for short jobs with frequent executor churn
35. A streaming pipeline's processing time exceeds the batch interval. How would you diagnose and fix this?
When processing time > batch interval, the pipeline falls behind and a backlog builds up. Eventually, it can crash or produce severely delayed results.
Diagnosis:
- Spark UI --> Streaming tab: Check the Processing Time graph. If it consistently exceeds the batch interval (e.g., 10-second batch, 15-second processing), the pipeline is falling behind.
- Check Input Rate: Is the source sending more events than expected? A sudden spike in input can overwhelm the pipeline.
- Check Stages tab for the streaming job: Look for data skew, shuffle spill, or long-running tasks.
- Check for micro-batch overhead: Too many small files being written per batch.
Fixes:
- Scale up: Add more workers or increase executor cores to process more partitions in parallel.
- Increase batch interval: If latency tolerance allows, increase
processingTimetrigger (e.g., from 10s to 30s). - Optimize the query: Remove unnecessary shuffles/joins. Use
mapPartitionsinstead of row-level UDFs. - Increase shuffle partitions: Default 200 may be too few for high-volume streams.
- Enable Auto Loader with cloud files: More efficient file discovery than
fileStream. - Compact output files: Enable
autoOptimizeon the Delta sink to prevent small file accumulation. - Use trigger.availableNow: For catch-up scenarios, processes all available data in one batch.
36. After deploying a code change, your Delta Lake MERGE job started failing with OOM errors. The data volume hasn't changed. What would you investigate?
The data volume hasn't changed, so the problem is in the code change or execution plan, not the data.
Investigation:
1. Compare execution plans (before vs after):
df.explain("extended")
- Did a broadcast join change to a SortMergeJoin (or vice versa)?
- Did predicate pushdown stop working due to a code change?
- Is the MERGE condition now matching more rows than intended?
2. Check the MERGE condition:
- A poorly written MERGE condition (e.g., missing a key column) can cause a cross-join, exploding the result set
- Verify the join cardinality:
source.join(target, condition).count()-- is it much larger than expected?
3. Check for new columns or schema changes:
- Adding wide columns (large strings, nested structs) increases row size, which means more memory per partition
- Check if schema evolution was triggered
4. Check for changes in partitioning:
- Did the code change alter
repartition()orcoalesce()calls? - Fewer partitions = larger partitions = higher memory per task
5. Check for removed filters:
- Was a filter accidentally removed? Without partition pruning, MERGE scans the entire target table.
- Run
DESCRIBE DETAIL target_tableand check how many files are being read.
6. Fix:
- Restore the correct MERGE condition
- Add explicit partition filters to limit the scan
- Increase
spark.executor.memoryas a short-term mitigation - Run
OPTIMIZEon the target table if file fragmentation is a factor
37. Your team has no visibility into why jobs fail. Management asks you to set up a monitoring strategy for Databricks. What would you propose?
A comprehensive monitoring strategy has three layers:
Layer 1: Built-in Databricks Monitoring (Day 1)
| Tool | What It Covers |
|---|---|
| Job Alerts | Email/Slack/webhook notifications on job success, failure, or duration threshold |
| Jobs UI Matrix View | Visual history of all task runs across jobs |
| Spark UI | Live performance diagnostics (stages, tasks, executors) |
| Cluster Event Logs | Cluster lifecycle events, autoscaling, init scripts |
Layer 2: Centralized Logging (Week 1-2)
| Action | How |
|---|---|
| Enable Cluster Log Delivery | Configure all production clusters to write driver/executor logs to ADLS Gen2 or S3 |
| Structured Logging | Replace print() with Python logging module using JSON format for machine-parseable logs |
| Centralize in Delta Table | Ingest cluster logs into a Delta table for querying, filtering, trend analysis |
| Persist to Unity Catalog Volume | Store application logs in Volumes for long-term retention |
Layer 3: Dashboards & Alerting (Month 1)
| Action | How |
|---|---|
| AI/BI Dashboard | Build a Databricks dashboard showing: job success/failure rates, average duration trends, top error messages, cost per job |
| Azure Monitor Integration | Forward diagnostic telemetry to Azure Monitor for cross-service correlation |
| Custom Alerts | Set up alerts for: job duration > 2x historical average, > 3 consecutive failures, cluster cost exceeding budget threshold |
| System Tables | Query Databricks system tables (system.billing, system.access) for cost tracking and audit logs |
Result: The team goes from "we don't know why jobs fail" to "we see failures in real-time, can diagnose root causes in minutes, and have historical trend data for capacity planning."