Databricks Interview Questions

Here I will be adding the questions as I learn and work on the databricks.

Question I think are important

1. Difference b/w Serverless Compute Plane and Classic Compute Plane

The primary difference between a serverless and a classic compute plane comes down to infrastructure ownership, startup latency, and billing granularity. In a classic compute plane, the virtual machines run inside my own cloud account, giving me full control over network topology and instance tuning at the cost of management overhead. In a serverless compute plane, the infrastructure runs in the vendor’s account, abstracting away the hardware layer completely so I can focus purely on data processing and transformation code.

2. What is a Databricks Workspace?

A workspace is the central collaborative environment where users organize and manage all Databricks assets -- notebooks, files, folders, dashboards, queries, libraries, experiments, and repos. It provides a unified web UI to access every Databricks feature.

3. What are the workspace folder types?

Users/ -- Private per-user space
Shared/ -- Team collaboration space
Repos/ -- Git-connected folders (GitHub, Azure DevOps, GitLab)

4. What is the difference between a Workspace and a Cluster?

Workspace is the organizational layer (where you manage assets). Cluster is the compute layer (where code actually runs). A notebook in the workspace must be attached to a cluster to execute.

5. (IMP) What is the difference between `%run` and `dbutils.notebook.run()`?

~	`%run`	`dbutils.notebook.run()`
Execution context	Shared -- variables from called notebook are available in caller	Isolated -- separate execution context
Return value	No explicit return	Returns a string via `dbutils.notebook.exit()`
Parameters	No parameter passing	Supports parameter passing via `arguments={}`
Use case	Importing shared functions/configs	Orchestrating pipelines, chaining notebooks

6. What are notebook permission levels?

Level	View	Comment	Run	Edit	Manage
NO PERMISSIONS
CAN READ	x	x
CAN RUN	x	x	x
CAN EDIT	x	x	x	x
CAN MANAGE	x	x	x	x	x

Creators get CAN MANAGE by default
Notebooks inherit permissions from parent folder

7. What is a Temporary View vs Global Temporary View?

~	Temporary View	Global Temporary View
Scope	Session-scoped -- available only in the notebook/session that created it	Application-scoped -- available across all notebooks on the same cluster
Namespace	Default namespace	`global_temp` schema
Syntax	`CREATE OR REPLACE TEMP VIEW v AS ...`	`CREATE OR REPLACE GLOBAL TEMP VIEW v AS ...`
Access	`SELECT * FROM v`	`SELECT * FROM global_temp.v`
Lifetime	Ends when session/notebook detaches	Ends when cluster terminates

8. What are Databricks Widgets? Why are they used?

Widgets add interactive input parameters to notebooks (text fields, dropdowns, comboboxes, multiselect). Used to parameterize notebooks for reusable pipelines -- e.g., passing env, date, region at runtime.

# to create a dropdown widget
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"]) 

# get value from a widget
env = dbutils.widgets.get("env")

9. What are the key dbutils modules?

Module	Purpose	Key Commands
`dbutils.fs`	File system (DBFS)	`ls`, `cp`, `mv`, `rm`, `mkdirs`, `head`, `put`, `mount/unmount`
`dbutils.secrets`	Secure secret access	`listScopes`, `list`, `get`
`dbutils.widgets`	Notebook parameterization	`text`, `dropdown`, `get`, `removeAll`
`dbutils.notebook`	Notebook orchestration	`run`, `exit`
`dbutils.jobs`	Job features	Task-level values

10. (IMP) How do you manage secrets in Databricks?

Create a secret scope (backed by Databricks or Azure Key Vault)
Store secrets in the scope
Access in notebooks via dbutils.secrets.get("scope", "key")
Secret values are redacted in notebook output -- never printed in plain text

11. What is DBFS? Is it still recommended?

DBFS (Databricks File System) is a distributed file system mounted into the workspace. It abstracts cloud storage. However, DBFS root is now considered legacy. Unity Catalog with External Locations and Volumes is the recommended approach for data access.

12. (IMP) What is the difference between mounting storage and using Unity Catalog External Locations?

~	Mount Points (Legacy)	UC External Locations (Recommended)
Setup	`dbutils.fs.mount(...)`	Configured via Unity Catalog
Governance	No fine-grained access control	Full UC governance (ACLs, lineage, audit)
Scope	Workspace-level	Cross-workspace
Credentials	Stored in mount config	Managed via Storage Credentials

13. How does Databricks integrate with Azure?

Azure Databricks is a first-party Azure service jointly developed by Databricks and Microsoft. It is deployed as an Azure resource and integrates natively with the Azure ecosystem.

14. What Azure services does Databricks integrate with?

Category	Service	How It Integrates
Storage	ADLS Gen2	Primary lakehouse storage
Batch Ingestion	Azure Data Factory	Orchestrate batch ETL into ADLS
Streaming	Event Hubs / IoT Hub	Real-time data ingestion
Identity	Entra ID + SCIM	SSO, user provisioning
Secrets	Azure Key Vault	Backed secret scopes
Governance	Microsoft Purview	Data discovery, classification, lineage export from UC
BI	Power BI	Optimized connector + Direct Lake via Fabric
Monitoring	Azure Monitor	Diagnostics, telemetry
CI/CD	Azure DevOps / GitHub	Deployment automation
Cost	Cost Management	Track Databricks spend
Networking	VNet, Private Link, NSG	Secure classic workspace isolation

15. (IMP) What is the difference between Azure Databricks and Databricks on AWS/GCP?

The core platform is the same. Azure-specific differences:

Deployed via Azure Portal as a native Azure resource
Uses ADLS Gen2 (not S3/GCS) as primary storage
Identity via Microsoft Entra ID (not AWS IAM)
Governance exports to Microsoft Purview
BI via Power BI optimized connector
Networking uses Azure VNet injection and Private Link

16. How does data flow in an Azure Databricks architecture?

Sources --> Ingestion --> Databricks (Medallion) --> Serving
  |            |              |                        |
  |-- Batch -> ADF --------> Bronze (raw)              |-> Power BI
  |-- Stream-> Event Hubs -> Silver (clean)            |-> DBSQL
  |-- CDC ---> Auto Loader-> Gold (aggregated)         |-> ML Models
                              |
                        ADLS Gen2 (Delta Lake)

17. How do you connect Azure Databricks to ADLS Gen2?

Three methods (know all three):

Unity Catalog External Locations (recommended) -- governed, cross-workspace
Service Principal + OAuth -- via Spark config with client ID/secret
Mount Points (legacy) -- dbutils.fs.mount() with OAuth config

18. What is the difference between a Driver Node and Worker Nodes in a Spark cluster?

	Driver Node	Worker Nodes
Count	Exactly 1 per cluster	0 to many
Role	Coordinator - holds SparkContext, plans the execution DAG, distributes tasks across workers, collects results	Executors -- process data partitions in parallel, store intermediate results
Failure impact	Entire job fails	Task is retried on another worker (Spark's fault tolerance)
Spot instance?	Never - always use on-demand (eviction kills the whole job)	Yes -- recommended for cost savings on fault-tolerant workloads
Sizing	Needs enough memory to hold the DAG plan, broadcast variables, and collected results	Needs enough memory/CPU for the data partitions being processed

Key follow-up: The driver is the single point of failure. That's why you never put it on a spot instance, and why collect() on large datasets can crash the driver (all data is pulled to one node).

19. What is the difference between All-Purpose Clusters and Job Clusters? When would you use each?

	All-Purpose Cluster	Job Cluster
Creation	Manual (UI, CLI, API)	Automatic (created when a job is triggered)
Lifecycle	Stays running until manually stopped or auto-terminated	Auto-terminates when the job completes
Sharing	Multiple users/notebooks can attach simultaneously	Isolated to a single job run
Cost (Premium)	~$0.55/DBU	~$0.15/DBU - 3-4x cheaper
Restart	Can be restarted by users	Cannot restart - new cluster per run
Use case	Development, exploration, ad-hoc analysis, collaboration	Production ETL, scheduled workflows, ML training

The most common follow-up: "Why not use All-Purpose for everything?"
Answer: Cost. At 3-4x the price, running production pipelines on All-Purpose clusters is the single biggest source of unnecessary Databricks spend. All-Purpose is for humans iterating; Job Clusters are for automated workloads.

20. How does Databricks autoscaling work? What are the key considerations?

Databricks autoscaling monitors the pending task queue in Spark:

When pending tasks exceed available executor cores - scales up (adds worker nodes from the cloud provider or instance pool)
When workers sit idle beyond a threshold - scales down (removes workers)

You configure two parameters:

Min workers: baseline floor (keeps cluster responsive, avoids cold start delays)
Max workers: budget ceiling (caps cost)

Key considerations:

Consideration	Guidance
Narrow range (e.g., 4-8)	Latency-sensitive workloads -- predictable performance
Wide range (e.g., 2-50)	Throughput-heavy batch jobs -- maximize parallelism
Ultra-short jobs	Don't autoscale -- startup overhead exceeds benefit
Instance Pools	Combine with autoscaling for faster node acquisition (seconds vs minutes)
Streaming workloads	Keep min workers higher -- scaling latency can cause processing delays
Scale-down behavior	Databricks is conservative on scale-down to avoid thrashing (removing then re-adding nodes)

21. What factors do you consider when creating a Databricks cluster?

Factor	What to Decide
Compute type	All-Purpose (dev) vs Job Cluster (prod) vs SQL Warehouse (SQL/BI) vs Serverless
Access mode	Standard (shared, UC-governed) vs Dedicated (single user, RDD/GPU/R support)
Runtime version	Latest for dev; LTS for production stability
Photon	Enable for SQL/DataFrame-heavy workloads (2-4x speedup)
Node type	General Purpose (balanced), Memory Optimized (ML, large joins), Compute Optimized (CPU-heavy ETL), GPU (deep learning)
Workers	Number based on data volume; enable autoscaling with sensible min/max
Spot instances	Use for workers on fault-tolerant jobs; never for the driver
Auto-termination	Always enable for interactive clusters (10-30 min recommended)
Instance pool	Use for fast startup on frequently used clusters
Cluster policy	Apply admin-defined policies for governance and cost guardrails
Init scripts	For custom library installs or environment setup at cluster startup
Tags	Add for cost tracking by team/project/environment

22. What is Photon and how does it improve performance?

Photon is Databricks' native vectorized query engine written in C++. It replaces the standard JVM-based Spark execution for SQL and DataFrame operations.

How it improves performance:

Vectorized execution: processes data in batches (columnar) instead of row-by-row
Native C++ execution: bypasses JVM overhead (no garbage collection, no serialization)
Optimized for I/O: better scan, filter, and aggregation performance on Delta Lake
Result: 2-4x faster (and up to 10x in some benchmarks) for SQL-heavy workloads

When Photon helps:

Large aggregations, joins, and scans
SQL Warehouse queries (Photon is enabled by default)
DataFrame transformations on wide tables

When Photon may NOT help:

Pure Python/RDD workloads (Photon operates at the SQL/DataFrame layer)
Ultra-short jobs completing in under 2 seconds
UDF-heavy pipelines (UDFs bypass Photon's optimized execution)

Cost implication: Photon uses more DBUs per hour, but jobs complete faster -- net cost is often lower.

23. What are the different SQL Warehouse types? When would you use each?

Type	Startup	Management	When to Use
Serverless	2-6 seconds	Fully managed by Databricks	Recommended default. Best for BI dashboards, ad-hoc SQL, ETL. Rapid autoscaling, no infra config.
Pro	Minutes	Customer-managed compute	When serverless is unavailable in your region, or you need custom networking (VNet injection, firewall rules, federation).
Classic	Minutes	Customer-managed compute	Entry-level. Basic interactive SQL exploration when Pro/Serverless are not options. Least features.

All SQL Warehouses come with Photon enabled by default.

Key differences from clusters:

SQL Warehouses are SQL-only (no Python, Scala, R)
Optimized for concurrent BI queries (Power BI, Tableau, Looker)
Serverless warehouses have auto-suspend -- no idle cost when not querying

24. How would you reduce Databricks costs in a production environment?

Ordered by impact:

#	Strategy	Savings	Effort
1	Switch production workloads from All-Purpose to Job Clusters	3-4x DBU reduction	Low
2	Enable auto-termination on all interactive clusters (10-30 min)	Eliminates idle waste (often 20-40% of bill)	Low
3	Use Spot Instances for worker nodes	Up to 90% infra savings	Low
4	Enable autoscaling with tight min/max boundaries	Pay only for what you use	Low
5	Enforce Cluster Policies	Prevent over-provisioning org-wide	Medium
6	Enable Photon for SQL/DataFrame workloads	Faster execution = fewer DBUs consumed	Low
7	Right-size instance types	Avoid paying for unused CPU/memory	Medium
8	Use Serverless SQL Warehouses for SQL workloads	Auto-suspend = zero idle cost	Low
9	Use Instance Pools for shared compute	Faster startup, better resource utilization	Medium
10	Tag everything for cost visibility	Identifies waste by team/project	Low

The first three alone typically cut costs by 50-70%.

25. Your cluster cost increased significantly last month. How would you investigate and optimize?

Step-by-step investigation:

Check cluster uptime -- Are all-purpose clusters running idle overnight or over weekends? This is the most common culprit.
Identify compute type -- Are production jobs running on All-Purpose ($0.55/DBU) instead of Job Clusters ($0.15/DBU)?
Review auto-termination settings -- Is it disabled or set too high (e.g., 120 min default)?
Inspect autoscaling config -- Are max workers set unreasonably high? Are clusters scaling to max and staying there?
Check instance types -- Are teams using expensive GPU or memory-optimized VMs when general-purpose would suffice?
Review cluster policies -- Are guardrails in place, or can any user create unrestricted clusters?
Look at job history -- Any failed jobs retrying repeatedly? A retry loop on a 50-node cluster burns cost fast.
Check tags -- Identify which team/project drove the spike.
SQL Warehouse usage -- Is a serverless warehouse running expensive, unoptimized queries at scale?
Use cloud cost tools -- Azure Cost Management / AWS Cost Explorer to trace the infrastructure portion of the bill.

Optimization actions: Apply the strategies from Q24 based on findings. Typically, enforcing auto-termination + switching to Job Clusters resolves 60-70% of cost spikes.

26. A production ETL job that usually completes in 20 minutes is now taking 2 hours. The cluster configuration hasn't changed. How would you debug this?

Systematic approach:

Check Spark UI -- Look at the stage timeline. Is one stage taking disproportionately long? That signals a data skew (one partition has far more data than others).
Check input data volume -- Did the source data grow significantly? A table that was 10 GB last week might be 100 GB now due to upstream changes.
Check for shuffle spills -- In the Spark UI, look for "Spill (Memory)" and "Spill (Disk)". Spills mean workers don't have enough memory for the shuffle, causing disk I/O.
Check for small file problem -- Too many small files in Delta Lake cause excessive task overhead. Run OPTIMIZE to compact files.
Check for missing VACUUM/OPTIMIZE -- Delta Lake tables accumulate old files over time. Without maintenance, read performance degrades.
Check upstream schema changes -- Did a column type change? Did new columns get added? Schema evolution can trigger full table scans.
Check cluster health -- Are any workers in an unhealthy state? Check for spot instance evictions causing task retries.
Check concurrent workloads -- Is another heavy job sharing the same cluster and competing for resources?
Compare execution plans -- Run EXPLAIN on the query and compare with a previous successful run. Look for missing predicate pushdown or broadcast join changes.
Check Delta Lake stats -- Run DESCRIBE DETAIL on the table. Check numFiles, sizeInBytes. If file count exploded, that's the problem.

27. Your team uses All-Purpose clusters for everything -- dev, test, and production. Management wants to cut Databricks costs by 50%. What changes would you propose?

Phased proposal:

Phase 1 -- Immediate wins (Week 1-2, ~40% savings):

Move all scheduled/production jobs to Job Clusters ($0.15 vs $0.55/DBU = 3-4x savings)
Set auto-termination to 15 minutes on all remaining All-Purpose clusters
Enable Spot Instances on worker nodes for all non-critical workloads

Phase 2 -- Governance (Week 3-4, ~10-15% additional savings):

Implement Cluster Policies restricting max workers, instance types, and enforcing auto-termination
Migrate SQL/BI workloads to Serverless SQL Warehouses (auto-suspend = zero idle cost)
Set up Instance Pools for dev teams needing fast startup

Phase 3 -- Optimization (Month 2, ~5-10% additional savings):

Enable Photon on SQL-heavy pipelines
Right-size VMs based on actual utilization metrics
Implement tagging for cost attribution and set up alerts for budget thresholds
Schedule dev clusters to auto-terminate outside business hours

Expected result: 50-65% cost reduction without any loss in functionality or performance.

The key insight: The bulk of savings (40%) comes from one single change -- switching production from All-Purpose to Job Clusters. Everything else is incremental.

28. How do you monitor and debug a failed Spark job in Databricks?

Follow a structured 5-step workflow:

Step 1: Find the failed run and task

Go to Jobs UI (Lakeflow Jobs) --> click the job --> Runs tab
Use the matrix view to spot red (failed) and grey (skipped) tasks
Hover over the failed task for metadata (duration, cluster, error summary)

Step 2: Read the error message properly

Click the failed task to open Task Run Details
Don't stop at the first line -- scroll to the root cause (often buried in the stack trace)
Classify the error: OOM? Schema drift? Permission issue? Timeout?

Step 3: Check logs

Driver logs (stdout/stderr): Application-level errors, Python tracebacks
Executor logs (via Spark UI --> Executors tab --> stderr): Task-level OOM, GC pressure
Cluster Event Log: Unexpected terminations, init script failures, autoscaling issues

Step 4: Reproduce in a notebook

Attach to the same cluster type and runtime version
Run the failing logic in isolation
Use count() on intermediate DataFrames to pinpoint the failing transformation
Use df.explain(True) to inspect the execution plan

Step 5: Fix and prevent

Fix the root cause, not the symptom
Use "Repair Run" in Jobs UI to re-run only failed/skipped tasks
Set up email/Slack alerts for failures
Enable cluster log delivery to ADLS/S3 for post-mortem analysis

29. What are the key tabs in the Spark UI and what does each tell you?

Tab	What It Shows	Key Use
Jobs	Every Spark action triggered, with duration, status, and stage count	Find which job is slow or failed
Stages	Stage-level metrics: shuffle read/write, spill (memory/disk), task distribution	Primary debugging tab -- find skew, spills, expensive shuffles
Storage	Cached RDDs/DataFrames, memory and disk usage, fraction cached	Verify caching is working or identify wasted memory
Executors	Per-executor metrics: cores, active/failed tasks, GC time, shuffle I/O	Identify overloaded executors, OOM, spot evictions
SQL	Visual query execution plan DAG with per-operator metrics	Understand how SQL/DataFrame queries are physically executed
Environment	All Spark configs, JVM info, classpath, system properties	Verify configuration settings are correctly applied
Streaming	Input rate, processing time, batch duration, completed batches	Monitor streaming pipeline throughput and lag

The most important tab for performance debugging is Stages. Start there, then drill into Tasks for skew analysis.

30. Your Databricks job is running 4x slower in production. How would you identify and fix the bottleneck?

Systematic approach:

1. Identify the slow stage:

Open Spark UI --> Jobs tab --> find the longest-running job
Drill into the longest stage in that job
Check the Task Duration Summary: is there a large gap between median and max duration?

2. Check for data skew:

If median task = 2 sec but max task = 5 min, one partition has disproportionately more data
Fix: Enable AQE (spark.sql.adaptive.enabled=true), salt the skewed key, or use broadcast join

3. Check for shuffle spill:

In the Stages tab, look for "Spill (Memory)" and "Spill (Disk)"
Fix: Increase spark.sql.shuffle.partitions, increase executor memory, or filter data earlier

4. Check for small file problem:

Run DESCRIBE DETAIL table -- if numFiles is very high with small average size
Fix: Run OPTIMIZE table, enable auto-optimize

5. Check input data volume:

Compare current input size vs historical. Did source data grow significantly?
Fix: Add incremental processing, partition pruning, or Z-Order on filter columns

6. Check execution plan:

Run df.explain("formatted") and compare with a previous successful run
Look for: missing predicate pushdown, SortMergeJoin where BroadcastHashJoin was expected, extra Exchange nodes

7. Check cluster health:

Executors tab: any dead executors? High GC time? Spot evictions?
Cluster Event Log: unexpected scaling or termination events?

31. What is a Spark Execution Plan? How do you read it?

When you write a transformation, Spark's Catalyst Optimizer transforms your code through four stages before executing:

Parsed Logical Plan --> Analyzed --> Optimized --> Physical Plan

Stage	What Happens
Parsed	Raw AST. Table/column names are unresolved strings.
Analyzed	Resolves all references against the catalog. Throws `AnalysisException` if invalid.
Optimized	Applies rules: predicate pushdown, constant folding, column pruning, join reordering.
Physical	Chooses execution strategy: join algorithms, shuffle methods, scan approaches.

How to view:

df.explain("simple")     # Physical plan only
df.explain("extended")   # All 4 stages
df.explain("formatted")  # Clean tree with node IDs

What to look for in the physical plan:

BroadcastHashJoin = efficient (small table broadcast). SortMergeJoin = expensive shuffle on both sides.
PushedFilters in FileScan = predicate pushdown is working.
Exchange hashpartitioning = shuffle happening. Minimize these.
*(n) prefix = whole-stage code generation (good -- operators fused into native code).
Missing * = operators running outside codegen (often UDFs).

32. How do you identify and handle data skew in Spark?

Detection:

Spark UI --> Stages tab --> Task Duration Summary
If a few tasks take 10x-100x longer than the median, you have skew
Also check: one executor with disproportionately high shuffle read

Common causes:

Joining on a key where one value dominates (e.g., country = "US" is 90% of data)
GroupBy on a low-cardinality column
Null values in join keys (all nulls go to one partition)

Solutions (ordered by preference):

Solution	How	When
AQE Skew Join	`spark.sql.adaptive.enabled=true` + `spark.sql.adaptive.skewJoin.enabled=true`	First option -- automatic, no code changes
Broadcast Join	`spark.sql.autoBroadcastJoinThreshold=10485760` or `broadcast(df)` hint	When one side of the join is small enough to fit in memory
Salting	Add random prefix to the skewed key, join on salted key, then remove prefix	When both tables are large and AQE isn't sufficient
Pre-aggregation	Aggregate the skewed data before the join to reduce volume	When downstream logic allows it
Filter nulls	Handle null join keys separately	When nulls are the source of skew
Repartition	Increase `spark.sql.shuffle.partitions` to create smaller partitions	General mitigation, not a targeted fix

33. How would you debug intermittent (non-deterministic) job failures in production?

Intermittent failures are harder than consistent failures because they can't be reliably reproduced. Approach:

1. Check for spot instance evictions:

Cluster Event Log --> look for "NODE_LOST" or "SPOT_INSTANCE_TERMINATION"
If workers are being evicted mid-task, Spark retries tasks but some may fail
Fix: Use on-demand for workers on critical jobs, or configure fallback to on-demand

2. Check for resource contention:

Is the cluster shared? Other workloads might compete for memory/CPU
Fix: Use isolated job clusters for critical pipelines

3. Check for upstream data changes:

Source files may arrive late, be partially written, or have inconsistent schemas
Fix: Add retry logic, data validation checks, and schema enforcement

4. Check for API/external dependency timeouts:

Network issues, rate limiting on external APIs, database connection pool exhaustion
Fix: Add retry with exponential backoff, increase connection timeouts

5. Check for race conditions:

Concurrent writes to the same Delta table without proper isolation
Fix: Use Delta Lake's ACID guarantees, avoid concurrent OPTIMIZE on the same table

6. Implement structured logging:

Log job parameters, input record counts, and key metrics at each step
When the failure recurs, logs provide context that the error message alone doesn't

7. Enable retry policies:

Configure job-level retries in Workflows (e.g., 2 retries with 5-minute delay)
This handles transient failures automatically

34. What are common executor issues and how do you troubleshoot them?

Issue	Symptom in Spark UI	Root Cause	Fix
OOM (Out of Memory)	Dead executors, failed tasks, `Container killed by YARN for exceeding memory limits`	Large partitions, data skew, or insufficient memory config	Increase executor memory, increase shuffle partitions, fix skew
High GC Time	GC Time > 10% of total task time in Executors tab	Too many objects in memory, large caches, or small heap	Increase executor memory, reduce cache size, tune GC settings
Shuffle Issues	High shuffle read/write, spill to disk	Wide transformations on large datasets	Increase shuffle partitions, use broadcast joins, filter early
Too Many Dead Executors	Dead executor count > 0, tasks retried	Spot evictions, OOM kills, or cloud resource limits	Use on-demand instances, increase memory, check cloud quotas
Uneven Task Distribution	Active tasks concentrated on 1-2 executors	Data skew or bad partitioning	Repartition data, salt keys, enable AQE

Debugging steps:

Spark UI --> Executors tab: Check active vs dead, GC time, shuffle metrics
Click a dead executor --> stderr: Read the full stack trace
Spark UI --> Stages tab: Find the stage with failed tasks, check task duration distribution
Adjust configuration:
- spark.executor.memory -- increase for OOM
- spark.sql.shuffle.partitions -- increase for spill
- spark.dynamicAllocation.enabled -- disable for short jobs with frequent executor churn

35. A streaming pipeline's processing time exceeds the batch interval. How would you diagnose and fix this?

When processing time > batch interval, the pipeline falls behind and a backlog builds up. Eventually, it can crash or produce severely delayed results.

Diagnosis:

Spark UI --> Streaming tab: Check the Processing Time graph. If it consistently exceeds the batch interval (e.g., 10-second batch, 15-second processing), the pipeline is falling behind.
Check Input Rate: Is the source sending more events than expected? A sudden spike in input can overwhelm the pipeline.
Check Stages tab for the streaming job: Look for data skew, shuffle spill, or long-running tasks.
Check for micro-batch overhead: Too many small files being written per batch.

Fixes:

Scale up: Add more workers or increase executor cores to process more partitions in parallel.
Increase batch interval: If latency tolerance allows, increase processingTime trigger (e.g., from 10s to 30s).
Optimize the query: Remove unnecessary shuffles/joins. Use mapPartitions instead of row-level UDFs.
Increase shuffle partitions: Default 200 may be too few for high-volume streams.
Enable Auto Loader with cloud files: More efficient file discovery than fileStream.
Compact output files: Enable autoOptimize on the Delta sink to prevent small file accumulation.
Use trigger.availableNow: For catch-up scenarios, processes all available data in one batch.

36. After deploying a code change, your Delta Lake MERGE job started failing with OOM errors. The data volume hasn't changed. What would you investigate?

The data volume hasn't changed, so the problem is in the code change or execution plan, not the data.

Investigation:

1. Compare execution plans (before vs after):

df.explain("extended")

Did a broadcast join change to a SortMergeJoin (or vice versa)?
Did predicate pushdown stop working due to a code change?
Is the MERGE condition now matching more rows than intended?

2. Check the MERGE condition:

A poorly written MERGE condition (e.g., missing a key column) can cause a cross-join, exploding the result set
Verify the join cardinality: source.join(target, condition).count() -- is it much larger than expected?

3. Check for new columns or schema changes:

Adding wide columns (large strings, nested structs) increases row size, which means more memory per partition
Check if schema evolution was triggered

4. Check for changes in partitioning:

Did the code change alter repartition() or coalesce() calls?
Fewer partitions = larger partitions = higher memory per task

5. Check for removed filters:

Was a filter accidentally removed? Without partition pruning, MERGE scans the entire target table.
Run DESCRIBE DETAIL target_table and check how many files are being read.

6. Fix:

Restore the correct MERGE condition
Add explicit partition filters to limit the scan
Increase spark.executor.memory as a short-term mitigation
Run OPTIMIZE on the target table if file fragmentation is a factor

37. Your team has no visibility into why jobs fail. Management asks you to set up a monitoring strategy for Databricks. What would you propose?

A comprehensive monitoring strategy has three layers:

Layer 1: Built-in Databricks Monitoring (Day 1)

Tool	What It Covers
Job Alerts	Email/Slack/webhook notifications on job success, failure, or duration threshold
Jobs UI Matrix View	Visual history of all task runs across jobs
Spark UI	Live performance diagnostics (stages, tasks, executors)
Cluster Event Logs	Cluster lifecycle events, autoscaling, init scripts

Layer 2: Centralized Logging (Week 1-2)

Action	How
Enable Cluster Log Delivery	Configure all production clusters to write driver/executor logs to ADLS Gen2 or S3
Structured Logging	Replace `print()` with Python `logging` module using JSON format for machine-parseable logs
Centralize in Delta Table	Ingest cluster logs into a Delta table for querying, filtering, trend analysis
Persist to Unity Catalog Volume	Store application logs in Volumes for long-term retention

Layer 3: Dashboards & Alerting (Month 1)

Action	How
AI/BI Dashboard	Build a Databricks dashboard showing: job success/failure rates, average duration trends, top error messages, cost per job
Azure Monitor Integration	Forward diagnostic telemetry to Azure Monitor for cross-service correlation
Custom Alerts	Set up alerts for: job duration > 2x historical average, > 3 consecutive failures, cluster cost exceeding budget threshold
System Tables	Query Databricks system tables (`system.billing`, `system.access`) for cost tracking and audit logs

Result: The team goes from "we don't know why jobs fail" to "we see failures in real-time, can diagnose root causes in minutes, and have historical trend data for capacity planning."

Questions which are asked in the interview

#databricks #todo #interview