Databricks Interview Questions

Here I will be adding the questions as I learn and work on the databricks.

Question I think are important

1. Difference b/w Serverless Compute Plane and Classic Compute Plane

The primary difference between a serverless and a classic compute plane comes down to infrastructure ownership, startup latency, and billing granularity. In a classic compute plane, the virtual machines run inside my own cloud account, giving me full control over network topology and instance tuning at the cost of management overhead. In a serverless compute plane, the infrastructure runs in the vendor’s account, abstracting away the hardware layer completely so I can focus purely on data processing and transformation code.


2. What is a Databricks Workspace?

A workspace is the central collaborative environment where users organize and manage all Databricks assets -- notebooks, files, folders, dashboards, queries, libraries, experiments, and repos. It provides a unified web UI to access every Databricks feature.


3. What are the workspace folder types?


4. What is the difference between a Workspace and a Cluster?

Workspace is the organizational layer (where you manage assets). Cluster is the compute layer (where code actually runs). A notebook in the workspace must be attached to a cluster to execute.


5. (IMP) What is the difference between %run and dbutils.notebook.run()?

~ %run dbutils.notebook.run()
Execution context Shared -- variables from called notebook are available in caller Isolated -- separate execution context
Return value No explicit return Returns a string via dbutils.notebook.exit()
Parameters No parameter passing Supports parameter passing via arguments={}
Use case Importing shared functions/configs Orchestrating pipelines, chaining notebooks

6. What are notebook permission levels?

Level View Comment Run Edit Manage
NO PERMISSIONS
CAN READ x x
CAN RUN x x x
CAN EDIT x x x x
CAN MANAGE x x x x x

7. What is a Temporary View vs Global Temporary View?

~ Temporary View Global Temporary View
Scope Session-scoped -- available only in the notebook/session that created it Application-scoped -- available across all notebooks on the same cluster
Namespace Default namespace global_temp schema
Syntax CREATE OR REPLACE TEMP VIEW v AS ... CREATE OR REPLACE GLOBAL TEMP VIEW v AS ...
Access SELECT * FROM v SELECT * FROM global_temp.v
Lifetime Ends when session/notebook detaches Ends when cluster terminates

8. What are Databricks Widgets? Why are they used?

Widgets add interactive input parameters to notebooks (text fields, dropdowns, comboboxes, multiselect). Used to parameterize notebooks for reusable pipelines -- e.g., passing env, date, region at runtime.

# to create a dropdown widget
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"]) 

# get value from a widget
env = dbutils.widgets.get("env")

9. What are the key dbutils modules?

Module Purpose Key Commands
dbutils.fs File system (DBFS) ls, cp, mv, rm, mkdirs, head, put, mount/unmount
dbutils.secrets Secure secret access listScopes, list, get
dbutils.widgets Notebook parameterization text, dropdown, get, removeAll
dbutils.notebook Notebook orchestration run, exit
dbutils.jobs Job features Task-level values

10. (IMP) How do you manage secrets in Databricks?

  1. Create a secret scope (backed by Databricks or Azure Key Vault)
  2. Store secrets in the scope
  3. Access in notebooks via dbutils.secrets.get("scope", "key")
  4. Secret values are redacted in notebook output -- never printed in plain text

DBFS (Databricks File System) is a distributed file system mounted into the workspace. It abstracts cloud storage. However, DBFS root is now considered legacy. Unity Catalog with External Locations and Volumes is the recommended approach for data access.


12. (IMP) What is the difference between mounting storage and using Unity Catalog External Locations?

~ Mount Points (Legacy) UC External Locations (Recommended)
Setup dbutils.fs.mount(...) Configured via Unity Catalog
Governance No fine-grained access control Full UC governance (ACLs, lineage, audit)
Scope Workspace-level Cross-workspace
Credentials Stored in mount config Managed via Storage Credentials

13. How does Databricks integrate with Azure?

Azure Databricks is a first-party Azure service jointly developed by Databricks and Microsoft. It is deployed as an Azure resource and integrates natively with the Azure ecosystem.


14. What Azure services does Databricks integrate with?

Category Service How It Integrates
Storage ADLS Gen2 Primary lakehouse storage
Batch Ingestion Azure Data Factory Orchestrate batch ETL into ADLS
Streaming Event Hubs / IoT Hub Real-time data ingestion
Identity Entra ID + SCIM SSO, user provisioning
Secrets Azure Key Vault Backed secret scopes
Governance Microsoft Purview Data discovery, classification, lineage export from UC
BI Power BI Optimized connector + Direct Lake via Fabric
Monitoring Azure Monitor Diagnostics, telemetry
CI/CD Azure DevOps / GitHub Deployment automation
Cost Cost Management Track Databricks spend
Networking VNet, Private Link, NSG Secure classic workspace isolation

15. (IMP) What is the difference between Azure Databricks and Databricks on AWS/GCP?

The core platform is the same. Azure-specific differences:


16. How does data flow in an Azure Databricks architecture?

Sources --> Ingestion --> Databricks (Medallion) --> Serving
  |            |              |                        |
  |-- Batch -> ADF --------> Bronze (raw)              |-> Power BI
  |-- Stream-> Event Hubs -> Silver (clean)            |-> DBSQL
  |-- CDC ---> Auto Loader-> Gold (aggregated)         |-> ML Models
                              |
                        ADLS Gen2 (Delta Lake)

17. How do you connect Azure Databricks to ADLS Gen2?

Three methods (know all three):

  1. Unity Catalog External Locations (recommended) -- governed, cross-workspace
  2. Service Principal + OAuth -- via Spark config with client ID/secret
  3. Mount Points (legacy) -- dbutils.fs.mount() with OAuth config

18. What is the difference between a Driver Node and Worker Nodes in a Spark cluster?

Driver Node Worker Nodes
Count Exactly 1 per cluster 0 to many
Role Coordinator - holds SparkContext, plans the execution DAG, distributes tasks across workers, collects results Executors -- process data partitions in parallel, store intermediate results
Failure impact Entire job fails Task is retried on another worker (Spark's fault tolerance)
Spot instance? Never - always use on-demand (eviction kills the whole job) Yes -- recommended for cost savings on fault-tolerant workloads
Sizing Needs enough memory to hold the DAG plan, broadcast variables, and collected results Needs enough memory/CPU for the data partitions being processed

Key follow-up: The driver is the single point of failure. That's why you never put it on a spot instance, and why collect() on large datasets can crash the driver (all data is pulled to one node).


19. What is the difference between All-Purpose Clusters and Job Clusters? When would you use each?

All-Purpose Cluster Job Cluster
Creation Manual (UI, CLI, API) Automatic (created when a job is triggered)
Lifecycle Stays running until manually stopped or auto-terminated Auto-terminates when the job completes
Sharing Multiple users/notebooks can attach simultaneously Isolated to a single job run
Cost (Premium) ~$0.55/DBU ~$0.15/DBU - 3-4x cheaper
Restart Can be restarted by users Cannot restart - new cluster per run
Use case Development, exploration, ad-hoc analysis, collaboration Production ETL, scheduled workflows, ML training

The most common follow-up: "Why not use All-Purpose for everything?"
Answer: Cost. At 3-4x the price, running production pipelines on All-Purpose clusters is the single biggest source of unnecessary Databricks spend. All-Purpose is for humans iterating; Job Clusters are for automated workloads.


20. How does Databricks autoscaling work? What are the key considerations?

Databricks autoscaling monitors the pending task queue in Spark:

  1. When pending tasks exceed available executor cores - scales up (adds worker nodes from the cloud provider or instance pool)
  2. When workers sit idle beyond a threshold - scales down (removes workers)

You configure two parameters:

Key considerations:

Consideration Guidance
Narrow range (e.g., 4-8) Latency-sensitive workloads -- predictable performance
Wide range (e.g., 2-50) Throughput-heavy batch jobs -- maximize parallelism
Ultra-short jobs Don't autoscale -- startup overhead exceeds benefit
Instance Pools Combine with autoscaling for faster node acquisition (seconds vs minutes)
Streaming workloads Keep min workers higher -- scaling latency can cause processing delays
Scale-down behavior Databricks is conservative on scale-down to avoid thrashing (removing then re-adding nodes)

21. What factors do you consider when creating a Databricks cluster?

Factor What to Decide
Compute type All-Purpose (dev) vs Job Cluster (prod) vs SQL Warehouse (SQL/BI) vs Serverless
Access mode Standard (shared, UC-governed) vs Dedicated (single user, RDD/GPU/R support)
Runtime version Latest for dev; LTS for production stability
Photon Enable for SQL/DataFrame-heavy workloads (2-4x speedup)
Node type General Purpose (balanced), Memory Optimized (ML, large joins), Compute Optimized (CPU-heavy ETL), GPU (deep learning)
Workers Number based on data volume; enable autoscaling with sensible min/max
Spot instances Use for workers on fault-tolerant jobs; never for the driver
Auto-termination Always enable for interactive clusters (10-30 min recommended)
Instance pool Use for fast startup on frequently used clusters
Cluster policy Apply admin-defined policies for governance and cost guardrails
Init scripts For custom library installs or environment setup at cluster startup
Tags Add for cost tracking by team/project/environment

22. What is Photon and how does it improve performance?

Photon is Databricks' native vectorized query engine written in C++. It replaces the standard JVM-based Spark execution for SQL and DataFrame operations.

How it improves performance:

When Photon helps:

When Photon may NOT help:

Cost implication: Photon uses more DBUs per hour, but jobs complete faster -- net cost is often lower.


23. What are the different SQL Warehouse types? When would you use each?

Type Startup Management When to Use
Serverless 2-6 seconds Fully managed by Databricks Recommended default. Best for BI dashboards, ad-hoc SQL, ETL. Rapid autoscaling, no infra config.
Pro Minutes Customer-managed compute When serverless is unavailable in your region, or you need custom networking (VNet injection, firewall rules, federation).
Classic Minutes Customer-managed compute Entry-level. Basic interactive SQL exploration when Pro/Serverless are not options. Least features.

All SQL Warehouses come with Photon enabled by default.

Key differences from clusters:


24. How would you reduce Databricks costs in a production environment?

Ordered by impact:

# Strategy Savings Effort
1 Switch production workloads from All-Purpose to Job Clusters 3-4x DBU reduction Low
2 Enable auto-termination on all interactive clusters (10-30 min) Eliminates idle waste (often 20-40% of bill) Low
3 Use Spot Instances for worker nodes Up to 90% infra savings Low
4 Enable autoscaling with tight min/max boundaries Pay only for what you use Low
5 Enforce Cluster Policies Prevent over-provisioning org-wide Medium
6 Enable Photon for SQL/DataFrame workloads Faster execution = fewer DBUs consumed Low
7 Right-size instance types Avoid paying for unused CPU/memory Medium
8 Use Serverless SQL Warehouses for SQL workloads Auto-suspend = zero idle cost Low
9 Use Instance Pools for shared compute Faster startup, better resource utilization Medium
10 Tag everything for cost visibility Identifies waste by team/project Low

The first three alone typically cut costs by 50-70%.


25. Your cluster cost increased significantly last month. How would you investigate and optimize?

Step-by-step investigation:

  1. Check cluster uptime -- Are all-purpose clusters running idle overnight or over weekends? This is the most common culprit.
  2. Identify compute type -- Are production jobs running on All-Purpose ($0.55/DBU) instead of Job Clusters ($0.15/DBU)?
  3. Review auto-termination settings -- Is it disabled or set too high (e.g., 120 min default)?
  4. Inspect autoscaling config -- Are max workers set unreasonably high? Are clusters scaling to max and staying there?
  5. Check instance types -- Are teams using expensive GPU or memory-optimized VMs when general-purpose would suffice?
  6. Review cluster policies -- Are guardrails in place, or can any user create unrestricted clusters?
  7. Look at job history -- Any failed jobs retrying repeatedly? A retry loop on a 50-node cluster burns cost fast.
  8. Check tags -- Identify which team/project drove the spike.
  9. SQL Warehouse usage -- Is a serverless warehouse running expensive, unoptimized queries at scale?
  10. Use cloud cost tools -- Azure Cost Management / AWS Cost Explorer to trace the infrastructure portion of the bill.

Optimization actions: Apply the strategies from Q24 based on findings. Typically, enforcing auto-termination + switching to Job Clusters resolves 60-70% of cost spikes.


26. A production ETL job that usually completes in 20 minutes is now taking 2 hours. The cluster configuration hasn't changed. How would you debug this?

Systematic approach:

  1. Check Spark UI -- Look at the stage timeline. Is one stage taking disproportionately long? That signals a data skew (one partition has far more data than others).

  2. Check input data volume -- Did the source data grow significantly? A table that was 10 GB last week might be 100 GB now due to upstream changes.

  3. Check for shuffle spills -- In the Spark UI, look for "Spill (Memory)" and "Spill (Disk)". Spills mean workers don't have enough memory for the shuffle, causing disk I/O.

  4. Check for small file problem -- Too many small files in Delta Lake cause excessive task overhead. Run OPTIMIZE to compact files.

  5. Check for missing VACUUM/OPTIMIZE -- Delta Lake tables accumulate old files over time. Without maintenance, read performance degrades.

  6. Check upstream schema changes -- Did a column type change? Did new columns get added? Schema evolution can trigger full table scans.

  7. Check cluster health -- Are any workers in an unhealthy state? Check for spot instance evictions causing task retries.

  8. Check concurrent workloads -- Is another heavy job sharing the same cluster and competing for resources?

  9. Compare execution plans -- Run EXPLAIN on the query and compare with a previous successful run. Look for missing predicate pushdown or broadcast join changes.

  10. Check Delta Lake stats -- Run DESCRIBE DETAIL on the table. Check numFiles, sizeInBytes. If file count exploded, that's the problem.


27. Your team uses All-Purpose clusters for everything -- dev, test, and production. Management wants to cut Databricks costs by 50%. What changes would you propose?

Phased proposal:

Phase 1 -- Immediate wins (Week 1-2, ~40% savings):

Phase 2 -- Governance (Week 3-4, ~10-15% additional savings):

Phase 3 -- Optimization (Month 2, ~5-10% additional savings):

Expected result: 50-65% cost reduction without any loss in functionality or performance.

The key insight: The bulk of savings (40%) comes from one single change -- switching production from All-Purpose to Job Clusters. Everything else is incremental.

28. How do you monitor and debug a failed Spark job in Databricks?

Follow a structured 5-step workflow:

Step 1: Find the failed run and task

Step 2: Read the error message properly

Step 3: Check logs

Step 4: Reproduce in a notebook

Step 5: Fix and prevent


29. What are the key tabs in the Spark UI and what does each tell you?

Tab What It Shows Key Use
Jobs Every Spark action triggered, with duration, status, and stage count Find which job is slow or failed
Stages Stage-level metrics: shuffle read/write, spill (memory/disk), task distribution Primary debugging tab -- find skew, spills, expensive shuffles
Storage Cached RDDs/DataFrames, memory and disk usage, fraction cached Verify caching is working or identify wasted memory
Executors Per-executor metrics: cores, active/failed tasks, GC time, shuffle I/O Identify overloaded executors, OOM, spot evictions
SQL Visual query execution plan DAG with per-operator metrics Understand how SQL/DataFrame queries are physically executed
Environment All Spark configs, JVM info, classpath, system properties Verify configuration settings are correctly applied
Streaming Input rate, processing time, batch duration, completed batches Monitor streaming pipeline throughput and lag

The most important tab for performance debugging is Stages. Start there, then drill into Tasks for skew analysis.


30. Your Databricks job is running 4x slower in production. How would you identify and fix the bottleneck?

Systematic approach:

1. Identify the slow stage:

2. Check for data skew:

3. Check for shuffle spill:

4. Check for small file problem:

5. Check input data volume:

6. Check execution plan:

7. Check cluster health:


31. What is a Spark Execution Plan? How do you read it?

When you write a transformation, Spark's Catalyst Optimizer transforms your code through four stages before executing:

Parsed Logical Plan --> Analyzed --> Optimized --> Physical Plan
Stage What Happens
Parsed Raw AST. Table/column names are unresolved strings.
Analyzed Resolves all references against the catalog. Throws AnalysisException if invalid.
Optimized Applies rules: predicate pushdown, constant folding, column pruning, join reordering.
Physical Chooses execution strategy: join algorithms, shuffle methods, scan approaches.

How to view:

df.explain("simple")     # Physical plan only
df.explain("extended")   # All 4 stages
df.explain("formatted")  # Clean tree with node IDs

What to look for in the physical plan:


32. How do you identify and handle data skew in Spark?

Detection:

Common causes:

Solutions (ordered by preference):

Solution How When
AQE Skew Join spark.sql.adaptive.enabled=true + spark.sql.adaptive.skewJoin.enabled=true First option -- automatic, no code changes
Broadcast Join spark.sql.autoBroadcastJoinThreshold=10485760 or broadcast(df) hint When one side of the join is small enough to fit in memory
Salting Add random prefix to the skewed key, join on salted key, then remove prefix When both tables are large and AQE isn't sufficient
Pre-aggregation Aggregate the skewed data before the join to reduce volume When downstream logic allows it
Filter nulls Handle null join keys separately When nulls are the source of skew
Repartition Increase spark.sql.shuffle.partitions to create smaller partitions General mitigation, not a targeted fix

33. How would you debug intermittent (non-deterministic) job failures in production?

Intermittent failures are harder than consistent failures because they can't be reliably reproduced. Approach:

1. Check for spot instance evictions:

2. Check for resource contention:

3. Check for upstream data changes:

4. Check for API/external dependency timeouts:

5. Check for race conditions:

6. Implement structured logging:

7. Enable retry policies:


34. What are common executor issues and how do you troubleshoot them?

Issue Symptom in Spark UI Root Cause Fix
OOM (Out of Memory) Dead executors, failed tasks, Container killed by YARN for exceeding memory limits Large partitions, data skew, or insufficient memory config Increase executor memory, increase shuffle partitions, fix skew
High GC Time GC Time > 10% of total task time in Executors tab Too many objects in memory, large caches, or small heap Increase executor memory, reduce cache size, tune GC settings
Shuffle Issues High shuffle read/write, spill to disk Wide transformations on large datasets Increase shuffle partitions, use broadcast joins, filter early
Too Many Dead Executors Dead executor count > 0, tasks retried Spot evictions, OOM kills, or cloud resource limits Use on-demand instances, increase memory, check cloud quotas
Uneven Task Distribution Active tasks concentrated on 1-2 executors Data skew or bad partitioning Repartition data, salt keys, enable AQE

Debugging steps:

  1. Spark UI --> Executors tab: Check active vs dead, GC time, shuffle metrics
  2. Click a dead executor --> stderr: Read the full stack trace
  3. Spark UI --> Stages tab: Find the stage with failed tasks, check task duration distribution
  4. Adjust configuration:
    • spark.executor.memory -- increase for OOM
    • spark.sql.shuffle.partitions -- increase for spill
    • spark.dynamicAllocation.enabled -- disable for short jobs with frequent executor churn

35. A streaming pipeline's processing time exceeds the batch interval. How would you diagnose and fix this?

When processing time > batch interval, the pipeline falls behind and a backlog builds up. Eventually, it can crash or produce severely delayed results.

Diagnosis:

  1. Spark UI --> Streaming tab: Check the Processing Time graph. If it consistently exceeds the batch interval (e.g., 10-second batch, 15-second processing), the pipeline is falling behind.
  2. Check Input Rate: Is the source sending more events than expected? A sudden spike in input can overwhelm the pipeline.
  3. Check Stages tab for the streaming job: Look for data skew, shuffle spill, or long-running tasks.
  4. Check for micro-batch overhead: Too many small files being written per batch.

Fixes:


36. After deploying a code change, your Delta Lake MERGE job started failing with OOM errors. The data volume hasn't changed. What would you investigate?

The data volume hasn't changed, so the problem is in the code change or execution plan, not the data.

Investigation:

1. Compare execution plans (before vs after):

df.explain("extended")

2. Check the MERGE condition:

3. Check for new columns or schema changes:

4. Check for changes in partitioning:

5. Check for removed filters:

6. Fix:


37. Your team has no visibility into why jobs fail. Management asks you to set up a monitoring strategy for Databricks. What would you propose?

A comprehensive monitoring strategy has three layers:

Layer 1: Built-in Databricks Monitoring (Day 1)

Tool What It Covers
Job Alerts Email/Slack/webhook notifications on job success, failure, or duration threshold
Jobs UI Matrix View Visual history of all task runs across jobs
Spark UI Live performance diagnostics (stages, tasks, executors)
Cluster Event Logs Cluster lifecycle events, autoscaling, init scripts

Layer 2: Centralized Logging (Week 1-2)

Action How
Enable Cluster Log Delivery Configure all production clusters to write driver/executor logs to ADLS Gen2 or S3
Structured Logging Replace print() with Python logging module using JSON format for machine-parseable logs
Centralize in Delta Table Ingest cluster logs into a Delta table for querying, filtering, trend analysis
Persist to Unity Catalog Volume Store application logs in Volumes for long-term retention

Layer 3: Dashboards & Alerting (Month 1)

Action How
AI/BI Dashboard Build a Databricks dashboard showing: job success/failure rates, average duration trends, top error messages, cost per job
Azure Monitor Integration Forward diagnostic telemetry to Azure Monitor for cross-service correlation
Custom Alerts Set up alerts for: job duration > 2x historical average, > 3 consecutive failures, cluster cost exceeding budget threshold
System Tables Query Databricks system tables (system.billing, system.access) for cost tracking and audit logs

Result: The team goes from "we don't know why jobs fail" to "we see failures in real-time, can diagnose root causes in minutes, and have historical trend data for capacity planning."


Questions which are asked in the interview


#databricks #todo #interview