Compute - Clusters, Policies & Cost Optimization

Compute refers to the computing resources available to run your data engineering, data science, and analytics workloads. Every notebook, job, pipeline, or SQL query needs compute to execute.

A Spark cluster in Databricks consists of two types of nodes:

Node Role
Driver Node The coordinator -- holds the SparkContext, plans execution (DAG), distributes tasks to workers, collects results. There is always exactly one driver per cluster.
Worker Nodes The executors -- perform actual data processing in parallel, store intermediate data, return results to the driver. A cluster can have zero or more workers.
+-----------+
|  Driver   |  <-- Coordinates, plans, collects results
+-----------+
   / | \
  /  |  \
+--------+ +--------+ +--------+
| Worker | | Worker | | Worker |  <-- Execute tasks in parallel
+--------+ +--------+ +--------+

Types of Compute

Databricks broadly groups compute into three categories:

1. Classic Compute

a) All-Purpose Clusters (Interactive)

b) Job Clusters (Automated)

2. SQL Warehouses

Compute resources specifically optimized for SQL analytics and BI workloads. They come with Photon enabled by default.

Type Description
Serverless SQL Warehouse Recommended. Startup in 2-6 seconds, rapid autoscaling, fully managed. Best for BI, ETL, exploratory analysis.
Pro SQL Warehouse Use when serverless is unavailable or custom networking is needed (e.g., federation, hybrid).
Classic SQL Warehouse Entry-level. Basic interactive exploration when serverless or Pro are not options.

3. Serverless Compute

Fully managed by Databricks no cluster configuration needed. Available for:

Benefits: faster startup, automatic scaling, lower operational overhead, no infrastructure management.


Instance Pools

An Instance Pool is a set of pre-provisioned, idle VMs ("warm" instances) that clusters can draw from, reducing startup time.

Without Pools With Pools
Cluster startup: 5-10 minutes (cold VM provisioning) Cluster startup: seconds (VMs already warm)
Each cluster provisions fresh VMs Shared pool of ready instances
Higher startup cost and latency Lower latency, better resource reuse

Key properties:

Use when: teams run frequent short-lived clusters, need fast startup, or want to share resources.


Cluster Modes

Mode Workers Use Case
Standard 1 or more workers General-purpose: ETL, batch, ML training. Shared JVM across jobs.
High Concurrency 1 or more workers Multi-user environments. Query isolation via Thrift Server. Fair scheduling pools. Credential passthrough support. Best for concurrent BI queries (Power BI, Tableau).
Single Node 0 (driver only) No workers. Spark runs locally on the driver. For lightweight dev, testing, small-scale ML, non-distributed workloads.

Note: High Concurrency mode is being replaced by Access Modes in newer Databricks versions.


Access Modes

Access mode determines who can use the compute and what security features are enforced:

Access Mode Description
Standard Shared by multiple users and groups. Enforces user-level and group-level data access permissions via Unity Catalog. Recommended for most workloads.
Dedicated (formerly Single User) Dedicated to a single user. Supports features not available in Standard mode: RDD APIs, GPU instances, R language, Databricks Container Service.

Cluster Configuration Options

Databricks Runtime (DBR)

The runtime is the set of core components that run on your cluster. Includes Apache Spark, pre-installed libraries, and Databricks optimizations.

Runtime Use Case
Standard Runtime General data engineering
ML Runtime Pre-installed ML libraries (TensorFlow, PyTorch, XGBoost, scikit-learn, MLflow)
Runtime for Genomics Bioinformatics workloads
LTS (Long Term Support) Recommended for production job clusters -- stability and tested compatibility
Latest (non-LTS) Recommended for all-purpose clusters -- latest features and optimizations

Photon Engine

Photon is Databricks' native vectorized query engine. It accelerates SQL and DataFrame workloads by 2-4x.

Node Type Selection

Instance Type Best For
General Purpose Balanced workloads, development, small ETL
Memory Optimized ML, large in-memory datasets, heavy joins
Storage Optimized Delta Lake caching, high I/O workloads
Compute Optimized CPU-heavy ETL, complex transformations
GPU Deep learning, model training (significantly more expensive)

Autoscaling

Dynamically adds or removes worker nodes based on pending tasks in Spark's task queue.

Best practices:

Auto-Termination

Automatically shuts down a cluster after a period of inactivity.

Tags

Custom key-value metadata attached to clusters for:


Compute Policies

Policies are admin-defined rules that limit what users can configure when creating clusters.

Benefits

Policy Families (Built-in Templates)

Family Purpose
Personal Compute Single-user, single-node development
Shared Compute Multi-user interactive workloads
Job Compute Automated production pipelines
Power User Compute Advanced users needing custom configs
Legacy Shared Compute Backward compatibility

Policies can be defined via UI dropdown menus or raw JSON. Libraries can also be attached to policies (auto-installed on cluster creation).


Pricing -- The Dual-Bill Model

Databricks cost consists of two separate bills:

Total Cost = Databricks DBU Fees + Cloud Infrastructure Fees

What is a DBU?

A Databricks Unit (DBU) is a normalized unit of processing power per hour. It is not tied to a specific VM size -- Databricks normalizes consumption across instance types.

Approximate DBU Rates (Premium Tier, 2026)

Compute Type AWS Rate Azure Rate Best For
Jobs Compute ~$0.15/DBU ~$0.15-0.20/DBU Scheduled production workloads
All-Purpose Compute ~$0.55/DBU ~$0.55-0.65/DBU Interactive notebooks, dev
SQL Serverless ~$0.70/DBU ~$0.70/DBU Serverless SQL analytics
DLT Core ~$0.15/DBU ~$0.15/DBU Basic pipelines
DLT Pro ~$0.25/DBU ~$0.25/DBU CDC + data quality
DLT Advanced ~$0.36/DBU ~$0.36/DBU Complex pipelines

Key insight: There is a 3-4x cost gap between All-Purpose Compute and Jobs Compute. This is the single biggest lever for cost reduction.

The cloud infrastructure bill (VMs, storage, networking) is charged separately by your cloud provider (Azure, AWS, GCP).


Spot Instances

Type Cost Savings Availability Eviction Risk
On-Demand Baseline Always available None
Reserved Up to 75% savings Committed 1-3 years None
Spot Up to 90% savings Surplus capacity High -- 30 sec to 2 min notice

When to Use Spot Instances

Best Practice: Hybrid Approach


Cost Optimization Best Practices

Strategy Impact Effort
Use Job Clusters instead of All-Purpose for production 3-4x cost reduction Low
Enable auto-termination (10-30 min idle timeout) Eliminates idle waste Low
Use Spot Instances for workers Up to 90% infra savings Low
Enable autoscaling with proper min/max Pay only for what you need Low
Apply Cluster Policies to enforce guardrails Prevents over-provisioning Medium
Use latest LTS runtime Performance improvements = shorter runtime Low
Enable Photon for SQL/DataFrame workloads 2-4x speedup = shorter runtime Low
Right-size instance types based on workload Avoid paying for unused resources Medium
Use Instance Pools for frequently used clusters Faster startup, shared resources Medium
Use SQL Warehouses for SQL workloads Optimized pricing for SQL Low
Use Serverless Compute where supported No idle cost, auto-scaling Low
Tag clusters for cost tracking Visibility into spend by team/project Low
Only use GPUs for GPU-accelerated workloads Avoid unnecessary premium Low

Summary Comparison Table

Feature All-Purpose Cluster Job Cluster SQL Warehouse Serverless Compute
Lifecycle Manual start/stop Auto-created, auto-terminated Always-on or auto-suspend On-demand, instant
Users Multi-user Single job Multi-user (SQL) Multi-user
Languages Python, SQL, Scala, R Python, SQL, Scala, R SQL only Python, SQL
Cost ~$0.55/DBU ~$0.15/DBU ~$0.70/DBU (serverless) Varies
Best For Dev, exploration Production ETL BI, analytics Quick tasks, no config
Autoscaling Yes Yes Yes Automatic
Photon Optional Optional Default Default

Interview Question


#databricks