Compute - Clusters, Policies & Cost Optimization

Compute refers to the computing resources available to run your data engineering, data science, and analytics workloads. Every notebook, job, pipeline, or SQL query needs compute to execute.

A Spark cluster in Databricks consists of two types of nodes:

Node	Role
Driver Node	The coordinator -- holds the SparkContext, plans execution (DAG), distributes tasks to workers, collects results. There is always exactly one driver per cluster.
Worker Nodes	The executors -- perform actual data processing in parallel, store intermediate data, return results to the driver. A cluster can have zero or more workers.

+-----------+
|  Driver   |  <-- Coordinates, plans, collects results
+-----------+
   / | \
  /  |  \
+--------+ +--------+ +--------+
| Worker | | Worker | | Worker |  <-- Execute tasks in parallel
+--------+ +--------+ +--------+

Types of Compute

Databricks broadly groups compute into three categories:

1. Classic Compute

a) All-Purpose Clusters (Interactive)

Created manually via UI, CLI, or API
Long-running, persistent -- stays alive until manually stopped or auto-terminated
Supports Python, SQL, Scala, R
Multiple users can attach simultaneously
Use case: development, exploration, ad-hoc analysis, collaborative notebooks
More expensive (~$0.55/DBU on Premium tier)

b) Job Clusters (Automated)

Created automatically when a job is triggered
Terminated automatically when the job completes
Isolated per job run -- no shared state between jobs
Use case: production ETL, scheduled batch processing, ML training
Significantly cheaper (~$0.15/DBU on Premium tier)

2. SQL Warehouses

Compute resources specifically optimized for SQL analytics and BI workloads. They come with Photon enabled by default.

Type	Description
Serverless SQL Warehouse	Recommended. Startup in 2-6 seconds, rapid autoscaling, fully managed. Best for BI, ETL, exploratory analysis.
Pro SQL Warehouse	Use when serverless is unavailable or custom networking is needed (e.g., federation, hybrid).
Classic SQL Warehouse	Entry-level. Basic interactive exploration when serverless or Pro are not options.

3. Serverless Compute

Fully managed by Databricks no cluster configuration needed. Available for:

Notebooks
Jobs
Delta Live Tables (Lakeflow Declarative Pipelines)

Benefits: faster startup, automatic scaling, lower operational overhead, no infrastructure management.

Instance Pools

An Instance Pool is a set of pre-provisioned, idle VMs ("warm" instances) that clusters can draw from, reducing startup time.

Without Pools	With Pools
Cluster startup: 5-10 minutes (cold VM provisioning)	Cluster startup: seconds (VMs already warm)
Each cluster provisions fresh VMs	Shared pool of ready instances
Higher startup cost and latency	Lower latency, better resource reuse

Key properties:

Minimum idle instances (always warm)
Maximum capacity (cost cap)
Idle instance auto-termination
Can be shared across multiple clusters and teams

Use when: teams run frequent short-lived clusters, need fast startup, or want to share resources.

Cluster Modes

Mode	Workers	Use Case
Standard	1 or more workers	General-purpose: ETL, batch, ML training. Shared JVM across jobs.
High Concurrency	1 or more workers	Multi-user environments. Query isolation via Thrift Server. Fair scheduling pools. Credential passthrough support. Best for concurrent BI queries (Power BI, Tableau).
Single Node	0 (driver only)	No workers. Spark runs locally on the driver. For lightweight dev, testing, small-scale ML, non-distributed workloads.

Note: High Concurrency mode is being replaced by Access Modes in newer Databricks versions.

Access Modes

Access mode determines who can use the compute and what security features are enforced:

Access Mode	Description
Standard	Shared by multiple users and groups. Enforces user-level and group-level data access permissions via Unity Catalog. Recommended for most workloads.
Dedicated (formerly Single User)	Dedicated to a single user. Supports features not available in Standard mode: RDD APIs, GPU instances, R language, Databricks Container Service.

Cluster Configuration Options

Databricks Runtime (DBR)

The runtime is the set of core components that run on your cluster. Includes Apache Spark, pre-installed libraries, and Databricks optimizations.

Runtime	Use Case
Standard Runtime	General data engineering
ML Runtime	Pre-installed ML libraries (TensorFlow, PyTorch, XGBoost, scikit-learn, MLflow)
Runtime for Genomics	Bioinformatics workloads
LTS (Long Term Support)	Recommended for production job clusters -- stability and tested compatibility
Latest (non-LTS)	Recommended for all-purpose clusters -- latest features and optimizations

Photon Engine

Photon is Databricks' native vectorized query engine. It accelerates SQL and DataFrame workloads by 2-4x.

Written in C++ for high performance
Enabled by default on SQL Warehouses
Optional on clusters (toggle on during creation)
Best for: SQL-heavy workloads, large aggregations, joins, data scans
May not benefit: simple batch ETL under 2 seconds, pure Python/RDD workloads

Node Type Selection

Instance Type	Best For
General Purpose	Balanced workloads, development, small ETL
Memory Optimized	ML, large in-memory datasets, heavy joins
Storage Optimized	Delta Lake caching, high I/O workloads
Compute Optimized	CPU-heavy ETL, complex transformations
GPU	Deep learning, model training (significantly more expensive)

Autoscaling

Dynamically adds or removes worker nodes based on pending tasks in Spark's task queue.

Min workers: baseline floor for SLA
Max workers: budget ceiling
When pending tasks exceed available cores, workers are added
When workers sit idle, they are removed

Best practices:

Narrow min-max range for latency-sensitive jobs (e.g., 4-8)
Wider range for throughput-heavy batch (e.g., 2-50)
Not beneficial for ultra-short jobs (startup overhead > benefit)

Auto-Termination

Automatically shuts down a cluster after a period of inactivity.

Default: 120 minutes for all-purpose clusters
Recommended: 10-30 minutes for interactive clusters
Job clusters auto-terminate by design when the job completes
Critical for cost control -- idle clusters are the #1 source of waste

Compute Policies

Policies are admin-defined rules that limit what users can configure when creating clusters.

Benefits

Enforce cost controls (max nodes, instance types, auto-termination)
Simplify cluster creation UI (fix/hide complex settings)
Prevent over-provisioning
Standardize configurations across teams
Limit per-cluster maximum cost (DBUs/hour)

Policy Families (Built-in Templates)

Family	Purpose
Personal Compute	Single-user, single-node development
Shared Compute	Multi-user interactive workloads
Job Compute	Automated production pipelines
Power User Compute	Advanced users needing custom configs
Legacy Shared Compute	Backward compatibility

Policies can be defined via UI dropdown menus or raw JSON. Libraries can also be attached to policies (auto-installed on cluster creation).

Pricing -- The Dual-Bill Model

Databricks cost consists of two separate bills:

Total Cost = Databricks DBU Fees + Cloud Infrastructure Fees

What is a DBU?

A Databricks Unit (DBU) is a normalized unit of processing power per hour. It is not tied to a specific VM size -- Databricks normalizes consumption across instance types.

Approximate DBU Rates (Premium Tier, 2026)

Compute Type	AWS Rate	Azure Rate	Best For
Jobs Compute	~$0.15/DBU	~$0.15-0.20/DBU	Scheduled production workloads
All-Purpose Compute	~$0.55/DBU	~$0.55-0.65/DBU	Interactive notebooks, dev
SQL Serverless	~$0.70/DBU	~$0.70/DBU	Serverless SQL analytics
DLT Core	~$0.15/DBU	~$0.15/DBU	Basic pipelines
DLT Pro	~$0.25/DBU	~$0.25/DBU	CDC + data quality
DLT Advanced	~$0.36/DBU	~$0.36/DBU	Complex pipelines

Key insight: There is a 3-4x cost gap between All-Purpose Compute and Jobs Compute. This is the single biggest lever for cost reduction.

The cloud infrastructure bill (VMs, storage, networking) is charged separately by your cloud provider (Azure, AWS, GCP).

Spot Instances

Type	Cost Savings	Availability	Eviction Risk
On-Demand	Baseline	Always available	None
Reserved	Up to 75% savings	Committed 1-3 years	None
Spot	Up to 90% savings	Surplus capacity	High -- 30 sec to 2 min notice

When to Use Spot Instances

Fault-tolerant batch ETL jobs
Non-critical dev/test clusters
ML training with checkpointing
Worker nodes (never use spot for the driver node)

Best Practice: Hybrid Approach

Driver node: Always on-demand (no eviction risk)
Worker nodes: Spot instances (cost savings, tolerable if evicted)
Configure fallback to on-demand if spot capacity is unavailable

Cost Optimization Best Practices

Strategy	Impact	Effort
Use Job Clusters instead of All-Purpose for production	3-4x cost reduction	Low
Enable auto-termination (10-30 min idle timeout)	Eliminates idle waste	Low
Use Spot Instances for workers	Up to 90% infra savings	Low
Enable autoscaling with proper min/max	Pay only for what you need	Low
Apply Cluster Policies to enforce guardrails	Prevents over-provisioning	Medium
Use latest LTS runtime	Performance improvements = shorter runtime	Low
Enable Photon for SQL/DataFrame workloads	2-4x speedup = shorter runtime	Low
Right-size instance types based on workload	Avoid paying for unused resources	Medium
Use Instance Pools for frequently used clusters	Faster startup, shared resources	Medium
Use SQL Warehouses for SQL workloads	Optimized pricing for SQL	Low
Use Serverless Compute where supported	No idle cost, auto-scaling	Low
Tag clusters for cost tracking	Visibility into spend by team/project	Low
Only use GPUs for GPU-accelerated workloads	Avoid unnecessary premium	Low

Summary Comparison Table

Feature	All-Purpose Cluster	Job Cluster	SQL Warehouse	Serverless Compute
Lifecycle	Manual start/stop	Auto-created, auto-terminated	Always-on or auto-suspend	On-demand, instant
Users	Multi-user	Single job	Multi-user (SQL)	Multi-user
Languages	Python, SQL, Scala, R	Python, SQL, Scala, R	SQL only	Python, SQL
Cost	~$0.55/DBU	~$0.15/DBU	~$0.70/DBU (serverless)	Varies
Best For	Dev, exploration	Production ETL	BI, analytics	Quick tasks, no config
Autoscaling	Yes	Yes	Yes	Automatic
Photon	Optional	Optional	Default	Default

Interview Question

#databricks