Compute - Clusters, Policies & Cost Optimization
Compute refers to the computing resources available to run your data engineering, data science, and analytics workloads. Every notebook, job, pipeline, or SQL query needs compute to execute.
A Spark cluster in Databricks consists of two types of nodes:
| Node | Role |
|---|---|
| Driver Node | The coordinator -- holds the SparkContext, plans execution (DAG), distributes tasks to workers, collects results. There is always exactly one driver per cluster. |
| Worker Nodes | The executors -- perform actual data processing in parallel, store intermediate data, return results to the driver. A cluster can have zero or more workers. |
+-----------+
| Driver | <-- Coordinates, plans, collects results
+-----------+
/ | \
/ | \
+--------+ +--------+ +--------+
| Worker | | Worker | | Worker | <-- Execute tasks in parallel
+--------+ +--------+ +--------+
Types of Compute
Databricks broadly groups compute into three categories:
1. Classic Compute
a) All-Purpose Clusters (Interactive)
- Created manually via UI, CLI, or API
- Long-running, persistent -- stays alive until manually stopped or auto-terminated
- Supports Python, SQL, Scala, R
- Multiple users can attach simultaneously
- Use case: development, exploration, ad-hoc analysis, collaborative notebooks
- More expensive (~$0.55/DBU on Premium tier)
b) Job Clusters (Automated)
- Created automatically when a job is triggered
- Terminated automatically when the job completes
- Isolated per job run -- no shared state between jobs
- Use case: production ETL, scheduled batch processing, ML training
- Significantly cheaper (~$0.15/DBU on Premium tier)
2. SQL Warehouses
Compute resources specifically optimized for SQL analytics and BI workloads. They come with Photon enabled by default.
| Type | Description |
|---|---|
| Serverless SQL Warehouse | Recommended. Startup in 2-6 seconds, rapid autoscaling, fully managed. Best for BI, ETL, exploratory analysis. |
| Pro SQL Warehouse | Use when serverless is unavailable or custom networking is needed (e.g., federation, hybrid). |
| Classic SQL Warehouse | Entry-level. Basic interactive exploration when serverless or Pro are not options. |
3. Serverless Compute
Fully managed by Databricks no cluster configuration needed. Available for:
- Notebooks
- Jobs
- Delta Live Tables (Lakeflow Declarative Pipelines)
Benefits: faster startup, automatic scaling, lower operational overhead, no infrastructure management.
Instance Pools
An Instance Pool is a set of pre-provisioned, idle VMs ("warm" instances) that clusters can draw from, reducing startup time.
| Without Pools | With Pools |
|---|---|
| Cluster startup: 5-10 minutes (cold VM provisioning) | Cluster startup: seconds (VMs already warm) |
| Each cluster provisions fresh VMs | Shared pool of ready instances |
| Higher startup cost and latency | Lower latency, better resource reuse |
Key properties:
- Minimum idle instances (always warm)
- Maximum capacity (cost cap)
- Idle instance auto-termination
- Can be shared across multiple clusters and teams
Use when: teams run frequent short-lived clusters, need fast startup, or want to share resources.
Cluster Modes
| Mode | Workers | Use Case |
|---|---|---|
| Standard | 1 or more workers | General-purpose: ETL, batch, ML training. Shared JVM across jobs. |
| High Concurrency | 1 or more workers | Multi-user environments. Query isolation via Thrift Server. Fair scheduling pools. Credential passthrough support. Best for concurrent BI queries (Power BI, Tableau). |
| Single Node | 0 (driver only) | No workers. Spark runs locally on the driver. For lightweight dev, testing, small-scale ML, non-distributed workloads. |
Note: High Concurrency mode is being replaced by Access Modes in newer Databricks versions.
Access Modes
Access mode determines who can use the compute and what security features are enforced:
| Access Mode | Description |
|---|---|
| Standard | Shared by multiple users and groups. Enforces user-level and group-level data access permissions via Unity Catalog. Recommended for most workloads. |
| Dedicated (formerly Single User) | Dedicated to a single user. Supports features not available in Standard mode: RDD APIs, GPU instances, R language, Databricks Container Service. |
Cluster Configuration Options
Databricks Runtime (DBR)
The runtime is the set of core components that run on your cluster. Includes Apache Spark, pre-installed libraries, and Databricks optimizations.
| Runtime | Use Case |
|---|---|
| Standard Runtime | General data engineering |
| ML Runtime | Pre-installed ML libraries (TensorFlow, PyTorch, XGBoost, scikit-learn, MLflow) |
| Runtime for Genomics | Bioinformatics workloads |
| LTS (Long Term Support) | Recommended for production job clusters -- stability and tested compatibility |
| Latest (non-LTS) | Recommended for all-purpose clusters -- latest features and optimizations |
Photon Engine
Photon is Databricks' native vectorized query engine. It accelerates SQL and DataFrame workloads by 2-4x.
- Written in C++ for high performance
- Enabled by default on SQL Warehouses
- Optional on clusters (toggle on during creation)
- Best for: SQL-heavy workloads, large aggregations, joins, data scans
- May not benefit: simple batch ETL under 2 seconds, pure Python/RDD workloads
Node Type Selection
| Instance Type | Best For |
|---|---|
| General Purpose | Balanced workloads, development, small ETL |
| Memory Optimized | ML, large in-memory datasets, heavy joins |
| Storage Optimized | Delta Lake caching, high I/O workloads |
| Compute Optimized | CPU-heavy ETL, complex transformations |
| GPU | Deep learning, model training (significantly more expensive) |
Autoscaling
Dynamically adds or removes worker nodes based on pending tasks in Spark's task queue.
- Min workers: baseline floor for SLA
- Max workers: budget ceiling
- When pending tasks exceed available cores, workers are added
- When workers sit idle, they are removed
Best practices:
- Narrow min-max range for latency-sensitive jobs (e.g., 4-8)
- Wider range for throughput-heavy batch (e.g., 2-50)
- Not beneficial for ultra-short jobs (startup overhead > benefit)
Auto-Termination
Automatically shuts down a cluster after a period of inactivity.
- Default: 120 minutes for all-purpose clusters
- Recommended: 10-30 minutes for interactive clusters
- Job clusters auto-terminate by design when the job completes
- Critical for cost control -- idle clusters are the #1 source of waste
Tags
Custom key-value metadata attached to clusters for:
- Cost allocation and tracking (e.g.,
team: analytics,env: prod) - Governance and compliance
- Filtering in cost reports
Compute Policies
Policies are admin-defined rules that limit what users can configure when creating clusters.
Benefits
- Enforce cost controls (max nodes, instance types, auto-termination)
- Simplify cluster creation UI (fix/hide complex settings)
- Prevent over-provisioning
- Standardize configurations across teams
- Limit per-cluster maximum cost (DBUs/hour)
Policy Families (Built-in Templates)
| Family | Purpose |
|---|---|
| Personal Compute | Single-user, single-node development |
| Shared Compute | Multi-user interactive workloads |
| Job Compute | Automated production pipelines |
| Power User Compute | Advanced users needing custom configs |
| Legacy Shared Compute | Backward compatibility |
Policies can be defined via UI dropdown menus or raw JSON. Libraries can also be attached to policies (auto-installed on cluster creation).
Pricing -- The Dual-Bill Model
Databricks cost consists of two separate bills:
Total Cost = Databricks DBU Fees + Cloud Infrastructure Fees
What is a DBU?
A Databricks Unit (DBU) is a normalized unit of processing power per hour. It is not tied to a specific VM size -- Databricks normalizes consumption across instance types.
Approximate DBU Rates (Premium Tier, 2026)
| Compute Type | AWS Rate | Azure Rate | Best For |
|---|---|---|---|
| Jobs Compute | ~$0.15/DBU | ~$0.15-0.20/DBU | Scheduled production workloads |
| All-Purpose Compute | ~$0.55/DBU | ~$0.55-0.65/DBU | Interactive notebooks, dev |
| SQL Serverless | ~$0.70/DBU | ~$0.70/DBU | Serverless SQL analytics |
| DLT Core | ~$0.15/DBU | ~$0.15/DBU | Basic pipelines |
| DLT Pro | ~$0.25/DBU | ~$0.25/DBU | CDC + data quality |
| DLT Advanced | ~$0.36/DBU | ~$0.36/DBU | Complex pipelines |
Key insight: There is a 3-4x cost gap between All-Purpose Compute and Jobs Compute. This is the single biggest lever for cost reduction.
The cloud infrastructure bill (VMs, storage, networking) is charged separately by your cloud provider (Azure, AWS, GCP).
Spot Instances
| Type | Cost Savings | Availability | Eviction Risk |
|---|---|---|---|
| On-Demand | Baseline | Always available | None |
| Reserved | Up to 75% savings | Committed 1-3 years | None |
| Spot | Up to 90% savings | Surplus capacity | High -- 30 sec to 2 min notice |
When to Use Spot Instances
- Fault-tolerant batch ETL jobs
- Non-critical dev/test clusters
- ML training with checkpointing
- Worker nodes (never use spot for the driver node)
Best Practice: Hybrid Approach
- Driver node: Always on-demand (no eviction risk)
- Worker nodes: Spot instances (cost savings, tolerable if evicted)
- Configure fallback to on-demand if spot capacity is unavailable
Cost Optimization Best Practices
| Strategy | Impact | Effort |
|---|---|---|
| Use Job Clusters instead of All-Purpose for production | 3-4x cost reduction | Low |
| Enable auto-termination (10-30 min idle timeout) | Eliminates idle waste | Low |
| Use Spot Instances for workers | Up to 90% infra savings | Low |
| Enable autoscaling with proper min/max | Pay only for what you need | Low |
| Apply Cluster Policies to enforce guardrails | Prevents over-provisioning | Medium |
| Use latest LTS runtime | Performance improvements = shorter runtime | Low |
| Enable Photon for SQL/DataFrame workloads | 2-4x speedup = shorter runtime | Low |
| Right-size instance types based on workload | Avoid paying for unused resources | Medium |
| Use Instance Pools for frequently used clusters | Faster startup, shared resources | Medium |
| Use SQL Warehouses for SQL workloads | Optimized pricing for SQL | Low |
| Use Serverless Compute where supported | No idle cost, auto-scaling | Low |
| Tag clusters for cost tracking | Visibility into spend by team/project | Low |
| Only use GPUs for GPU-accelerated workloads | Avoid unnecessary premium | Low |
Summary Comparison Table
| Feature | All-Purpose Cluster | Job Cluster | SQL Warehouse | Serverless Compute |
|---|---|---|---|---|
| Lifecycle | Manual start/stop | Auto-created, auto-terminated | Always-on or auto-suspend | On-demand, instant |
| Users | Multi-user | Single job | Multi-user (SQL) | Multi-user |
| Languages | Python, SQL, Scala, R | Python, SQL, Scala, R | SQL only | Python, SQL |
| Cost | ~$0.55/DBU | ~$0.15/DBU | ~$0.70/DBU (serverless) | Varies |
| Best For | Dev, exploration | Production ETL | BI, analytics | Quick tasks, no config |
| Autoscaling | Yes | Yes | Yes | Automatic |
| Photon | Optional | Optional | Default | Default |