Databricks Architecture

Databricks at it's core follows a two-layer architecture:

  1. Control Plane (Managed by Databricks)
  2. Compute Plane (Customer's Cloud or Serverless)

Control Plane

The Control Plane runs as a SaaS service fully managed by Databricks. It handles:

Component Purpose
Web Application / Workspace UI for managing notebooks, clusters, jobs, and assets
Unity Catalog & Metastore Data governance, access control, lineage tracking; stores metadata (table structures, schemas, partitions)
Access Control & Security IAM, RBAC for managing user identities and permissions
Workflows / Job Scheduler Automates job execution for data pipelines
Mosaic AI Suite of AI/ML tools for advanced analytics
Git & CI/CD Integration Version control via Git repos; CI/CD workflows for deployment
Notebooks & DBSQL Collaborative coding (Python, Scala, SQL, R); serverless SQL engine for big data queries

Compute Plane

The Compute Plane is where actual data processing and storage happens. It comes in two flavors:

Key components in the Compute Plane


Core Platform Components

Component What It Does
Delta Lake Open-source storage layer, ACID transactions, schema enforcement/evolution, time travel, streaming + batch on one layer
Unity Catalog Centralized governance, fine-grained access control, auditing, lineage, data privacy across all workspaces
Delta Live Tables (DLT) Declarative ETL framework, build reliable pipelines in SQL or Python with built-in data quality checks and lineage
Databricks SQL Run interactive SQL queries; connect to Power BI, Tableau, Looker for dashboarding and reporting.
MLflow End-to-end ML lifecycle, experiment tracking, model registry, versioning, deployment
Workflows (Jobs) Schedule and orchestrate ETL, ML training, and maintenance tasks with alerts, retries, and dependencies
Notebooks Interactive, collaborative coding in Python, SQL, Scala, R with real-time visualizations
Repos Git integration (GitHub, Azure DevOps, GitLab) for version control and CI/CD

Medallion Architecture (Data Organization Pattern)

Databricks uses the Medallion Architecture to organize data through progressive layers of quality:

Raw Sources --> [Bronze] --> [Silver] --> [Gold] --> BI / ML / Apps
Layer Purpose Example
Bronze Raw ingested data (as-is from source) Raw CSV, JSON, logs, CDC streams
Silver Cleaned, validated, enriched data Deduped records, joined tables, type-casted columns
Gold Aggregated, business-ready data KPI dashboards, summary tables, ML feature tables

Object Hierarchy

Databricks organizes resources in a clear hierarchy.
Pasted image 20260516145233.png

Pasted image 20260528115820.png


High-Level Architecture Diagram (Simplified)

download.png


#todo #databricks