Databricks Introduction
All my notes about the databricks will be linked here so that, when I want to revise or brush up the concepts I do not have to worry.
Prerequisites:
Table of Content
- Introduction
- Databricks Architecture
- Workspace, Notebooks & Azure Integration
- Compute - Clusters, Policies & Cost Optimization
- Monitoring & Debugging - Spark UI, execution plans, debugging failed jobs
- Delta Lake Deep Dive
- DLT / Lakeflow Spark Declarative Pipelines
Firstly start with the introduction, where it talks about what is databricks, what problems it over come and how it overcomes. Then read about the high level architecture of databricks along with the workspace architecture and architecture of it's types classic and serverless workspaces.
For more on workspaces read here.
After the workspaces, there comes the another important component that is Unity Catalog, that talks about which and how the data is managed.
Note: While making notes and working on the projects If I think it is worth as a interview question I will add here and question which are asked in interview or I got them from internet while prepping for interviews are here.
Introduction
Databricks is a cloud-based unified data and AI platform built on Apache Spark that implements the Lakehouse architecture. It provides a single environment for data engineering, analytics, BI, and machine learning.
Why Databricks exists
In traditional data architectures, organizations relied on multiple separate tools such as Data Warehouses, Data Lakes, ETL tools, orchestration systems, governance frameworks, and BI tools.
The main challenge was managing and integrating these tools, which led to:
- Vendor lock-in — occurs when data is stored in proprietary formats and tightly coupled to a specific processing engine, making it difficult to migrate or access outside the platform
- High engineering complexity
- Multiple integration points
- Increased maintenance overhead
- Slower development and debugging
What Databricks Provides
Databricks offers a unified data platform where storage, processing, governance, and analytics are seamlessly integrated, eliminating the need to manage multiple disconnected systems. This allows engineers to build and manage end-to-end data workflows in a single environment.
At its core, Databricks leverages the Data Lakehouse architecture, which combines the capabilities of data lakes and data warehouses into a single system. By doing so, it removes the need for separate platforms, reduces data duplication, and enables a single source of truth for both analytics and machine learning workloads.
In Databricks, the Data Lakehouse is powered by Delta Tables (an open-source technology).
Glossary
External Location
An external location is a securable object that combines a storage path with a storage credential that authorizes access to that path.
An external location's creator is its initial owner. An external location's owner and users with the MANAGE privilege can modify the external location's name, URI, and storage credential.
After an external location is created, you can grant access to it to account-level principals (users and groups).
A user or group with permission to use an external location can access any storage path within the location's path without direct access to the storage credential.
Jobs
In Databricks, a job is used to orchestrate and schedule tasks in a workflow. Common data processing workflows include ETL workflows, running notebooks and ML workflows, as well as integrating external systems like dbt.
Job consists of one or more jobs and supporting custom logics - branching (if/else) and looping
Task
A task is a specific unit of work within a job. Each task can perform a variety of operations, including running notebooks (a notebook task), a pipeline task or a python script task and many more.
Trigger
A trigger is a mechanism that initiates running a job based on specific conditions or events. A trigger can be time-based, such as running a job at a scheduled time (for example, ever day at 2 AM), or event-based, such as running a job when new data arrives in cloud storage.
Workflow
Points to Remember
- In databricks, schema are sometime called DATABASES. For example,
CREATE DATABASEis an alias forCREATE SCHEMA
Scratchpad
- What is Lakeflow Spark Declarative Pipelines
- What is lakehouse federation and lakehouse connect
- COPY INTO, Auto Loaders