Databricks Introduction

All my notes about the databricks will be linked here so that, when I want to revise or brush up the concepts I do not have to worry.

Prerequisites:

  1. Data Lakehouse

Table of Content

  1. Introduction
  2. Databricks Architecture
  3. Workspace, Notebooks & Azure Integration
  4. Compute - Clusters, Policies & Cost Optimization
  5. Monitoring & Debugging - Spark UI, execution plans, debugging failed jobs
  6. Delta Lake Deep Dive
  7. DLT / Lakeflow Spark Declarative Pipelines

Firstly start with the introduction, where it talks about what is databricks, what problems it over come and how it overcomes. Then read about the high level architecture of databricks along with the workspace architecture and architecture of it's types classic and serverless workspaces.

For more on workspaces read here.

After the workspaces, there comes the another important component that is Unity Catalog, that talks about which and how the data is managed.

Note: While making notes and working on the projects If I think it is worth as a interview question I will add here and question which are asked in interview or I got them from internet while prepping for interviews are here.


Introduction

Databricks is a cloud-based unified data and AI platform built on Apache Spark that implements the Lakehouse architecture. It provides a single environment for data engineering, analytics, BI, and machine learning.

Why Databricks exists

In traditional data architectures, organizations relied on multiple separate tools such as Data Warehouses, Data Lakes, ETL tools, orchestration systems, governance frameworks, and BI tools.

The main challenge was managing and integrating these tools, which led to:

What Databricks Provides

Databricks offers a unified data platform where storage, processing, governance, and analytics are seamlessly integrated, eliminating the need to manage multiple disconnected systems. This allows engineers to build and manage end-to-end data workflows in a single environment.

At its core, Databricks leverages the Data Lakehouse architecture, which combines the capabilities of data lakes and data warehouses into a single system. By doing so, it removes the need for separate platforms, reduces data duplication, and enables a single source of truth for both analytics and machine learning workloads.

In Databricks, the Data Lakehouse is powered by Delta Tables (an open-source technology).


Glossary

External Location

An external location is a securable object that combines a storage path with a storage credential that authorizes access to that path.
An external location's creator is its initial owner. An external location's owner and users with the MANAGE privilege can modify the external location's name, URI, and storage credential.
After an external location is created, you can grant access to it to account-level principals (users and groups).
A user or group with permission to use an external location can access any storage path within the location's path without direct access to the storage credential.

Jobs

In Databricks, a job is used to orchestrate and schedule tasks in a workflow. Common data processing workflows include ETL workflows, running notebooks and ML workflows, as well as integrating external systems like dbt.
Job consists of one or more jobs and supporting custom logics - branching (if/else) and looping

Task

A task is a specific unit of work within a job. Each task can perform a variety of operations, including running notebooks (a notebook task), a pipeline task or a python script task and many more.

Trigger

A trigger is a mechanism that initiates running a job based on specific conditions or events. A trigger can be time-based, such as running a job at a scheduled time (for example, ever day at 2 AM), or event-based, such as running a job when new data arrives in cloud storage.

Workflow


Points to Remember

  1. In databricks, schema are sometime called DATABASES. For example, CREATE DATABASE is an alias for CREATE SCHEMA

Scratchpad

  1. What is Lakeflow Spark Declarative Pipelines
  2. What is lakehouse federation and lakehouse connect
  3. COPY INTO, Auto Loaders

References


#databricks #data-engineering #interview