# Role Deep Dive: Azure Data Engineer

---

## Role Overview

Azure Data Engineers design and implement data pipelines, data storage, and data processing solutions on Azure. They build the infrastructure that moves data from source to destination — transforming, cleaning, and organizing it along the way. They are the plumbers of the data world.

**Alternative Titles:** Data Engineer, Cloud Data Engineer, ETL Engineer, Data Platform Engineer

**Typical Salary Range:** $100,000 – $165,000 (US)

---

## Core Responsibilities

### 1. Data Pipeline Design & Implementation (30% of role)
- Design ETL/ELT pipelines with Data Factory and Synapse Pipelines
- Implement real-time streaming with Stream Analytics and Event Hubs
- Build batch processing with Databricks and Synapse Spark
- Design data orchestration and scheduling
- Handle data quality and validation

**Granular Tasks:**
- **Data Factory Pipeline:**
  - Linked services: connections to sources (SQL Server, Blob, API, SFTP) and sinks (Synapse, ADLS, Cosmos DB)
  - Datasets: define data structure (schema, format, compression)
  - Activities: Copy Data (move data), Get Metadata (check existence), ForEach (iterate), If Condition (branch), Stored Procedure, Databricks Notebook
  - Triggers: Schedule (daily at 2 AM), Tumbling Window (hourly), Event-based (blob created)
  - Parameters: make pipelines reusable across environments
  - Error handling: set dependency conditions (on success, on failure, on completion)
  - Logging: pipeline run history, activity output, error messages

- **Streaming Pipeline:**
  - Source: IoT Hub / Event Hubs
  - Processing: Stream Analytics (SQL-like queries, windowing functions)
  - Sink: Cosmos DB (real-time), Blob/ADLS (archival), Power BI (dashboard), Synapse (analytics)
  - Windowing: Tumbling (fixed, non-overlapping), Hopping (fixed, overlapping), Session (variable, gap-based)

- **Batch Processing (Databricks):**
  - Read from ADLS (Bronze layer — raw data)
  - Transform with Spark (Silver layer — cleaned, deduplicated)
  - Aggregate and serve (Gold layer — business-ready aggregates)
  - Delta Lake: ACID transactions, MERGE (upsert), time travel
  - Optimize: Z-ORDER, VACUUM, OPTIMIZE

### 2. Data Storage Design (25% of role)
- Design data lake architecture (ADLS Gen2)
- Implement medallion architecture (Bronze/Silver/Gold)
- Design data warehouse (Synapse dedicated SQL pool)
- Choose appropriate storage (ADLS vs SQL vs Cosmos DB vs Redis)
- Implement data partitioning and organization

**Granular Tasks:**
- **Medallion Architecture:**
  - Bronze (Raw): ingest all data as-is. Partition by source/date. Parquet/Delta format.
  - Silver (Cleaned): deduplicated, validated, conformed. Schema enforced. Delta format.
  - Gold (Business): aggregated, business-ready. Star schema. Delta format. Serve to BI.
  - Each layer in separate ADLS container or folder hierarchy

- **ADLS Gen2 Organization:**
  ```
  /raw/{source}/{year}/{month}/{day}/  (Bronze)
  /cleaned/{source}/{year}/{month}/    (Silver)
  /curated/{domain}/{table}/           (Gold)
  /staging/                            (temporary)
  /archive/                            (old data, cold tier)
  ```

- **Synapse Dedicated SQL Pool Design:**
  - Table types: Clustered Columnstore (default for analytics), Heap (staging), Clustered Index (lookup-heavy)
  - Distribution: Hash (distribute by join key), Round Robin (even distribution), Replicated (small dimension tables)
  - Partitioning: partition large fact tables by date (daily/monthly)
  - PolyBase: load data from ADLS directly (fastest loading method)
  - CTAS (Create Table As Select): standard pattern for data transformation

### 3. Data Integration & Orchestration (15% of role)
- Integrate diverse data sources
- Implement change data capture (CDC)
- Design data refresh strategies
- Orchestrate complex data workflows

**Granular Tasks:**
- CDC patterns: read change feed (Cosmos DB), query CDC tables (SQL Server), use watermark columns (UpdatedDate)
- Incremental load: only process new/changed data (more efficient than full load)
- Full load: truncate and reload (simple but slower, use for small tables)
- SCD (Slowly Changing Dimension): Type 1 (overwrite), Type 2 (add row with valid dates), Type 3 (add column)
- Orchestrate: Data Factory pipeline chains activities, handles dependencies, retries on failure
- Self-hosted Integration Runtime: for on-prem data sources (install on on-prem server)

### 4. Data Quality & Governance (15% of role)
- Implement data validation rules
- Design data quality checks in pipelines
- Implement Microsoft Purview for governance
- Track data lineage

**Granular Tasks:**
- Data quality checks in pipeline:
  - Row count validation (source count ≈ destination count)
  - Null check on required columns
  - Schema validation (column exists, correct type)
  - Business rule validation (values within expected range)
  - Referential integrity (foreign keys exist)
- On failure: alert team, quarantine bad data, don't load to production
- Microsoft Purview: scan data sources, auto-classify sensitive data, track lineage (source → transform → destination)
- Data catalog: document datasets, owners, freshness, quality score

### 5. Data Processing with Databricks (15% of role)
- Write Spark jobs in Python/Scala/SQL
- Implement Delta Lake operations
- Optimize Spark performance
- Schedule and orchestrate jobs

**Granular Tasks:**
- Delta Lake operations:
  - MERGE INTO target USING source ON target.id = source.id WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT
  - Time travel: SELECT * FROM table VERSION AS OF 5
  - VACUUM: remove old versions (retain 168 hours minimum)
  - OPTIMIZE: compact small files into larger ones
  - Z-ORDER: cluster data by frequently filtered columns
- Spark optimization:
  - Partition data by frequently filtered column
  - Use broadcast joins for small-large table joins
  - Cache frequently accessed DataFrames
  - Avoid UDFs when possible (use built-in functions)
  - Right-size cluster: auto-scale, use job clusters (ephemeral)

---

## Data Architecture Patterns

### Pattern 1: Lambda Architecture (Batch + Real-time)
```
Source → Event Hubs → Stream Analytics → Cosmos DB (real-time serving)
                                    → ADLS (real-time archival)
Source → Data Factory → ADLS (Bronze) → Databricks (Silver) → Synapse (Gold) → Power BI
```

### Pattern 2: Modern Data Warehouse
```
Sources → Data Factory → ADLS Gen2 (raw) → Synapse/Spark (transform) → Synapse SQL (serve) → Power BI
```

### Pattern 3: Real-time Analytics
```
IoT Devices → IoT Hub → Event Hubs → Stream Analytics → Cosmos DB (hot path)
                                                  → ADLS (cold path) → Synapse (batch analytics)
```

---

## Certification Path

| Certification | Level | Focus |
|---|---|---|
| **DP-900** | Foundational | Azure Data fundamentals |
| **DP-203** | Associate | **Core cert** — Azure Data Engineer |
| **DP-300** | Associate | Azure Database Administrator |

### DP-203 Exam Breakdown
| Domain | Weight |
|---|---|
| Design and implement data storage | 10-15% |
| Design and develop data processing | 25-30% |
| Design and implement data security | 10-15% |
| Monitor and optimize data storage and data processing | 10-15% |
| Design and implement data governance | 10-15% |

---

## Interview Focus Areas

1. **Design a data pipeline from on-prem SQL to Azure.**
   → Self-hosted IR → Data Factory → Copy Activity → ADLS (Bronze) → Databricks (Silver/Gold) → Synapse SQL → Power BI. Incremental load using watermark.

2. **Explain the medallion architecture.**
   → Bronze (raw, as-is), Silver (cleaned, validated, conformed), Gold (business aggregates, star schema). Each layer adds quality and structure.

3. **How do you handle incremental data loads?**
   → Watermark column (UpdatedDate), store last watermark value, next run reads > last watermark. CDC for real-time. Delta Lake MERGE for upserts.

4. **When to use Synapse vs Databricks?**
   → Synapse: integrated platform (SQL + Spark + Pipelines), good for SQL-heavy teams. Databricks: superior Spark experience, Delta Lake native, collaborative notebooks, better for data science. Many use both.

5. **How do you ensure data quality in pipelines?**
   → Validation at each layer: row counts, null checks, schema validation, business rules. Quarantine bad data. Alert on failures. Track quality metrics over time.

6. **What is Delta Lake and why use it?**
   → Storage layer on data lake adding ACID transactions, time travel, schema enforcement, MERGE (upsert). Solves data reliability. Default in Databricks.

7. **How do you optimize Synapse dedicated SQL pool?**
   → Choose correct distribution (Hash for large facts, Replicated for small dims), use Columnstore indexes, partition by date, use PolyBase for loading, CTAS for transforms, avoid data movement.

8. **How do you design for real-time vs batch?**
   → Real-time: Event Hubs + Stream Analytics + Cosmos DB (sub-second latency). Batch: Data Factory + ADLS + Databricks + Synapse (minutes-hours latency). Lambda: both paths, merge at serving layer.
