Azure Data Engineer — Complete Learning Package
Hands-On Labs (1–50)
- Create ADLS Gen2 with hierarchical namespace
- Build Bronze/Silver/Gold layers
- Upload raw data (CSV/JSON)
- Create Data Factory pipelines (Copy + ForEach)
- Implement incremental loads (watermark)
- Run Databricks notebooks (PySpark)
- Transform Bronze → Silver (Delta)
- Implement Delta MERGE (upsert)
- Create Gold aggregations
- Implement SCD Type 2
- Run Synapse serverless queries
- Load data into Synapse SQL pool
- Create Stream Analytics jobs
- Implement windowing (tumbling/hopping)
- Set up Event Hubs and IoT Hub
- Build CDC pipelines
- Create mapping data flows
- Implement data validation
- Set up Purview (scan + classification + lineage)
- Implement Cosmos DB change feed
- Build real-time pipelines (IoT → Power BI)
- Create Delta Live Tables pipelines
- Enable schema evolution
- Optimize Synapse queries
- Create triggers and monitoring
- Build full medallion architecture
- Implement Lambda architecture
- Build data mesh platform
- Create Customer 360 system
- Implement governance at scale
- Full enterprise data platform
Major Projects
Core Platforms
- Modern data warehouse (ADLS + Databricks + Synapse)
- Real-time analytics (IoT + Event Hubs + Stream Analytics)
- Data lake governance (Purview)
- ETL migration (SSIS → Data Factory)
- Streaming pipelines (Kafka/Event Hubs)
Advanced Architectures
- Data mesh architecture
- Customer 360 platform
- Predictive maintenance pipeline
- Compliance data platform (HIPAA/GDPR)
- Data API platform (Functions + Cosmos DB)
Industry Use Cases
- Financial analytics platform
- Retail analytics system
- Healthcare analytics platform
- IoT manufacturing data pipeline
- Real-time fraud detection
Advanced Systems
- Data catalog and discovery
- Data quality management system
- Data lineage tracking
- Multi-cloud data integration
- Full enterprise data platform
Gotchas & Common Mistakes
- ADLS hierarchical namespace can't be enabled later
- Delta Lake VACUUM minimum retention = 7 days
- Event Hubs partitions cannot be changed
- Cosmos DB change feed doesn't include deletes
- Synapse serverless charges per TB processed
- PolyBase requires exact file format match
- Delta MERGE creates small files (optimize required)
- Data Factory Copy doesn't validate schema
- Stream Analytics output failures aren't retried
- Purview scanning requires proper permissions
- Databricks DBFS deprecated (use ADLS)
- Delta schema evolution must be enabled
- ADLS ACLs are per-object (not inherited)
- Synapse serverless result size limit = 400MB
- Data Factory concurrency limit (~50 pipelines)
- Delta time travel retention default = 30 days
- Event Hubs retention is per hub
- Data quality must be enforced at ingestion
Data Pipeline Development Playbook
- Define sources, sinks, and transformations
- Select services (ADF, Databricks, Synapse)
- Design medallion architecture
- Ingest raw data (Bronze)
- Clean and transform (Silver)
- Aggregate and model (Gold)
- Apply data quality checks
- Set up monitoring and alerts
- Test pipelines (failure + recovery)
- Document lineage, schema, SLA