This 2-day course equip learners to design, build, secure, and automate reliable data pipelines. The course utilises Apache Spark on the Databricks Free Platform and moves beyond simple notebook execution to introduce rigorous software engineering practices. The course also integrates modern data architecture (Lakehouse vs. Data Mesh), operational excellence (FinOps, Observability), and data governance.
Learning Outcomes:
- Analyze distributed-systems and data-platform trade-offs
- Evaluate storage and compute options for a specified workload (structured vs. unstructured).
- Build an end-to-end batch pipeline in Spark (Databricks), implementing data-quality validation and unit tests to verify transformation logic.
- Implement data governance and security controls (PII-aware handling, catalog concepts, RBAC, encryption, network isolation) to protect data across the pipeline.
Module 1: The Engineering Landscape & Theory
Module 2: Storage Strategy & Economics
Module 3: Engineering Lifecycle
Module 4: Advanced Modelling & Migration
Module 5: The Secure Data Lakehouse - Architecture Data Warehouse principles & Medallion
Module 6: Robust Pipeline Construction
Module 7: Observability & Automation
Module 8: Streaming & Real-Time Challenges