Subcribe and Access : 5200+ FREE Videos and 21+ Subjects Like CRT, SoftSkills, JAVA, Hadoop, Microsoft .NET, Testing Tools etc..
Batch
Date: Sept
13th & 14th @4:00PM
Faculty: Mr. N. Vijay Sunder Sagar (20+ Yrs of Exp,..)
Duration: 10 Weekends Batch
Venue
:
DURGA SOFTWARE SOLUTIONS,
Flat No : 202,
2nd Floor,
HUDA Maitrivanam,
Ameerpet, Hyderabad - 500038
Ph.No: +91 - 8885252627, 9246212143, 80 96 96 96 96
Syllabus:
GCP Data Engineering
Module 1: GCP Data Engineering Fundamentals
Module 2: Google Cloud Storage(GCS)
Module 3: Cloud SQL -Setting up Database
Module 4: BigQuery – for Building Datawarehouse
Module 5: Dataproc – Bigdata processing
Module 6: Databricks – Pyspark Processing
Module 7: Dataflow -Apache Beam Development
Module 8: Google Cloud Composer – Orchestration
Module 9: Data Fusion – Data Integration service
Module 10 : DBT, Airflow and Terraform
1. Introduction to Google Cloud Platform
- Overview of cloud platforms
- GCP services and regions
- IAM (Identity & Access Management) basics
- Resource hierarchy (organization, folders, projects)
- Billing & cost management
2. GCP Data Engineering Fundamentals
- Data engineering roles & responsibilities
- Batch vs real-time data processing
- Data lake vs data warehouse vs data marts
- ETL vs ELT
3. Getting Started with GCP
- Signing for GCP Account
- Create New Google Account using Non Gmail Id
- Sign up for GCP using Google Account
- Overview of GCP Credits
- Overview of GCP Project and Billing
- Overview of Google Cloud Shell
- Install Google Cloud SDK on Windows
- Initialize gcloud CLI using GCP Project
- Reinitialize Google Cloud Shell with Project id
- Overview of Analytics Services on GCP
4. Storage & Databases in GCP
- Cloud Storage – buckets, lifecycle, versioning
- BigQuery – datasets, tables, partitioning, clustering, query optimization
- Cloud SQL & Cloud Spanner – relational databases
- Firestore & Bigtable – NoSQL databases
- Data modeling best practices
5. Google Cloud Storage (GCS) : Setting up Datalake using GCS
- Getting Started with Google Cloud Storage or GCS
- Overview of Google Cloud Storage or GCS Web UI
- Create GCS Bucket using GCP Web UI
- Upload Folders and Files using into GCS Bucket using GCP Web UI
- Review GCS Buckets and Objects using gsutil commands
- Delete GCS Bucket using Web UI
- Setup Data Repository in Google Cloud Shell
- Overview of Data Sets
- Managing Buckets in GCS using gsutil
- Copy Data Sets into GCS using gsutil
- Cleanup Buckets in GCS using gsutil
- Exercise to Manage Buckets and Files in GCS using gsutil
- Overview of Setting up Data Lake using GCS
- Setup Google Cloud Libraries in Python Virtual Environment
- Setup Bucket and Files in GCS using gsutil
- Getting Started to manage files in GCS using Python
- Setup Credentials for Python and GCS Integration
- Review Methods in Google Cloud Storage Python library
- Get GCS Bucket Details using Python
- Manage Blobs or Files in GCS using Python
- Project Problem Statement to Manage Files in GCS using Python
- Design to Upload multiple files into GCS using Python
- Get File Names to upload into GCS using Python glob and os
- Upload all Files to GCS as blobs using Python
- Validate Files or Blobs in GCS using Python
- Overview of Processing Data in GCS using Pandas
- Convert Data to Parquet and Write to GCS using Pandas
- Design to Upload multiple files into GCS using Pandas
- Get File Names to upload into GCS using Python glob and os
- Overview of Parquet File Format and Schemas JSON File
- Get Column Names for Dataset using Schemas JSON File
- Upload all Files to GCS as Parquet using Pandas
- Perform Validation of Files Copied using Pandas
6. Cloud Sql : Set Up postgres Database using cloud sql
- Overview of GCP Cloud SQL
- Setup Postgres Database Server using GCP Cloud SQL
- Configure Network for Cloud SQL Postgres Database
- Install Postgres 14 on Windows 11
- Getting Started with pgAdmin on Windows
- Getting Started with pgAdmin on Mac
- Validate Client Tools for Postgres on Mac or PC
- Setup Database in GCP Cloud SQL Postgres Database Server
- Setup Tables in GCP Cloud SQL Postgres Database
- Validate Data in GCP Cloud SQL Postgres Database Table
- Integration of GCP Cloud SQL Postgres with Python
- Overview of Integration of GCP Cloud SQL Postgres with Pandas
- Read Data From Files to Pandas Data Frame
- Process Data using Pandas Dataframe APIs
- Write Pandas Dataframe into Postgres Database Table
- Validate Data in Postgres Database Tables using Pandas
- Getting Started with Secrets using GCP Secret Manager
- Configure Access to GCP Secret Manager via IAM Roles
- Install Google Cloud Secret Manager Python Library
- Get Secret Details from GCP Secret Manager using Python
- Connect to Database using Credentials from Secret Manager
- Stop GCP Cloud SQL Postgres Database Server
7. Data Ingestion & Integration
- Pub/Sub – message queues, subscriptions, streaming data ingestion
- Dataflow (Apache Beam) – batch & stream data pipelines
- Dataproc (Hadoop/Spark on GCP) – ETL with Spark/Hive
- Transfer Service & Storage Transfer – on-prem to cloud data movement
- APIs, connectors, and partner ETL tools (Informatica, Fivetran, etc.)
8. Data Processing & Transformation
- Designing pipelines with Dataflow
Streaming analytics with Pub/Sub + Dataflow
- Data transformations using Dataproc (Spark/Presto/Hive)
- BigQuery transformations (SQL-based ELT)
- Using Dataprep (Trifacta) for no-code data wrangling
9. Data Warehousing & Analytics
- BigQuery :
- Schemas, partitioning, clustering
- Optimization (slots, caching, materialized views)
- Federated queries (Cloud Storage, Bigtable, Sheets)
- BigQuery ML basics
- Designing star/snowflake schemas
- Analytics & BI integration (Looker, Data Studio)
10. BigQuery: for building Datawarehouse
- Overview of Google BigQuery
- Getting Started with Google BigQuery
- Overview of CRUD Operations in Google BigQuery
- Merge or Upsert into Google BigQuery Tables
- Create Dataset and Tables in Google BigQuery using UI
- Create Table in Google BigQuery using Command
- Exercise to create tables in Google BigQuery
- Overview of Loading Data from Files into BigQuery Tables
- Getting Started with Integration between Google BigQuery and Python
- Load Data from GCS Files into an Empty Table in Google BigQuery
- Run Queries in Google BigQuery using Python Applications
- Exercise to Load Data into BigQuery Tables
- Drop Tables from Google BigQuery
- Overview of External Tables in BigQuery
- Create Google BigQuery External Table on GCS Files using Web UI
- Create Google BigQuery External Table on GCS Files using Command
- Google BigQuery External Tables using AWS s3 or Azure Blob or Google Drive
- Exercise to Create Google BigQuery External Tables
- Overview of SQL Capabilities of Google BigQuery
- Basic SQL Queries using Google BigQuery
- Cumulative Aggregations using Google BigQuery
- Compute Ranks using Google BigQuery
- Filter based on Ranks using Google BigQuery
- Overview of Key Integrations with Google BigQuery
- Python Pandas Integration with Google BigQuery
- Overview of Integration between BigQuery and RDBMS Databases
- Validate Cloud SQL Postgres Database for BigQuery Integration
- Create External Connections and Run External Queries from Google BigQuery
- Running External Queries using External Connections in Google BigQuery
11. Dataproc : BigData Processing
- Getting Started with GCP Dataproc
- Setup Single Node Dataproc Cluster for Development
- Validate SSH Connectivity to Master Node of Dataproc Cluster
- Allocate Static IP to the Master Node VM of Dataproc Cluster
- Setup VS Code Remote Window for Dataproc VM
- Setup Workspace using VS Code on Dataproc
- Getting Started with HDFS Commands on Dataproc
- Recap of gsutil to manage files and folders in GCS
- Review Data Sets setup on Dataproc Master Node VM
- Copy Local Files into HDFS on Dataproc
- Copy GCS Files into HDFS on Dataproc.cmproj
- Validate Pyspark CLI in Dataproc Cluster
- Validate Spark Scala CLI in Dataproc Cluster
- Validate Spark SQL CLI in Dataproc Cluster
12. ELT Datapipelines using Dataproc
- Overview of GCP Dataproc Jobs and Workflow
- Setup JSON Dataset in GCS for Dataproc Jobs
- Review Spark SQL Commands used for Dataproc Jobs
- Run Dataproc Job using Spark SQL
- Overview of Modularizing Spark SQL Applications for Dataproc
- Review Spark SQL Scripts for Dataproc Jobs and Workflows
- Validate Spark SQL Script for File Format Conversion
- Exercise to convert file format using Spark SQL Script
- Validate Spark SQL Script for Daily Product Revenue
- Develop Spark SQL Script to Cleanup Database
- Copy Spark SQL Scripts to GCS
- Run and Validate Spark SQL Scripts in GCS
- Limitations of Running Spark SQL Scripts using Dataproc Jobs
- Manage Dataproc Clusters using gcloud Commands
- Run Dataproc Jobs using Spark SQL Command or Query
- Run Dataproc Jobs using Spark SQL Scripts
- Exercises to Run Spark SQL Scripts as Dataproc Jobs using gcloud
- Delete Dataproc Jobs using gcloud commands
- Importance of using gcloud commands to manage dataproc jobs
- Getting Started with Dataproc Workflow Templates using Web UI
- Review Steps and Design to create Dataproc Workflow Template
- Create Dataproc Workflow Template and Add Cluster using gcloud Commands
- Review gcloud Commands to Add Jobs to Dataproc Workflow Templates
- Add Jobs to Dataproc Workflow Template using Commands
- Instantiate Dataproc Workflow Template to run the Data Pipeline
- Overview of Dataproc Operations and Deleting Workflow Runs
- Run and Validate ELT Data Pipeline using Dataproc
- Stop Dataproc Cluster
13. Databricks : Pyspark Processing in GCP
- Overview of Databricks on GCP
- Signing up for Databricks on GCP
- Create Databricks Workspace on GCP
- Getting Started with Databricks Clusters on GCP
- Getting Started with Databricks Notebook
- High level architecture of Databricks
- Setup Databricks CLI on Mac or Windows
- Overview of Databricks CLI and other clients
- Configure Databricks CLI on Mac or Windows
- Troubleshoot issues to configure Databricks CLI
- Overview of Databricks CLI Commands
- Setup Data Repository for Data Sets
- Setup Data Sets in DBFS using Databricks CLI Commands
- Process Data in DBFS using Databricks Spark SQL
- Getting Started with Spark SQL Example using Databricks
- Create Temporary Views using Spark SQL
- Exercise to create temporary views using Spark SQL
- Spark SQL Query to compute Daily Product Revenue
- Save Query Result to DBFS using Spark SQL
- Overview of Pyspark Examples on Databricks.cmproj
Process Schema Details in JSON using Pyspark
- Create Dataframe with Schema from JSON File using Pyspark
- Transform Data using Spark APIs
- Get Schema Details for all Data Sets using Pyspark
- Convert CSV to Parquet with Schema using Pyspark
14. ELT pipeline using Databricks
- Overview of Databricks Workflows
- Pass Arguments to Databricks Python Notebooks
- Pass Arguments to Databricks SQL Notebooks
- Create and Run First Databricks Job
- Run Databricks Jobs and Tasks with Parameters
- Create and Run Orchestrated Pipeline using Databricks Job
- Import ELT Data Pipeline Applications into Databricks Environment
- Spark SQL Application to Cleanup Database and Datasets
- Review File Format Converter Pyspark Code
- Review Databricks SQL Notebooks for Tables and Final Results
- Validate Applications for ELT Pipeline using Databricks
- Build ELT Pipeline using Databricks Job in Workflows
- Run and Review Execution details of ELT Data Pipeline using Databricks Job
- Cleanup Databricks Environment on GCP
15. Integration of Spark on Dataproc and BigQuery
- Review Development Environment with VS Code using Dataproc Cluster
- Validate Google BigQuery Integration with Python on Dataproc
- Setup Native Tables in Google BigQuery
- Review Spark Google BigQuery Connector
- Integration of Spark on Dataproc and BigQuery using Pyspark CLI
- Integration of Spark on Dataproc and BigQuery using Notebook
- Review Design of Data Pipeline using Spark and BigQuery
- Review Spark Applications to compute daily product revenue
- Create Table for Daily Product Revenue in Google BigQuery
- Validate Parquet Files for Daily Product Revenue in GCS
- Develop Logic to Save Daily Product Revenue to BigQuery Table
- Reset Daily Product Revenue Table in Google BigQuery
- Review Spark Application Code to Write to BigQuery Table
- Submit Spark Application with BigQuery Integration using Client Mode
- Submit Spark Application with BigQuery Integration using Cluster Mode
- Deploy Spark Application with BigQuery Integration in GCS
- Switching to Local Development Environment from Dataproc
- Run Spark Application as Dataproc Job using Web
- Run Spark Application as Dataproc Job using Command
- Review Dataproc Jobs and Spark Application using Dataproc UI
- Overview of Orchestration using Dataproc Commands for Spark Applications on
- Overview of ELT Pipeline using Dataproc Workflows
- Create Workflow Template with Spark SQL Applications
- Add Pyspark Application to Dataproc Workflow Template
- Run Dataproc Workflow Template using Dataproc Command
- Update Job Properties in Dataproc Workflow Template
16. DataFlow (Apache Beam Development)
- Introduction to DataFlow
- Use cases for DataFlow in real-time analytics and ETL.
- Understanding the difference between Apache Spark and Apache Beam
- How Dataflow is different from Dataproc
- Building Data Pipelines with Apache Beam
- Writing Apache Beam pipelines for batch and stream processing.
- Custom Pipelines and Pre-defined pipelines
- Transformations and windowing concepts.
- Integration with Other GCP Services
- Integrating DataFlow with BigQuery, Pub/Sub, and other GCP services.
- Real-time analytics and visualization using DataFlow and BigQuery.
- Workflow orchestration with Composer.
- End to End Streaming Pipeline using Apache beam with Dataflow, Python app, PubSub, BigQuery, GCS
- Template method of creating pipelines
17. Cloud Pub/Sub
- Introduction to Pub/Sub
- Understanding the role of Pub/Sub in event-driven architectures.
- Key Pub/Sub concepts: topics, subscriptions, messages, and acknowledgments.
- Creating and Managing Topics and Subscriptions
- Using the GCP Console to create Pub/Sub topics and subscriptions.
- Configuring message retention policies and acknowledgment settings.
- Publishing and Consuming Messages
- Writing and deploying code to publish messages to a topic.
- Implementing subscribers to consume and process messages from subscriptions.
- Integration with Other GCP Services
- Connecting Pub/Sub with Cloud Functions for serverless event-driven computing.
- Integrating Pub/Sub with Dataflow for real-time stream processing.
- Streaming use-case using Dataflow
18. Google Cloud Composer : For Datapipeline Orchestration
- Orchestration & Workflow Management and DAG Creations
- Cloud Composer (Airflow on GCP) – DAGs, operators, scheduling pipelines
- Integration with Dataflow, Dataproc, BigQuery
- Workflow automation with Cloud Functions & Workflows
- Create Airflow or Cloud Composer Environment
- Review Google Cloud Composer Environment
- Development Process of Airflow DAGs for Cloud Composer
- Install Required Dependencies for Development of Airflow DAGs
- Run Airflow Commands in Cloud Composer using gcloud
- Overview of Airflow Architecture
- Deploy and Run First Airflow DAG in Google Cloud Composer Environment
- Understand Relationship between Python Scripts and Airflow DAGs
- Code Review of Airflow DAGs and Tasks
- Overview of Airflow Dataproc Operators
- Review Airflow DAG with GCP Dataproc Workflow Template Operator
- Deploy and Run GCP Dataproc Workflow using Airflow
- Using Variables in Airflow DAGs
- Deploy and Run Airflow DAGs with Variables
- Overview of Data Pipeline using Cloud Composer and Dataproc Jobs
- Review the Spark Applications related to the Data Pipeline
- Review Airflow DAG for Orchestrated Pipeline using Dataproc Jobs
- Deploy Data Pipeline or Airflow DAG using Dataproc Jobs
- Review Source and Target before Deployment of Airflow DAG
- Deploy and Run Airflow DAG with Dataproc Jobs
- Differences Between Dataproc Workflows and Airflow DAGs
- Cleanup Cloud Composer Environment and Dataproc Cluster
19. Data Fusion:
- Introduction to Data Fusion
- Overview of Data Fusion as a fully managed data integration service.
- Use cases for Data Fusion in ETL and data migration.
- Building Data Integration Pipelines
- Creating ETL pipelines using the visual interface.
- Configuring data sources, transformations, and sinks.
- Using pre-built templates for common integration scenarios.
- Integration with GCP and External Services
- Integrating Data Fusion with BigQuery, Cloud Storage, and other GCP services.
- End to End pipeline using Data fusion with Wrangler, GCS, BigQuery
20. Cloud Functions
- Cloud Functions Introduction
- Setting up Cloud Functions in GCP
- Event-driven architecture and use cases
- Writing and deploying Cloud Functions
- Triggering Cloud Functions:
- HTTP triggers
- Pub/Sub triggers
- Cloud Storage triggers
- Monitoring and logging Cloud Functions
- Usecase-1: Loading the files from GCS to BigQuery as soon as it is uploaded
21. Terraform:
- Terraform Introduction
- Installing and configuring Terraform.
- Infrastructure Provisioning
- Terraform basic commands
- Init, plan, apply, destroy
- Create Resources in Google Cloud Platform
- GCS buckets
- Dataproc cluster
- BigQuery Datasets and tables
- And more resources as needed
22. Datapipelines using DBT,Airflow and BigQuery
- Overview of Data Landscape of Large Enterprise
- DBT High Level Architecture
- Overview of DBT Cloud Features and DBT Adapters
- Airflow and DBT Pipeline Patterns
- Pre-requisites for Dev Environment using Airflow and DBT
- Setup Astro CLI on Windows or Mac
- Setup Workspace using VSCode
- Setup Local Airflow Environment using Astro CLI
- Setup Python Virtual Environment with Airflow
- Overview of Airflow Providers
- Manage Local Airflow Containers using Astro CLI
- Connect to Airflow Containers and Review Logs using Astro CLI
- Setup Datasets for Airflow Pipelines or DAGs
- Setup GCS Bucket and Upload Data Set
- Getting Started with Google BigQuery
- Create External Table using Google BigQuery
- Create GCP Service Account and Download Credentials
- Getting Started with DBT Cloud
- Setup DBT Cloud Project for Google BigQuery
- Review and Run Example DBT Pipeline using DBT Cloud
- Validate Google BigQuery Objects created by DBT Pipeline
- Overview of ELT Pipeline using DBT and Google BigQuery
- Change the DBT Project Structure from example
- Create Models for Orders and Order Items
- Define Denormalized Model for Order Details
- Query to compute daily product revenue
- Add Model for Daily Product Revenue
- Create and Run DBT Cloud Job
- Validate Airflow and Review DBT Cloud Provider
- Install Airflow DBT Cloud Provider
- Overview of End to End Orchestrated Data Pipeline using Airflow
- Create DBT Cloud Connection in Airflow
- Create DBT Job Variables in Airflow
- Develop Airflow DAG to trigger DBT Cloud Job
- Deploy Airflow DAG with DBT Cloud
- Run Airflow DAG with DBT Cloud Job
23. Machine Learning in GCP (for Data Engineers)
- Overview of AI/ML on GCP
- BigQuery ML – building ML models directly in SQL
- Vertex AI basics – training & deploying ML models
- Pipelines for ML (Vertex AI Pipelines, Kubeflow)
24. Security, Monitoring & Governance
- Data encryption (at rest, in transit, CMEK vs Google-managed keys)
- IAM roles for data services
- VPC Service Controls for data security
- Cloud Logging, Cloud Monitoring, and Cloud Trace
- Data Catalog for metadata management & lineage
- DLP (Data Loss Prevention) for sensitive data
25. Real-World GCP Data Engineering Scenarios
- Building a streaming pipeline (Pub/Sub → Dataflow → BigQuery → Looker)
- Building a batch pipeline (Cloud Storage → Dataproc → BigQuery)
- Data migration from on-prem to GCP
- Designing a hybrid data lakehouse (BigQuery + Dataplex + GCS)
- Project flow