DURGA SOFTWARE SOLUTIONS

Courses Offered:

SCJP SCWCD Design patterns EJB CORE JAVA AJAX Adv. Java XML STRUTS Web services SPRING HIBERNATE

GCP DATA ENGINEERING Course Details

Subcribe and Access : 5200+ FREE Videos and 21+ Subjects Like CRT, SoftSkills, JAVA, Hadoop, Microsoft .NET, Testing Tools etc..

Batch Date: Nov 8th & 9th @6:00PM

Faculty: Mr. N. Vijay Sunder Sagar (20+ Yrs of Exp,..)

Duration: 10 Weekends Batch

Venue :
DURGA SOFTWARE SOLUTIONS,
Flat No : 202, 2nd Floor,
HUDA Maitrivanam,
Ameerpet, Hyderabad - 500038

Ph.No: +91 - 8885252627, 9246212143, 80 96 96 96 96

Syllabus:

GCP Data Engineering

Module 1: GCP Data Engineering Fundamentals

Module 2: Google Cloud Storage(GCS)

Module 3: Cloud SQL -Setting up Database

Module 4: BigQuery – for Building Datawarehouse

Module 5: Dataproc – Bigdata processing

Module 6: Databricks – Pyspark Processing

Module 7: Dataflow -Apache Beam Development

Module 8: Google Cloud Composer – Orchestration

Module 9: Data Fusion – Data Integration service

Module 10 : DBT, Airflow and Terraform

1. Introduction to Google Cloud Platform

Overview of cloud platforms

GCP services and regions

IAM (Identity & Access Management) basics

Resource hierarchy (organization, folders, projects)

Billing & cost management

2. GCP Data Engineering Fundamentals

Data engineering roles & responsibilities

Batch vs real-time data processing

Data lake vs data warehouse vs data marts

ETL vs ELT

3. Getting Started with GCP

Signing for GCP Account

Create New Google Account using Non Gmail Id

Sign up for GCP using Google Account

Overview of GCP Credits

Overview of GCP Project and Billing

Overview of Google Cloud Shell

Install Google Cloud SDK on Windows

Initialize gcloud CLI using GCP Project

Reinitialize Google Cloud Shell with Project id

Overview of Analytics Services on GCP

4. Storage & Databases in GCP

Cloud Storage – buckets, lifecycle, versioning

BigQuery – datasets, tables, partitioning, clustering, query optimization

Cloud SQL & Cloud Spanner – relational databases

Firestore & Bigtable – NoSQL databases

Data modeling best practices

5. Google Cloud Storage (GCS) : Setting up Datalake using GCS

Getting Started with Google Cloud Storage or GCS

Overview of Google Cloud Storage or GCS Web UI

Create GCS Bucket using GCP Web UI

Upload Folders and Files using into GCS Bucket using GCP Web UI

Review GCS Buckets and Objects using gsutil commands

Delete GCS Bucket using Web UI

Setup Data Repository in Google Cloud Shell

Overview of Data Sets

Managing Buckets in GCS using gsutil

Copy Data Sets into GCS using gsutil

Cleanup Buckets in GCS using gsutil

Exercise to Manage Buckets and Files in GCS using gsutil

Overview of Setting up Data Lake using GCS

Setup Google Cloud Libraries in Python Virtual Environment

Setup Bucket and Files in GCS using gsutil

Getting Started to manage files in GCS using Python

Setup Credentials for Python and GCS Integration

Review Methods in Google Cloud Storage Python library

Get GCS Bucket Details using Python

Manage Blobs or Files in GCS using Python

Project Problem Statement to Manage Files in GCS using Python

Design to Upload multiple files into GCS using Python

Get File Names to upload into GCS using Python glob and os

Upload all Files to GCS as blobs using Python

Validate Files or Blobs in GCS using Python

Overview of Processing Data in GCS using Pandas

Convert Data to Parquet and Write to GCS using Pandas

Design to Upload multiple files into GCS using Pandas

Get File Names to upload into GCS using Python glob and os

Overview of Parquet File Format and Schemas JSON File

Get Column Names for Dataset using Schemas JSON File

Upload all Files to GCS as Parquet using Pandas

Perform Validation of Files Copied using Pandas

6. Cloud Sql : Set Up postgres Database using cloud sql

Overview of GCP Cloud SQL

Setup Postgres Database Server using GCP Cloud SQL

Configure Network for Cloud SQL Postgres Database

Install Postgres 14 on Windows 11

Getting Started with pgAdmin on Windows

Getting Started with pgAdmin on Mac

Validate Client Tools for Postgres on Mac or PC

Setup Database in GCP Cloud SQL Postgres Database Server

Setup Tables in GCP Cloud SQL Postgres Database

Validate Data in GCP Cloud SQL Postgres Database Table

Integration of GCP Cloud SQL Postgres with Python

Overview of Integration of GCP Cloud SQL Postgres with Pandas

Read Data From Files to Pandas Data Frame

Process Data using Pandas Dataframe APIs

Write Pandas Dataframe into Postgres Database Table

Validate Data in Postgres Database Tables using Pandas

Getting Started with Secrets using GCP Secret Manager

Configure Access to GCP Secret Manager via IAM Roles

Install Google Cloud Secret Manager Python Library

Get Secret Details from GCP Secret Manager using Python

Connect to Database using Credentials from Secret Manager

Stop GCP Cloud SQL Postgres Database Server

7. Data Ingestion & Integration

Pub/Sub – message queues, subscriptions, streaming data ingestion

Dataflow (Apache Beam) – batch & stream data pipelines

Dataproc (Hadoop/Spark on GCP) – ETL with Spark/Hive

Transfer Service & Storage Transfer – on-prem to cloud data movement

APIs, connectors, and partner ETL tools (Informatica, Fivetran, etc.)

8. Data Processing & Transformation

Designing pipelines with Dataflow
Streaming analytics with Pub/Sub + Dataflow

Data transformations using Dataproc (Spark/Presto/Hive)

BigQuery transformations (SQL-based ELT)

Using Dataprep (Trifacta) for no-code data wrangling

9. Data Warehousing & Analytics

BigQuery :

Schemas, partitioning, clustering

Optimization (slots, caching, materialized views)

Federated queries (Cloud Storage, Bigtable, Sheets)

BigQuery ML basics

Designing star/snowflake schemas

Analytics & BI integration (Looker, Data Studio)

10. BigQuery: for building Datawarehouse

Overview of Google BigQuery

Getting Started with Google BigQuery

Overview of CRUD Operations in Google BigQuery

Merge or Upsert into Google BigQuery Tables

Create Dataset and Tables in Google BigQuery using UI

Create Table in Google BigQuery using Command

Exercise to create tables in Google BigQuery

Overview of Loading Data from Files into BigQuery Tables

Getting Started with Integration between Google BigQuery and Python

Load Data from GCS Files into an Empty Table in Google BigQuery

Run Queries in Google BigQuery using Python Applications

Exercise to Load Data into BigQuery Tables

Drop Tables from Google BigQuery

Overview of External Tables in BigQuery

Create Google BigQuery External Table on GCS Files using Web UI

Create Google BigQuery External Table on GCS Files using Command

Google BigQuery External Tables using AWS s3 or Azure Blob or Google Drive

Exercise to Create Google BigQuery External Tables

Overview of SQL Capabilities of Google BigQuery

Basic SQL Queries using Google BigQuery

Cumulative Aggregations using Google BigQuery

Compute Ranks using Google BigQuery

Filter based on Ranks using Google BigQuery

Overview of Key Integrations with Google BigQuery

Python Pandas Integration with Google BigQuery

Overview of Integration between BigQuery and RDBMS Databases

Validate Cloud SQL Postgres Database for BigQuery Integration

Create External Connections and Run External Queries from Google BigQuery

Running External Queries using External Connections in Google BigQuery

11. Dataproc : BigData Processing

Getting Started with GCP Dataproc

Setup Single Node Dataproc Cluster for Development

Validate SSH Connectivity to Master Node of Dataproc Cluster

Allocate Static IP to the Master Node VM of Dataproc Cluster

Setup VS Code Remote Window for Dataproc VM

Setup Workspace using VS Code on Dataproc

Getting Started with HDFS Commands on Dataproc

Recap of gsutil to manage files and folders in GCS

Review Data Sets setup on Dataproc Master Node VM

Copy Local Files into HDFS on Dataproc

Copy GCS Files into HDFS on Dataproc.cmproj

Validate Pyspark CLI in Dataproc Cluster

Validate Spark Scala CLI in Dataproc Cluster

Validate Spark SQL CLI in Dataproc Cluster

12. ELT Datapipelines using Dataproc

Overview of GCP Dataproc Jobs and Workflow

Setup JSON Dataset in GCS for Dataproc Jobs

Review Spark SQL Commands used for Dataproc Jobs

Run Dataproc Job using Spark SQL

Overview of Modularizing Spark SQL Applications for Dataproc

Review Spark SQL Scripts for Dataproc Jobs and Workflows

Validate Spark SQL Script for File Format Conversion

Exercise to convert file format using Spark SQL Script

Validate Spark SQL Script for Daily Product Revenue

Develop Spark SQL Script to Cleanup Database

Copy Spark SQL Scripts to GCS

Run and Validate Spark SQL Scripts in GCS

Limitations of Running Spark SQL Scripts using Dataproc Jobs

Manage Dataproc Clusters using gcloud Commands

Run Dataproc Jobs using Spark SQL Command or Query

Run Dataproc Jobs using Spark SQL Scripts

Exercises to Run Spark SQL Scripts as Dataproc Jobs using gcloud

Delete Dataproc Jobs using gcloud commands

Importance of using gcloud commands to manage dataproc jobs

Getting Started with Dataproc Workflow Templates using Web UI

Review Steps and Design to create Dataproc Workflow Template

Create Dataproc Workflow Template and Add Cluster using gcloud Commands

Review gcloud Commands to Add Jobs to Dataproc Workflow Templates

Add Jobs to Dataproc Workflow Template using Commands

Instantiate Dataproc Workflow Template to run the Data Pipeline

Overview of Dataproc Operations and Deleting Workflow Runs

Run and Validate ELT Data Pipeline using Dataproc

Stop Dataproc Cluster

13. Databricks : Pyspark Processing in GCP

Overview of Databricks on GCP

Signing up for Databricks on GCP

Create Databricks Workspace on GCP

Getting Started with Databricks Clusters on GCP

Getting Started with Databricks Notebook

High level architecture of Databricks

Setup Databricks CLI on Mac or Windows

Overview of Databricks CLI and other clients

Configure Databricks CLI on Mac or Windows

Troubleshoot issues to configure Databricks CLI

Overview of Databricks CLI Commands

Setup Data Repository for Data Sets

Setup Data Sets in DBFS using Databricks CLI Commands

Process Data in DBFS using Databricks Spark SQL

Getting Started with Spark SQL Example using Databricks

Create Temporary Views using Spark SQL

Exercise to create temporary views using Spark SQL

Spark SQL Query to compute Daily Product Revenue

Save Query Result to DBFS using Spark SQL

Overview of Pyspark Examples on Databricks.cmproj
Process Schema Details in JSON using Pyspark

Create Dataframe with Schema from JSON File using Pyspark

Transform Data using Spark APIs

Get Schema Details for all Data Sets using Pyspark

Convert CSV to Parquet with Schema using Pyspark

14. ELT pipeline using Databricks

Overview of Databricks Workflows

Pass Arguments to Databricks Python Notebooks

Pass Arguments to Databricks SQL Notebooks

Create and Run First Databricks Job

Run Databricks Jobs and Tasks with Parameters

Create and Run Orchestrated Pipeline using Databricks Job

Import ELT Data Pipeline Applications into Databricks Environment

Spark SQL Application to Cleanup Database and Datasets

Review File Format Converter Pyspark Code

Review Databricks SQL Notebooks for Tables and Final Results

Validate Applications for ELT Pipeline using Databricks

Build ELT Pipeline using Databricks Job in Workflows

Run and Review Execution details of ELT Data Pipeline using Databricks Job

Cleanup Databricks Environment on GCP

15. Integration of Spark on Dataproc and BigQuery

Review Development Environment with VS Code using Dataproc Cluster

Validate Google BigQuery Integration with Python on Dataproc

Setup Native Tables in Google BigQuery

Review Spark Google BigQuery Connector

Integration of Spark on Dataproc and BigQuery using Pyspark CLI

Integration of Spark on Dataproc and BigQuery using Notebook

Review Design of Data Pipeline using Spark and BigQuery

Review Spark Applications to compute daily product revenue

Create Table for Daily Product Revenue in Google BigQuery

Validate Parquet Files for Daily Product Revenue in GCS

Develop Logic to Save Daily Product Revenue to BigQuery Table

Reset Daily Product Revenue Table in Google BigQuery

Review Spark Application Code to Write to BigQuery Table

Submit Spark Application with BigQuery Integration using Client Mode

Submit Spark Application with BigQuery Integration using Cluster Mode

Deploy Spark Application with BigQuery Integration in GCS

Switching to Local Development Environment from Dataproc

Run Spark Application as Dataproc Job using Web

Run Spark Application as Dataproc Job using Command

Review Dataproc Jobs and Spark Application using Dataproc UI

Overview of Orchestration using Dataproc Commands for Spark Applications on

Overview of ELT Pipeline using Dataproc Workflows

Create Workflow Template with Spark SQL Applications

Add Pyspark Application to Dataproc Workflow Template

Run Dataproc Workflow Template using Dataproc Command

Update Job Properties in Dataproc Workflow Template

16. DataFlow (Apache Beam Development)

Introduction to DataFlow

Use cases for DataFlow in real-time analytics and ETL.

Understanding the difference between Apache Spark and Apache Beam

How Dataflow is different from Dataproc

Building Data Pipelines with Apache Beam

Writing Apache Beam pipelines for batch and stream processing.

Custom Pipelines and Pre-defined pipelines

Transformations and windowing concepts.

Integration with Other GCP Services

Integrating DataFlow with BigQuery, Pub/Sub, and other GCP services.

Real-time analytics and visualization using DataFlow and BigQuery.

Workflow orchestration with Composer.

End to End Streaming Pipeline using Apache beam with Dataflow, Python app, PubSub, BigQuery, GCS

Template method of creating pipelines

17. Cloud Pub/Sub

Introduction to Pub/Sub

Understanding the role of Pub/Sub in event-driven architectures.

Key Pub/Sub concepts: topics, subscriptions, messages, and acknowledgments.

Creating and Managing Topics and Subscriptions

Using the GCP Console to create Pub/Sub topics and subscriptions.

Configuring message retention policies and acknowledgment settings.

Publishing and Consuming Messages

Writing and deploying code to publish messages to a topic.

Implementing subscribers to consume and process messages from subscriptions.

Integration with Other GCP Services

Connecting Pub/Sub with Cloud Functions for serverless event-driven computing.

Integrating Pub/Sub with Dataflow for real-time stream processing.

Streaming use-case using Dataflow

18. Google Cloud Composer : For Datapipeline Orchestration

Orchestration & Workflow Management and DAG Creations

Cloud Composer (Airflow on GCP) – DAGs, operators, scheduling pipelines

Integration with Dataflow, Dataproc, BigQuery

Workflow automation with Cloud Functions & Workflows

Create Airflow or Cloud Composer Environment

Review Google Cloud Composer Environment

Development Process of Airflow DAGs for Cloud Composer

Install Required Dependencies for Development of Airflow DAGs

Run Airflow Commands in Cloud Composer using gcloud

Overview of Airflow Architecture

Deploy and Run First Airflow DAG in Google Cloud Composer Environment

Understand Relationship between Python Scripts and Airflow DAGs

Code Review of Airflow DAGs and Tasks

Overview of Airflow Dataproc Operators

Review Airflow DAG with GCP Dataproc Workflow Template Operator

Deploy and Run GCP Dataproc Workflow using Airflow

Using Variables in Airflow DAGs

Deploy and Run Airflow DAGs with Variables

Overview of Data Pipeline using Cloud Composer and Dataproc Jobs

Review the Spark Applications related to the Data Pipeline

Review Airflow DAG for Orchestrated Pipeline using Dataproc Jobs

Deploy Data Pipeline or Airflow DAG using Dataproc Jobs

Review Source and Target before Deployment of Airflow DAG

Deploy and Run Airflow DAG with Dataproc Jobs

Differences Between Dataproc Workflows and Airflow DAGs

Cleanup Cloud Composer Environment and Dataproc Cluster

19. Data Fusion:

Introduction to Data Fusion

Overview of Data Fusion as a fully managed data integration service.

Use cases for Data Fusion in ETL and data migration.

Building Data Integration Pipelines

Creating ETL pipelines using the visual interface.

Configuring data sources, transformations, and sinks.

Using pre-built templates for common integration scenarios.

Integration with GCP and External Services

Integrating Data Fusion with BigQuery, Cloud Storage, and other GCP services.

End to End pipeline using Data fusion with Wrangler, GCS, BigQuery

20. Cloud Functions

Cloud Functions Introduction

Setting up Cloud Functions in GCP

Event-driven architecture and use cases

Writing and deploying Cloud Functions

Triggering Cloud Functions:

HTTP triggers

Pub/Sub triggers

Cloud Storage triggers

Monitoring and logging Cloud Functions

Usecase-1: Loading the files from GCS to BigQuery as soon as it is uploaded

21. Terraform:

Terraform Introduction

Installing and configuring Terraform.

Infrastructure Provisioning

Terraform basic commands

Init, plan, apply, destroy

Create Resources in Google Cloud Platform

GCS buckets

Dataproc cluster

BigQuery Datasets and tables

And more resources as needed

22. Datapipelines using DBT,Airflow and BigQuery

Overview of Data Landscape of Large Enterprise

DBT High Level Architecture

Overview of DBT Cloud Features and DBT Adapters

Airflow and DBT Pipeline Patterns

Pre-requisites for Dev Environment using Airflow and DBT

Setup Astro CLI on Windows or Mac

Setup Workspace using VSCode

Setup Local Airflow Environment using Astro CLI

Setup Python Virtual Environment with Airflow

Overview of Airflow Providers

Manage Local Airflow Containers using Astro CLI

Connect to Airflow Containers and Review Logs using Astro CLI

Setup Datasets for Airflow Pipelines or DAGs

Setup GCS Bucket and Upload Data Set

Getting Started with Google BigQuery

Create External Table using Google BigQuery

Create GCP Service Account and Download Credentials

Getting Started with DBT Cloud

Setup DBT Cloud Project for Google BigQuery

Review and Run Example DBT Pipeline using DBT Cloud

Validate Google BigQuery Objects created by DBT Pipeline

Overview of ELT Pipeline using DBT and Google BigQuery

Change the DBT Project Structure from example

Create Models for Orders and Order Items

Define Denormalized Model for Order Details

Query to compute daily product revenue

Add Model for Daily Product Revenue

Create and Run DBT Cloud Job

Validate Airflow and Review DBT Cloud Provider

Install Airflow DBT Cloud Provider

Overview of End to End Orchestrated Data Pipeline using Airflow

Create DBT Cloud Connection in Airflow

Create DBT Job Variables in Airflow

Develop Airflow DAG to trigger DBT Cloud Job

Deploy Airflow DAG with DBT Cloud

Run Airflow DAG with DBT Cloud Job

23. Machine Learning in GCP (for Data Engineers)

Overview of AI/ML on GCP

BigQuery ML – building ML models directly in SQL

Vertex AI basics – training & deploying ML models

Pipelines for ML (Vertex AI Pipelines, Kubeflow)

24. Security, Monitoring & Governance

Data encryption (at rest, in transit, CMEK vs Google-managed keys)

IAM roles for data services

VPC Service Controls for data security

Cloud Logging, Cloud Monitoring, and Cloud Trace

Data Catalog for metadata management & lineage

DLP (Data Loss Prevention) for sensitive data

25. Real-World GCP Data Engineering Scenarios

Building a streaming pipeline (Pub/Sub → Dataflow → BigQuery → Looker)

Building a batch pipeline (Cloud Storage → Dataproc → BigQuery)

Data migration from on-prem to GCP

Designing a hybrid data lakehouse (BigQuery + Dataplex + GCS)

Project flow