Courses Offered: SCJP SCWCD Design patterns EJB CORE JAVA AJAX Adv. Java XML STRUTS Web services SPRING HIBERNATE  

       

GCP DATA ENGINEERING Course Details
 

Subcribe and Access : 5200+ FREE Videos and 21+ Subjects Like CRT, SoftSkills, JAVA, Hadoop, Microsoft .NET, Testing Tools etc..

Batch Date: Sept 13th & 14th @4:00PM

Faculty: Mr. N. Vijay Sunder Sagar (20+ Yrs of Exp,..)

Duration: 10 Weekends Batch

Venue :
DURGA SOFTWARE SOLUTIONS,
Flat No : 202, 2nd Floor,
HUDA Maitrivanam,
Ameerpet, Hyderabad - 500038

Ph.No: +91 - 8885252627, 9246212143, 80 96 96 96 96

Syllabus:

GCP Data Engineering

Module 1:  GCP Data Engineering Fundamentals

Module 2:  Google Cloud Storage(GCS)

Module 3:  Cloud SQL -Setting up Database

Module 4:  BigQuery – for Building Datawarehouse

Module 5:  Dataproc – Bigdata processing

Module 6:  Databricks – Pyspark Processing

Module 7:  Dataflow -Apache Beam Development

Module 8:  Google Cloud ComposerOrchestration

Module 9: Data Fusion – Data Integration service

Module 10 :  DBT, Airflow and Terraform

1. Introduction to Google Cloud Platform

  • Overview of cloud platforms
  • GCP services and regions
  • IAM (Identity & Access Management) basics
  • Resource hierarchy (organization, folders, projects)
  • Billing & cost management

2. GCP Data Engineering Fundamentals

  • Data engineering roles & responsibilities
  • Batch vs real-time data processing
  • Data lake vs data warehouse vs data marts
  • ETL vs ELT

3. Getting Started with GCP

  • Signing for GCP Account
  • Create New Google Account using Non Gmail Id
  • Sign up for GCP using Google Account
  • Overview of GCP Credits
  • Overview of GCP Project and Billing
  • Overview of Google Cloud Shell
  • Install Google Cloud SDK on Windows
  • Initialize gcloud CLI using GCP Project
  • Reinitialize Google Cloud Shell with Project id
  • Overview of Analytics Services on GCP

4. Storage & Databases in GCP

  • Cloud Storage – buckets, lifecycle, versioning
  • BigQuery – datasets, tables, partitioning, clustering, query optimization
  • Cloud SQL & Cloud Spanner – relational databases
  • Firestore & Bigtable – NoSQL databases
  • Data modeling best practices

5. Google Cloud Storage (GCS) : Setting up Datalake using GCS

  • Getting Started with Google Cloud Storage or GCS
  • Overview of Google Cloud Storage or GCS Web UI
  • Create GCS Bucket using GCP Web UI
  • Upload Folders and Files using into GCS Bucket using GCP Web UI
  • Review GCS Buckets and Objects using gsutil commands
  • Delete GCS Bucket using Web UI
  • Setup Data Repository in Google Cloud Shell
  • Overview of Data Sets
  • Managing Buckets in GCS using gsutil
  • Copy Data Sets into GCS using gsutil
  • Cleanup Buckets in GCS using gsutil
  • Exercise to Manage Buckets and Files in GCS using gsutil
  • Overview of Setting up Data Lake using GCS
  • Setup Google Cloud Libraries in Python Virtual Environment
  • Setup Bucket and Files in GCS using gsutil
  • Getting Started to manage files in GCS using Python
  • Setup Credentials for Python and GCS Integration
  • Review Methods in Google Cloud Storage Python library
  • Get GCS Bucket Details using Python
  • Manage Blobs or Files in GCS using Python
  • Project Problem Statement to Manage Files in GCS using Python
  • Design to Upload multiple files into GCS using Python
  • Get File Names to upload into GCS using Python glob and os
  • Upload all Files to GCS as blobs using Python
  • Validate Files or Blobs in GCS using Python
  • Overview of Processing Data in GCS using Pandas
  • Convert Data to Parquet and Write to GCS using Pandas
  • Design to Upload multiple files into GCS using Pandas
  • Get File Names to upload into GCS using Python glob and os
  • Overview of Parquet File Format and Schemas JSON File
  • Get Column Names for Dataset using Schemas JSON File
  • Upload all Files to GCS as Parquet using Pandas
  • Perform Validation of Files Copied using Pandas

6. Cloud Sql : Set Up postgres Database using cloud sql

  • Overview of GCP Cloud SQL
  • Setup Postgres Database Server using GCP Cloud SQL
  • Configure Network for Cloud SQL Postgres Database
  • Install Postgres 14 on Windows 11
  • Getting Started with pgAdmin on Windows
  • Getting Started with pgAdmin on Mac
  • Validate Client Tools for Postgres on Mac or PC
  • Setup Database in GCP Cloud SQL Postgres Database Server
  • Setup Tables in GCP Cloud SQL Postgres Database
  • Validate Data in GCP Cloud SQL Postgres Database Table
  • Integration of GCP Cloud SQL Postgres with Python
  • Overview of Integration of GCP Cloud SQL Postgres with Pandas
  • Read Data From Files to Pandas Data Frame
  • Process Data using Pandas Dataframe APIs
  • Write Pandas Dataframe into Postgres Database Table
  • Validate Data in Postgres Database Tables using Pandas
  • Getting Started with Secrets using GCP Secret Manager
  • Configure Access to GCP Secret Manager via IAM Roles
  • Install Google Cloud Secret Manager Python Library
  • Get Secret Details from GCP Secret Manager using Python
  • Connect to Database using Credentials from Secret Manager
  • Stop GCP Cloud SQL Postgres Database Server

7. Data Ingestion & Integration

  • Pub/Sub – message queues, subscriptions, streaming data ingestion
  • Dataflow (Apache Beam) – batch & stream data pipelines
  • Dataproc (Hadoop/Spark on GCP) – ETL with Spark/Hive
  • Transfer Service & Storage Transfer – on-prem to cloud data movement
  • APIs, connectors, and partner ETL tools (Informatica, Fivetran, etc.)

8. Data Processing & Transformation

  • Designing pipelines with Dataflow
    Streaming analytics with Pub/Sub + Dataflow
  • Data transformations using Dataproc (Spark/Presto/Hive)
  • BigQuery transformations (SQL-based ELT)
  • Using Dataprep (Trifacta) for no-code data wrangling

9. Data Warehousing & Analytics

  • BigQuery :
    • Schemas, partitioning, clustering
    • Optimization (slots, caching, materialized views)
    • Federated queries (Cloud Storage, Bigtable, Sheets)
    • BigQuery ML basics
  • Designing star/snowflake schemas
  • Analytics & BI integration (Looker, Data Studio)

10. BigQuery: for building Datawarehouse

  • Overview of Google BigQuery
  • Getting Started with Google BigQuery
  • Overview of CRUD Operations in Google BigQuery
  • Merge or Upsert into Google BigQuery Tables
  • Create Dataset and Tables in Google BigQuery using UI
  • Create Table in Google BigQuery using Command
  • Exercise to create tables in Google BigQuery
  • Overview of Loading Data from Files into BigQuery Tables
  • Getting Started with Integration between Google BigQuery and Python
  • Load Data from GCS Files into an Empty Table in Google BigQuery
  • Run Queries in Google BigQuery using Python Applications
  • Exercise to Load Data into BigQuery Tables
  • Drop Tables from Google BigQuery
  • Overview of External Tables in BigQuery
  • Create Google BigQuery External Table on GCS Files using Web UI
  • Create Google BigQuery External Table on GCS Files using Command
  • Google BigQuery External Tables using AWS s3 or Azure Blob or Google Drive
  • Exercise to Create Google BigQuery External Tables
  • Overview of SQL Capabilities of Google BigQuery
  • Basic SQL Queries using Google BigQuery
  • Cumulative Aggregations using Google BigQuery
  • Compute Ranks using Google BigQuery
  • Filter based on Ranks using Google BigQuery
  • Overview of Key Integrations with Google BigQuery
  • Python Pandas Integration with Google BigQuery
  • Overview of Integration between BigQuery and RDBMS Databases
  • Validate Cloud SQL Postgres Database for BigQuery Integration
  • Create External Connections and Run External Queries from Google BigQuery
  • Running External Queries using External Connections in Google BigQuery

11. Dataproc : BigData Processing

  • Getting Started with GCP Dataproc
  • Setup Single Node Dataproc Cluster for Development
  • Validate SSH Connectivity to Master Node of Dataproc Cluster
  • Allocate Static IP to the Master Node VM of Dataproc Cluster
  • Setup VS Code Remote Window for Dataproc VM
  • Setup Workspace using VS Code on Dataproc
  • Getting Started with HDFS Commands on Dataproc
  • Recap of gsutil to manage files and folders in GCS
  • Review Data Sets setup on Dataproc Master Node VM
  • Copy Local Files into HDFS on Dataproc
  • Copy GCS Files into HDFS on Dataproc.cmproj
  • Validate Pyspark CLI in Dataproc Cluster
  • Validate Spark Scala CLI in Dataproc Cluster
  • Validate Spark SQL CLI in Dataproc Cluster

12. ELT Datapipelines using Dataproc

  • Overview of GCP Dataproc Jobs and Workflow
  • Setup JSON Dataset in GCS for Dataproc Jobs
  • Review Spark SQL Commands used for Dataproc Jobs
  • Run Dataproc Job using Spark SQL
  • Overview of Modularizing Spark SQL Applications for Dataproc
  • Review Spark SQL Scripts for Dataproc Jobs and Workflows
  • Validate Spark SQL Script for File Format Conversion
  • Exercise to convert file format using Spark SQL Script
  • Validate Spark SQL Script for Daily Product Revenue
  • Develop Spark SQL Script to Cleanup Database
  • Copy Spark SQL Scripts to GCS
  • Run and Validate Spark SQL Scripts in GCS
  • Limitations of Running Spark SQL Scripts using Dataproc Jobs
  • Manage Dataproc Clusters using gcloud Commands
  • Run Dataproc Jobs using Spark SQL Command or Query
  • Run Dataproc Jobs using Spark SQL Scripts
  • Exercises to Run Spark SQL Scripts as Dataproc Jobs using gcloud
  • Delete Dataproc Jobs using gcloud commands
  • Importance of using gcloud commands to manage dataproc jobs
  • Getting Started with Dataproc Workflow Templates using Web UI
  • Review Steps and Design to create Dataproc Workflow Template
  • Create Dataproc Workflow Template and Add Cluster using gcloud Commands
  • Review gcloud Commands to Add Jobs to Dataproc Workflow Templates
  • Add Jobs to Dataproc Workflow Template using Commands
  • Instantiate Dataproc Workflow Template to run the Data Pipeline
  • Overview of Dataproc Operations and Deleting Workflow Runs
  • Run and Validate ELT Data Pipeline using Dataproc
  • Stop Dataproc Cluster

13. Databricks : Pyspark Processing in GCP

  • Overview of Databricks on GCP
  • Signing up for Databricks on GCP
  • Create Databricks Workspace on GCP
  • Getting Started with Databricks Clusters on GCP
  • Getting Started with Databricks Notebook
  • High level architecture of Databricks
  • Setup Databricks CLI on Mac or Windows
  • Overview of Databricks CLI and other clients
  • Configure Databricks CLI on Mac or Windows
  • Troubleshoot issues to configure Databricks CLI
  • Overview of Databricks CLI Commands
  • Setup Data Repository for Data Sets
  • Setup Data Sets in DBFS using Databricks CLI Commands
  • Process Data in DBFS using Databricks Spark SQL
  • Getting Started with Spark SQL Example using Databricks
  • Create Temporary Views using Spark SQL
  • Exercise to create temporary views using Spark SQL
  • Spark SQL Query to compute Daily Product Revenue
  • Save Query Result to DBFS using Spark SQL
  • Overview of Pyspark Examples on Databricks.cmproj
    Process Schema Details in JSON using Pyspark
  • Create Dataframe with Schema from JSON File using Pyspark
  • Transform Data using Spark APIs
  • Get Schema Details for all Data Sets using Pyspark
  • Convert CSV to Parquet with Schema using Pyspark

14. ELT pipeline using Databricks

  • Overview of Databricks Workflows
  • Pass Arguments to Databricks Python Notebooks
  • Pass Arguments to Databricks SQL Notebooks
  • Create and Run First Databricks Job
  • Run Databricks Jobs and Tasks with Parameters
  • Create and Run Orchestrated Pipeline using Databricks Job
  • Import ELT Data Pipeline Applications into Databricks Environment
  • Spark SQL Application to Cleanup Database and Datasets
  • Review File Format Converter Pyspark Code
  • Review Databricks SQL Notebooks for Tables and Final Results
  • Validate Applications for ELT Pipeline using Databricks
  • Build ELT Pipeline using Databricks Job in Workflows
  • Run and Review Execution details of ELT Data Pipeline using Databricks Job
  • Cleanup Databricks Environment on GCP

15. Integration of Spark on Dataproc and BigQuery

  • Review Development Environment with VS Code using Dataproc Cluster
  • Validate Google BigQuery Integration with Python on Dataproc
  • Setup Native Tables in Google BigQuery
  • Review Spark Google BigQuery Connector
  • Integration of Spark on Dataproc and BigQuery using Pyspark CLI
  • Integration of Spark on Dataproc and BigQuery using Notebook
  • Review Design of Data Pipeline using Spark and BigQuery
  • Review Spark Applications to compute daily product revenue
  • Create Table for Daily Product Revenue in Google BigQuery
  • Validate Parquet Files for Daily Product Revenue in GCS
  • Develop Logic to Save Daily Product Revenue to BigQuery Table
  • Reset Daily Product Revenue Table in Google BigQuery
  • Review Spark Application Code to Write to BigQuery Table
  • Submit Spark Application with BigQuery Integration using Client Mode
  • Submit Spark Application with BigQuery Integration using Cluster Mode
  • Deploy Spark Application with BigQuery Integration in GCS
  • Switching to Local Development Environment from Dataproc
  • Run Spark Application as Dataproc Job using Web
  • Run Spark Application as Dataproc Job using Command
  • Review Dataproc Jobs and Spark Application using Dataproc UI
  • Overview of Orchestration using Dataproc Commands for Spark Applications on
  • Overview of ELT Pipeline using Dataproc Workflows
  • Create Workflow Template with Spark SQL Applications
  • Add Pyspark Application to Dataproc Workflow Template
  • Run Dataproc Workflow Template using Dataproc Command
  • Update Job Properties in Dataproc Workflow Template

16. DataFlow (Apache Beam Development)

  • Introduction to DataFlow
  • Use cases for DataFlow in real-time analytics and ETL.
  • Understanding the difference between Apache Spark and Apache Beam
  • How Dataflow is different from Dataproc
  • Building Data Pipelines with Apache Beam
    • Writing Apache Beam pipelines for batch and stream processing.
    • Custom Pipelines and Pre-defined pipelines
    • Transformations and windowing concepts.
  • Integration with Other GCP Services
    • Integrating DataFlow with BigQuery, Pub/Sub, and other GCP services.
    • Real-time analytics and visualization using DataFlow and BigQuery.
    • Workflow orchestration with Composer.
  • End to End Streaming Pipeline using Apache beam with Dataflow, Python app, PubSub, BigQuery, GCS
  • Template method of creating pipelines

17. Cloud Pub/Sub

  • Introduction to Pub/Sub
  • Understanding the role of Pub/Sub in event-driven architectures.
  • Key Pub/Sub concepts: topics, subscriptions, messages, and acknowledgments.
  • Creating and Managing Topics and Subscriptions
    • Using the GCP Console to create Pub/Sub topics and subscriptions.
    • Configuring message retention policies and acknowledgment settings.
  • Publishing and Consuming Messages
    • Writing and deploying code to publish messages to a topic.
    • Implementing subscribers to consume and process messages from subscriptions.
  • Integration with Other GCP Services
    • Connecting Pub/Sub with Cloud Functions for serverless event-driven computing.
    • Integrating Pub/Sub with Dataflow for real-time stream processing.
  • Streaming use-case using Dataflow

18. Google Cloud Composer : For Datapipeline Orchestration

  • Orchestration & Workflow Management and DAG Creations
  • Cloud Composer (Airflow on GCP) – DAGs, operators, scheduling pipelines
  • Integration with Dataflow, Dataproc, BigQuery
  • Workflow automation with Cloud Functions & Workflows
  • Create Airflow or Cloud Composer Environment
  • Review Google Cloud Composer Environment
  • Development Process of Airflow DAGs for Cloud Composer
  • Install Required Dependencies for Development of Airflow DAGs
  • Run Airflow Commands in Cloud Composer using gcloud
  • Overview of Airflow Architecture
  • Deploy and Run First Airflow DAG in Google Cloud Composer Environment
  • Understand Relationship between Python Scripts and Airflow DAGs
  • Code Review of Airflow DAGs and Tasks
  • Overview of Airflow Dataproc Operators
  • Review Airflow DAG with GCP Dataproc Workflow Template Operator
  • Deploy and Run GCP Dataproc Workflow using Airflow
  • Using Variables in Airflow DAGs
  • Deploy and Run Airflow DAGs with Variables
  • Overview of Data Pipeline using Cloud Composer and Dataproc Jobs
  • Review the Spark Applications related to the Data Pipeline
  • Review Airflow DAG for Orchestrated Pipeline using Dataproc Jobs
  • Deploy Data Pipeline or Airflow DAG using Dataproc Jobs
  • Review Source and Target before Deployment of Airflow DAG
  • Deploy and Run Airflow DAG with Dataproc Jobs
  • Differences Between Dataproc Workflows and Airflow DAGs
  • Cleanup Cloud Composer Environment and Dataproc Cluster

19. Data Fusion:

  • Introduction to Data Fusion
    • Overview of Data Fusion as a fully managed data integration service.
    • Use cases for Data Fusion in ETL and data migration.
  • Building Data Integration Pipelines
    • Creating ETL pipelines using the visual interface.
    • Configuring data sources, transformations, and sinks.
    • Using pre-built templates for common integration scenarios.
  • Integration with GCP and External Services
    • Integrating Data Fusion with BigQuery, Cloud Storage, and other GCP services.
  • End to End pipeline using Data fusion with Wrangler, GCS, BigQuery

20. Cloud Functions

  • Cloud Functions Introduction
  • Setting up Cloud Functions in GCP
  • Event-driven architecture and use cases
  • Writing and deploying Cloud Functions
  • Triggering Cloud Functions:
    • HTTP triggers
    • Pub/Sub triggers
    • Cloud Storage triggers
  • Monitoring and logging Cloud Functions
  • Usecase-1: Loading the files from GCS to BigQuery as soon as it is uploaded

21. Terraform:

  • Terraform Introduction
  • Installing and configuring Terraform.
  • Infrastructure Provisioning
  • Terraform basic commands
    • Init, plan, apply, destroy
  • Create Resources in Google Cloud Platform
    • GCS buckets
    • Dataproc cluster
    • BigQuery Datasets and tables
    • And more resources as needed

22. Datapipelines using DBT,Airflow and BigQuery

  • Overview of Data Landscape of Large Enterprise
  • DBT High Level Architecture
  • Overview of DBT Cloud Features and DBT Adapters
  • Airflow and DBT Pipeline Patterns
  • Pre-requisites for Dev Environment using Airflow and DBT
  • Setup Astro CLI on Windows or Mac
  • Setup Workspace using VSCode
  • Setup Local Airflow Environment using Astro CLI
  • Setup Python Virtual Environment with Airflow
  • Overview of Airflow Providers
  • Manage Local Airflow Containers using Astro CLI
  • Connect to Airflow Containers and Review Logs using Astro CLI
  • Setup Datasets for Airflow Pipelines or DAGs
  • Setup GCS Bucket and Upload Data Set
  • Getting Started with Google BigQuery
  • Create External Table using Google BigQuery
  • Create GCP Service Account and Download Credentials
  • Getting Started with DBT Cloud
  • Setup DBT Cloud Project for Google BigQuery
  • Review and Run Example DBT Pipeline using DBT Cloud
  • Validate Google BigQuery Objects created by DBT Pipeline
  • Overview of ELT Pipeline using DBT and Google BigQuery
  • Change the DBT Project Structure from example
  • Create Models for Orders and Order Items
  • Define Denormalized Model for Order Details
  • Query to compute daily product revenue
  • Add Model for Daily Product Revenue
  • Create and Run DBT Cloud Job
  • Validate Airflow and Review DBT Cloud Provider
  • Install Airflow DBT Cloud Provider
  • Overview of End to End Orchestrated Data Pipeline using Airflow
  • Create DBT Cloud Connection in Airflow
  • Create DBT Job Variables in Airflow
  • Develop Airflow DAG to trigger DBT Cloud Job
  • Deploy Airflow DAG with DBT Cloud
  • Run Airflow DAG with DBT Cloud Job

23. Machine Learning in GCP (for Data Engineers)

  • Overview of AI/ML on GCP
  • BigQuery ML – building ML models directly in SQL
  • Vertex AI basics – training & deploying ML models
  • Pipelines for ML (Vertex AI Pipelines, Kubeflow)

24. Security, Monitoring & Governance

  • Data encryption (at rest, in transit, CMEK vs Google-managed keys)
  • IAM roles for data services
  • VPC Service Controls for data security
  • Cloud Logging, Cloud Monitoring, and Cloud Trace
  • Data Catalog for metadata management & lineage
  • DLP (Data Loss Prevention) for sensitive data

25. Real-World GCP Data Engineering Scenarios

  • Building a streaming pipeline (Pub/Sub → Dataflow → BigQuery → Looker)
  • Building a batch pipeline (Cloud Storage → Dataproc → BigQuery)
  • Data migration from on-prem to GCP
  • Designing a hybrid data lakehouse (BigQuery + Dataplex + GCS)
  • Project flow