Subcribe and Access : 5200+ FREE Videos and 21+ Subjects Like CRT, SoftSkills, JAVA, Hadoop, Microsoft .NET, Testing Tools etc..
Batch
Date: Dec 21st & 22nd @5:00PM
Faculty: Mr. N. Vijay Sunder Sagar (20+ Yrs Of Exp,..)
Duration: 12 Weekends Batch
Venue
:
DURGA SOFTWARE SOLUTIONS,
Flat No : 202,
2nd Floor,
HUDA Maitrivanam,
Ameerpet, Hyderabad - 500038
Ph. No: +91 - 9246212143, 80 96 96 96 96
Syllabus:
BIG DATA HADOOP
I: INTRODUCTION
- What is Big Data?
- What is Hadoop?
- Need of Hadoop
- Sources and Types of Data
- Comparison with Other Technologies
- Challenges with Big Data
- i. Storage
- ii. Processing
- RDBMS vs Hadoop
- Advantages of Hadoop
- Hadoop Echo System components
II: HDFS (Hadoop Distributed File System)
- Features of HDFS
- Name node ,Data node ,Blocks
- Configuring Block size,
- HDFS Architecture ( 5 Daemons)
- i. Name Node
- ii. Data Node
- iii. Secondary Name node
- iv. Job Tracker
- v. Task Tracker
- Metadata management
- Storage and processing
- Replication in Hadoop
- Configuring Custom Replication
- Fault Tolerance in Hadoop
- HDFS Commands
III: MAP REDUCE
- Map Reduce Architecture
- Processing Daemons of Hadoop
- Job Tracker (Roles and Responsibilities)
- Task Tracker(Roles and Responsibilities)
- Phases of Map Reduce
- i) Mapper phase
- ii) Reducer phase
- Input split
- Input split vs Block size
- Partitioner in Map Reduce
- Groupings and Aggregations
- Data Types in Map Reduce
- Map Reduce Programming Model
- Driver Code
- Mapper Code
- Reducer Code
- Programming examples
- File input formats
- File output formats
- Merging in Map Reduce
- Speculative Execution Model
- Speculative Job
IV: SQOOP (SQL + HADOOP)
- Introduction to Sqoop
- SQOOP Import
- SQOOP Export
- Importing Data From RDBMS to HDFS
- Importing Data From RDBMS to HIVE
- Importing Data From RDBMS to HBASE
- Exporting From HASE to RDBMS
- Exporting From HBASE to RDBMS
- Exporting From HIVE to RDBMS
- Exporting From HDFS to RDBMS
- Transformations While Importing / Exporting
- Filtering data while importing
- Vertical and Horizontal merging while import
- Working with delimiters while importing
- Groupings and Aggregations while import
- Incremental import
- Examples and operations
- Defining SQOOP Jobs
V: YARN
- Introduction
- Speculative Execution ,Speculative job and
- Speculative Task.
- Comparision of Hadoop1.xx with Hadoop2.xx
- Comparision with previous versions
- YARN Architecture Componets
- i. Resource Manager
- ii. Application Master
- iii. Node Manager
- iv. Application Manager
- v. Resource Scheduler
- vi. Job History Server
- vii. Container
VI: NOSQL
- What is “Not only SQL”
- NOSQL Advantages
- What is problem with RDBMS for Large
- Data Scaling Systems
- Types of NOSQL & Purposes
- Key Value Store
- Columer Store
- Document Store
- Graph Store
- Introduction to cassandra – NOSQL Database
- Introduction to MongoDB and CouchDB Database
- Intergration of NOSQL Databases with Hadoop
VII: HBASE
- Introduction to big table
- What is NOSQL and colummer store Database
- HBASE Introduction
- Hbase use cases
- Hbase basics
- Column families
- Scans
- Hbase Architecture
- Map Reduce Over Hbase
- Hbase data Modeling
- Hbase Schema design
- Hbase CRUD operators
- Hive & Hbaseinteragation
- Hbase storage handlers
VIII: HIVE
- Introduction
- Hive Architecture
- Hive Metastore
- Hive Query Launguage
- Difference between HQL and SQL
- Hive Built in Functions
- Loading Data From Local Files To Hive Tables
- Loading Data From Hdfs Files To Hive Tables
- Tables Types
- Inner Tables
- External Tables
- Hive Working with unstructured data
- Hive Working With Xml Data
- Hive Working With Json Data
- Hive Working With Urls And Weblog Data
- Hive Unions
- Hive Joins
- Multi Table / File Inserts
- Inserting Into Local Files
- Inserting Into Hdfs Files
- Hive UDF (user defined functions)
- Hive UDAF (user defined Aggregated functions)
- Hive UDTF (user defined table Generated functions
- Partitioned Tables
- Non – Partitioned Tables
- Multi-column Partitioning
- Dynamic Partitions In Hive
- Performance Tuning mechanism
- Bucketing in hive
- Indexing in Hive
- Hive Examples
- Hive & Hbase Integration
PYSPARK
I ) PYSPARK INTRODUCTION
- What is Apache Spark?
- Why Pyspark?
- Need for pyspark
- spark Python Vs Scala
- pyspark features
- Real-life usage of PySpark
- PySpark Web/Application
- PySpark - SparkSession
- PySpark – SparkContext
- PySpark – RDD
- PySpark – Parallelize
- PySpark – repartition() vs coalesce()
- PySpark – Broadcast Variables
- PySpark – Accumulator
II) PYSPARK - RDD COMPUTATION
- Operations on a RDD
- Direct Acyclic Graph (DAG)
- RDD Actions and Transformations
- RDD computation
- Steps in RDD computation
- RDD persistence
- Persistence features
II) PERSISTENCE Options:
- 1) MEMORY_ONLY
- 2) MEMORY_SER_ONLY
- 3) DISK_ONLY
- 4) DISK_SER_ONLY
- 5) MEMORY_AND_DISK_ONLY
III) PYSPARK - CORE COMPUTING
- Fault Tolerence model in spark
- Different ways of creating a RDD
- Word Count Example
- Creating spark objects(RDDs) from Scala Objects(lists).
- Increasing the no of partitons
- Aggregations Over Structured Data:
- reduceByKey()
IV) GROUPINGS AND AGGREGATIONS
- i) Single Grouping and Single Aggregation
- ii) Single Grouping and multiple Aggregation
- iii) multi Grouping and Single Aggregation
- iv) Multi Grouping and Multi Aggregation
- Differences b/w reduceByKey() and groupByKey()
- Process of groupByKey
- Process of reduceByKey
- Reduce() function
- Various Transformations
- Various Built-in Functions
V) Various Actions and Transformations:
- countByKey()
- countByValue()
- sortByKey()
- zip()
- Union()
- Distinct()
- Various count aggregation
- Joins
- -inner join
- -outer join
- Cartesian()
- Cogroup()
- Other actions and transformations
VI) PySpark SQL - DataFrame
- Introduction
- Making data Structured
- Case Classes
- ways to extract case class objects
- 1) using function
- 2) using map with multiple exressions
- 3) using map with single expression
- Sql Context
- Data Frames API
- DataSet API
- RDD vs DataFrame vs DataSet
- PySpark – Create a DataFrame
- PySpark – Create an empty DataFrame
- PySpark – Convert RDD to DataFrame
- PySpark – Convert DataFrame to Pandas
- PySpark – show()
- PySpark – StructType & StructField
- PySpark – Row Class
- PySpark – Column Class
- PySpark – select()
- PySpark – collect()
- PySpark – withColumn()
- PySpark – withColumnRenamed()
- PySpark – where() & filter()
- PySpark – drop() & dropDuplicates()
- PySpark – orderBy() and sort()
- PySpark – groupBy()
- PySpark – join()
- PySpark – union() & unionAll()
- PySpark – unionByName()
- PySpark – UDF (User Defined Function)
- PySpark – map()
- PySpark – flatMap()
- pyspark – foreach()
- PySpark – sample() vs sampleBy()
- PySpark – fillna() & fill()
- PySpark – pivot() (Row to Column)
- PySpark – partitionBy()
- PySpark – ArrayType Column (Array)
- PySpark – MapType (Map/Dict)
VII) PySpark SQL Functions
- PySpark – Aggregate Functions
- PySpark – Window Functions
- PySpark – Date and Timestamp Functions
- PySpark – JSON Functions
- PySpark – Read & Write JSON file
VIII) PySpark Built-In Functions
- PySpark – when()
- PySpark – expr()
- PySpark – lit()
- PySpark – split()
- PySpark – concat_ws()
- Pyspark – substring()
- PySpark – translate()
- PySpark – regexp_replace()
- PySpark – overlay()
- PySpark – to_timestamp()
- PySpark – to_date()
- PySpark – date_format()
- PySpark – datediff()
- PySpark – months_between()
- PySpark – explode()
- PySpark – array_contains()
- PySpark – array()
- PySpark – collect_list()
- PySpark – collect_set()
- PySpark – create_map()
- PySpark – map_keys()
- PySpark – map_values()
- PySpark – struct()
- PySpark – countDistinct()
- PySpark – sum(), avg()
- PySpark – row_number()
- PySpark – rank()
- PySpark – dense_rank()
- PySpark – percent_rank()
- PySpark – typedLit()
- PySpark – from_json()
- PySpark – to_json()
- PySpark – json_tuple()
- PySpark – get_json_object()
- PySpark – schema_of_json()
- Working Examples
IX) Pyspark External Sources
- Working with sql statements
- Spark and Hive Integration
- Spark and mysql Integration
- Working with CSV
- Working with JSON
- Transformations and actions on dataframes
- Narrow, wide transformations
- Addition of new columns, dropping of columns ,renaming columns
- Addition of new rows, dropping rows
- Handling nulls
- Joins
- Window function
- Writing data back to External sources
- Creation of tables fromDataframes (Internal tables, Temporary tables)
X) DEPLOYMENT MODES
- Local Mode
- Cluster Modes(Standalone , YARN
XI) PYSPARK APLLICATION
- Stages and Tasks
- Driver and Executor
- Building spark applications/pipelines
- Deploying spark apps to cluster and tuning
- Performance tuning
PySpark Streaming Concepts
Integration with Kafka
PySpark-mllib