Courses Offered: SCJP SCWCD Design patterns EJB CORE JAVA AJAX Adv. Java XML STRUTS Web services SPRING HIBERNATE  

       

BIG DATA Analytics Course Details
 

Subcribe and Access : 5200+ FREE Videos and 21+ Subjects Like CRT, SoftSkills, JAVA, Hadoop, Microsoft .NET, Testing Tools etc..

Batch Date: Dec 21st & 22nd @5:00PM

Faculty: Mr. N. Vijay Sunder Sagar (20+ Yrs Of Exp,..)

Duration: 12 Weekends Batch

Venue :
DURGA SOFTWARE SOLUTIONS,
Flat No : 202, 2nd Floor,
HUDA Maitrivanam,
Ameerpet, Hyderabad - 500038

Ph. No: +91 - 9246212143, 80 96 96 96 96



Syllabus:

BIG DATA HADOOP

I: INTRODUCTION

  • What is Big Data?
  • What is Hadoop?
  • Need of Hadoop
  • Sources and Types of Data
  • Comparison with Other Technologies
  • Challenges with Big Data
    • i. Storage
    • ii. Processing
  • RDBMS vs Hadoop
  • Advantages of Hadoop
  • Hadoop Echo System components

II: HDFS (Hadoop Distributed File System)

  • Features of HDFS
  • Name node ,Data node ,Blocks
  • Configuring Block size,
  • HDFS Architecture ( 5 Daemons)
    • i. Name Node
    • ii. Data Node
    • iii. Secondary Name node
    • iv. Job Tracker
    • v. Task Tracker
  • Metadata management
  • Storage and processing
  • Replication in Hadoop
  • Configuring Custom Replication
  • Fault Tolerance in Hadoop
  • HDFS Commands

III: MAP REDUCE

  • Map Reduce Architecture
  • Processing Daemons of Hadoop
    • Job Tracker (Roles and Responsibilities)
    • Task Tracker(Roles and Responsibilities)
  • Phases of Map Reduce
    • i) Mapper phase
    • ii) Reducer phase
  • Input split
  • Input split vs Block size
  • Partitioner in Map Reduce
  • Groupings and Aggregations
  • Data Types in Map Reduce
  • Map Reduce Programming Model
    • Driver Code
    • Mapper Code
    • Reducer Code
  • Programming examples
  • File input formats
  • File output formats
  • Merging in Map Reduce
  • Speculative Execution Model
  • Speculative Job

IV: SQOOP (SQL + HADOOP)

  • Introduction to Sqoop
  • SQOOP Import
  • SQOOP Export
  • Importing Data From RDBMS to HDFS
  • Importing Data From RDBMS to HIVE
  • Importing Data From RDBMS to HBASE
  • Exporting From HASE to RDBMS
  • Exporting From HBASE to RDBMS
  • Exporting From HIVE to RDBMS
  • Exporting From HDFS to RDBMS
  • Transformations While Importing / Exporting
  • Filtering data while importing
  • Vertical and Horizontal merging while import
  • Working with delimiters while importing
  • Groupings and Aggregations while import
  • Incremental import
  • Examples and operations
  • Defining SQOOP Jobs

V: YARN

  • Introduction
  • Speculative Execution ,Speculative job and
  • Speculative Task.
  • Comparision of Hadoop1.xx with Hadoop2.xx
  • Comparision with previous versions
  • YARN Architecture Componets
    • i. Resource Manager
    • ii. Application Master
    • iii. Node Manager
    • iv. Application Manager
    • v. Resource Scheduler
    • vi. Job History Server
    • vii. Container

VI: NOSQL

  • What is “Not only SQL”
  • NOSQL Advantages
  • What is problem with RDBMS for Large
  • Data Scaling Systems
  • Types of NOSQL & Purposes
  • Key Value Store
  • Columer Store
  • Document Store
  • Graph Store
  • Introduction to cassandra – NOSQL Database
  • Introduction to MongoDB and CouchDB Database
  • Intergration of NOSQL Databases with Hadoop

VII: HBASE

  • Introduction to big table
  • What is NOSQL and colummer store Database
  • HBASE Introduction
  • Hbase use cases
  • Hbase basics
  • Column families
  • Scans
  • Hbase Architecture
  • Map Reduce Over Hbase
  • Hbase data Modeling
  • Hbase Schema design
  • Hbase CRUD operators
  • Hive & Hbaseinteragation
  • Hbase storage handlers

VIII: HIVE

  • Introduction
  • Hive Architecture
  • Hive Metastore
  • Hive Query Launguage
  • Difference between HQL and SQL
  • Hive Built in Functions
  • Loading Data From Local Files To Hive Tables
  • Loading Data From Hdfs Files To Hive Tables
  • Tables Types
  • Inner Tables
  • External Tables
  • Hive Working with unstructured data
  • Hive Working With Xml Data
  • Hive Working With Json Data
  • Hive Working With Urls And Weblog Data
  • Hive Unions
  • Hive Joins
  • Multi Table / File Inserts
  • Inserting Into Local Files
  • Inserting Into Hdfs Files
  • Hive UDF (user defined functions)
  • Hive UDAF (user defined Aggregated functions)
  • Hive UDTF (user defined table Generated functions
  • Partitioned Tables
  • Non – Partitioned Tables
  • Multi-column Partitioning
  • Dynamic Partitions In Hive
  • Performance Tuning mechanism
  • Bucketing in hive
  • Indexing in Hive
  • Hive Examples
  • Hive & Hbase Integration

 

PYSPARK

I ) PYSPARK INTRODUCTION

  • What is Apache Spark?
  • Why Pyspark?
  • Need for pyspark
  • spark Python Vs Scala
  • pyspark features
  • Real-life usage of PySpark
  • PySpark Web/Application
  • PySpark - SparkSession
  • PySpark – SparkContext
  • PySpark – RDD
  • PySpark – Parallelize
  • PySpark – repartition() vs coalesce()
  • PySpark – Broadcast Variables
  • PySpark – Accumulator

II) PYSPARK - RDD COMPUTATION

  • Operations on a RDD
  • Direct Acyclic Graph (DAG)
  • RDD Actions and Transformations
  • RDD computation
  • Steps in RDD computation
  • RDD persistence
  • Persistence features

II) PERSISTENCE Options:

  • 1) MEMORY_ONLY
  • 2) MEMORY_SER_ONLY
  • 3) DISK_ONLY
  • 4) DISK_SER_ONLY
  • 5) MEMORY_AND_DISK_ONLY

III) PYSPARK - CORE COMPUTING

  • Fault Tolerence model in spark
  • Different ways of creating a RDD
  • Word Count Example
  • Creating spark objects(RDDs) from Scala Objects(lists).
  • Increasing the no of partitons
  • Aggregations Over Structured Data:
  • reduceByKey()

IV) GROUPINGS AND AGGREGATIONS

  • i) Single Grouping and Single Aggregation
  • ii) Single Grouping and multiple Aggregation
  • iii) multi Grouping and Single Aggregation
  • iv) Multi Grouping and Multi Aggregation
  • Differences b/w reduceByKey() and groupByKey()
  • Process of groupByKey
  • Process of reduceByKey
  • Reduce() function
  • Various Transformations
  • Various Built-in Functions

V) Various Actions and Transformations:

  • countByKey()
  • countByValue()
  • sortByKey()
  • zip()
  • Union()
  • Distinct()
  • Various count aggregation
  • Joins
  • -inner join
  • -outer join
  • Cartesian()
  • Cogroup()
  • Other actions and transformations

VI) PySpark SQL - DataFrame

  • Introduction
  • Making data Structured
  • Case Classes
  • ways to extract case class objects
  • 1) using function
  • 2) using map with multiple exressions
  • 3) using map with single expression
  • Sql Context
  • Data Frames API
  • DataSet API
  • RDD vs DataFrame vs DataSet
  • PySpark – Create a DataFrame
  • PySpark – Create an empty DataFrame
  • PySpark – Convert RDD to DataFrame
  • PySpark – Convert DataFrame to Pandas
  • PySpark – show()
  • PySpark – StructType & StructField
  • PySpark – Row Class
  • PySpark – Column Class
  • PySpark – select()
  • PySpark – collect()
  • PySpark – withColumn()
  • PySpark – withColumnRenamed()
  • PySpark – where() & filter()
  • PySpark – drop() & dropDuplicates()
  • PySpark – orderBy() and sort()
  • PySpark – groupBy()
  • PySpark – join()
  • PySpark – union() & unionAll()
  • PySpark – unionByName()
  • PySpark – UDF (User Defined Function)
  • PySpark – map()
  • PySpark – flatMap()
  • pyspark – foreach()
  • PySpark – sample() vs sampleBy()
  • PySpark – fillna() & fill()
  • PySpark – pivot() (Row to Column)
  • PySpark – partitionBy()
  • PySpark – ArrayType Column (Array)
  • PySpark – MapType (Map/Dict)

VII) PySpark SQL Functions

  • PySpark – Aggregate Functions
  • PySpark – Window Functions
  • PySpark – Date and Timestamp Functions
  • PySpark – JSON Functions
  • PySpark – Read & Write JSON file

VIII) PySpark Built-In Functions

  • PySpark – when()
  • PySpark – expr()
  • PySpark – lit()
  • PySpark – split()
  • PySpark – concat_ws()
  • Pyspark – substring()
  • PySpark – translate()
  • PySpark – regexp_replace()
  • PySpark – overlay()
  • PySpark – to_timestamp()
  • PySpark – to_date()
  • PySpark – date_format()
  • PySpark – datediff()
  • PySpark – months_between()
  • PySpark – explode()
  • PySpark – array_contains()
  • PySpark – array()
  • PySpark – collect_list()
  • PySpark – collect_set()
  • PySpark – create_map()
  • PySpark – map_keys()
  • PySpark – map_values()
  • PySpark – struct()
  • PySpark – countDistinct()
  • PySpark – sum(), avg()
  • PySpark – row_number()
  • PySpark – rank()
  • PySpark – dense_rank()
  • PySpark – percent_rank()
  • PySpark – typedLit()
  • PySpark – from_json()
  • PySpark – to_json()
  • PySpark – json_tuple()
  • PySpark – get_json_object()
  • PySpark – schema_of_json()
  • Working Examples

IX) Pyspark External Sources

  • Working with sql statements
  • Spark and Hive Integration
  • Spark and mysql Integration
  • Working with CSV
  • Working with JSON
  • Transformations and actions on dataframes
  • Narrow, wide transformations
  • Addition of new columns, dropping of columns ,renaming columns
  • Addition of new rows, dropping rows
  • Handling nulls
  • Joins
  • Window function
  • Writing data back to External sources
  • Creation of tables fromDataframes (Internal tables, Temporary tables)

X) DEPLOYMENT MODES

  • Local Mode
  • Cluster Modes(Standalone , YARN

XI) PYSPARK APLLICATION

  • Stages and Tasks
  • Driver and Executor
  • Building spark applications/pipelines
  • Deploying spark apps to cluster and tuning
  • Performance tuning

PySpark Streaming Concepts

Integration with Kafka

PySpark-mllib