Phân tích dữ liệu lớn với HADOOP VÀ SPARK

19th Jun 2023

Table of contents

Khóa học Big Data nhằm cung cấp cho học viên những kiến thức và kỹ năng làm việc với Dữ liệu lớn.

“Dữ liệu lớn là những tài sản thông tin với ba chiều tăng trưởng (3V), tăng về lượng (volume), tăng về tốc độ (velocity), tăng về chủng loại (variety), do đó cần các hình thức xử lý mới để nâng cao khả năng ra quyết định, khám phá giá trị nội tại và tối ưu hóa quy trình làm việc”.

1. Tổng quan khoá học Big Data

Big Data là thuật ngữ dùng để chỉ một tập hợp dữ liệu rất lớn, với quy mô vượt quá khả năng nắm bắt và xử lý của các công cụ phần mềm truyền thống trong khoảng thời chấp nhận.

Big Data còn là tập hợp các kỹ thuật và công nghệ đòi hỏi cách tích hợp mới nhằm khám phá những giá trị tiềm ẩn to lớn từ những tập hợp dữ liệu lớn, đa dạng, phức tạp. Năm 2012, Gartner định nghĩa “Dữ liệu lớn là những tài sản thông tin với ba chiều tăng trưởng (3V), tăng về lượng (volume), tăng về tốc độ (velocity), tăng về chủng loại (variety), do đó cần các hình thức xử lý mới để nâng cao khả năng ra quyết định, khám phá giá trị nội tại và tối ưu hóa quy trình làm việc”.

2. Mục tiêu đào tạo của Khóa Big Data

Mục tiêu của Khóa học Big Data nhằm cung cấp cho học viên những kiến thức và kỹ năng cần thiết về các framework sau:

Hadoop: Là nền tảng nguồn mở viết bằng Java hỗ trợ xử lý và lưu trữ các tập dữ liệu cực lớn trên môi trường tính toán phân tán. Cốt lõi của Hadoop gồm phần lưu trữ (Hệ thống tập tin phân tán Hadoop – HDFS) và phần xử lý (MapReduce).
Spark: Là công cụ nguồn mở, có khả năng xử lý nhanh chóng, tiện dụng và phân tích. Nó có thể xử lý một lượng dữ liệu lớn với độ trễ thấp mà chương trình MapReduce thông thường không thể thực hiện.

Học viên bắt đầu tìm hiểu Thế nào là lưu trữ phân tán và xử lý dữ liệu lớn, và Tại sao không thể dùng các công cụ truyền thống để lưu trữ và xử lý dữ liệu lớn. Kế tiếp, tìm hiểu Spark (một công cụ kế tiếp MapReduce sử dụng Scala). Sau khi kết thúc khóa học, học viên có thể:

Cài đặt Hadoop version 2
Hiểu về Yarn và cơ chế làm việc
Hiểu sự khác biệt giữa xử lý thời gian thực (real time) và xử lý theo lô (batch)
Sử dụng MapReduce để phân tích xử lý theo lô
Các cách xử lý dữ liệu khác nhau với Java, Pig Latin và ngôn ngữ HQL
Thực hành nhiều ví dụ đa dạng
Sử dụng Sqoop và Flume để đưa dữ liệu lớn vào Hadoop cluster
Hiểu NoSQL và sử dụng HBase
Nắm rõ các khái niệm và tính năng của RDD trong Spark
Chuyển đổi và xử lý dữ liệu
Sử dụng ngôn ngữ truy vấn cấu trúc Spark (Spark SQL)

3. Nội dung & thời lượng đào tạo

Chương trình được đào tạo có thời lượng trong 5 ngày (40 giờ)
Nội dung chi tiết được đính kèm bên dưới
Ngôn ngữ giảng dạy: Tiếng Anh / Tiếng Việt

4. Nội dung chi tiết

Overview of Big Data
- What is Big Data?
- History of Big Data
- The Vs’ of Big Data (3Vs’, 4Vs’, 5Vs’)
- Batch processing vs Stream processing
- Introduction to Apache Spark
- Apache Spark Components: Spark RDD API, Spark SQL, Spark MLlib, Spark GraphX, Spark Streaming
Overview of PySpark
- Introduction to PySpark: Spark with Python (Python API)
- Why PySpark?
- Installing and configuration PySpark
- Spark context, Spark Session
PySpark RDDs
- Introduction to PySpark RDDs (Resilient Distributed Dataset)
- RDDs operations
  1. Transformation
  2. Action
- Working with PySpark RDDs
  1. Create RDD: parallelize(), textFile()
  2. RDD Transformations: map(), filter(), flatMap(), RDD1.union(RDD2)
  3. RDD Actions: collect(), take(), count(), first(), reduce(), saveAsTextFile(),…
  4. Pair RDDs:
    1. Create Pair RDDs from key-value tuple/ regular RDD
    2. Transformations: reduceByKey(), groupByKey(), sortByKey(), join()
    3. Actions: countByKey(), collectAsMap()
PySpark DataFrame
- Introduction to PySpark DataFrame
- Features and Advantages
- Working with PySpark DataFrame
  1. Create DataFrame: createDataFrame(), spark.read.csv(), spark.read.json()
  2. printSchema(), show()
  3. count()
  4. describe()
  5. crosstab()
  6. groupby()
  7. select(), select() và agg, count, max, mean, min, sum..., select().distinct(),
  8. orderby().asc()/desc()
  9. withColumn(), withColumnRenamed()
  10. drop(), dropDuplicates(), dropna()
  11. filter(), where()
  12. Column string transformation
  13. Conditional clauses: .when(<if condition>, <then x>), .otherwise()
  14. User defined functions (UDF)
- Data Visualization in PySpark using DataFrames
  1. hist(), distplot()
  2. pandas_histogram()
PySpark SQL
- Introduction to PySpark SQL
- Running SQL Queries Programmatically
  1. select()
  2. when()
  3. like()
  4. startswith(), endswith()
  5. substr(), between()
- Manipulating data
  1. Group by
  2. Filtering
  3. Sorting
  4. Missing and replacing value
  5. Joining Data
  6. Repartitioning
  7. Registering DataFrames as Views
Data Preprocessing & Analysis
- Wrangling with Spark Functions
  1. Dropping, Filtering, Joining
  2. Working with missing data
  3. Using lazy processing
  4. Parquet
  5. Removing, Splitting rows/columns
  6. Data validation
- Feature Engineering
  1. Feature Generation
  2. Differences, Ratios
  3. Deeper Features, Time Features
  4. Time Components, Joining On Time Components
  5. Date Math
  6. Extracting Features/ Text to New Features
  7. Splitting & Exploding
  8. Scaling data
  9. Pivoting & Joining
  10. Binarizing, Bucketing & Encoding
- Data Analysis
  1. Exploratory Data Analysis (EDA), Corr
  2. Visualization: distplot, implot…
Overview of PySpark MLlib
- Introduction to PySpark MLlib
- PySpark MLlib algorithms
- Building a Model
- Estimator and evaluator
- Cross-validation, Grid Search
- Interpreting Results
Machine Learning with PySpark MLlib
- Supervised Learning (Classification & Regression)
  1. Linear Regression (pyspark.ml.regression)
  2. Logistic Regression (pyspark.mllib.classification)
  3. Decision Tree (pyspark.mllib.classification)
  4. Random forest (pyspark.mllib.classification)
  5. Gradient-Boosted Tree
- Pipeline
  1. Introduction to Pipeline
  2. Working with Pipeline (pyspark.ml import Pipeline)
- Unsupervised Learning (Clustering & Recommender System)
  1. Clustering with KMeans
  2. Recommender System - ALS
  3. Association rules – FPGrowth (pyspark.ml.fpm.FPGrowth)
PySpark Streaming
- Introduction to PySpark Streaming
- Why PySpark Streaming?
- Features and Advantages
- Streaming Context
- DStream
- Streaming Transformation Operations
- Streaming Checkpoint
Natural Language Processing - NLP
- Tools for NLP
  1. Tokenizer
  2. StopWordsRemover
  3. NGram
  4. CountVectorizer
  5. TF-IDF
Apache Spark standalone cluster
- Running Master Server
- Connecting from Slave computers to Master Server
- Deployment project in Mater – Slave computers system
GraphX
- Introduction to GraphX
- Working with GraphX
  1. Creating graph
  2. Vertex and edge
  3. Visualization Graph
  4. Filtering
  5. Connecting
  6. Motif finding
  7. Triangle count
  8. Page rank

5. Phương pháp đào tạo

Khóa học bao gồm 30% thời gian thảo luận lý thuyết và 70% thực hành.
Học viên mang theo laptop (RAM tối thiểu: Linux OS 4GB , Windows 6GB)

6. Tài liệu, phòng học & trang thiết bị giảng dậy

Mỗi học viên được phát giáo trình của NIIT biên soạn và tài liệu hỗ trợ học tập miễn phí.

7. Yêu cầu đầu vào

Học viên cần có kiến thức cơ bản về hệ điều hành Linux và ngôn ngữ lập trình Java

8. Nội dung khóa học

Học viên cần có kiến thức cơ bản về hệ điều hành Linux và ngôn ngữ lập trình Java

9. Chứng chỉ

Học viên tham dự từ 70% thời lượng đào tạo trở lên, được cấp Chứng chỉ tham gia khóa học Chuyên viên phân tích Dữ liệu lớn do NIIT cấp (Certificate of Participation).

10. Chính sách bảo hành học tập:

- Bảo hành học tập, tổ chức ĐÀO TẠO LẠI cho tất cả học viên đã theo học tại Học viện nhưng kết quả học tập chưa đạt yêu cầu, hoặc cần hỗ trợ thêm về kiến thức để làm việc thực tế.

- Học viên theo học tại Học viện được hỗ trợ kiến thức, thực hành ngoài giờ học offline (trực tiếp tại Học viện) và online (trực tuyến với giáo viên) nhằm đảm bảo hiệu quả học tập tốt nhất cho học viên.

Tập tành Big Data

Add new comment
121 views

Bạn thấy bài viết này như thế nào?

Comments

admin

June 19

- At least 2 years of experience in Data/ Big Data Engineer
- Bachelor degree in IT/ Computer Science or relevant background
- Experience in the Hadoop ecosystem and its components: HDFS, Yarn, MapReduce, Apache Spark (Python/Scala), Apache Sqoop, Apache Impala, Apache Avro, Apache Flume, Apache Kafka
- Preferred: having certificate CCA175 – Spark and Hadoop Developer
- Designed and developed ETL process
- Experienced in Unix with Scripting experience is preferred
- Should have strong knowledge on concepts of data warehousing models, data ingestion patterns, data quality and data governance
- Experience on the Hadoop systems with good understanding and knowledge of Hadoop cluster
- Good at English communication skills

admin

June 19

- Bachelor degree in IT/ Computer Science or relevant background.
- Have at least 5 year of experience in the relevant technologies.
- Expertise in implementation of Modern Data Warehouse and Lakehouse solutions, data quality and metadata management.
- Strong Experience with Azure Synapse Analytics, Dedicated and Serverless SQL Pools, ADLS Gen2, Azure Data Factory, Databricks, Stream Analytics.
- Extensive ETL/ELT experience with Azure data movement and transformation capabilities (Azure Synapse Pipelines, Azure Data Flow).
- Excellent working knowledge on SQL/TSQL.
- Deep knowledge of Azure Synapse data pipeline orchestration and computation framework Azure Synapse with Spark Pools.
- Strong experience on data modelling of dimensional, temporal, slowly changing dimensions and full/incremental/delta data loading processes.
- Familiarity with data visualization techniques using Power BI Cloud, Tableau is a plus.
- Microsoft Azure Data Engineer Associate (DP-203) preferred.
- Good at English communication skills.

admin

June 19

1. Trách nhiệm chính/Main responsibilities

- Design, create and maintain optimal data pipeline architecture
- Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL, and AWS ‘big data’ technologies or Google Big Query
- Ability to build processes that support data transformation, workload management, data structures, dependency and metadata
- Assemble large, complex data sets that meet functional / non-functional business requirements.
- Identify, design, and implement internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.
- Perform the data preparation for data model (data cleansing, data aggregation)
- Design and develop reliable, stable, and effective data marts to support business.
- Create ETL jobs and data pipelines
- Monitoring data quality to meet SLA
- Work with stakeholders including the Executive, Product, Data and Design teams to assist with data-related technical issues and support their data infrastructure needs.
- Create data tools for analytics and data scientist team members that assist them in building and optimizing our product into an innovative industry leader.

2. Trình độ Học vấn/ Educational Qualifications

We are looking for a candidate with 5+ years of experience in a Data Engineer role, who has attained a degree in Computer Science, Statistics, Informatics, Information Systems or another quantitative field, prioritizing graduating from Technology Universities, FPT University or other Tech University.

3. Kiến thức/ Chuyên môn có liên quan/ Professional Knowledge

- Advanced working SQL knowledge and experience working with relational databases, query authoring (SQL) as well as working familiarity with a variety of databases including SQL and NoSQL databases
- Hands-on experience with SQL database design
- Experience building and optimizing ‘big data’ data pipelines, architectures, and data sets.
- Experience with object-oriented/object function scripting languages: Python, Java, C++, etc.
- Experience with build processes supporting data transformation, data structures, metadata, dependency and workload management.

4. Kinh nghiệm liên quan/ Relevant Experience

- A successful history of manipulating, processing and extracting value from large disconnected datasets.
- Working knowledge of message queuing, stream processing, and highly scalable ‘big data’ data stores.
- Strong project management and organizational skills.
- Experience supporting and working with cross-functional teams in a dynamic environment.
- Good logical thinking, hard-working, positive attitude, good communication skills.

Phân tích dữ liệu lớn với HADOOP VÀ SPARK

1. Tổng quan khoá học Big Data

2. Mục tiêu đào tạo của Khóa Big Data

3. Nội dung & thời lượng đào tạo

4. Nội dung chi tiết

Comments

admin

admin

admin

1. Trách nhiệm chính/Main responsibilities

2. Trình độ Học vấn/ Educational Qualifications

3. Kiến thức/ Chuyên môn có liên quan/ Professional Knowledge

4. Kinh nghiệm liên quan/ Relevant Experience

Add new comment

Related Articles

Link học tập

Link học tập

Địa chỉ