Modern Big Data Engineering Architecture and Big Data Pipelines

Spread the love

What is Big Data ?

Here’s the definition of the big data :

“Big data refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around for a long time.“

Key problems in big data analysis are :-

1). Volume : Processing large volume of data

2). Variety : Dealing with unstructured and structured data e.g. textual data, CSV’s, databases

3). Veracity : Data Quality data Issues e.g. missing data, duplicate data, corrupt data

4). Huge computations involved in accomplishing data analysis

5). Variety : Dealing with diverse data sources at fast pace

The main challenges of big data are often referred to as 4 V’s of big data i.e. volume, variety, velocity and veracity etc. Here’s a reference http://www.ibmbigdatahub.com/infographic/four-vs-big-data

and visual illustration :-

Some of the areas and business problems where big data is applied include :-

1). Advertising

a). Affiliate Marketing

b). Analysis of Reviews for hotels, locations

c). Clickstream Analysis

2). Finance

3). Weather

4). Health Sciences

Big data experts are trying to answers tough questions like :-

1). Which products are being bought by which segments, and why ?

2). What is going to be the trend next month ?

3). Which locations will be visited by different customer segments ?

4). What are the biggest improvements/needs of the customers that can be met by a business ?

5). Which candidates will be voted and what are the issues closest to the heart of the potential voters ?

6). Detecting loss/theft/movement of stock in stores ?

7). Predicting stock requirements across a store chain ?

Companies will be looking for different kinds of people to work on big data problems :-

1). ETL Experts or Data Engineers

2). Data Warehouse Experts

3). Database Administrators, or database experts

4). Cloud architects

5). Support people specializing in cloud technologies and infrastructure management

6). Software Architects

7). Data Scientists

8). Software Developers with background in AI, Data Mining, Machine Learning etc

ETL Pipeline

An ETL pipeline is the set of processes used to move data from a source or multiple sources into a database such as a data warehouse. ETL stands for “extract, transform, load,” the three interdependent processes of data integration used to pull data from one database and move it to another. Once loaded, data can be used for reporting, analysis, and deriving actionable business insights.

Ultimately, ETL pipelines enable businesses to gain competitive advantage by empowering decision-makers. To do this effectively, ETL pipelines should:

Provide continuous data processing
Be elastic and agile
Use isolated, independent processing resources
Increase data access
Be easy to set up and maintain

Modern Big Data Pipelines

Modern big data pipelines are capable of ingesting structured data from enterprise applications, machine-generated data from IoT systems, streaming data from social media feeds, JSON event data, and weblog data from Internet and mobile apps.

Figure : A generic big data pipeline based on snowflake platform(https://www.snowflake.com/guides/etl-pipeline#:~:text=An%20ETL%20pipeline%20is%20the,and%20move%20it%20to%20another.)

Figure : Modern big data pipeline architecture

ELT Pipelines

Modern big data platforms receive data from many data sources, it is therefore cumbersome and time consuming to apply traditional ETL techniques in these cases.

So what options do you have if you have a lot of data from numerous sources and you want to shape, clean, filter, join and transform that data? ELT is the next generation of data integration that is applied in such cases.

‘ELT’ means you extract data from the source, load it unchanged into a target platform (which is often a cloud data warehouse), and then transform it afterwards, to make it ready for use.

The main advantage of using an ELT approach is that you can move all raw data from a multitude of sources into a single, unified repository (a single source of truth) and have unlimited access to all of your data at any time. You can work more flexibly and it makes it easy to store new, unstructured data. Data analysts and engineers save time when working with new information as they no longer have to develop complex ETL processes before the data is loaded into the warehouse.

Apache Spark Tutorial Covering Concepts, Questions and Answers – Technology Magazine (tech-mags.com)

Deep Dive into understanding and making best use of Open Source Greenplum Data Warehouse Architecture – Technology Magazine (tech-mags.com)

Understanding Greenplum Architecture and Use Cases for Data Science and Business Intelligence – Technology Magazine (tech-mags.com)

Modern Big Data Engineering Architecture and Big Data Pipelines

ByHassan Amin

What is Big Data ?

ETL Pipeline

Modern Big Data Pipelines

ELT Pipelines

Related

Related

By Hassan Amin

Related Post

Understanding and Developing Data Strategy and Monetization

Combating AI Fear

Transforming Real Estate Search with REGA

Building a Personal Brand

Book Summary : Think and Grow Rich by Napolean Hill

Book Summary : Thinking, Fast and Slow by Nobel Laureate Daniel Kahneman

A Guide to Good and Bad Habits for Teens

Setting Up PostGIS Extension On Greenplum 6 In Ubuntu 18.04

ChatGPT is a Turning Point

Building Conversational Chatbot with GPT3

Success of Technology Companies Depends on Relationship between Technical Leadership and Management

You missed

Building a Personal Brand

Book Summary : Think and Grow Rich by Napolean Hill

Book Summary : Thinking, Fast and Slow by Nobel Laureate Daniel Kahneman

A Guide to Good and Bad Habits for Teens