Big Data ArchitectureBig Data Architecture
Spread the love

Loading

What is Big Data ?

Here’s the definition of the big data :

Big data refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around for a long time.

Key problems in big data analysis are :-

1). Volume : Processing large volume of data

2). Variety : Dealing with unstructured and structured data e.g. textual data, CSV’s, databases

3). Veracity : Data Quality data Issues e.g. missing data, duplicate data, corrupt data

4). Huge computations involved in accomplishing data analysis

5). Variety : Dealing with diverse data sources at fast pace

The main challenges of big data are often referred to as 4 V’s of big data i.e. volume, variety, velocity and veracity etc. Here’s a reference http://www.ibmbigdatahub.com/infographic/four-vs-big-data 

and visual illustration :-

4V's of big data
Figure : 4V’s of big data

Some of the areas and business problems where big data is applied include :-

1). Advertising

a). Affiliate Marketing

b). Analysis of Reviews for hotels, locations

c). Clickstream Analysis

2). Finance

3). Weather

4). Health Sciences

Big data experts are trying to answers tough questions like :-

1). Which products are being bought by which segments, and why ?

2). What is going to be the trend next month ?

3). Which locations will be visited by different customer segments ?

4). What are the biggest improvements/needs of the customers that can be met by a business ?

5). Which candidates will be voted and what are the issues closest to the heart of the potential voters ?

6). Detecting loss/theft/movement of stock in stores ?

7). Predicting stock requirements across a store chain ?

Companies will be looking for different kinds of people to work on big data problems :-

1). ETL Experts or Data Engineers

2). Data Warehouse Experts

3). Database Administrators, or database experts

4). Cloud architects

5). Support people specializing in cloud technologies and infrastructure management

6). Software Architects

7). Data Scientists

8). Software Developers with background in AI, Data Mining, Machine Learning etc

ETL Pipeline

An ETL pipeline is the set of processes used to move data from a source or multiple sources into a database such as a data warehouse. ETL stands for “extract, transform, load,” the three interdependent processes of data integration used to pull data from one database and move it to another. Once loaded, data can be used for reporting, analysis, and deriving actionable business insights.

Ultimately, ETL pipelines enable businesses to gain competitive advantage by empowering decision-makers. To do this effectively, ETL pipelines should:

  • Provide continuous data processing
  • Be elastic and agile
  • Use isolated, independent processing resources
  • Increase data access
  • Be easy to set up and maintain

Modern Big Data Pipelines

Modern big data pipelines are capable of ingesting structured data from enterprise applications, machine-generated data from IoT systems, streaming data from social media feeds, JSON event data, and weblog data from Internet and mobile apps.

A generic big data pipeline based on snowflake platform
Figure : A generic big data pipeline based on snowflake platform(https://www.snowflake.com/guides/etl-pipeline#:~:text=An%20ETL%20pipeline%20is%20the,and%20move%20it%20to%20another.)

Modern big data pipeline architecture
Figure : Modern big data pipeline architecture

ELT Pipelines

Modern big data platforms receive data from many data sources, it is therefore cumbersome and time consuming to apply traditional ETL techniques in these cases.

So what options do you have if you have a lot of data from numerous sources and you want to shape, clean, filter, join and transform that data? ELT is the next generation of data integration that is applied in such cases.

‘ELT’ means you extract data from the source, load it unchanged into a target platform (which is often a cloud data warehouse), and then transform it afterwards, to make it ready for use.

The main advantage of using an ELT approach is that you can move all raw data from a multitude of sources into a single, unified repository (a single source of truth) and have unlimited access to all of your data at any time. You can work more flexibly and it makes it easy to store new, unstructured data. Data analysts and engineers save time when working with new information as they no longer have to develop complex ETL processes before the data is loaded into the warehouse.

Related

Apache Spark Tutorial Covering Concepts, Questions and Answers – Technology Magazine (tech-mags.com)

Deep Dive into understanding and making best use of Open Source Greenplum Data Warehouse Architecture – Technology Magazine (tech-mags.com)

Understanding Greenplum Architecture and Use Cases for Data Science and Business Intelligence – Technology Magazine (tech-mags.com)

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.