What is Big Data ?
Here’s the definition of the big data :
“Big data refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around for a long time.“
Key problems in big data analysis are :-
1). Volume : Processing large volume of data
2). Variety : Dealing with unstructured and structured data e.g. textual data, CSV’s, databases
3). Veracity : Data Quality data Issues e.g. missing data, duplicate data, corrupt data
4). Huge computations involved in accomplishing data analysis
5). Variety : Dealing with diverse data sources at fast pace
The main challenges of big data are often referred to as 4 V’s of big data i.e. volume, variety, velocity and veracity etc. Here’s a reference http://www.ibmbigdatahub.com/infographic/four-vs-big-data
and visual illustration :-
Some of the areas and business problems where big data is applied include :-
1). Advertising
a). Affiliate Marketing
b). Analysis of Reviews for hotels, locations
c). Clickstream Analysis
2). Finance
3). Weather
4). Health Sciences
Big data experts are trying to answers tough questions like :-
1). Which products are being bought by which segments, and why ?
2). What is going to be the trend next month ?
3). Which locations will be visited by different customer segments ?
4). What are the biggest improvements/needs of the customers that can be met by a business ?
5). Which candidates will be voted and what are the issues closest to the heart of the potential voters ?
6). Detecting loss/theft/movement of stock in stores ?
7). Predicting stock requirements across a store chain ?
Companies will be looking for different kinds of people to work on big data problems :-
1). ETL Experts or Data Engineers
2). Data Warehouse Experts
3). Database Administrators, or database experts
4). Cloud architects
5). Support people specializing in cloud technologies and infrastructure management
6). Software Architects
7). Data Scientists
8). Software Developers with background in AI, Data Mining, Machine Learning etc
ETL Pipeline
An ETL pipeline is the set of processes used to move data from a source or multiple sources into a database such as a data warehouse. ETL stands for “extract, transform, load,” the three interdependent processes of data integration used to pull data from one database and move it to another. Once loaded, data can be used for reporting, analysis, and deriving actionable business insights.
Ultimately, ETL pipelines enable businesses to gain competitive advantage by empowering decision-makers. To do this effectively, ETL pipelines should:
- Provide continuous data processing
- Be elastic and agile
- Use isolated, independent processing resources
- Increase data access
- Be easy to set up and maintain
Modern Big Data Pipelines
Modern big data pipelines are capable of ingesting structured data from enterprise applications, machine-generated data from IoT systems, streaming data from social media feeds, JSON event data, and weblog data from Internet and mobile apps.
ELT Pipelines
Modern big data platforms receive data from many data sources, it is therefore cumbersome and time consuming to apply traditional ETL techniques in these cases.
So what options do you have if you have a lot of data from numerous sources and you want to shape, clean, filter, join and transform that data? ELT is the next generation of data integration that is applied in such cases.
‘ELT’ means you extract data from the source, load it unchanged into a target platform (which is often a cloud data warehouse), and then transform it afterwards, to make it ready for use.
The main advantage of using an ELT approach is that you can move all raw data from a multitude of sources into a single, unified repository (a single source of truth) and have unlimited access to all of your data at any time. You can work more flexibly and it makes it easy to store new, unstructured data. Data analysts and engineers save time when working with new information as they no longer have to develop complex ETL processes before the data is loaded into the warehouse.
Related
Apache Spark Tutorial Covering Concepts, Questions and Answers – Technology Magazine (tech-mags.com)