Data LakeData Lake Illustration
Spread the love

Introduction

For most modern organizations, data is not limited to structured data found in databases and data warehouses.  Data Lakes are the foundation for building AI driven solutions that many organizations need. 

The key reasoning behind adapting data lake architecture is, because there is much more information in audios, videos, images and social media chatter that can drive the industry, health and society than in traditional databases.

 

A typical organization today  may have many different sources and types of data, providing a diverse array of data and data sources including :-

1). JSON files

2). XML files

3). Various types of Images

4). Videos

5). Audios

6). Data generated on social media including tweets, comments, customer feedback

7). Emails

8). IoT 

A typical, modern organization would collect all the relevant data in real-time and store it somewhere; that somewhere is usually a data lake. 

So, a Data Lake is a central repository to store structured data such as on-premise or cloud databases, semi-structured data such as json, avro, parquet, xml, and other raw files, and unstructured data such as audio, video, and binary files ingested from several (to the tune of millions, even,) batch or continuous data streams.

Once data is stored on a data lake, then it is processed by a variety of processes  and may find its way into a database, a data warehouse or may be used for building an ML model that is used for driving the business.

Data Lake

The table given below breaks down differences between data lake and data warehouse into different categories.

 Data LakeData Warehouse
Type of dataUnstructured and structured data from various company data sourcesHistorical data that has been structured to fit a relational database schema
PurposeCost-effective big data storageAnalytics for business decisions
UsersData scientists and engineersData analysts and business analysts
TasksStoring data and big data analytics, like deep learning and real-time analyticsTypically read-only queries for aggregating and summarizing data
SizeStores all data that might be used—can take up petabytes!Only stores data relevant to analysis

Why Do We Need Data Lake in Healthcare Sector 

Data warehouses have been used for many years in the healthcare industry, but they can’t be used to stores CT Scans, Lab Reports, Doctors Notes, Transcripts, X-Ray Images and many other kinds of healthcare data. 

Because of the unstructured nature of much of the data in healthcare (physicians notes, clinical data, etc.) and the need for real-time insights, data warehouses are generally not an ideal model.

Data lakes allow for a combination of structured and unstructured data, which tends to be a better fit for healthcare companies.

Typical Data Lake Architecture

Data engineers use data lakes to store incoming data.  Unstructured data such as images, audios and videos are more flexible and scalable, which is oftentimes better for big data analytics. 

Big data analytics can be run on data lakes using services such as Apache Spark and Hadoop. This is especially true for deep learning, which requires scalability in the increasing amount of training data.

Data warehouses are typically set to read-only for analyst users, who are primarily reading and aggregating data for insights. Since data is already clean and archival, there is usually no need to insert or update data.

Volume of Data 

It should be no surprise that data lakes are much bigger in size because they retain all data that might be relevant to a company. Data lakes are often petabytes in size—that’s 1,000 terabytes! Data warehouses are much more selective on what data is stored.

Conclusion

When you are trying to make a decision about whether you should go with data warehouse or data lake; then you should look at your use cases. If you have got different types of data that needs to be processed, analyzed, and modeled for business needs then you should go for the data lake.  But, if you are looking at only structured data and want to do analytics and predictive modeling then you should go for traditional data warehouse architecture.

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.