Introduction
For most modern organizations, data is not limited to structured data found in databases and data warehouses. Data Lakes are the foundation for building AI driven solutions that many organizations need.
The key reasoning behind adapting data lake architecture is, because there is much more information in audios, videos, images and social media chatter that can drive the industry, health and society than in traditional databases.
A typical organization today may have many different sources and types of data, providing a diverse array of data and data sources including :-
1). JSON files
2). XML files
3). Various types of Images
4). Videos
5). Audios
6). Data generated on social media including tweets, comments, customer feedback
7). Emails
8). IoT
A typical, modern organization would collect all the relevant data in real-time and store it somewhere; that somewhere is usually a data lake.
So, a Data Lake is a central repository to store structured data such as on-premise or cloud databases, semi-structured data such as json, avro, parquet, xml, and other raw files, and unstructured data such as audio, video, and binary files ingested from several (to the tune of millions, even,) batch or continuous data streams.
Once data is stored on a data lake, then it is processed by a variety of processes and may find its way into a database, a data warehouse or may be used for building an ML model that is used for driving the business.
The table given below breaks down differences between data lake and data warehouse into different categories.
Data Lake | Data Warehouse | |
---|---|---|
Type of data | Unstructured and structured data from various company data sources | Historical data that has been structured to fit a relational database schema |
Purpose | Cost-effective big data storage | Analytics for business decisions |
Users | Data scientists and engineers | Data analysts and business analysts |
Tasks | Storing data and big data analytics, like deep learning and real-time analytics | Typically read-only queries for aggregating and summarizing data |
Size | Stores all data that might be used—can take up petabytes! | Only stores data relevant to analysis |
Why Do We Need Data Lake in Healthcare Sector
Data warehouses have been used for many years in the healthcare industry, but they can’t be used to stores CT Scans, Lab Reports, Doctors Notes, Transcripts, X-Ray Images and many other kinds of healthcare data.
Because of the unstructured nature of much of the data in healthcare (physicians notes, clinical data, etc.) and the need for real-time insights, data warehouses are generally not an ideal model.
Data lakes allow for a combination of structured and unstructured data, which tends to be a better fit for healthcare companies.
Typical Data Lake Architecture
Data engineers use data lakes to store incoming data. Unstructured data such as images, audios and videos are more flexible and scalable, which is oftentimes better for big data analytics.
Big data analytics can be run on data lakes using services such as Apache Spark and Hadoop. This is especially true for deep learning, which requires scalability in the increasing amount of training data.
Data warehouses are typically set to read-only for analyst users, who are primarily reading and aggregating data for insights. Since data is already clean and archival, there is usually no need to insert or update data.
Volume of Data
It should be no surprise that data lakes are much bigger in size because they retain all data that might be relevant to a company. Data lakes are often petabytes in size—that’s 1,000 terabytes! Data warehouses are much more selective on what data is stored.
Conclusion
When you are trying to make a decision about whether you should go with data warehouse or data lake; then you should look at your use cases. If you have got different types of data that needs to be processed, analyzed, and modeled for business needs then you should go for the data lake. But, if you are looking at only structured data and want to do analytics and predictive modeling then you should go for traditional data warehouse architecture.