Select Content Type
What kind of article do you want to create?
Close Window
Article
Schedule
ArticleID
124312 - Click to preview
Canonical URL
https://www.dbta.com/Columns/DBA-Corner/Data-Lake-Versus-Data-Warehouse-Understanding-the-Differences-124312.aspx
Title
StandOut Url
Author(s)
Craig S. Mullins
Images
Related Articles
Summary
Data lake is a newer IT term created for a new category of data store. But just what is a data lake? According to IBM, "a data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is accessed." That makes sense. I think the most important aspect of this definition is that data is stored in its "native format." The data is not manipulated or transformed in any meaningful way; it is simply stored and cataloged for future use.
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Page 14
Page 15
Video
<p> Data lake is a newer IT term created for a new category of data store. But just what is a data lake?</p><p>According to IBM, “a data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is accessed.”</p><p>That makes sense. I think the most important aspect of this definition is that data is stored in its “native format.” The data is not manipulated or transformed in any meaningful way; it is simply stored and cataloged for future use.</p><p>Any type of data can be stored in a data lake: structured, semi-structured, and unstructured. For example, organizations can use a data lake for customer information captured from multiple sources for future analysis and aggregation. This can consist of typical structured data (numbers, characters, dates, and times), as well as complex documents, text, multimedia, and more. In general, the data is ingested without transformation and data scientists can run analytical models against the data; business analysts can augment business intelligence activities with the data; and it can even be used as a long-term data archive.</p><p>Organizations are under intense pressure these days to capture any data that could be relevant to their business. And the number of sources and amount of data continues to steadily rise. So the desire to grab the data when it is available is high, but the time to organize and understand that data fully at the time of capture is not usually available.</p><p>But a data lake should not be treated as a dumping ground for data. It is important to have a means of understanding and managing the data that is stored in the data lake. Without a mechanism for defining, populating, accessing, and managing the data in your data lakes, you will find them to be less than useful.</p><p>Populating a data lake requires knowledge of and proper tools for data integration. Because the data lake contains multiple types of data from multiple sources, it must include support for a wide array of different platforms, data types and structures, interfaces, and processing capabilities.</p><p>You will also need some form of metadata management for a data lake environment to remain useful and healthy. Minimally, a data lake requires information about each type of data stored there but also some guidance on where the data originated (that is, its provenance), the data elements it contains, the meaning of each, and how to read them. Of course, the metadata can be minimal to begin with and then fleshed out as your data scientists and analytics teams explore the data.</p><p>Some pundits have surmised that data lakes will summon the death of data marts and data warehouses. But if you think about it, this cannot be the case. A data warehouse, as defined by Bill Inmon, is “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process.”</p><p>In contrast with a data lake, where data is captured and stored with no transformation or aggregation, a data warehouse contains data transformed from multiple sources and is designed for business users. A data lake cannot serve the same purpose unless the data is modified from its “native format” … and then it stops being a data lake by definition.</p><p>There are, certainly, many other differences. A data warehouse contains structured data whereas a data lake can contain structured, unstructured, and semi-structured data. Data in the data lake comes from multiple sources and will have varying schemata. As such, the data lake requires schema-on-read capability—and a platform, such as Hadoop, that supports such a requirement. With data from multiple, disparate sources all being stored in its native format, data lakes cannot support schema-on-write like data warehouses do.</p><p>Of course, Hadoop is not the only technology that can be used for data lakes. Some organizations with a more cloud-focused mentality are using solutions from cloud providers like Amazon Web Services (AWS) and others.</p><p>The type of storage that can be used also separates data warehouses from data lakes. With a data warehouse, performance is important, and you do not want to store data that will be queried by business professionals on slower, less-costly storage devices. Conversely, storing a data lake on such devices makes a lot of sense!</p><p>So understand the differences between data lakes and data warehouses; use them both accordingly; and do not confuse the two.</p>
Newsletter Name
Issue Name
Article Type
Article SubType
DBTA E-Edition
April 2018
Columns
DBA Corner
Please Choose
5 Minute Briefing : Blockchain
5 Minute Briefing: Cloud
5MB: Data Center
5MB: Information Management
5MB: MultiValue
5MB: Oracle
5MB: SAP
Big Data Quarterly Issue
Cloud Strategies
DBTA E-Edition
ExaBriefing
Headlines from AIOUG
IBM LinuxLine
Infrastructure Wisdom
IOUG Storage Systems
Linux Executive Report from IBM
Magazine Issue
Oracle Enterprise Manager
Unisphere Five Minute Briefing
Columns
Editorial
A Wider View
Applications Insight
Big Data Notes
Database Elaborations
DBA Corner
Defining Data
Emerging Technologies
From 30,000 Feet
MongoDB Matters
My View
MySQL Musings
New Directions
Next-Gen Data Management
Notes on NoSQL
Oracle Data Strategies
Oracle Observations
Quest IOUG Database & Technology Insights
SQL Server Drill Down
The Enterprise Environment
The Open DBA
The Philosophy of PL/SQL
Trends and Observations
Categories
Topics
Artificial Intelligence
Big Data
Blockchain
Business Intelligence and Analytics
Cloud Computing
Data Center Management
Data Integration
Data Modeling
Data Quality
Data Warehousing
Database Management
Database Security
Hadoop
Internet of Things
Master Data Management
MultiValue Database Technology
NoSQL Central
Virtualization
×
Authors
×
Search Articles
×
Image Helper
Upload an Image
Name
File
Browse…
Image Resize (width):
90
120
135
250
No resize
Preview (Click image to select)
Select the Image
ID
Image Name