Saturday 14 November 2015

Data Warehouse vs Data Lake



In my last post we talked about Data Lake and its need and advantages. In this post we will compare Data Lake with Data Warehouse.

The emergence of the data lake has periodically led to a somewhat regrettable dialogue that compares data warehouses unfavorably with data lakes. These kinds of discussions are inevitable as a new, hot concept elbows the previous decade’s paradigm out of the way. But, it’s a misleading thought process. While data lakes can accomplish a number of things that are very difficult for a data warehouse to do, the opposite is also true. The two technologies are inherently different and serve different enterprise needs. One is not necessarily better than the other. In some use cases, the data lake can be quite deficient when contrasted with a data warehouse.

A data warehouse is a carefully designed data store that contains data from pre-selected sources, such as enterprise resource planning (ERP) and logistics systems. Data warehouses tend to emphasize transactional / structured data over semi-structured and unstructured data. The data in a warehouse is usually organized using multi-dimensional, star or snowflake schemas in order to streamline execution of queries, reports, dashboards or the running of advanced analytical models.In Data Warehouse (DW), schema is defined before data is stored. This is called “Schema on WRITE” or required data is identified and modeled in advance. The quality of data that exists in a traditional Data Warehouse is cleansed whereas typical data that exist in Data Lake is Raw.

A data lake (typically implemented using Hadoop) is a mix of structured, semi-structured or unstructured data. For example, transactions, spreadsheets, Documents, sensor data, images, social media, etc. may all be stored in the data lake). The data lake may be fed using traditional-style batch jobs or by connecting the data lake to real-time data feeds.In Data Lake the schema is defined after the data is stored. This is called “Schema on READ”. So the data must be captured in code for each program accessing the data. Data Lake provides cheaper storage of large volumes of data and has potential to reduce the processing cost by bringing analytics near to data.The data lake gives business users immediate access to all data. They don’t have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery which is required in a dynamic market economy.Data Lakes offers unparalleled flexibility since nobody or nothing stands between business users and the data.

Pertinence in Big Data world: 

Traditional approach of manually curated data warehouses, provides limited window view of data and are designed to answer only specific questions identified at the design time. This may not be adequate for data discovery in today’s big data world. Moreover, data lake can contain any type of data – clickstream, machine-generated, social media, and external data, and even audio, video, and text. Traditional data warehouses are limited to structured data. The data lake can hold any type of data.

The question should not be, “Which one is better?” Rather, you should ask, “Which one is better for my scenario, given my unique needs?” The suitability of one over the other will depend on many factors that are particular to an organization.

1 comment:

  1. Hi, just wanted to tell you, I enjoyed this blog post. It was funny. Keep on posting! Such a lovely blog you have shared here with us. Really nice.
    _____________________
    Hadoop

    ReplyDelete