Saturday, 14 November 2015

Data Warehouse vs Data Lake



In my last post we talked about Data Lake and its need and advantages. In this post we will compare Data Lake with Data Warehouse.

The emergence of the data lake has periodically led to a somewhat regrettable dialogue that compares data warehouses unfavorably with data lakes. These kinds of discussions are inevitable as a new, hot concept elbows the previous decade’s paradigm out of the way. But, it’s a misleading thought process. While data lakes can accomplish a number of things that are very difficult for a data warehouse to do, the opposite is also true. The two technologies are inherently different and serve different enterprise needs. One is not necessarily better than the other. In some use cases, the data lake can be quite deficient when contrasted with a data warehouse.

A data warehouse is a carefully designed data store that contains data from pre-selected sources, such as enterprise resource planning (ERP) and logistics systems. Data warehouses tend to emphasize transactional / structured data over semi-structured and unstructured data. The data in a warehouse is usually organized using multi-dimensional, star or snowflake schemas in order to streamline execution of queries, reports, dashboards or the running of advanced analytical models.In Data Warehouse (DW), schema is defined before data is stored. This is called “Schema on WRITE” or required data is identified and modeled in advance. The quality of data that exists in a traditional Data Warehouse is cleansed whereas typical data that exist in Data Lake is Raw.

A data lake (typically implemented using Hadoop) is a mix of structured, semi-structured or unstructured data. For example, transactions, spreadsheets, Documents, sensor data, images, social media, etc. may all be stored in the data lake). The data lake may be fed using traditional-style batch jobs or by connecting the data lake to real-time data feeds.In Data Lake the schema is defined after the data is stored. This is called “Schema on READ”. So the data must be captured in code for each program accessing the data. Data Lake provides cheaper storage of large volumes of data and has potential to reduce the processing cost by bringing analytics near to data.The data lake gives business users immediate access to all data. They don’t have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery which is required in a dynamic market economy.Data Lakes offers unparalleled flexibility since nobody or nothing stands between business users and the data.

Pertinence in Big Data world: 

Traditional approach of manually curated data warehouses, provides limited window view of data and are designed to answer only specific questions identified at the design time. This may not be adequate for data discovery in today’s big data world. Moreover, data lake can contain any type of data – clickstream, machine-generated, social media, and external data, and even audio, video, and text. Traditional data warehouses are limited to structured data. The data lake can hold any type of data.

The question should not be, “Which one is better?” Rather, you should ask, “Which one is better for my scenario, given my unique needs?” The suitability of one over the other will depend on many factors that are particular to an organization.

A Data Lake


Recently, I came across a new term which is a product of evolving data science. The term is "Data Lake". With data multiplying on on a daily basis this term adds a different dimension to data science and has already created a lot of excitement. The data lake concept has quickly gained traction in the world of big data.

So, in this blog I wish to demystify the concept.

The term “data lake” uses a metaphor to simplify an approach to storing data for analysis that is abstract and cumbersome to describe in plain, factual terms. James Dixon, “Chief Geek” at Pentaho, is credited with coining the phrase. Dixon posted that each specialized data mart in a data warehouse could be likened to a bottle of water. The data was ready for use in a small, identifiable container. In contrast, a data “lake” is a massive, intermingled repository of all data in its raw form.






Need of Data Lake :

The reason that data lake concept came into picture was to adopt to the freewheeling, deep analysis, ask-anything ethos of big data. It’s harder to retool a data warehouse to do the kind of wide-ranging data correlation that big data solutions make relatively easy. 

Moreover, It is estimated that a staggering 70% of the time spent on analytics projects is concerned with identifying, cleansing, and integrating data. Data is often difficult to locate because it is scattered among many business applications and business systems. The data needs re-engineering and reformatting frequently in order to make it easier to analyze.

The data must be refreshed regularly to keep it up-to-date.Acquiring data for analytics in an ad hoc manner creates a huge burden on the teams that own the systems supplying data. Often the same type of data is repeatedly requested and the original information owner finds it hard to keep track of who has copies of which data.

As a result, implementing a data lake can be a solution. A data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. The data lake contains data from many different sources. People in the organization are free to add data to the data lake and access any updates as necessary.

In my next blog I will talk about Data Lake vs Data Warehouse.