Recently, I came across a new term which is a product of evolving data science. The term is "Data Lake". With data multiplying on on a daily basis this term adds a different dimension to data science and has already created a lot of excitement. The data lake concept has quickly gained traction in the world of big data.
So, in this blog I wish to demystify the concept.
The term “data lake” uses a metaphor to simplify an approach to storing data for analysis that is abstract and cumbersome to describe in plain, factual terms. James Dixon, “Chief Geek” at Pentaho, is credited with coining the phrase. Dixon posted that each specialized data mart in a data warehouse could be likened to a bottle of water. The data was ready for use in a small, identifiable container. In contrast, a data “lake” is a massive, intermingled repository of all data in its raw form.
Need of Data Lake :
The reason that data lake concept came into picture was to adopt to the freewheeling, deep analysis, ask-anything ethos of big data. It’s harder to retool a data warehouse to do the kind of wide-ranging data correlation that big data solutions make relatively easy.
Moreover, It is estimated that a staggering 70% of the time spent on analytics projects is concerned with identifying, cleansing, and integrating data. Data is often difficult to locate because it is scattered among many business applications and business systems. The data needs re-engineering and reformatting frequently in order to make it easier to analyze.
The data must be refreshed regularly to keep it up-to-date.Acquiring data for analytics in an ad hoc manner creates a huge burden on the teams that own the systems supplying data. Often the same type of data is repeatedly requested and the original information owner finds it hard to keep track of who has copies of which data.
As a result, implementing a data lake can be a solution. A data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. The data lake contains data from many different sources. People in the organization are free to add data to the data lake and access any updates as necessary.
In my next blog I will talk about Data Lake vs Data Warehouse.
No comments:
Post a Comment