Data Lake

by Supriya Deshmukh

Posted on September 18, 2018 at 04:00 PM

Data Lake

What is Data Lake?

We can consider a data lake as actual lake and rivers. Similar to it, you have multiple branches coming in, a data lake has structured data, unstructured data, and semi structured data and logs flowing through in real-time.

We can stock every type of data in its natural format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.

Why Data Lake?

  • With the onset of storage engines like Hadoop storing disparate information has become easy. There is no need to model data into an enterprise-wide schema with a Data Lake.

  • With the increase in data size, data superiority, and metadata, data analyses quality also increases.

  • Data Lake proposes business quickness.

  • Machine Learning and Artificial Intelligence can be used to make profitable predictions.

  • There is no data silo structure. Data Lake gives 360 degrees view of customers and makes analysis more strong.

Data Lake Architecture

  • Ingestion Tier: The tiers represent the data sources. The data could be loaded into the data lake in batches or in real-time.

  • Insights Tier: The tiers help in researching where insights from the system are used. SQL, NoSQL queries, or even excel could be used for data analysis.

  • HDFS: (Hadoop Distributed File System) is a profitable solution for both structured and unstructured data.

  • Distillation Tier: Receives data from the storage tier and converts it to structured data to make analysis easier.

  • Processing Tier: Run analytical algorithms and user’s queries with fluctuating real time, interactive, batch to generate structured data for easier analysis.

  • Unified Operations Tier: Governs system management and monitoring. It includes auditing and proficiency management, data management, workflow management.

Key Data Lake Concepts

  • Data Ingestion: Data Ingestion allows connectors to get data from a different data sources and load into the Data Lake.

  • Data Storage: Data storage should be accessible, offers profitable storage and data can be retrieved fast and easily. Multiple data formats can be supported.

  • Data Governance: Data governance is a method of dealing with availability, usability, safety, and reliability of data used in an organization.

  • Security: Security needs to be implemented in every layer of the Data Lake. It starts with Storage, Detection, and Intake. An important and primary need is to stop the access of unauthorized users. Multiple tools with easy navigation GUI should be supported.

  • Data Quality: Data superiority is an important component of Data Lake architecture. Data is used to extract business value.

Maturity Stages of Data Lake

  1. Handle and ingest data at scale

    This first stage of Data Maturity help in refining the ability to transform and examine data. Here, business owners need to find the tools according to their skillset for obtaining more data and build analytical applications.

  2. Building the analytical strength

    In this stage, we start gaining more data and building applications. Here, capabilities of the enterprise data warehouse and data lake are used together.

  3. EDW (Enterprise Data Warehouse) and Data Lake work in union

    This step involves getting data and analytics into the hands of as many individuals as possible. In this stage, the data lake and the EDW start to work in a union.

  4. Enterprise capability in the lake

    In this maturity stage of the data lake, enterprise capabilities are added to the Data Lake. Acceptance of information governance, information lifecycle management capabilities, and Metadata management.


  • The core benefit of Data Lake is the centralization of different content sources.

  • Users from various departments, may be scattered around the globe can have flexible access to the data.

  • Helps fully with product ionizing and advanced analytics.

  • Offers cost-effective scalability and flexibility.


  • Data Lake may lose significance.

  • Huge amount of risk may involve while designing Data Lake.

  • It also increases storage and computes costs.

  • The major risk of data lakes is safety and access control. Sometimes data can be placed into a lake without any omission, as some of the data may have secrecy and monitoring need.