Data Lake

Buzzwords come and go, but if they manage to stick around a while, it means the concept is catching on. A buzzword you’re likely hearing more and more of is “Data Lake” which invokes a body of water, and in many ways, a data lake is like a lake. It’s a repository of data (or water) with multiple feeders in and perhaps a few out. It’s vast, still, and unmoving, unlike a river. Data (or water) accumulates until it is needed.

So the name, while not very high tech, is actually appropriately descriptive.

What is a Data Lake?

A data lake is a storage repository that holds a vast amount of raw data in its native format, to be held until it is processed. Unlike the more structured data warehouse, which uses hierarchical data structures like folders, rows and columns, a data lake is a flat file structure that preserves the original structure of the data as it was input.

Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When someone performs a business query based on a certain metadata, all of the data tagged is then analyzed for the query or question.

“The reason data lakes exist is because everyone is collecting huge amounts of information from everywhere, especially from IoT, and they need to store it somewhere. And the historical storage medium was a relational database. But these technologies just don’t work well for all these data fragments we’re collecting from all over the place. They’re too structured, too expensive and they typically require an enormous amount of prior setup,” said Avi Perez, CTO of Pyramid Analytics, developer of BI and analytics software.

“A data lake is a lot more forgiving, cheaper and can accommodate unstructured data. Though the problem is that if you can put something in there, you will just stick everything in there. That is what is happening with data lakes today. And it’s causing the ‘data graveyard effect’ whereby data becomes inaccessible and un-useable,” he cautions.

A data lake must be scalable to meet the demands of rapidly expanding data storage.

Benefits of Data Lakes

The data lake is a response to the challenge of massive data inflow. Internet data, sensor data, machine data, IoT data; it comes in many forms and from many sources, and as fast as servers are these days, not everything can be processed in real time.

Ability to look at original data. The volume and variety and velocity of data could cause you to miss something at the time it comes in, but by storing it in the data lake you can go back and look later.
Easy analysis. Also, because the data is unstructured, you can apply any analytics or schema when you need to do your analysis. With a data warehouse, the data is preprocessed so if you want to do a search or type of query that the data wasn’t prepared for, you might have to start all over again in terms of processing, if you can at all.
Availability. Another advantage is that the data is available to anyone in the organization. Something stored in a data warehouse might be only accessible to the business analysts, but if security wants to run through data to check for potential security compromises, they can go through historical data themselves to look for signs of a break-in.

Data Lake Architecture

The data lake has a deep end and shallow end, says Mark Beyer, research vice president and distinguished analyst for data and analytics at Gartner. The deep end is for data scientists and engineers who know how to manipulate and massage it while the shallow end is for more general users doing less specific searches.

“Those two groups of users always want to use the lake but the advanced users prove out the lake. They build models, come up with theoreticals, and challenge existing business process models,” he said.

No special hardware is needed to build a data lake since it’s storage mechanism is a flat file system. You could use a mainframe if you want. The data will be moved out to other servers for processing. Most users, though, are likely to go with the Hadoop File System, a distributed, scale-out file system because it supports faster processing of large data sets.

That said, there needs to be some kind of structure or order to make it work. The data needs to have a timeliness quality so when users need immediate access to the data, they can get at it. It must be flexible, so users can use the tools of choice to process and analyze the data, and not just the ones IT has.

There must be some integrity and quality to the data, because the old adage about garbage-in, garbage-out applies here. If the data is missing or inaccurate, then your users might not even use it, so what good is it? Finally, it must be easily searchable.

Pivotal, a cloud development firm, recommends multiple tiers for a data lake, starting with the source, i.e. the flat file repository. There is the ingestion tier, where data is input based on the query, the unified operations tier where it is processed, the insights tier where the answers are found, and the action tier, where decisions and actions are made on the findings.

Building a Data Lake

Wei Zheng, vice president of products at Trifacta, a data manipulation and visualization developer, said that while data lakes are more open structurally than a data warehouse, one thing users should do is build zones for different data to quarantine the cleanliness of the data.

“In a data lake model, if you don’t know how the data is consumed but want to catalog everything in the lake, you have to group and organize it on the cleanliness and how mature the data might be,” she said.

She recommended four zones: the first being completely raw data, not clean not filtered or examined at all. Second is the ingestion zone where you do early standardization around categories. Does this data fit into finance, security, customer information, etc? Third is data ready for exploration. You might still need to at least pull from raw data a few key ingredients you want to focus on.

The consumption layer is fourth. This is the closest match to a data warehouse where you have a defined schema and clear attributes understood by everyone. Between all of these zones is some kind of ingestion and transformation on the data.

While this allows for a more freewheeling method of data processing, it can also get expensive if you have to reprocess the data every time you use it. “Generally you will pay less dollars if you define it up front because a lot has to do with how you organize the info in your data lake. There is a cost with repartitioning the data,” said Zheng.

Data Lake Tools

Tools for data lake preparation and processing come in several forms, and many are still early, as the data lake concept is only around five years old. The old guard of BI and data warehousing tools vendors have not moved into the data lake space yet, so most of what is out there comes from start-ups and open source projects. But there are notable vendors.

Amazon, Microsoft, Google, and IBM all offer a variety of data lake tools along with the basic service. While most data lakes reside on premises, some can be born in the cloud and stored on a cloud provider like the big four, and the four all offer a variety of tools for data ingestion, transformation, examination, and reporting. In addition, there are other notable vendors of data lake tools.

HVR: HVR offers software for moving data in and out of the lake in real time from multiple sources, does real-time comparisons to ensure data integrity and scale over multiple systems.

Apache NiFi: This is an Apache-licensed open-source tool, but it’s also available as a commercially supported product from Hortonworks under the name DataFlow. NiFi processors are file-oriented and schema-agnostic, so the individual processors operate on one specific format. It’s used for data routing and transformation.

Podium Data: Podium offers an easy-to-use tool for building an enterprise-class managed Data Lake while requiring no specialized Hadoop skills. It claims it can build and deploy a secure, managed enterprise Data Lake takes less than a week.

Snowflake Software: Snowflake has a custom SQL database for building repositories to store and process a wide variety of data, including corporate data, weblogs, clickstreams, event data, and email. It also can ingest semi-structured data from a variety of data sources without having to transform it first.

Zaloni: Zaloni offers a complete, enterprise data lake platform called Data Lake 360, which includes a management platform, data catalog and self-service data prep tools that cover end-to-end processing.

RELATED NEWS AND ANALYSIS

Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs

FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020

SEE ALL
BIG DATA ARTICLES

What is a Data Lake?

Benefits of Data Lakes

Data Lake Architecture

Building a Data Lake

Data Lake Tools

RELATED NEWS AND ANALYSIS

Similar articles

Latest Articles

DevOps Tool Comparison: Ansible...

The Top Intrusion Prevention...

The Top 5 Data...

Cloud vs. On-Premises: Pros,...

Advertisers

Menu

Our Brands

Data Lake

What is a Data Lake?

Benefits of Data Lakes

Data Lake Architecture

Building a Data Lake

Data Lake Tools

RELATED NEWS AND ANALYSIS

Similar articles

The Top 5 Data Migration Tools of 2023

Top Digital Transformation Companies

Data Migration Trends

Latest Articles

DevOps Tool Comparison: Ansible...

The Top Intrusion Prevention...

The Top 5 Data...

Cloud vs. On-Premises: Pros,...

Advertisers

Menu

Our Brands