Is data lake the right storage option for you?

Is data lake the right storage option for you?

We take a look at the pros and cons to help you decide.

There was a time when enterprises built data warehouses to store processed data that resided within their data centres. Their analytical insights were restricted to the data they gathered through traditional means. But in recent times, as social media and other digital infomediaries unleashed a tsunami of customer information, enterprises have had to find, store, and make sense of unstructured data. This has led to new approaches to acquiring, storing, and analysing data. Data lakes is one such approach that has been widely promoted by many technology providers as the ideal storage technology for the digital era.

The data lake advantage
Unlike a data warehouse, where data is mostly processed and stored in files and folders, data lakes allow the storage of data in all forms, whether it is structured, semi-structured or unstructured in a flat architecture. Data lakes are expected to help data scientists view and analyse data in a native format and should ideally replace data warehouses. One of the key advantages of building a data lake includes the convergence of all data sources, which could include logs, XML, sensor data, multimedia, and social data. With a combination of Hadoop and a variety of tools it’s also possible to accommodate high speed data and this can be integrated with historical data to get the best insights. A data lake has the ability to process large quantities of coherent data with deep learning algorithms to enable real time decision-making. In an ideal situation, a data lake should provide democratised access to data in a single, unified view across the organisation.

The pitfalls
On paper, data lakes sound like the best alternative to data warehouses, as big data analysis is an eventuality that every enterprise has to cope with in an increasingly hyper-connected and highly data-driven digital universe. But it is not easy to implement data lake projects, as they are a radical departure from proven approaches to storage. Considering that data lake is a single container to which one can keep on adding all types of raw data, the sheer volume of data can get intimidating after a while. Going through the data with a fine tooth comb, separating the irrelevant from the relevant, and preparing the data for analysis can take up much of your data scientists valuable time, adding to your costs. The other issue with a data lake is your ability to store data in such a way that you can extract any piece of information needed with a query. That can be achieved by building metadata tags. Without that you run the risk of drowning in a data cesspool. The other argument against data lakes is that they tend to add to latency, as data often has to travel from the source to a different location where it gets analysed. Then there are costs associated with data lakes in terms of storage and processing power.

Points to consider before implementing a data lake
If your data is extremely complex and voluminous, and your traditional data warehouse and storage management solutions are unable to meet your needs, it might be worth your while to explore data lakes. However, in order to have a well-managed data lake it’s important you first have a plan that helps you extricate quality data out of the data lake in a cost effective manner. You should have a data governance framework that delivers your objectives in a non-hierarchical, flat approach. You also need to address information security concerns as you are clubbing data from multiple, external sources. Of course, there are storage engines and tools that facilitate data management in data lakes. But these come with their own limitations, with issues like ease of use, latency, security, and limitations in processing speeds. You’ll need to work around these.

Photograph: xb100/