How to Improve Quality, Richness and Analytics of Data Lake

Without an appropriate data strategy or advance planning, data lakes can quickly become unmanageable, particularly as businesses evolve and capture larger volumes of data. Now you’re a step closer to building a higher quality, well-structured data lake that’s fit for the future.

How to Improve Quality, Richness and Analytics of Data Lake
How to Improve Quality, Richness and Analytics of Data Lake

In this article, we’ll explore how to prevent your data lake from becoming a large collection of unstructured data from a plethora of sources.

  • Why you need to focus on data quality, not capture
  • How to think about your data blueprint before building out your data lake
  • How to future proof your data lake design

Content Summary

Introduction
The do’s and don’ts of data lakes
Focus first on data quality not capture
Have a data blueprint before building your data lake
Data lake design – being future proof
Data lakes are no still pond
Data quality is key
Think data process not data destination

Introduction

Today, companies are being urged to make more use of their data to demonstrate value and ultimately, succeed. At the same time, these organizations have more data coming into their operations from their websites, applications and mobile services, while Internet of Things (IoT) devices can add even more to the mix.

Industries of all kinds are being fundamentally changed by the availability of more data – for example, insurance companies see opportunities to improve the home insurance market through sharing access to data from smart home devices. Established banks face mounting competition from a new generation of fintech competitors – and both are seeking ways to monetize the data they hold. The legacy banks, however, are more likely to struggle.

Similarly, the marketing sector has been transformed through access to data on customer actions across websites and mobile applications, providing greater insight into how consumers make decisions about their purchases. However, the volume of data is growing so rapidly that it becomes hard to handle and use effectively.

For many businesses, the solution to the problem at first seems simple – pour all the data into a data lake and let your team swim in it, picking up fresh and actionable insights to deliver an always-on, personalized customer experience.

The problem with this approach is that it focuses solely on capturing all this data, rather than looking at data quality from the start. As a result, data lakes rapidly become large collections of unstructured data, containing everything from website visits, application metadata and mobile app usage to information from IoT sensors and devices.

Customer insight doesn’t come simply from capturing data in a data lake or data warehouse. Instead you must think about consistent data structure before any of it hits your data lake.

In this article, we’ll explore the key decisions you should be making around data, how to manage your data lakes effectively, and how Snowplow can support you in achieving your data-related goals.

The do’s and don’ts of data lakes

Deciding on the right data strategy can be tricky, so here’s a quick guide to the top do’s and don’ts to help you as you’re building your data lake.

The Do’s:

  • Outline your data blueprint and your main objectives before deploying your data lake
  • Define which events you want to track, according to your business needs
  • Ensure only high quality data enters your data lake by validating your event data upfront

The Don’ts:

  • Implement your data lake without considering its place in your overall strategy
  • Ignore your data collection process
  • Track everything and then only address data use-cases afterwards

Focus first on data quality not capture

The market for data lakes is expected to grow at a compound annual growth rate of 27.4% over the period 2019-2024 and this trend is understandable. Let’s say you run IT for a rapidly growing company. You have more data coming in than you can currently store, and it’s difficult to get insight into what is taking place over time. Implementing a data lake should help you capture, store and use that data over time, bringing together multiple sets of data to inform decisions and offers to customers.

In theory, this approach should work well for growing companies, but the reality is a lot more complicated because the quantity and variety of data coming in will continue to increase, as will the demand for business value. Making the right choices about data structure and schemas may not seem important at the beginning, but without advance planning, data lakes can easily become data swamps, where organizing and using the stored data becomes difficult over time.

Gartner analysts Mike Rollings and Andrew White describe this as companies being “inhibited [by] their earlier behaviours.”

To avoid this problem, it’s worth spending more time on how you collect data in the first place. For many projects, a data lake or data warehouse will seem like a natural first step – after all, you need somewhere to put all the data that is now being created by your applications or by your customers. However, to avoid a “data swamp”, it is worth first spending time deciding on a framework for how you collect data.

Looking at data collection first instead encourages you to think about where this information will come from and how it will be used. Coming under the umbrella term of “event data” that is, data created by an activity like a click on a website, a mobile app being used or an application event – collecting this information can be used to build a better picture of what customers want. More importantly, it can work well when you want to build services that work in real time.

At Snowplow, we believe that companies have the most success with data when they own their data pipeline end-to-end. Ownership puts you in control of your data, giving you visibility into each stage of the pipeline. Moreover, since Snowplow is built to ensure data quality, the data that reaches your data lake or warehouse arrives in a clean, structured and ready-to-use format.

Have a data blueprint before building your data lake

In the past, data would be funneled into a data warehouse for long-term usage. Data warehouses tended to be highly organized and structured. However, they also tended to be large, inflexible and expensive to implement.

For new and growing companies, data lakes have been the alternative. Data lakes offer a simpler way to capture incoming data and store it for future use, such as more in-depth analytics and reporting. Rather than the complexity and expense of traditional data warehouses, data lakes make it possible to store structured and unstructured data at scale at a much lower cost.

Over the past decade, data lakes based on the open source project Apache Hadoop alongside Apache Zookeeper and Apache Hive have been used to provide this cheaper, more flexible approach to storing and analyzing data. Recently, public cloud services have grown that can be used for data lake deployments. Rather than using Hadoop, services like Amazon Web Services’ S3, Glue and Athena can be combined to provide a data lake that can make analysis easier over time.

Alongside this, data warehouse services have moved into the cloud too. The launch of Google BigQuery and Amazon Redshift made running a data warehouse in the cloud more affordable for many companies, while Snowflake Computing developed its own cloud data warehouse to fill that specific market need. What cloud data warehouses offer is more control and management over data while copying the cost efficiencies that data lakes provide; what they require is more structured data compared to how data lakes are typically organized.

In essence, the market for data lakes, data warehouses and the cloud has developed so there is a lot more cross-over. In order to meet company needs and IT team requirements, data lakes have started adding more structure and analytics support upfront while cloud data warehouses have added more speed and ease of use. Both data lakes and data warehouses have looked to reduce cost where they can, through the use of cloud.

While data lakes using Hadoop have been a great option for managing data at scale, many teams have found it challenging to get value out of that data once it has been stored. Data scientists have to spend a lot of time and effort on data preparation and cleaning.

Building a data blueprint involves taking the best of the data warehouse – structure, ease of analysis, clean data – and adding these to your data lake approach. By combining an understanding of how you want to use your data as well as how you will organize it over time, you can achieve the best of both approaches.

Data lake design – being future proof

It’s important to always look ahead when you deploy a data lake. Rather than looking at a data lake as a data store that can be used for analysis at some point in the future, structuring your data schemas first can ensure that your data collection meets your quality thresholds and makes it easier to analyze that data. By taking an end-to-end approach to structure from the start, you can avoid data-quality issues over time.

Planning ahead can make it easier to stop bad data entering your data lake in the first place. When data is incomplete or wrong, it should not be added into the data lake but still captured for further investigation. More importantly, this process should be automated so that information can be automatically recovered rather than needing a manual intervention up front, given the volumes of data that data lakes can store and the sheer number of events being processed.

Data lakes are no still pond

Data lake approaches have continued to evolve. These projects aim to improve the performance and data quality of data lake designs while also maintaining the cost and efficiency gains that data lakes have provided in the past.

This year, Databricks launched a new open source project, Delta Lake. This adds more structure to data lake deployments through refining and ingesting data within data lake deployments, alongside adding ACID transactions and data versioning. Delta Lake is used to replace the default Apache Parquet format, adding more functionality to data lakes while still being compatible with existing data lake implementations.

Similarly, the Apache Iceberg incubator project is designed to improve on the standard table layout that is built into tools like Apache Hive, Presto and Apache Spark. Iceberg adds data version control alongside schema evolution, making it possible to manage data versions.

Snowplow can work with a number of data lakes and cloud data warehouses. For companies using AWS, Snowplow works with Redshift and AWS S3 as data repositories, and can integrate with the likes of AWS Glue and Athena for data lake deployments. Similarly, Snowplow integrates directly with Google Cloud Platform’s Cloud Storage and BigQuery services for companies using GCP. For those companies that prefer alternatives to the public cloud providers’ own tools, Snowplow also integrates with Snowflake and Databricks on AWS.

Snowplow has developed its own approach to ensuring data quality with Snowplow Insights. Before incoming data reaches a data lake or warehouse, it first gets validated against associated schemas. Unlike potential data-swamp scenarios, where the data lake has been used as a dumping ground for unstructured data, Snowplow schemas ensure that the data lake receives only validated data – a built-in way to identify quality issues before data ever enters the lake.

Similarly, event-validation failures are also captured in the Snowplow pipeline rather than simply being dropped. This ability to audit event-level data makes it easier to identify potential issues proactively, while also ensuring no data is lost or incorrectly transformed. Lastly, it is possible to recover and reprocess any bad data so that gaps don’t appear in the data lake. These examples show how data collection technology and data lake designs continue to evolve as developers and data teams deal with the huge volumes of data that their companies produce over time.

Previously we had a traditional business intelligence pipeline with staging and all our information was in a data warehouse. However, there was a very long discovery phase in the warehouse – any time the business teams needed data, they had to ask for what they needed and then we would have to prepare and provide the data back to them. It was very siloed and the parameters are very closed. When we started with data lakes, we looked at all the data you can leverage, the analytics there and what the output from this data could be. This gave us a huge amount more freedom. Snowplow helps us get all those sources of data and put them in the data lakes – which is raw, messy data – and then move the right information into data shores, where we gather relevant data and filter it through different use cases. Each analyst can then leverage what is in the data and build business models when we need to do reporting. – Romain Thomas, Director of Data Engineering, La Presse

Data quality is key

Data lakes can effectively store large volumes of data, support more analytics use cases and be more flexible than traditional data warehouses. However, to get more value from their implementation, you need to look at how you can bring users closer to your data in the first place.

Rather than being limited to the standard approaches built into common tools around website data analytics, you can look at what is really required by the business and build to fit that need. This helps you prepare for your future data needs and demands.

By making sure all of the raw data that makes it to the data lake is high quality from the beginning, users achieve a faster time to surfacing valuable insights.

At Snowplow, well-defined schemas that validate incoming event data highlight the importance afforded to upfront data quality. By making sure all of the raw data that makes it to the data lake is high quality from the beginning, users achieve a faster time to surfacing valuable insights. More importantly, it makes it easier to expose that data to meet different use cases, speed up the data discovery phase and provide actionable, real-time data across an organization.

Snowplow provides all of our event data in a data model which we own and can shape to our organisational needs. Snowplow has really helped accelerate our analytics; we can quickly answer questions which would have required a tremendous amount of engineering effort with our previous solution. – Darren Haken, Head of Data Engineering, Auto Trader

Think data process not data destination

There is no hiding from it – for every business, data is only going to become more important over time. The complexity of setting up data pipelines, of integrating different tools and open source projects, and of meeting business goals around data will only increase.

In the future, businesses will need to think about event data from the beginning of their data projects. By looking at the process first – rather than solely at the data lake or data warehouse as a destination – you can put more structure in place to support those longer-term goals.

More importantly, this can help you structure your business for more success built around how data is used. Whether it is getting data out of the hands of individuals and into use by business teams every day, or getting support for more operational use of data within departments, getting your data pipelines right can help make it easier to operate with data at scale. It can also help ease transition and collaboration problems between teams, as everyone is working from the same approach to data in the first place.

Source: Snowplow