How To Avoid Data Lake Crocodiles

Data lakes are massive, by definition. They work to house the morass of unstructured and semi-structured data that is generally unfiltered, often duplicated, typically unparsed and low-level (i.e. log files, system status readings, website clickstream data) and increasingly machine-generated by sensors in the Internet of Things, or by AI agents that now start to pour their output into the data lake as well.

On balance, data lakes are often regarded as a good thing. They allow organizations to make sure they are capturing all the data that they might channel through every operational pipe of their IT stack. Having access to as-yet-untapped data stores when needed is a comfortable position for the chief data scientist in any business. Viewed as a key move for firms to future-proof their data strategy (who knows how the company might use sensor data x, y and z tomorrow or next year?), a data lake also represents a democratization of data i.e. it’s a really deep pool and – as long as you wear a life jacket (adhere to security and compliance guidelines) anyone including business users can potentially take a dip at any time.

Data lakes also store structured data such as information streams from customer relationship management systems or enterprise resource planning systems, but they are less frequently discussed in that role.

In our current climate of AI-everything, organizations are demanding end-to-end visibility of their businesses and the activities carried out by their customers. Data lakes help make that possible and they also ensure a business can centralize around one repository so that data silos don’t start to grow… and that’s a good thing too.

Danger: Deep Water

As in practically all aspects of technology, there’s a yin and yang factor to consider. If we think back to pre-millennial (or at least pre-cloud) times, when an organization had 42 databases (and many ran more), users needed to know 42 database attributes and a corresponding number of security measures and procedures to access data. However, in a single data lake, it is theoretically possible for a person with access to the right credentials to access everything via one entry point. The fabled “single pane of glass” strategy that so many companies are chasing when it comes to data, apps and business actions becomes the same single pane an intruder needs to break to enter.

This reality has been highlighted by Steve Karam, head of product for AI and SaaS at DevOps platform company (also known for its heritage in enterprise version control and application testing and lifecycle management) Perforce. Speaking at a data analytics roundtable this week, the product engineering development man highlighted more danger in the water.

“It’s always important to remember that there’s Sam – and most organizations have a Sam. They’ve been with the company for decades and, during their tenure, they built a database into which no one else has insight. Maybe Sam has now left the organization, so Sam’s database is effectively a black box. Now put Sam’s database in the single data lake and the implications could be huge,” suggested Karam. “But what if Sam’s data store includes duplicated personally identifiable information and the columns with that PII are no longer tracked? This would be an ideal feeding ground for the crocodiles dwelling beneath the lake’s surface. An already broken process just expanded.”

Karam invites us to add AI into the mix. Compared to analysts who are expert data wranglers and write targeted queries to get what they need, he says that AI has an “omnivorous, insatiable appetite” these days (he actually used the term datavore, well, someone had to coin it sometime) and that means it wants to eat all the data. He views it as something of a “blabbermouth” that spills more secrets than a chatty family relative during a holiday dinner after too much wine. The risk landscape subsequently explodes.

Dipping Our Toes Back In

“So we have a quandary: teams across enterprises depend on fast access to data to build and test software, get to market faster and optimize strategy… yet data lakes are essentially useful things,” said Karam. “For an illustrative example, consider the fact that detailed data is increasingly essential to meet demand for customer experience customisation. Yet the risks are very real, our own market study suggests that around half of organizations have reported that they had already experienced a data breach or theft involving sensitive data in non-production environments.”

So what’s the answer? Cataloguing and dividing data into different categories is a good starting point, Karam says that Microsoft’s Medallion architecture is a good example.

Microsoft actually talks about this technology as the Medallion data lakehouse architecture (a median amalgam of data lakes and structured data warehouses with the expansiveness of the lake, but the data management and transactional capabilities the warehouse) and it is essentially data design pattern used to organize data logically.

“The medallion architecture describes a series of data layers that denote the quality of data stored in the lakehouse. Azure Databricks recommends taking a multi-layered approach to building a single source of truth for enterprise data products. This architecture guarantees atomicity, consistency, isolation and durability as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics,” details Microsoft, on the learn Microsoft web portal.

What happens next is synthetic, but at the same time, it is very tangible and real.

Data Masking & Synthetic Data

“The next step is to find ways in which to give non-production teams (by which I am talking about our friends in software application development) realistic data without risk; so this means stepping into techniques including data masking and the use of synthetic data. Synthetic data is particularly beneficial when there is a lack of real data that matches a new business case, or when compliance demands that access to production data in any form is forbidden. It’s also fast to create and useful for large-volume requirements like unit testing,” explained Perforce’s Karam.

Static data masking replaces sensitive data like personally identifiable information (remember Sam and the PII worries?) with synthetic but realistic values, which are deterministic and persistent, so that the referential integrity and demographics are maintained. This means (in theory and indeed in practice) that software developers have genuinely useful data without the risk of accidentally exposing sensitive customer data.

As a working example, development teams at a bank could see a customer’s balance to look for anomalies, spikes or other outliers, but they would have no idea which customer it might belong to. Date of birth, social security and bank account number and other personal identifiers would all be masked. Many organizations are likely to have a place for both techniques, which are supported by highly automated tools to mitigate any additional workload on developers.

Risk-Averse Clean & Compliant

“New use cases in AI can also help. Beyond synthetic data, AI is being used for automated testing with natural language processing, relieving testing teams from the burden of writing test scripts and maintaining data relationships with production,” said Karam. “Even if an organization is already ‘all in’ on data lakes, it should continue to treat software development and quality assurance data as separate data environments that are risk-averse, solid, clean, compliant and delivered fast so that teams can build without concern. The data lake should also have separate workspaces for non-production teams with guaranteed compliant data so they can jump right in safely. It’s like having a roped-off children’s pool in the shallow end of the lake for non-production, but the production part in the deep end is off-limits.”

Key providers in the data lake arena include Amazon (AWS S3 Simple Storage Service is the underpinning technology behind a large number of data lakes); Microsoft Azure Data Lake and the company’s data lake analytics service; Google with its BigLake (loved by those who want to build an Apache Iceberg lakehouse); AI data cloud company Snowflake and Databricks with its already-referenced relationship to Microsoft.

Although Perforce didn’t peddle its own agenda or message set in this discussion, the company competes in version control with Git, Atlassian Bitbucket Data Center, Apache Subversion and Mercurial to name a handful. In software testing, Perforce shares its market with BrowserStack, Sauce Labs, LambdaTest and (when is the company not somewhere on most lists?) into application lifecycle management, the organization comes up against IBM’s Engineering Lifecycle Management among others.

Taking these steps and approaches tabled here could help to pinpoint, ring-fence and mitigate the risks around data lake information and balance its role against the need for its protection. The crocodiles may still be circling, but there are safe ways to enter the water if we know what kind of protective clothing to wear. These processes might not kill off the lake crocodiles (malicious attackers and ne’er-do-wells), but it might mean a few of them are forced back to shore.

Forbes