Back to News

Blog by our CTO: Is it Safe to Fish in Your Data Lake?

By on May 26th, 2021

Read time: 4 minutes.

Is it Safe to Fish in Your Data Lake? 

Why
Context-on-Demand is Essential

Introduction

Globally corporations spent and are spending big on data lakes, racking up an estimated $7.9 billion in 2019 in costs. These numbers will grow to reach up to more than $20.0 billion by the mid-2020s. Even as the location of the lakes shifts from on-premise to cloud at an ever-increasing rate, these numbers will grow.

Data lakes were designed to bring multiple sources of data together into a single location so the business could have on-demand reporting capability from data that was acceptably aged. Data lakes have since been shoehorned to meet needs of disparate data reconciliation in live systems. With this questionable use, engineers have sought to balance timeliness with cost. The more frequent the reconciliation, the more likely the data will not be stale. That assurance comes at the costs of network, storage, and compute. 

A data lake can ease the problem of creating intelligence from disparate data sources. But when it comes to creating action from that data, the mismatch of problem and solution comes clearer into view. 

The solutions have taken many forms, of which the following have been the most frequently used: 

  • Write back to the data lake and use triggers or other database processes to push data back to the SoR. Problems with reconciliation commonly lead to audit processes, log analysis, alert procedures, etc.
  • Engage the SoR through its database or API. This leads to synchronization issues between the data lake and the SoR. Either increasing the cost of the data lake or decreasing the reliability or requiring a hybrid-style solution that both writes back to the data lake AND updates the SoR.

The approach to disparate data sources has led to increased IT costs, especially with the labor required to architect, build and maintain the solutions engineered to work-around the pattern that data lakes were designed to solve. Ironically, businesses were sold on, and believed, they would reap financial rewards from putting this data in a lake.

The answer is to follow the best practices and restore your data lake to its original intent 

Serve the needs of stakeholders to whom static business intelligence is acceptable. Reduce the costs of your acquisition and storage. Challenge the sources you have stored, and remove the data you don’t need. Set reasonable retention practices. Consider the data you are storing as static to the refreshing interval which meets your goals for return on the investment.

But how do you solve the problem of delivering and acting on data on-demand or near-demand?

By reversing the data lake paradigm. Using, when possible and practical, the staleness of your data lakes’ design. Combine that data lake with a healthy on-demand data acquisition and action strategy. Stream the data lake’s data and combine it with real-time SoRs as needed. Using this pattern, you keep your organization’s data close at hand while combining it with critical systems your business needs. While data is streaming to your operator, augment that data with other contextual data to provide your operator all the information that they need to perform their job most effectively. 

Primary complaint about the hybrid model has been that each time an external system is integrated, the cost is duplicated. 

Leave Data Where it Belongs

Software savvy organizations place abstractions between systems and applications, while data-savvy organizations place data lakes in the same place. Low code solutions, like edgeCore™ offer a new option. Leave the data where it belongs and create a powerful decisioning platform for your users. Combine streams of data into impactful visualizations that drive decisions. Bring data lakes together with on-demand data and decision support.

edgeCore™ enables companies to utilize their lakes, placing aggregations and other powerful lake performant goodies within easy reach. Placing data within a pipeline is an exercise of creating the node, setting the connection properties, and publishing it. Connecting to APIs, text files, shell results, and more follow that same pattern.

Additionally, with the data sources published, a user with a modest level of SQL experience can custom-craft the user experience, reducing, securing, and transforming captured data. 

Most importantly, when seeking to reduce the dependence on the lake, the selected data, reduced to only what is important to the set of users acting upon it, can be augmented. Therefore, this enables dramatic improvements to corporate applications. Augmenting looks up infrequent references and connects frequently changing data, it efficiently connects specific data to business support and AI. Finally, through this connection to support systems, businesses reduce costs associated with processing unneeded information.

Learn How at Our CTO Masterclass

In this month’s CTO Masterclass join Destin Valine, as he explores, explains, and demonstrates the “Augment Transform” feature in the edgeCore™ platform. Destin will explain how reduction and targeting can help you identify the routes needed to act efficiently on data. How this process can reduce your dependence on frequent updates to otherwise low-value data, and how these changes, taken to their logical conclusions, will help organizations utilize their systems more efficiently and effectively.

 

About Destin Valine

Destin Valine has been building cutting-edge solutions centered around reducing cost and error for most of his 25 years in technology. He has a computer science degree from the University of Virginia and his Juris Doctor from George Mason University School of Law. Destin has been with Edge Technologies, Inc in Arlington Virginia since 2018 and has been building software since 1994. He also consults with attorneys around the world about the practical implications of technology in business and the law.

View All Posts

To learn more about edgeCore, get a copy of our Data Sheet