As the sheer amount of data businesses generated has grown—and the ways we put that data to work along with it—data warehouses have played a pivotal role. They store our data, serve as a central source of truth, and more.
Where do data warehouses stand in today’s data landscape? And how will they evolve along with it?
Those are some of the questions Indicative CEO Jeremy Levy took on during a session of Coalesce 2020—along with Arjun Narayan, CEO at Materialize, Boris Jabes, CEO at Census, and moderator Jennifer Li of Andreesen Horowitz. The session focused on the future of the data warehouse, including how we’ll use them, emerging trends and use cases, and more.
To catch the main points, you can read our written summary or watch the full recording here.
Today’s Data Warehouse
The New Generation of Data Warehouses and Transactional Workflows
“It’s a great time to be in data right now because of the opportunity.”
Noting the evolution of the data warehouse and how it’s employed across organizations, Jennifer wondered about the new generation. What’s fundamentally different about the most recent crop of data warehouses that enables them to power transactional workflows?
“Only in this generation of data warehouses have we managed to separate storage from compute and, more importantly, separate workloads within compute from each other,” Boris said.
“That’s the biggest change that allows people to start exploring more uses of the data warehouse than just a nightly or monthly analytics batch process. So I think that’s probably the dominant reason why this is evolving.”
“This generation of cloud data warehouses that have unique profiles in terms of how they scale and how they grow has created space in our ecosystem for new categories of products that were previously handled by monolithic data warehouses,” Jeremy added. “It’s allowed entrepreneurs like ourselves to create and innovate incredibly quickly, creating all of these solutions like dbt, for example, or Indicative.”
Risks and Dangers Associated With Using a Data Warehouse
It’s clear that data warehouses are, and will continue to be, vital in collecting, storing, and utilizing data. But that doesn’t mean using one comes without its risks.
To start, Jennifer posed that question: What are some of the risks and dangers of leveraging data warehouses that may not be obvious to our audience?
“What’s really incredible about where we are today is the fact that data warehouses can be used by virtually anybody, whether you’re a startup that has no funding or a Fortune 500 company,” Jeremy said. “You can actually use the same underlying technologies that will scale with you as you grow. But figuring out what the right technologies are, what the right approach is is where there’s risk with choosing a data warehouse.”
“There are no guardrails,” Boris added, “there’s no ‘This is how you should structure your data.’ It’s just an open field to do whatever you want. So the onus is a lot more on your team to figure out how to structure that data.”
Arjun highlighted another risk: speed (or lack thereof).
“They’re tremendously powerful, tremendously scalable and these are all benefits,” Arjun said. “But it comes at a cost, which is really that the data refresh speed is the speed of your slowest source.”
Tensions Data Teams Face in Trying to Manage Data in a Coherent Way
“The role of the data team is to distill the needs of multiple teams into a common set of data points.”
The evolution of the data landscape has democratized how data is consumed and used across organizations—it’s no longer just the realm of data teams. That shift made Jennifer wonder about the tensions created when data teams need to manage data in a coherent way, while working with other teams.
“Once you go deeper and actually integrate directly into the operational business workflows, you’re going to have some initial pain,” Boris explained. “A lot of times, there’s an upfront cost that you have to bear to say, ‘Why are we disagreeing on this data and how does this affect me?’”
“The main value of enabling the data team to serve the rest of the company in this way,” Boris added, “is that you move from working in silos to actually working together. And that does come with organizational change but it is much, much better than the alternative.”
“I think it’s important to remember that data teams and business teams are all aligned amongst the same goals,” Jeremy noted. “So to the extent that we’re setting expectations about the objectives we have with our data gives way to the ability to democratize that data across an organization.”
Boris chimed in, “I think the data teams should think about their peers in terms of, ‘Let me help you get seamless data, but let’s make it safe for you to operate on this.’”
The Future of the Data Warehouse
“The ultimate goal of all of our platforms, of all of our products, is to derive greater business value.”
Emerging Uses Cases
As the conversation shifted to the future of the data warehouse, Jennifer asked the panelists to weigh in on new and emerging use cases they see coming to the fore in the next five or ten years.
Jeremy said, “One area that still is not really easy for business users to use is moving beyond the dashboard and reporting use case.”
“To take it to the next level is still well out of the reach of almost all business users. And that’s very much where we’re trying to move our product,” Jeremy continued. “How do we move to a place where we can tell our users, not what happened from the perspective of a report or dashboard, but help them predict and understand why these things are happening and why they’re driving outcomes?”
“I think about all the ways a data team can be in the critical path to drive action, and I think we’re very much at the infancy on that front,” Boris noted. “So many companies are data-informed versus data-driven.”
Boris added, “There’s also a broader secular trend here that the data warehouse is participating in and enabling to some degree—we’re clearly on a 10-year journey of people wanting to own their data.”
Arjun agreed, saying, “There’s this privacy angle of wanting to understand all your user data and knowing where it is and harnessing it into one thing, so you can reason about it and control it. And I think the warehouses are going to be a big part of that over the next decade. I think we’re also just at the infancy of that—how to build privacy controls into the warehouse.
Expanding, Arjun said, “I expect a continuation of existing trends first and foremost. We’re going to see an even bigger explosion of data sources. People are onboarding cloud SaaS tools that store core customer data at an astonishing rate, and that only makes the need for a source of truth to unify all that data even greater.
The Dominant Language for Data in the Future
SQL or Python? Jennifer asked the panelists which language they expected to dominate the future of data warehousing. They all agreed—SQL—but few were happy about it.
“I think it’s going to be SQL but I don’t really want it to be,” Jeremy said. “I think SQL has held us back in terms of limiting how we think about analyzing data. I think it’s part of the reason we’ve seen an explosion in the data space—it’s a common framework that has a huge amount of people who speak it, it’s an interchange format for how we interact with products. But because of its limitations, it also limits how and what type of analysis we do.”
“We’re all in on SQL because I’m fundamentally pessimistic,” Arjun explained. “A large part of the SQL value proposition is exactly the ecosystem—it’s the fact that there’s BI tools that speak SQL out of the box. It’s really hard to envision a world in which there’s another language that also gains that adoption.
“If there is another language,” Arjun added, “it will certainly have to be spearheaded by a very large player that can get a lot of ecosystem partners on board very quickly.”
Stream Processing or Batch Processing?
The session explored processing next—will stream or batch processing dominate the landscape?
Arjun expressed optimism, but cautioned against expecting a tectonic shift in norms. “Just because there is a new paradigm that has some advantages does not necessarily displace the old paradigm because both are growing.”
“I think, by volume, batch processing will be greater than stream processing for the next 100 years,” Arjun explained. “But will you, as a net new user, build a batch pipeline ten years from now? I would bet not.”
“To Arjun’s point,” Jeremy added, “one doesn’t alleviate the need for the other and vice versa. We have cheap storage that’s relatively slow, and we have fast storage that’s really expensive. It’s about the right tool for the right job. There’s no right answer—it’s ‘what is the problem I’m trying to solve?’”
Emerging Trends to Watch For
There was consensus among the group that the future of the data warehouse will develop in line with the existing trends we see shaping the industry already. That said, the panelists also felt some of those trends have been largely overlooked to date.
Jennifer asked them to expand on those overlooked trends and how they’ll shape the data warehouse landscape into the future.
“I think we’ll see new types of data move into the data warehouse,” Jeremy predicted. “We’ll move back to a place where we get to have alternative and more unique data types that can be queried and manipulated within our data warehouses.”
Noting recent shifts toward user privacy and ownership of data, Arjun argued, “Something that we really have to get very good at—which we don’t really have the cleanest story for yet—is data deletion.”
“As we have this explosion of tools, as we have this explosion of copies of copies of data sitting around everywhere, when a customer shows up and says, ‘I want you to delete everything about me,’ that is an immense challenge,” Arjun added, saying, “That’s something that I think we are overlooking right now.
“It’s not just about where the data lives,” Arjun explained. “It’s also about having clear termination points for how long the data lives. We’re barely scratching the surface in solving that problem.”
To hear more about the future of the data warehouse, watch the full recorded session at the top of this page or on dbt’s Coalesce 2020 site.