AI workloads, data platforms, and infrastructure notes, written from the engineering edge between benchmarks and production.

RSS feed
/ /

Why Modern Data Architectures Break, and What the VAST DataBase Fixes (part 1 of 4)

Over fifty years we stacked databases, warehouses, and data lakes on top of each other. Here is why modern data architectures break, why data movement is the real cost, and how the VAST DataBase fixes it.

I

Itzik — VP Mission Alignment, VAST Data

·

·

17 min read


Part 1 of 4: From punch cards to the lakehouse, and the structural cracks that started it all.

Every few years the data industry reinvents itself, promises that this time the architecture will finally be simple, and then quietly ships another layer of glue to hold the previous generation together. If you run a data warehouse, a data lake, a few streaming systems, and a fleet of nightly jobs, and you still cannot answer a simple business question in real time, you already understand the problem this series is about.

This first post sets the stage. Before we can talk about what the VAST DataBase does differently, it helps to be honest about how we got here: why analytics architectures evolved the way they did, what each generation solved, and what each one quietly broke. Understanding that arc is the difference between chasing a feature and fixing a root cause.

Executive summary

If you read nothing else, read this. Over the past fifty years, analytics infrastructure has moved from databases to warehouses to data lakes to lakehouses. At every step, organizations gained one new capability and inherited a fresh set of seams. The important part is what did not happen: almost nobody retired the old system. They stacked the new one on top and wired the two together with pipelines. The result is the architecture most companies actually run today, which is a sprawl of specialized systems connected by fragile, expensive, and slow glue.

The single idea that runs through this whole series is simple. The biggest cost in modern analytics is not storage and it is not compute. It is data movement. Every time you copy data from one system to another, you add delay, expense, operational risk, and one more place where the numbers can quietly disagree. The VAST DataBase matters because it goes after that root cause instead of adding yet another layer on top. This first post explains the problem. Parts 2 through 4 explain the engine that solves it.

First, what do we actually mean by analytics?

Analytics is the practice of turning raw data into decisions you can act on. Companies use it to spot trends, run operations more efficiently, cut cost, and build new products, all from the data they already generate across sales, HR, supply chain, customer activity, and machines. A retailer uses analytics to learn which products sell best, why one region is lagging, and what demand will look like next quarter. Done well, it is a lasting competitive advantage.

It also helps to remember that analytics is not one thing. It is a ladder of questions, and each rung is more valuable than the one below it:

  • Descriptive: what happened? Last quarter’s sales, this week’s website traffic.
  • Diagnostic: why did it happen? The root cause of a sales drop or a supply delay.
  • Predictive: what will happen next? Forecasting customer churn or future demand from history.
  • Prescriptive: what should we do about it? Recommending the best promotion or the right inventory level.
Analytics maturity ladder showing descriptive, diagnostic, predictive, and prescriptive stages

The analytics maturity ladder. Each rung is more valuable and asks more of the platform underneath it.

Each rung asks more of the platform underneath. Descriptive reporting can live with yesterday’s data and a rigid structure. Predictive and prescriptive analytics, and the AI workloads now stacked on top of them, need fresh data, every data type, and the freedom to ask questions nobody planned for in advance. In many ways, the history of data architecture is the history of platforms straining to climb this ladder.

The rungs are also cumulative. Prescriptive analytics does not replace descriptive reporting. It sits on top of it and depends on the same clean, current, trustworthy data. A company that cannot reliably answer what happened has no business promising a model that recommends what to do. Weaknesses at the bottom of the ladder, such as stale data, missing data types, or brittle pipelines, travel upward and quietly limit how far the business can climb. Even the most ambitious AI project still rests on whether the underlying data is fresh, complete, and well governed.

The fifty-year march: databases to the lakehouse

The shape of analytics infrastructure has changed roughly once a decade, and each shift was a reasonable response to a real business pressure.

Timeline of data architecture evolution from databases to warehouses, data lakes, and the lakehouse

How analytics architectures evolved, from transactional databases to the modern lakehouse.

  • Traditional databases (1970s to today): electronic systems to store, organize, and retrieve structured records. They became the backbone of digital operations, running transactions, payroll, banking, and applications.
  • Enterprise data warehouses (1980s to 2000s): central, structured stores built for business intelligence. They required data to be cleaned and reshaped before analysis, which is how the ETL industry was born.
  • Data lakes (2000s to 2010s): flexible, low-cost storage for raw data of any kind. They let organizations keep everything and decide later what to do with it, right as data volumes exploded.
  • Cloud data warehouses (2010s): scalable, on-demand analytics that separated storage from compute. Elastic and convenient, but often a one-way door into vendor lock-in.
  • Lakehouses (late 2010s to today): hybrid designs that put structured, governed analytics directly on top of flexible lake storage, aiming for the scale of a lake with the reliability of a warehouse.

Look at that list again and a pattern appears. Every transition was driven by one of two needs: more structure and trust (databases to warehouses), or more scale and flexibility (warehouses to lakes). The lakehouse is the industry admitting it wants both at once, and that it is tired of moving data between two systems to get them.

The hidden tax nobody put on the invoice

Here is what the tidy timeline leaves out. Almost no organization actually replaced the previous generation. They accumulated it. The warehouse did not retire the databases. The lake did not retire the warehouse. Each new system was bolted onto the last, and the connective tissue between them, meaning the pipelines, copies, and reconciliation jobs, quietly became the real architecture.

Diagram of stacked, siloed data architectures with duplicated data and fragile ETL pipelines

The real cost of stacked architectures: silos, duplicated data, fragile pipelines, and delay.

That accumulation shows up as a tax on every analytics project:

  • Complex pipelines. Building and maintaining extract, transform, and load jobs is slow, expensive, and needs specialized skills. Every pipeline is a small software project that can break.
  • Data quality problems. Inconsistent, incomplete, or duplicated data flows downstream into unreliable analytics, and unreliable analytics lead to bad decisions.
  • High operating cost. Each system needs its own licenses, its own experts, and its own capacity planning. Cost grows with the number of tools, not the value delivered.
  • Fragility when sources change. Adding a new source often breaks existing pipelines, which starts another round of fixes and troubleshooting.
  • Delay for the business. Teams wait days or weeks for answers. By the time the report arrives, the decision window has often closed.
▶  The core problem of modern data architecture is not storage and it is not compute. It is movement. Every copy between systems adds latency, cost, risk, and one more place for the truth to drift.

A closer look at the pipeline tax

It is worth slowing down on what a pipeline actually is, because the word hides a lot of work. ETL stands for extract, transform, and load. Extract means pulling data out of a source system. Transform means cleaning and reshaping it so it matches everything else, for example making every date look the same and every product code line up. Load means writing the finished result into the system where it will be analyzed. A helpful way to picture it is a conveyor belt that takes messy, scattered data and turns it into something consistent and trustworthy at the other end.

ETL pipeline diagram showing extract, transform, and load stages that add delay, cost, and errors

Every pipeline is extract, transform, and load: useful work, but each step adds delay, cost, and a chance for errors.

The trouble is that none of those steps are free, and they rarely stay simple. A pipeline that worked last month breaks when a source adds a column. A transformation that looked correct hides a subtle rounding or time-zone error that nobody notices for weeks. And every pipeline has to be scheduled, monitored, and repaired by someone. Multiply that by the dozens of questions a business asks every week, each needing its own joins and its own refresh, and the glue between systems becomes the single largest thing a data team maintains. The goal is not a faster conveyor belt. It is needing fewer belts in the first place.

It is also worth being honest about where the effort goes. Survey after survey of data teams finds that most of their time is spent not on analysis but on finding, cleaning, and moving data. That is the tax in human terms: highly skilled people spending their days as plumbers rather than analysts. Every pipeline you can remove is time handed back to the work that actually creates value, and one less thing that can quietly break at three in the morning.

Why the lakehouse was inevitable, and not quite enough

The lakehouse is the right idea. Run powerful analytics, and even AI workloads, on all of your data without shuttling it between systems. It promises the scale and low cost of a lake with the reliability and speed of a warehouse. Open table formats and catalogs, which we cover in detail in Part 4, made it believable by adding transactions, structure, and governance on top of cheap object storage.

But most lakehouse setups still carry an assumption from the cloud era: that storage and compute should be loosely connected across a slow network, and that you must still choose, for each workload, between a system tuned for transactions and one tuned for analytics. The result is better than the old two-system world, yet it still forces trade-offs that feel familiar. Small, frequent writes fight with large analytical scans. Fresh operational data is still one copy away from the analytics that need it.

This is exactly the gap the VAST DataBase fills. It is not another warehouse to copy data into, and it is not a thin layer of metadata pasted over a lake. It is a database engine built on a very different hardware design, one that is meant to remove the choice between transactions and analytics, between structured tables and raw files, and between freshness and scale.

Event streaming and the demand for now

One more force sped all of this up: the shift from batch to real time. More and more businesses run on event streams, meaning a constant flow of clicks, transactions, sensor readings, and application events instead of a nightly batch. An event-driven architecture captures and reacts to those events as they happen, and an event bus is the pipe that carries them between the systems that produce them and the systems that use them.

Comparison of batch processing versus real-time event streaming for analytics

Batch waits for the next scheduled run. Streaming reacts as events happen, which raises the bar for how fresh analytics must be.

Streaming raised the bar in a way older stacks were never built for. It is no longer enough to analyze yesterday’s data. People expect to act on data that is seconds old. But if analytics still requires copying fresh operational data into a separate analytical system, real time quietly turns into as soon as the next pipeline finishes. Solving for real time, much like solving for AI, comes back to solving for data movement, which brings us right back to the architecture.

There is a human cost here too. When data is always a day behind, people slowly stop trusting dashboards and drift back to private spreadsheets and gut feel. Freshness is not only a technical property. It is what makes people believe the numbers enough to act on them. A platform that keeps data current by default rebuilds that trust, because the report on the screen and the reality on the floor finally agree.

The cloud-warehouse era and the lock-in it created

It is worth pausing on the cloud data warehouse, because it is where many organizations are stuck right now. The cloud warehouse was a real breakthrough. By separating storage from compute, it let teams scale query power on demand, pay for what they used, and stop planning capacity years ahead. For a while it felt like the destination rather than a stop along the way.

Cloud data warehouse hidden costs including per-query pricing and proprietary-format lock-in

The cloud warehouse traded capacity planning for two new costs: a meter on every query, and data locked in a proprietary format.

The convenience came with a bill that grew in two directions. The first is consumption pricing. When every query costs money and compute is billed by the second, teams start rationing their own curiosity, which is the opposite of what a data-driven culture needs. The second is lock-in. Data loaded into a proprietary warehouse format is not easy to move, and the surrounding pipelines, transformations, and permissions quietly cement you in place. The open table formats in Part 4 are, in large part, the industry’s response to that lock-in: a way to keep data in open, portable formats so the engine reading it stays a choice rather than a cage.

A worked example: the simple question that is hard to answer

To make the tax concrete, follow a single question through a typical stacked architecture. A retail leader asks which products are trending up this week in regions where a promotion is also running, and whether there is enough inventory to support it. It sounds trivial. In most organizations it is not.

The sales come from a transactional database, the kind tuned for fast checkout writes (often called OLTP, which simply means online transaction processing). The promotion calendar lives in a marketing tool. Inventory lives in an ERP system. Web behavior sits in a data lake as raw clickstream files. To answer the question, an engineer has to pull data from each source, reconcile mismatched product codes and date formats, load the cleaned result into a data warehouse, and only then can an analyst build the report. By the time the dashboard refreshes, the promotion may be half over.

Nothing here is anyone’s fault. Each system is doing its job well. The failure is in the architecture: the answer requires joining data that has been deliberately scattered across systems that were never designed to be queried together. Every copy along the way is a place where the numbers can drift and time can be lost. This is the pattern the VAST DataBase is built to break.

Notice how the cost multiplies. That one question touched four systems and at least three copies. Multiply it by the many questions a business asks every week, each with its own joins, its own reconciliation, and its own scheduled refresh, and you can see why the pipelines are never a one-time setup cost. They are a permanent burden that grows as curiosity grows. Stacked architectures make asking questions expensive at exactly the moment you want to ask more of them.

What this means for you

The same problem looks different depending on where you sit, so here is the short version for four readers who will all recognize it:

  • If you run storage: you are the one buying more capacity for each new silo, sizing every system for its own peak, and keeping copies of the same data in three places. The tax shows up as stranded capacity and backups of duplicated data.
  • If you run databases: you are the one maintaining the nightly load that copies production data into the reporting system, fielding questions about why two dashboards disagree, and protecting transactions from heavy reporting queries.
  • If you build AI or machine learning: you are the one waiting on data that is already a day old, then copying it again into yet another store before you can train or retrieve against it.
  • If you own the business outcome: you are the one waiting days for an answer, paying for a stack of tools, and wondering whether the number on the slide is current. The cost is slower decisions and higher risk.

Different seats, same root cause. All four are paying, in their own currency, for data that has to move before it can be used.

Signs the data-movement tax is hurting you

You do not need a formal architecture review to know whether this problem is yours. A handful of everyday symptoms give it away, and most organizations recognize several of them at once:

  • Two reports built from the same source show different numbers, and reconciling them is a recurring task rather than a rare surprise.
  • The freshest figure a decision-maker can see is from last night or last week, and never from this morning.
  • Adding a new data source is treated as a project with its own timeline, because it means building and testing yet another pipeline.
  • A real share of your data team’s week goes to fixing broken loads instead of answering new questions.
  • Nobody can say quickly where a given number came from, or who is allowed to see it, without checking several systems by hand.

None of these are exotic failures. They are the ordinary background noise of a stacked architecture, and they are easy to stop noticing precisely because they have always been there. The reason to name them is to make an invisible cost visible, because the first step to removing the tax is admitting how much of it you are already paying. If three or more of those points sound like your week, the problem is not your team and it is not your tools. It is the shape of the architecture they are forced to work inside.

What you are really paying for

It helps to split the bill into the part you can see and the part you cannot. The visible part is the obvious line items: a license for each system, storage for each copy, and the compute every tool burns through. Those are straightforward to add up, and once every silo is counted they are usually larger than anyone expected.

The hidden part is bigger and far harder to put on an invoice. It is the salary of skilled engineers spent on plumbing instead of insight. It is the decision made a day late, or made on a number that quietly turned out to be wrong. It is the question nobody bothered to ask because getting the answer was too slow or too expensive to be worth it. When the architecture makes curiosity costly, the real loss is not a line in a budget. It is the good ideas that never got tested because the data was too hard to reach.

This is why consolidation is worth taking seriously even when the current stack technically works. A system can function and still quietly tax everything around it. The goal of the rest of this series is not to chase a faster version of the same setup. It is to show what becomes possible when the copies, and the waiting they create, are removed from the picture.

Where this series is going

Over the next three posts we move from the problem to the engine that resolves it:

  • Part 2, inside the engine. The three shapes of data and the three database archetypes, and how VAST’s disaggregated shared-everything design (DASE) brings them into one system for tables, files, and streams.
  • Part 3, under the hood. Row versus columnar storage, transactions versus analytics, ACID guarantees, and how the VAST DataBase delivers both kinds of performance at very large scale.
  • Part 4, from lakehouse to AI. Open table formats, catalogs, querying with Spark and Trino, governance, and how the same platform becomes the foundation for AI.
▶  If your reporting is slow on a rigid warehouse, your team is drowning in pipeline maintenance, or you are trying to bolt AI onto a lake full of ungoverned files, the real need is not another tier. It is to stop moving the data.

Key takeaways

  • Analytics is a ladder, from descriptive to diagnostic to predictive to prescriptive, and every rung asks more of the platform underneath.
  • Data infrastructure evolved from databases to warehouses to lakes to lakehouses, each step trading structure for scale or scale for structure.
  • Organizations rarely replaced old systems. They stacked them, and the glue between them became the real and most expensive architecture.
  • The biggest cost in modern analytics is data movement: the copies, pipelines, latency, and drift they introduce.
  • The VAST DataBase goes after that root cause directly, which is what the rest of this series unpacks.

VAST Data · The VAST DataBase · Analytics series


The VAST DataBase Engine series

This article is part of a four-part series on the VAST DataBase. Continue reading:


Related reading

Next in this series: Part 2 — Inside the VAST Database Engine.

Next in this series: Part 3 — Under the Hood: Row, Column, and ACID at Scale.

Discover more from Lots of Data

Subscribe now to keep reading and get access to the full archive.

Continue reading