AI workloads, data platforms, and infrastructure notes, written from the engineering edge between benchmarks and production.

RSS feed
/ /

Inside the VAST DataBase Engine: One System for Tables, Files, and Streams (part 2 of 4)

Data sprawl is the result of a hardware limit, not bad planning. Here is how the VAST DataBase engine unifies tables, files, and streams on a DASE foundation.

I

Itzik — VP Mission Alignment, VAST Data

·

·

16 min read


Part 2 of 4: The three shapes of data, the three database archetypes, and the architecture that unifies them.

In Part 1 we argued that the defining cost of modern analytics is data movement, the copies and pipelines that pile up when you stack databases, warehouses, and lakes. This post goes inside the engine to explain why that fragmentation existed in the first place, and how the VAST DataBase is built to remove the need for it. To get there we need two foundations: how data is shaped, and how databases have traditionally been built to handle each shape.

Executive summary

Data sprawl is not an accident of poor planning. It is the logical result of a hardware limit. Data comes in three shapes, which are structured, semi-structured, and unstructured. Three kinds of database grew up to serve them: relational row stores for transactions, columnar stores for analytics, and NoSQL for flexible scale. Each is excellent at one job and weak at the others, so organizations bought all three and glued them together.

Underneath those choices sits the real cause. For decades, scalable systems were built shared-nothing, which tied each slice of data to a specific server, and the cloud later separated compute from storage across a slow network. The VAST DataBase starts from a different foundation called DASE, short for disaggregated, shared-everything, where stateless compute nodes all reach one shared pool of flash storage over a fast network. On that base, a single engine holds tables, files, and streams together, and the old question of which system this data should live in largely goes away.

The three shapes of data

Data arrives in three broad forms, and the form has historically decided how it gets stored and analyzed:

The three shapes of data: structured tables, semi-structured records, and unstructured files
The three shapes of data: structured tables, semi-structured records, and unstructured files.
  • Structured data: organized into tables with a fixed layout, such as customer records with names, emails, and purchase history. This is the natural home of SQL, the standard language for querying tables.
  • Semi-structured data: has some organization but no rigid layout, such as the JSON a web app emits to store user preferences, or the event records that flow off a stream.
  • Unstructured data: no fixed format at all, such as emails, PDFs, images, video, and sensor captures. This is the fastest-growing category and the fuel for modern AI.

For decades, each shape pushed organizations toward a different kind of database, because no single engine handled all three well. That is the root of the sprawl.

The three database archetypes

Three categories of database grew up to serve those shapes, each tuned for a different job:

  • Relational (row-based) databases are great at structured data and transactions. They store records one full row at a time, which makes inserting, updating, and reading a single record very fast. This suits OLTP, online transaction processing, such as order entry and payments.
  • Columnar databases store data one column at a time, which makes scanning and adding up millions of rows fast and easy to compress. This suits OLAP, online analytical processing, such as sales-trend reporting.
  • NoSQL databases give up a strict layout in exchange for flexibility and easy horizontal scale, which fits semi-structured and unstructured data in fast-changing web, IoT, and big-data applications.

Each archetype is genuinely good at its job and genuinely poor at the others. A relational database struggles with large analytical scans. A columnar store is slow at single-record writes. NoSQL gains flexibility by giving up the consistency and rich querying that transactions rely on. The rough mental model the industry settled on is blunt: SQL says everything must fit the table, while NoSQL says store the data however you like.

Three database archetypes: relational row stores, columnar stores, and NoSQL, and their trade-offs
Three database archetypes, three sweet spots, and three sets of trade-offs that push organizations to run all of them at once.
DimensionSQL / RelationalNoSQL
StructureFixed tablesFlexible
SchemaStrictDynamic
ScalingVerticalHorizontal
Best forTransactionsScale and flexibility
Data typeStructuredStructured, semi, unstructured

The reason organizations run so many systems is not a lack of skill. It is that the classic archetypes force a choice: transactions or analytics, structure or flexibility, freshness or scale. Every or becomes another system, and every system becomes another copy.

The architectural root cause

Why could one system not do it all? The answer lives below the database, in the hardware design. For about thirty years, scalable systems were built shared-nothing. Data is split into shards across many servers, and each server owns its own slice of storage and compute. Shared-nothing scaled out well, but it baked in painful trade-offs. Joining or rebalancing data means shuffling it across the network. Mixing small writes with big scans makes the two fight over the same local resources. And because compute and capacity grow together, you end up over-buying one just to get enough of the other.

Shared-nothing architecture versus VAST DASE disaggregated shared-everything architecture
Shared-nothing ties each slice of data to a server. DASE lets every compute node reach all of the storage.

Cloud designs softened this by separating storage and compute, but they introduced a different bottleneck: a relatively slow network between the compute tier and an object store, which pushes designers back toward caching, tiering, and, you guessed it, more copies. The VAST DataBase starts from a different hardware premise entirely.

DASE: disaggregated, shared-everything

The VAST DataBase is built on DASE, which stands for disaggregated, shared-everything. The idea is easy to state and powerful in practice: separate compute from storage the way the cloud does, but connect every compute node to all of the storage over a fast, low-latency network, so there are no shards and no single owner of any piece of data. NVMe-over-Fabrics is the technology that makes this possible; in plain terms, it lets a server reach flash storage across the network almost as quickly as if it were local.

VAST DASE architecture: stateless compute nodes sharing all flash storage over a fast fabric
DASE: stateless compute nodes share all storage over a fast fabric, with no sharding and no data movement to scale.

A few results fall straight out of that design:

  • Every node sees all data. Because compute is stateless and storage is shared, any node can serve any query or write. There is no reshuffling to rebalance and no cross-shard coordination for a join.
  • Compute and capacity scale on their own. Need more query power? Add compute nodes. Need more space? Add storage. You stop over-buying one to get the other.
  • Steady performance under mixed work. Small transactional writes and huge analytical scans no longer fight over one server’s local disk, because storage is a shared all-flash pool reached over the network.
  • Resilience by design. With no node owning data, losing a node is an availability event, not a data-loss event.

DASE is the foundation that makes the next idea possible. Once no single server owns the data and every node can reach all of it at flash speed, the artificial walls between transactional and analytical systems, and between structured and unstructured stores, stop being necessary.

How scaling changes day to day

Because architecture is abstract, it helps to compare DASE with the shared-nothing design it replaces through one practical lens: what happens when you grow?

Scaling compute and storage capacity independently in a DASE architecture without re-sharding
Add capacity and compute independently. Growing the pool does not require re-sharding or moving existing data.
  • Adding capacity. In shared-nothing, adding servers means re-sharding and shuffling existing data so the new server gets its slice, which is an expensive and disruptive rebalance. In DASE, storage is one shared pool, so adding capacity does not move existing data.
  • Adding performance. In shared-nothing, compute and storage are welded together, so you often add whole servers, and storage you did not need, just to get more processing power. In DASE, you add stateless compute on its own.
  • Mixed workloads. In shared-nothing, a heavy scan and a burst of small writes compete for the same server’s local disk. In DASE, they hit a shared all-flash pool over a fast network, so they interfere far less.
  • Recovering from failure. In shared-nothing, a lost server means its shard is unavailable until it recovers. In DASE, no server owns data, so a lost node is a compute event, not a data event.

These are not abstract virtues. They are the difference between an architecture that fights you as it grows and one that gets out of the way, and they are why a single engine can finally serve transactions and analytics on the same data.

For anyone who has lived through a storage migration or a cluster rebalance, this is the part that sounds almost too good. Growth stops being a project with a maintenance window and a risk register, and becomes a routine action. You add what you are short of, whether that is space or processing power, and the system keeps serving traffic while it happens. The architecture does the hard part, instead of handing it to the team as a weekend of careful, nerve-wracking work.

The unified store

On top of DASE, the VAST DataBase exposes a single engine that natively holds tables, files, and objects in the same system, governed the same way, on the same media. You can write structured rows, semi-structured event streams, and unstructured files, and query across all of them, without standing up three platforms and three pipelines to keep them in sync.

One unified engine for structured tables, unstructured files and objects, and semi-structured streams
One engine for structured tables, unstructured files and objects, and semi-structured streams.

This is the payoff of everything above. Instead of picking an archetype for each workload and then building pipelines to reconcile them, the same data is available at once to transactional access and analytical queries, to a SQL engine and a Python notebook, to a dashboard and an AI model. The classic workaround, where you run NoSQL for operational data, a lake for raw datasets, and a separate query engine to stitch access across them, collapses into one platform.

The point of the VAST DataBase is not that it is a faster warehouse or a smarter lake. It is that the question of which system this data should live in largely disappears. There is one system, and the data does not move to be used.

A closer look at the archetypes

It is worth slowing down on the three archetypes, because their trade-offs are exactly what a unified engine is trying to reconcile. Understanding them explains why so many environments ended up with sprawl in the first place.

Relational, row-based databases

A relational database stores data in rows and enforces a strict layout. Its strength is the transaction: it can insert or update a single record quickly and reliably, and it guarantees ACID behavior (explained in Part 3) so that many users working at once never corrupt each other’s changes. That makes it the backbone of operational systems such as banking, payroll, and order entry. Its weakness is analytics. Answering a question like the average order value across ten million rows means dragging entire rows off storage just to read one field, so large scans are slow and do not compress well.

Columnar databases

A columnar database flips the layout and stores each field together. That makes wide calculations much faster and far easier to compress, which is why it dominates reporting and business intelligence. The cost is the mirror image of the row store: writing or updating a single record means touching many separate column locations, so transactional work suffers. Columnar is built to read analytics, not to run the business.

NoSQL databases

NoSQL relaxes the layout entirely. Key-value, document, and similar models let teams store data however they like and scale across many cheap servers, which suits fast-changing web apps, IoT, and big-data work. The trade-off is the very thing relational databases guarantee: consistency, rich querying, and easy joins are weaker or missing. You gain flexibility and scale by giving up some of the rigor that analytics and transactions depend on.

Put the three side by side and the sprawl explains itself. A business that needs reliable transactions, fast reporting, and flexible storage of varied data has historically had to buy all three, and then build pipelines to keep them consistent. The unified VAST engine exists to make and the default instead of or.

Why unstructured data changes the stakes

For most of computing history, the valuable data was structured, and the unstructured data such as documents, images, and logs was mostly dead weight you kept for compliance. AI flipped that. The fastest-growing and now most valuable category is exactly the unstructured and semi-structured data that relational and columnar systems were never designed to hold. A modern platform that can only handle tables is solving yesterday’s problem.

This is why one engine for tables, files, and streams is not a tidiness argument. It is a readiness argument. The same organization that needs ACID transactions for orders also has a decade of PDFs, contracts, images, and sensor data that an AI model would love to learn from. If those live in a separate silo with separate governance, every AI project starts with a data-plumbing project. When they live in one governed engine next to the structured data, the plumbing is already done. Part 4 returns to this when we cover retrieval-augmented generation.

There is a second effect worth calling out. Because unstructured data is messy and large, copying it into a special-purpose store is even more painful than copying a tidy table. A single engine that can hold a petabyte of documents and images in place, and serve them to analytics and AI without a migration, turns the hardest data to move into data you never have to move at all. The bigger and messier the data, the more the unified approach pays off.

Why one engine was long considered impossible

It is fair to ask: if unifying these workloads is so clearly useful, why did nobody do it before? The honest answer is that for decades it genuinely was not practical. Spinning hard disks punished the fine-grained random access that transactions need, so analytical systems avoided it and loaded data in big batches instead. Networks were too slow to let every compute node treat all storage as if it were local, so data had to be sharded and owned. Given those limits, splitting work into specialized systems was the correct engineering decision at the time.

What changed is the hardware. Affordable, dense flash removes the random-access penalty, and fast fabrics such as NVMe-over-Fabrics let many compute nodes share one storage pool at close to local speed. DASE is essentially the design you would choose if you started today, on today’s hardware, without thirty years of disk-era assumptions baked in. One engine is possible now not because someone got cleverer about software, but because the physical limits that forced fragmentation have lifted.

What this changes in practice

When the engine is unified, the work changes shape. Instead of mapping each requirement to a different product and then paying to connect them, a single platform adapts to the workload:

  • Operational systems that need fast, consistent writes and analytics that need fast, wide scans share the same data, not copies of it.
  • New data sources, such as a new stream or a new bucket of files, land in the same place instead of spawning a new pipeline.
  • Governance, security, and access control are applied once, to one system, rather than rebuilt for every silo, which is the subject of Part 4.

None of this requires a rip-and-replace. The usual path is to land a new workload or a new data source on the unified engine first, prove it out, and let the old silos age out over time. The point is not a dramatic migration. It is that the next thing you build does not have to add another system to the pile, and the system after that can start to take work away from the silos you already have.

What this means for you

  • If you run storage: capacity becomes one pool you grow without re-sharding, failure domains get simpler, and you stop sizing three separate clusters for three separate peaks.
  • If you run databases: one engine serves both the transactional and the analytical side, so there is no shard map to babysit and no separate reporting copy to keep in sync.
  • If you build AI or machine learning: your training and retrieval data can sit next to the source records, in their original governed home, instead of in yet another store you have to populate.
  • If you own the business outcome: fewer systems means lower cost, fewer integration projects, and faster delivery when a new question or data source appears.

A day in the life of one dataset

To see what unification really means, follow a single dataset through a normal working day instead of following the architecture diagram. Imagine the stream of orders coming off an online store. In a stacked world, those orders are written to a transactional database, then copied overnight into a warehouse for reporting, then exported again into a lake so data scientists can use them, and perhaps copied a fourth time into a vector store so an assistant can answer questions about them. Four copies, four sets of permissions, four chances to fall out of sync, all describing the exact same orders.

On a unified engine the orders are written once and then simply used. The operational application reads and updates them at transactional speed. The analyst’s dashboard scans them for this morning’s trend. The data scientist trains a forecast on the same records without exporting anything. The assistant retrieves the relevant ones to ground an answer. Nothing was copied, nothing drifted, and there is exactly one version of each order that everyone agrees on. The dataset did not move to be useful. It stayed put while the work came to it.

That shift, from moving data to moving the work, is the whole idea in miniature. It sounds modest, but it is the difference between an architecture you spend your time maintaining and one that simply serves whatever asks of it.

Common questions about a single engine

When people first hear about one engine for tables, files, and streams, a few honest questions tend to come up. They are worth answering plainly.

  • Does one system for everything mean one big risk? No. Because storage is shared and no single node owns the data, losing a node is an availability event rather than a data-loss event, and the platform is built to keep serving through failures. One platform does not mean one fragile point.
  • Will the transactional side slow down when analytics run? This is the classic worry, and it is exactly what the shared all-flash design and snapshot isolation in Part 3 are built to prevent. Big scans and small writes hit a shared pool rather than fighting over one server’s disk.
  • Do I have to move everything at once? No. As noted above, the practical path is to land one new workload or source on the engine, prove it, and let the old silos shed work over time. Nothing forces a single dramatic cutover.
  • Is my data locked into a new proprietary box? No, and Part 4 covers this directly. Keeping data in open formats means the engines that read it stay a choice you can revisit, not a cage.

Key takeaways

  • Data comes in three shapes, structured, semi-structured, and unstructured, and each historically pushed organizations toward a different database.
  • The three archetypes (relational, columnar, NoSQL) each excel at one job and struggle at the others, which forces the choices that create silos.
  • Those choices trace back to shared-nothing and cloud hardware designs that tie data to servers or separate them across a slow network.
  • DASE separates compute from storage while sharing all storage with all compute over a fast fabric, removing sharding and data ownership.
  • On that foundation, the VAST DataBase unifies tables, files, and streams in one engine, so the data stops moving and the archetype question fades.

The VAST DataBase Engine series

This article is part of a four-part series on the VAST DataBase. Continue reading:


Related reading

Next in this series: Part 3 — Under the Hood: Row, Column, and ACID at Scale.

Discover more from Lots of Data

Subscribe now to keep reading and get access to the full archive.

Continue reading