AI workloads, data platforms, and infrastructure notes, written from the engineering edge between benchmarks and production.

RSS feed
/ /

Why Enterprise AI Needs a New Kind of Vector Database (Part 1 of 2)

What vector databases do, why they power enterprise AI, and why legacy and sharded architectures crack under the scale, speed, and governance real AI workloads demand.

I

Itzik — VP Mission Alignment, VAST Data

·

·

16 min read


Part 1 of 2: What vector databases actually do, why they have become the backbone of enterprise AI, and why the legacy and sharded architectures most teams reach for first start to crack under the scale, speed, and governance that real AI workloads demand.

Almost every enterprise AI project eventually runs into the same quiet dependency. Behind the chatbot that answers questions about your products, the assistant that searches your internal knowledge, and the recommendation engine that decides what a customer sees next, there is a database doing something most traditional databases were never built to do: finding information by meaning rather than by exact match. That capability lives in a vector database, and the choice of which one you build on turns out to matter far more than most teams expect.

This is the first of two posts. Here we cover the foundations: what vectors and embeddings are, how similarity search powers applications like retrieval augmented generation, and why the architectures most organizations try first, from a Postgres extension to a sharded vector engine, begin to struggle exactly when the workload gets serious. In Part 2 we look at how the VAST Vector Store in the VAST DataBase is built to get past those limits, and what the numbers look like at scale. If you are evaluating vector platforms, or just trying to understand why your current one keeps hitting walls, this is the context that makes the rest make sense.

Executive summary

Enterprise AI is transforming industries at a breakneck pace, and it is driving demand for data systems that are smarter, faster, and more scalable than anything that came before. Traditional data architectures, designed for exact-match lookups and overnight reporting, are struggling to keep up with the scale, speed, and complexity of modern AI workloads. Vector databases have emerged as the foundation that makes AI applications possible, because they let software search by meaning.

But not all vector database architectures are built for what the enterprise actually demands. Many organizations are running AI on systems that were never designed for this kind of scale, and the workarounds that once seemed reasonable, manual sharding, memory-resident indexes, and bolting vectors onto a general purpose database, are now creating real operational risk. This post explains the concepts from the ground up, then walks through where legacy and sharded vector architectures break down: cross-shard query overhead, memory bottlenecks, and the governance silos that form when vectors, metadata, and raw data live in separate systems.

Start with the shift: from matching text to understanding meaning

To understand why vector databases matter, it helps to be precise about what makes them different. A traditional database is built around exact matches. You ask for the customer whose identifier is 4471, or every order placed after a certain date, and the system returns the rows that match those conditions exactly. That model is perfect for running a business, and it is not going away. But it cannot answer a question like “find me the documents that are about roughly the same thing as this one,” because being about the same thing is not an exact match. It is a question about meaning.

A vector database is designed for exactly that question. It manages, stores, and retrieves high-dimensional numerical representations of data, called embeddings, in order to find items that are conceptually similar. Instead of asking whether two things are identical, it asks how close they are in meaning, and returns the nearest matches. This is called similarity search, and it is the capability that nearly every modern AI feature quietly depends on.

How embeddings and similarity search work: raw inputs such as text, images, and audio are converted into vectors, then placed in a space where conceptually similar items sit close together, so a query can find its nearest neighbors by meaning rather than by keyword.
How embeddings and similarity search work: raw inputs such as text, images, and audio are converted into vectors, then placed in a space where conceptually similar items sit close together, so a query can find its nearest neighbors by meaning rather than by keyword.

The vocabulary that unlocks the rest

A handful of terms come up constantly in this field, and they are worth pinning down in plain language before we go further, because the differences between vector platforms are easier to see once the words are clear.

  • Vector: an array of numerical values that expresses the location of a point along many dimensions. You can think of it as coordinates, except instead of two or three dimensions there are often hundreds or more than a thousand.
  • Embedding: the process of converting data, whether text, an image, audio, or something else, into a vector. The words embedding and vector are often used interchangeably, so “the embeddings are stored” and “the vectors are stored” mean the same thing.
  • Dimensions: the number of values in a vector, typically somewhere between 128 and 1536. More dimensions can capture more nuance about meaning.
  • Similarity search: finding items that are conceptually related rather than exactly identical, by comparing how close their vectors are.
  • Semantic search: a search technique that finds results based on meaning and context rather than exact keywords, built on vector embeddings.
  • Approximate nearest neighbor (ANN): the family of algorithms used to find the closest matching vectors quickly, without comparing against every single vector one by one.
  • Retrieval augmented generation (RAG): an AI approach that combines real-time data retrieval with a generative model, so the model can produce accurate, up-to-date answers grounded in your own data.
  • AI data pipeline: the end-to-end process of collecting, transforming, embedding, and retrieving data to power AI applications.

With those in hand, the role of a vector database snaps into focus. It is the system that stores embeddings and makes them searchable by meaning, at the speed and scale that an AI application needs. Everything else in this series is about how well different architectures actually do that job.

Why vector databases became the backbone of enterprise AI

Vector databases are not an academic curiosity. They are the foundation underneath a set of applications that enterprises are racing to deploy right now. Four use cases show up over and over, and each one depends on similarity search working reliably.

Retrieval augmented generation

RAG is the pattern behind most enterprise AI assistants that need to be accurate about your business. A general purpose large language model knows a great deal about the world in general, the way a tool like ChatGPT does, but it knows nothing about your internal policies, your product catalog, or last quarter’s results. RAG closes that gap. Ahead of time, your documents are converted into vectors and stored in a vector database. At question time, the user’s query is also turned into a vector, the database retrieves the most relevant chunks of your proprietary content, and those chunks are handed to the model along with the question. The model then generates an answer grounded in your data rather than guessing.

A retrieval augmented generation pipeline: the user's question is embedded, the vector database retrieves the most relevant proprietary content by similarity, and the language model combines that context with the question to produce a grounded, accurate answer.
A retrieval augmented generation pipeline: the user’s question is embedded, the vector database retrieves the most relevant proprietary content by similarity, and the language model combines that context with the question to produce a grounded, accurate answer.

This is why vector search is the common retrieval method for RAG. It is ideal for natural language questions, fuzzy matching where the wording will never be exact, and the unstructured data that makes up most of an enterprise’s knowledge. Without a vector database underneath, a RAG system has nothing reliable to retrieve from.

Intelligent agents, recommendations, and semantic search

The same capability powers three more workloads. Intelligent agents use vector databases to access knowledge bases, find relevant historical context, make informed decisions, and learn from past interactions. Recommendation systems in e-commerce, streaming, and content platforms use vectors to find similar products or content, personalize what each user sees, power “customers also bought” features, and drive discovery. Semantic search goes beyond keyword matching to understand intent, so users can search by meaning rather than exact words, find related content even across languages, and ask questions in natural language, with dramatically better relevance than a keyword index can offer.

The throughline is simple: vector databases store meaning, not just data. Embeddings let AI compare, retrieve, and reason over information, which is why a purpose-built vector database has become essential infrastructure rather than a nice-to-have.

That last point deserves emphasis. Enterprise AI requires purpose-built infrastructure precisely because traditional databases were not designed for the speed, the scale, or the unstructured data types these workloads involve. A vector database fills that gap. The open question, and the one that decides whether an AI initiative scales gracefully or painfully, is which kind of vector database you build on.

Where the first choices start to crack

When teams first add vector search, they reach for whatever is closest. Often that means a familiar relational database with a vector extension added on, or a dedicated vector engine that keeps its index in memory and splits data across nodes as it grows. These choices are reasonable on day one. The trouble is that they inherit a set of architectural assumptions that do not hold up as data grows from millions of vectors into billions, and as the workload moves from a pilot to something the business depends on.

It is worth being specific about the mechanics, because the limitations are not bugs to be patched. They are consequences of the architecture itself. Three of them tend to surface together: the cost of querying across shards, the cost of keeping indexes in memory, and the governance silos that form when your data is scattered across systems.

The complexity of sharding and cross-shard search

Sharding means dividing a database into smaller, independent pieces, called shards, and distributing them across multiple nodes. Partitioning is the related idea of organizing data into segments based on a key such as time or user identifier. Many vector systems use a shared-nothing architecture, where each node owns and manages its own slice of the data with no overlap. On paper this is how you scale beyond one machine. In practice it introduces a steady stream of operational overhead.

Manually dividing data into shards requires careful planning and ongoing management. As data grows or usage patterns change, rebalancing shards becomes time-consuming and error-prone. The result is often uneven distribution, with some shards running hot while others sit idle, and an elevated risk of downtime during maintenance or scaling events. That is the cost before a single query runs.

Querying makes it worse. When a search needs data from multiple shards, and a large-scale vector search almost always does, the query has to be broadcast to every relevant node. Each node searches its own portion, returns its best results, and a coordinator merges and re-ranks everything before replying. This fan-out and fan-in pattern adds network latency and coordination overhead to every request.

Cross-shard vector search in a sharded, shared-nothing system: every query fans out to all shards, each searches its own memory-resident index, and the coordinator must wait for all responses before merging and re-ranking, so a single slow shard stalls the entire search.
Cross-shard vector search in a sharded, shared-nothing system: every query fans out to all shards, each searches its own memory-resident index, and the coordinator must wait for all responses before merging and re-ranking, so a single slow shard stalls the entire search.

Walk through the steps and the fragility becomes obvious. A coordinator node receives the query and decides which shards are involved. It broadcasts the query to every shard, increasing network traffic and coordination complexity. Each shard searches its own, often memory-resident, index, limited by its local resources. The shards return their top results to the coordinator, which must wait for all of them before proceeding. The coordinator then merges, sorts, and re-ranks the combined results, adding processing time and the potential for inconsistency. Only then are the final results sent to the client. A delay or failure at any one of those steps degrades the experience, and if a single shard is overloaded or out of sync, the entire search slows down or fails.

The memory wall

The second limit is about memory. Many vector databases rely on keeping their index fully loaded into RAM, because memory-resident structures such as HNSW or FAISS deliver fast search, but only as long as the entire index fits in memory. That qualifier is the whole problem.

As datasets grow into the billions of vectors, keeping everything in memory becomes prohibitively expensive. Operators are forced into an uncomfortable choice: add ever more RAM, which drives up cost relentlessly, or split the data into still more shards, which compounds the coordination problem described above. And when an index can no longer fit and spills to disk, search performance drops sharply, which undermines the very responsiveness that real-time AI requires.

The memory wall of in-memory vector indexes: they are fast only while the index fits in RAM. As data grows into the billions, cost climbs unsustainably, and when the index spills to disk, search latency jumps.
The memory wall of in-memory vector indexes: they are fast only while the index fits in RAM. As data grows into the billions, cost climbs unsustainably, and when the index spills to disk, search latency jumps.

This is why scale exposes the design. Memory-resident indexes are genuinely fast at modest sizes, which is why they are popular for pilots and demos. But the cost and complexity of maintaining enough memory to hold billions of vectors becomes unsustainable for most enterprises. Sharded and distributed systems try to work around it by spreading data across nodes, but that only shifts the problem rather than solving it. Operators end up constantly balancing memory usage, managing index spills to disk, and handling the operational risk of node failures and uneven data growth. The outcome is fragile, expensive infrastructure that struggles to deliver the low-latency, high-throughput performance modern AI needs.

Governance silos

The third limit is the one that security and compliance teams feel most acutely, and it often gets overlooked until late in an evaluation. Fragmented architectures tend to separate vectors, metadata, and the raw source data across different systems. The embeddings live in a vector engine, the structured attributes live in a database, and the original files or objects live in a data lake or object store.

Fragmented vector architectures scatter raw data, metadata, and embeddings across separate systems, each with its own tools and policies, which creates governance silos, inconsistent controls, and compliance risk.
Fragmented vector architectures scatter raw data, metadata, and embeddings across separate systems, each with its own tools and policies, which creates governance silos, inconsistent controls, and compliance risk.

Each of those systems comes with its own tools and its own access policies. Keeping them consistent is hard, and the gaps between them are exactly where risk lives. Enterprises face an increased operational burden as they try to enforce policies and manage data lineage across disconnected platforms, and an attacker or a simple misconfiguration only needs one of those systems to be out of step. When the data that feeds an AI model is governed inconsistently, you cannot give regulators, security teams, or legal departments the clean answer they need before an initiative can scale.

Why teams pick these architectures anyway

It would be unfair to suggest that anyone chose a fragile design on purpose. Each of these starting points is a sensible local decision. A team that already runs PostgreSQL adds a vector extension because it is right there and the data is already in the database. A team that wants the fastest possible search at small scale picks an in-memory engine because, at the size of a proof of concept, nothing feels faster. A team with no appetite for managing infrastructure picks a managed service because it removes operational work on day one. None of those choices is wrong in the moment they are made.

The difficulty is that the moment they are made is exactly when the workload is smallest and the trade-offs are invisible. A few million vectors fit comfortably in memory, a single shard needs no coordination, and a single pod has plenty of headroom. The architecture only reveals its limits later, when the pilot succeeds and the dataset grows by an order of magnitude, when more applications start querying the same index, and when the security team finally asks how the embeddings are governed. By then the system is in production, the cost of changing it has gone up, and the workarounds have started to accumulate. This is why it pays to understand the limits before you commit, rather than after.

What a better architecture would have to do

If the problems are sharding overhead, the memory wall, and governance silos, then the requirements for something better follow directly. It is worth stating them plainly, because they are the yardstick against which any vector platform should be measured, and they are exactly what Part 2 examines in the VAST Vector Store.

  • Scale without sharding. Growth should mean adding resources, not manually splitting data, rebalancing shards, or provisioning more pods. The system should hold billions or trillions of vectors without the operator having to think about how the data is divided.
  • Stay fast without living in memory. Search latency should remain low even when the dataset is far larger than RAM, which means the index cannot depend on being fully memory-resident, and performance must not fall off a cliff when data spills to disk.
  • Keep one copy, governed once. Vectors, metadata, and source data should live in one system under one set of access controls, so there are no silos to reconcile and no gaps for risk to hide in. Permissions on the source data should extend to the vectors derived from it.
  • Let you query meaning and structure together. Vector similarity and ordinary filters (dates, categories, attributes) belong in the same query, so applications are not forced to stitch results together across systems.

Notice that these requirements are not a wish list of features. They are the direct inverse of the three limitations above. A platform that meets them does not just perform better on a benchmark; it removes the operational and governance burdens that make large-scale vector search painful in the first place.

What this means depending on where you sit

The same architectural limits land differently on different teams, so here is the short version for four readers who will each recognize their own version of the problem.

  • If you run infrastructure or storage: the memory wall is your budget line. You are the one provisioning ever more RAM or GPU to keep indexes resident, and adding nodes to hold a dataset that keeps growing. An architecture that scales on shared storage instead of memory changes that conversation.
  • If you run databases or platforms: sharding and rebalancing are your operational tax. You are the one planning shard maps, handling hot shards, and coordinating maintenance windows. A system with no shards to manage takes that recurring work off your plate.
  • If you build AI or machine learning: staleness and plumbing are your bottleneck. You are the one waiting for a separate vector store to sync, and rebuilding pipelines when sources change. Vectors that live next to their source data remove a copy and a sync job.
  • If you own security or compliance: the silos are your risk. You are the one trying to prove that embeddings are governed the same way as the documents they came from, across disconnected systems. Unified, inherited permissions are what let you give a clean answer.

Different seats, same root cause. All four are paying, in their own currency, for an architecture that was adapted to vector search rather than built for it.

What this adds up to

Put the three together and a pattern emerges. As your data grows, legacy and sharded vector architectures do not simply get a little slower. They threaten reliability, they increase cost, and they create operational chaos, all at the same moment the business is trying to lean on AI more heavily. The systems were adapted from a different era, and adaptation has limits.

None of this means vector search is the problem. Vector search is the foundation, and it is not going away. The problem is the architecture underneath it. A system that treats vector search as something bolted onto a general purpose database, or as a memory-resident index that must be sharded to grow, carries the cost of those choices into every query and every audit.

The right question is not “which vector database is fastest in a small benchmark,” but “which architecture still delivers predictable performance, simple operations, and unified governance when the dataset reaches billions or trillions of vectors.” That is a question about design, not tuning.

That is exactly the question Part 2 takes up. We will look at how the VAST Vector Store in the VAST DataBase approaches vector search differently: storing vectors as a first-class data type alongside your structured data, replacing brute-force and memory-bound indexing with a hierarchical clustering approach that scales logarithmically, and unifying governance across every data type. We will also put real numbers on it, including how the architecture delivered substantially lower cost per search at fifty billion vectors while holding latency low. The foundations in this post are what make those results meaningful.

For now, the takeaway is straightforward. Vector databases are the backbone of enterprise AI, but the architecture you choose determines whether that backbone holds up under load. If your current platform forces you to shard manually, to keep buying memory, or to stitch governance across disconnected systems, those are not isolated annoyances. They are the predictable symptoms of a design that was never built for the scale AI now demands.

Related reading: the VAST Vector Store is built inside the VAST DataBase. See Inside the VAST DataBase Engine for the storage foundation, and From Lakehouse to AI on the VAST DataBase for how it powers RAG and analytics.

Discover more from Lots of Data

Subscribe now to keep reading and get access to the full archive.

Continue reading