Lots of Data

  • The History and Future of KV Cache and how VAST Data is transforming the KV cache from a temporary, disposable byproduct of inference into a persistent, manageable, and valuable data asset.

    The History and Future of KV Cache and how VAST Data is transforming the KV cache from a temporary, disposable byproduct of inference into a persistent, manageable, and valuable data asset.

    In the rapidly evolving world of artificial intelligence, we often hear about massive parameter counts, powerful GPUs, and breakthrough model architectures. But there’s a silent workhorse behind the scenes, a critical component that has enabled the incredible growth in LLM capabilities and is now at the center of the next great leap in AI infrastructure. That component is the KV Cache.

    This is the story of KV Cache: what it is, how it became the biggest bottleneck in AI, and the revolutionary solution announced by NVIDIA and VAST Data at CES 2026 that promises to unlock the era of true “Agentic AI.”

    Part 1: What is KV Cache? A Simple Explanation

    Imagine you’re a chef in a busy kitchen. You get an order for a complex dish. As you prepare it, you need to keep track of all the ingredients you’ve already chopped, the spices you’ve added, and the cooking times for each component. If you had to re-chop every vegetable and re-measure every spice for every new step of the recipe, you’d never finish. Instead, you keep the prepared ingredients in bowls on your counter—a “cache” of past work—ready to be used instantly.

    In the world of Large Language Models (LLMs) like GPT-4, the process is similar. When an LLM generates text, it does so one word (or token) at a time. To generate the next token, it needs to understand the context of all the tokens that came before it.

    This is where the Key-Value (KV) Cache comes in. Inside the model’s “attention mechanism,” for every token processed, two mathematical vectors are created: a Key (which helps identify the token) and a Value (which holds its semantic meaning).

    Without a cache, the model would have to re-compute these Key and Value vectors for every single previous token for each new word it generates. This would be incredibly slow and inefficient. The KV cache stores these vectors in the GPU’s high-speed memory, so they only need to be computed once. When generating the next token, the model simply looks up the pre-computed Keys and Values from the cache, saving a massive amount of computational work.

    Part 2: The History of KV Cache: From Novelty to Bottleneck

    The concept of KV caching is as old as the Transformer architecture itself, which powers nearly all modern LLMs. In the early days, models were relatively small, and the “context window”—the amount of text the model could consider at once—was limited to a few hundred or a few thousand tokens. The KV cache for a single request could easily fit within the ample memory of a data center GPU. It was a neat optimization, a solved problem.

    Then, the AI race began. Models grew exponentially in size, and so did the demand for longer context windows. We went from processing paragraphs to entire books, from simple Q&A to complex, multi-step reasoning tasks.

    This created a massive problem. The size of the KV cache grows linearly with the sequence length and the batch size (the number of simultaneous requests). For a model with a 100k token context window, the KV cache for a single user can be tens of gigabytes. Multiply that by dozens of concurrent users, and you quickly exhaust the memory of even the most powerful GPUs.

    The AI workload shifted from being compute-bound (limited by how fast the GPU could do math) to being memory-bound (limited by how much data the GPU could hold). The GPU’s precious high-bandwidth memory (HBM) was no longer just for model weights; it was being swallowed whole by the KV cache.

    The Era of Optimization

    Faced with this bottleneck, researchers and engineers developed a series of clever optimizations to squeeze more performance out of existing hardware:

    • Multi-Query Attention (MQA) & Grouped-Query Attention (GQA): These techniques modify the model architecture to use fewer Key and Value heads, significantly reducing the memory footprint of the cache at the cost of a small amount of model quality.
    • FlashAttention: A groundbreaking software technique that optimizes how the GPU reads and writes data to memory, reducing the time spent moving data back and forth and speeding up attention calculations.
    • Quantization: Instead of storing the Keys and Values in high-precision 16-bit formats (FP16), they can be compressed into 8-bit (INT8) or even 4-bit formats. This dramatically reduces memory usage but requires careful implementation to avoid losing accuracy.
    • PagedAttention (from vLLM): Inspired by operating system virtual memory, PagedAttention manages the KV cache in non-contiguous memory blocks. This eliminates memory fragmentation and allows for much more efficient use of the GPU’s available memory, enabling larger batch sizes.

    These innovations were crucial for deploying models like Llama 2 and Mistral, but they were all fighting a losing battle against the insatiable demand for more context.

    Part 3: The “Agentic AI” Problem and the CES 2026 Revolution

    The next frontier of AI is Agentic AI: systems of autonomous agents that can plan, reason, use tools, and collaborate to solve complex, long-horizon problems. Think of an AI software engineer that doesn’t just write a function but architects an entire application, debugging and iterating over days or weeks.

    For these agents to be effective, they need persistent, long-term memory. They need to remember what they did yesterday, what their goals are, and the context of their collaboration with other agents. The KV cache is the perfect representation of this memory. But storing terabytes of KV cache in scarce, expensive GPU memory for days is simply not feasible.

    This was the problem that NVIDIA and VAST Data set out to solve, culminating in their game-changing announcements at CES 2026.

    NVIDIA’s Inference Context Memory Storage Platform

    At CES 2026, Jensen Huang took the stage to announce a new class of AI infrastructure: the NVIDIA Inference Context Memory Storage Platform. The core idea is simple but revolutionary: disaggregate the KV cache.

    Instead of trapping the KV cache inside the GPU, this new platform allows it to be stored in a specialized, high-performance external storage system. The GPU can then fetch only the parts of the cache it needs, when it needs them, over an ultra-fast network.

    The key enablers for this are:

    1. NVIDIA BlueField-4 DPU (Data Processing Unit): The “brains” of the operation. The BlueField-4 sits between the GPU and the storage, managing the data placement, handling security, and offloading the complex task of managing the KV cache from the GPU.
    2. NVIDIA Spectrum-X Ethernet: The high-speed network fabric. Using RDMA (Remote Direct Memory Access), Spectrum-X allows the GPUs to access the remote KV cache with incredibly low latency, almost as if it were in their own local memory.

    This architecture provides massive benefits:

    • Virtually Unlimited Context: The size of the KV cache is no longer limited by GPU memory but by the capacity of the storage system, which is far cheaper and more scalable.
    • Context Sharing: Multiple GPUs and even multiple different AI agents can share the same KV cache, enabling seamless collaboration.
    • Increased Throughput: By freeing up GPU memory, more batches can be processed simultaneously, boosting the number of tokens generated per second.
    • Improved Power Efficiency: It’s much more power-efficient to store data in a dedicated storage system than in power-hungry GPU HBM.

    VAST Data: The AI Operating System for the Agentic Era

    VAST Data, a leader in AI data platforms, announced its role as a key partner in this new ecosystem. The VAST Data Platform is the first to run its software, the VAST AI Operating System, directly on the NVIDIA BlueField-4 DPUs.

    By running on the DPU, VAST’s software sits right in the data path, managing the flow of KV cache between the GPUs and VAST’s scalable all-flash storage. This integration is what makes the entire system practical for enterprise use.

    VAST’s contribution goes beyond just raw storage. They are providing the data services needed for a world of persistent AI agents:

    • Data Management: Efficiently storing, retrieving, and managing the lifecycle of billions of KV cache objects.
    • Security & Isolation: Ensuring that one agent’s context is secure and cannot be accessed by unauthorized agents.
    • Auditability: Tracking who accessed what context and when, which is crucial for regulated industries.

    In essence, VAST Data is transforming the KV cache from a temporary, disposable byproduct of inference into a persistent, manageable, and valuable data asset.

    Conclusion: A New Foundation for AI

    The journey of the KV cache from a simple optimization to a central pillar of AI infrastructure is a testament to the incredible pace of innovation in this field. The announcements from NVIDIA and VAST Data at CES 2026 are not just about faster chips or bigger drives; they represent a fundamental rethinking of how we build AI systems.

    By disaggregating memory and enabling persistent, shared context, they have laid the foundation for the next generation of AI: agents that can think, plan, and collaborate over long periods to solve the world’s most complex problems. The silent workhorse has finally taken center stage.

  • The Great Storage Squeeze of the 2020s: Why Skyrocketing Component Costs Are Exposing the Fatal Economic Flaws of Legacy Architectures (And Why VAST DASE is the Only Viable Escape Route)

    The Great Storage Squeeze of the 2020s: Why Skyrocketing Component Costs Are Exposing the Fatal Economic Flaws of Legacy Architectures (And Why VAST DASE is the Only Viable Escape Route)

    We are currently living through a “perfect storm” in the data infrastructure world, a convergence of trends that is putting unprecedented pressure on IT budgets.

    On one front, the sheer gravitational pull of data demand is becoming exponential. We are past the era of simple file storage. We are now in the age of consolidating virtual machines and containers, generative AI training sets measured in hundreds of terabytes, high resolution volumetric video, relentless IoT sensor logging, and modernized enterprise backup strategies that require instant recovery capabilities. The world is generating petabytes of data that doesn’t just need to be stored; it needs to be instantly accessible, highly performant, and “always on.”

    On the other front, the underlying economics of the hardware required to store that data have turned hostile. For nearly a decade, IT leaders got used to a comfortable trend: flash prices went down, and density went up. We took it for granted.

    That trend has violently reversed.

    We have entered an era of the “Great Storage Squeeze.” The cost of NAND flash (SSDs) is rising sharply, driven by calculated production cuts by major manufacturers and supply chain constraints. Simultaneously, the cost of DRAM is skyrocketing, driven by the insatiable appetite of AI GPU servers gobbling up high performance memory (HBM) and the industry transition to more expensive DDR5 standards.

    If you are relying on traditional storage architectures designed fifteen or twenty years ago, this squeeze isn’t just uncomfortable, it’s financial anemia.

    In this new economic reality, efficiency is no longer just a “nice to have” feature on a datasheet; it is the single most critical metric for Total Cost of Ownership (TCO). If your architecture wastes flash or squanders RAM, your budget is bleeding.

    This post will take a deep dive into why legacy “shared everything or nothing,” dual controller architectures are financially disastrous in the current market, and how VAST Data’s unique Disaggregated Shared Everything (DASE) architecture and revolutionary Similarity Engine offer the only viable economic escape route for petabyte scale organizations.

    The Legacy Trap: Why Dual Controller Architectures Can’t Survive the New Economics

    For over two decades, the “shared everything ” HA pair (high availability dual controller) has been the standard Lego brick of enterprise storage. You buy a physical chassis containing two controllers for redundancy, and a set of drives plugged into the back.

    Those two controllers “own” those drives. No other controller in your data center can touch them.

    This “shared everything” approach worked perfectly fine when enterprise data sets were measured in tens of terabytes. But as organizations scale into multiple petabytes, this architecture introduces massive, expensive inefficiencies that are magnified tenfold by rising component costs.

    The Silo Problem and the Curse of Stranded Capacity

    Traditional architectures scale by adding more “silos.” If you fill up Array A, you must buy Array B. If you fill up Array B, you buy Array C. If you need more performance than what the dual controllers can provide, yes, you guessed it properly, you add another dual controller cluster..

    The fundamental economic flaw is that these arrays are islands. They do not share resources.

    Imagine you have ten separate legacy arrays. It is statistically improbable that all ten will be perfectly utilized at 85%.

    • Array A, hosting a mature database, might be 95% full, constantly triggering capacity alarms.
    • Array B, bought for a project that got delayed, might be sitting at 20% utilization.

    In a legacy world, Controller Pair A cannot utilize the stranded capacity trapped behind Controller Pair B. You have terabytes of expensive flash sitting idle in one rack, yet you are forced to issue a purchase order for more expensive flash for the adjacent rack simply because the empty space is trapped in the wrong silo.

    Industry analysts estimate that in large scale traditional environments, 25% to 35% of total purchased flash capacity is perpetually stranded due to imperfect silo balancing. When SSD prices are skyrocketing, paying for 30% more flash than you actually use is a massive, unacceptable tax on your organization.


    The “Dirty Little Secret” of Legacy Data Reduction

    This is perhaps the most critical financial deficiency of traditional architectures in a multi petabyte environment, yet it is rarely discussed openly by legacy vendors.

    Every vendor touts their deduplication and compression capabilities. They promise 3:1, 4:1, or sometimes 5:1 data reduction ratios. But there is a massive caveat hidden in the architectural fine print: Deduplication is bounded by the cluster boundary.

    Because traditional architectures are based on independent silos, they only “know” about the data within their specific domain. They have no global awareness.

    The Multi Cluster Duplication Disaster: A Real World Scenario

    Let’s visualize a very common enterprise workflow to understand the scale of this economic waste.

    1. Production: You have a primary, high performance legacy all-flash array (Cluster A) holding 1PB of critical production data.
    2. Disaster Recovery: You require a remote copy. You replicate that 1PB to a secondary cluster (Cluster B) at a different site.
    3. Dev/Test: Your development teams need realistic data to work with. You spin up a third cluster (Cluster C) and clone the production environment for them.
    4. Analytics: Your data science team needs to run heavy queries without impacting production. You extract that data to a fourth data lake cluster (Cluster D).

    In a traditional, shared nothing world, Cluster B, C, and D have absolutely no knowledge of the data sitting on Cluster A.

    Even though the 1PB of data across all four sites is 99% identical, every single cluster will re-ingest it, re-process it, re-hash it, and store it as unique physical blocks.

    The Economic Reality: You have 1PB of actual corporate information, but you have purchased and are powering 4PB of expensive flash to store it.

    In a world of rising SSD costs, this inability to deduplicate globally across your entire environment is a financial catastrophe. It forces you to buy the same expensive terabyte over and over again.


    The Paradigm Shift: VAST Data’s Disaggregated Shared Everything (DASE)

    To solve an efficiency crisis of this magnitude, you cannot just tweak the old model with faster CPUs. You have to break the architecture completely.

    VAST Data realized that the shared nothing, dual controller approach was a dead end for petabyte scale. Instead, VAST built the DASE (Disaggregated Shared Everything) architecture from the ground up to align with modern hardware realities.

    How DASE Breaks the Silos

    DASE fundamentally separates the “brains” of storage (compute logic) from the “media” (persistence).

    1. The Compute Layer (The Brains): These are stateless Docker containers running on standard servers. They handle all the complex logic—NFS/S3/SMB/Block protocols, erasure coding, encryption, and data reduction. Crucially, they hold no persistent state. If a node fails, another one instantly takes over without a long rebuild process. You can scale performance linearly just by adding more stateless containers.
    2. The Persistence Layer (The Media): This is a giant, shared pool of highly available NVMe JBOFs (Just A Bunch Of Flash). These enclosures contain no logic, only media. They hold a mix of expensive, ultra fast Storage Class Memory (SCM) for write buffering and metadata, and dense, affordable QLC flash for long term storage.
    3. The Interconnect (NVMe-oF): A high speed, low latency Ethernet fabric connects everything to everything.

    The crucial difference that changes the economics: Every single compute node can see and access every single SSD in the entire cluster directly over the network at NVMe speeds.

    Why DASE is an Economic Fortress

    Because everything is shared, there are no silos. There is absolutely no stranded capacity.

    The entire cluster is one single pool of storage. If you are at 70% capacity, you are at 70% utilization across every drive. You never have to over provision one resource just to get more of the other.

    Furthermore, DASE unlocks the economic potential of QLC Flash. QLC (Quad Level Cell) flash is significantly denser and cheaper than the TLC flash used by most legacy arrays. However, QLC has low endurance—it wears out quickly if you write to it randomly, the way legacy controllers do.

    VAST’s DASE architecture uses the ultra fast SCM layer to absorb all incoming writes, organizing them into massive, perfectly sequential stripes before laying them down gently onto the cheap QLC flash. This allows VAST systems to use low cost QLC for 98% of their capacity while offering a 10 year endurance guarantee, something legacy architectures simply cannot achieve.


    The Secret Weapon: The VAST Similarity Engine

    We established that rising SSD costs make capacity efficiency paramount. But we also established that rising RAM (DDR5) costs are painful.

    Traditional deduplication is terrible at both.

    • It’s Fragile (Bad Capacity Efficiency): Old school dedupe breaks data into fixed blocks (e.g., 8KB). It creates a mathematical “hash” (a fingerprint) of that block. If a single bit changes in that 8KB block, the hash changes completely, and the dedupe fails. It only catches exact matches. It fails miserably on encrypted data, compressed logs, or genomic sequencing data where blocks are almost identical but not perfect matches.
    • It’s RAM Hungry (Bad Memory Efficiency): To know if a block is a duplicate, the storage controller must keep a massive table of every single hash it has ever seen. Where does that table live? In incredibly expensive, fast DRAM. As your data grows to petabytes, the required DRAM table grows linearly, becoming prohibitively expensive.

    Enter Similarity Based Data Reduction

    VAST didn’t just rebuild the hardware architecture; they reinvented data reduction for the modern era.

    The VAST Similarity Engine doesn’t just look for exact block matches. It looks for similar blocks.

    Using advanced algorithms derived from hyperscale search engine technology, VAST breaks data down into very small chunks and compares them against “reference blocks” already stored in the system. If a new block is 99% similar to an existing block, VAST compresses it against that reference block, storing only the tiny delta of differences.

    The Twin Economic Benefits of Similarity:

    1. Far Better Reduction (Saving SSDs): Similarity works amazingly well on data types that traditional dedupe gives up on—like voluminous log files, machine generated IoT data, genomic data, and even pre encrypted backup streams. VAST routinely achieves dramatic data reduction on datasets considered “uncompressible” by legacy vendors.
    2. Massive RAM Savings Because Similarity doesn’t rely on a rigid, massive hash table of every single 8KB block, it requires a fraction of the DRAM that legacy systems need to manage petabytes of data. The metadata footprint is radically smaller. In an era where filling a server with DDR5 memory can cost as much as the CPUs, this is a massive cost advantage.

    The Global Data Reduction Knockout Punch

    Remember the “Multi Cluster Disaster” scenario where you paid for 4PB of flash to store 1PB of data across Production, DR, Dev, and Analytics?

    Because VAST DASE is a single, scalable global namespace that can grow to exabytes without performance degradation, you never have that problem.

    Whether you have one PB or fifty PBs, it is all managed by one loosely coupled DASE cluster. The Similarity Engine sees everything.

    If you create a clone of your 1PB production database for testing, VAST recognizes it instantly. It doesn’t copy the data. It just creates pointers. Even as the test team modifies that data, the Similarity engine only stores the tiny unique changes.

    With VAST, you store 1PB of information once. Period. You don’t pay the “silo tax” ever again.


    Conclusion: The Economics Have Changed. Will You?

    The era of cheap flash and abundant, inexpensive RAM masking inefficient storage architectures is over. The storage market has entered a new phase of harsh economic reality defined by supply scarcity and exploding demand.

    Sticking with traditional dual controller architectures in this environment means voluntarily accepting stranded capacity, paying for duplicate data copies across multiple silos, and buying excessive amounts of overpriced RAM just to manage inefficient legacy dedupe tables.

    VAST Data’s DASE architecture and Similarity Engine were designed specifically for this petabyte scale reality. By breaking down physical silos through Disaggregated Shared Everything, and by reinventing data reduction to be both globally aware and radically RAM efficient, VAST doesn’t just offer better technology.

    It offers the only viable economic path forward for large scale data infrastructure in the 2020s. Stop paying the legacy tax.

  • VAST Data NVMe/TCP Block Support: Eliminating Silos and Unifying Workloads for VMware vSphere

    VAST Data NVMe/TCP Block Support: Eliminating Silos and Unifying Workloads for VMware vSphere

    In modern enterprise IT, complexity is the enemy of agility. Organizations have long struggled with multiple storage silos: Fibre Channel SANs for block workloads, NFS for file, and object storage for analytics, backup, or cloud-native applications. Managing these separate systems increases cost, operational overhead, and risk, especially in VMware vSphere environments running critical workloads.

    VAST Data’s Element Store architecture changes that paradigm by delivering a unified, disaggregated, all-flash platform capable of handling block, file, and object workloads simultaneously. With the addition of NVMe/TCP block support, VMware customers can now consolidate mission critical virtual machines, databases, analytics, and unstructured workloads onto a single platform, eliminating silos while maintaining high performance and predictable latency.

    The Problem with Traditional Silos

    Enterprises typically maintain multiple arrays because each workload has historically required a different storage protocol:

    • Block storage for VMware VMFS datastores and transactional databases.
    • File storage for shared directories, home folders, or application file systems.
    • Object storage for analytics, backup, and cloud-native workloads.

    This siloed approach creates challenges:

    1. High management overhead – Separate monitoring, patching, replication, and capacity planning.
    2. Fragmented data – Moving data between silos is slow and inefficient.
    3. Inefficient hardware usage – Some arrays are underutilized, others are overprovisioned.
    4. Limited flexibility – Adapting to new workloads often requires deploying another silo.

    VAST Element Store: One Platform, All Protocols

    VAST’s Element Store is fundamentally different. It provides:

    • Unified architecture – Block, file, and object data coexist on the same underlying storage hardware, sharing the same pool of NVMe SSDs.
    • Disaggregated design – Compute and storage scale independently, so adding capacity or performance does not require forklift upgrades.
    • High-performance NVMe/TCP block support – Low latency for VMware vSphere workloads, with VMFS datastores.
    • File (NFS) and object (S3) access – Fully supported alongside block volumes, all running on the same Element Store.

    This architecture eliminates the need for separate SANs, NAS, or object systems, no more silos. Whether your organization runs transactional databases, virtual desktops, analytics pipelines, or backup workloads, VAST can handle them simultaneously on one platform.

    NVMe/TCP: Bringing Block into the Unified Store

    NVMe/TCP brings enterprise grade block storage to Ethernet networks, allowing VMware administrators to leverage standard infrastructure while achieving the performance traditionally associated with Fibre Channel:

    • Low-latency, high-throughput block volumes – Ideal for virtual machines.
    • Seamless integration with vSphere 8.x – Supports VMFS datastores.
    • Concurrent protocol access – Block volumes do not interfere with NFS shares or S3 object storage on the same Element Store.

    By supporting multiple protocols simultaneously, VAST ensures that all workloads are served efficiently without carving the system into silos, maximizing hardware utilization while simplifying operations.

    Benefits of a Unified Platform

    VAST’s single platform approach delivers several tangible benefits:

    1. Silo elimination – Consolidate block, file, and object workloads without deploying separate arrays.
    2. Operational simplicity – One management plane, one namespace, and consistent data services across protocols.
    3. Performance consistency – NVMe/TCP for block workloads, low-latency NFS for file, and high-throughput S3 for object, all coexisting without interference.
    4. Scalability – Disaggregated architecture allows linear scale of capacity and performance.
    5. Enterprise features – Instant snapshots, clones, replication, and automated tiering across all protocols from the same Element Store.

    Real-World Impact

    Imagine a hospital IT environment running VMware vSphere workloads for Epic databases, departmental file shares, and long-term imaging archives. Traditionally, each workload would live on a different silo: a SAN for Epic block volumes, a NAS for shared files, and object storage for images. VAST eliminates that complexity:

    • All workloads live on the same Element Store.
    • NVMe/TCP block volumes power VMware VMs and databases.
    • NFS handles file sharing and departmental apps.
    • S3 object storage supports long-term retention and analytics.
    • Snapshots, clones, and replication apply consistently across all protocols.

    The result: one platform, one namespace, one management model, and maximum hardware efficiency.

    Conclusion

    VAST Data’s NVMe/TCP block support is not just about performance; it’s about breaking down storage silos and enabling a truly unified data architecture. VMware administrators can now deploy mission critical block workloads alongside file and object workloads on the same Element Store, simplifying operations, improving utilization, and accelerating innovation.

    With VAST, enterprises can finally stop managing multiple arrays and start managing data as a single, universal resource, future-proofing their infrastructure while delivering the high performance and reliability modern workloads demand.


    below, you can see a demo how it all works:

    The official documentation: https://support.vastdata.com/s/document-item?bundleId=vast-cluster-administrator-s-guide5.3&topicId=managing-access-protocols%2Fblock-storage-protocol%2Fconfiguring-an-nvme-tcp-client-on-vmware-vsphere-for-vast-cluster-block-storage.html&_LANG=enus

  • Kubernetes COSI: Simplifying Object Storage with VAST Data

    Kubernetes COSI: Simplifying Object Storage with VAST Data

    When Kubernetes was first designed, it came with strong support for compute and block storage through the Container Storage Interface (CSI). CSI standardized how workloads could consume persistent volumes, enabling automation, portability, and ecosystem growth. But object storage—one of the most critical storage paradigms for modern applications—was left behind.

    That gap led to the creation of the Container Object Storage Interface (COSI), an emerging Kubernetes standard that allows applications to dynamically request, provision, and consume object storage buckets in the same way that CSI enables block storage volumes.

    In this post, we’ll explore why COSI exists, what problems it solves compared to “just provisioning object storage manually,” and how VAST Data integrates with and extends COSI to deliver enterprise-grade capabilities.


    Why COSI Exists

    At first glance, object storage might seem simpler than block or file storage. A bucket is just a logical container—you can create one with a single aws s3 mb command or a line of YAML through your object storage system’s API. Why do we need an entire Kubernetes API around this?

    The answer lies in scale, repeatability, and automation.

    • Manual provisioning doesn’t scale: In traditional environments, administrators pre-create buckets, hand out access keys, and manually wire credentials into workloads. This model quickly becomes unmanageable when dealing with hundreds or thousands of microservices.
    • Dynamic provisioning is critical: Developers want to define their object storage needs in their deployment YAML. Just like a PersistentVolumeClaim for block storage, a BucketClaim in COSI lets them request storage resources without needing to talk to a storage admin.
    • Consistent lifecycle management: Buckets, access policies, and credentials should follow the lifecycle of the Kubernetes resource. When the app is deleted, the bucket (or its credentials) can be reclaimed or cleaned up automatically.
    • Portability and standardization: COSI provides a common API that works across different object storage backends, eliminating the need for cluster operators to re-architect automation when switching vendors.

    In short, COSI brings self-service, automation, and policy-driven lifecycle management to object storage in Kubernetes.


    How COSI Works

    COSI introduces three key Kubernetes resources:

    1. BucketClass – Defines storage classes for object storage (e.g., standard vs archive tier, data protection settings).
    2. BucketClaim – A request for a bucket by an application.
    3. BucketAccess – Manages credentials and access permissions.

    Behind the scenes, a COSI driver runs in the cluster and interfaces with the object storage backend. It provisions buckets, generates access keys, and enforces policies defined in the BucketClass.

    For developers, this means a simple YAML declaration:

    apiVersion: objectstorage.k8s.io/v1alpha1
    kind: BucketClaim
    metadata:
      name: my-app-bucket
    spec:
      bucketClassName: standard

    The cluster handles the rest—provisioning the bucket on the backend, wiring access credentials into the pod, and ensuring lifecycle consistency.


    VAST Data and Kubernetes COSI

    VAST Data extends COSI with the unique benefits of its universal storage platform, enabling Kubernetes workloads to consume object storage with enterprise-grade guarantees:

    1. Unified Storage Engine: With VAST, COSI buckets live on the same platform that powers block, file, and database services. This simplifies operations and reduces silos while still supporting S3-compatible access.
    2. Performance at Scale: Unlike traditional object stores optimized for capacity but not speed, VAST delivers low-latency and high-throughput S3 performance—critical for modern data-intensive Kubernetes workloads like AI/ML pipelines, analytics, and media processing.
    3. Policy-Driven BucketClasses: VAST lets administrators expose bucket classes tied to real backend policies—data protection (erasure coding, snapshots), security (encryption, immutability), and tiering—so Kubernetes developers consume object storage aligned with enterprise governance.
    4. Lifecycle Automation: Buckets and credentials created via COSI on VAST can be tied to namespace or workload lifecycles, ensuring compliance and reducing orphaned resources.
    5. Deep Integration with VAST Ecosystem: By using VAST’s COSI driver, Kubernetes clusters benefit from the same consistency, scale, and global namespace that VAST provides across object, file, and block.

    Why It Matters

    The promise of Kubernetes has always been about self-service, automation, and portability. COSI brings that promise to object storage, enabling developers to request and consume buckets without manual intervention.

    For organizations standardizing on VAST Data, this means:

    • Faster developer velocity – No tickets, no manual bucket provisioning.
    • Reduced operational overhead – Policies and automation handle lifecycle.
    • Enterprise-grade data services – All powered by VAST’s universal storage engine.

    COSI is not just about buckets—it’s about unlocking cloud-native agility for data-driven applications. With VAST Data’s support, enterprises can extend these capabilities to mission-critical workloads at petabyte scale.


    Sample Kubernetes YAML Workflow with VAST’s COSI Driver

    1. Install the VAST COSI Driver

    First, you’d install the necessary Custom Resource Definitions (CRDs) and controller into your cluster as outlined in VAST’s documentation  .

    2. Create a Kubernetes Secret

    You’ll need a Kubernetes secret containing credentials or a VMS authentication token for the VAST COSI driver to authenticate with your VAST cluster:

    apiVersion: v1
    kind: Secret
    metadata:
      name: vast-cosi-auth
      namespace: vast-cosi-system
    type: Opaque
    stringData:
      # either:
      # - Vast Access Key and Secret Key
      # - or an authentication token from VMS (preferred for VAST Cluster 5.3+)
      access_key: "<YOUR_VAST_ACCESS_KEY>"
      secret_key: "<YOUR_VAST_SECRET_KEY>"
      # OR for token-based auth (VAST 5.3+):
      # token: "<YOUR_VAST_VMS_AUTH_TOKEN>"

    Documentation notes that from VAST Cluster version 5.3 or later, authentication using VMS authentication tokens is supported  . This is the preferred method over static credentials.

    3. Define a BucketClass

    Define how buckets should be provisioned—e.g., performance, data protection, or snapshot policies:

    apiVersion: objectstorage.k8s.io/v1alpha1
    kind: BucketClass
    metadata:
      name: vast-standard
    spec:
      # backend-specific class name that maps to a policy on the VAST platform
      provisioner: vast.cosi.vastdata.com
      parameters:
        # The “policy” should correspond to configured policies in VAST (e.g., erasure coding, snapshot-enabled)
        policy: standard-performance-s3

    This ties Kubernetes bucket requests to VAST backend policies for performance, protection, and lifecycle management.

    4. Create a BucketClaim

    Applications can request a bucket just like a PersistentVolumeClaim:

    apiVersion: objectstorage.k8s.io/v1alpha1
    kind: BucketClaim
    metadata:
      name: myapp-data-bucket
    spec:
      bucketClassName: vast-standard
      # optional: request a specific bucket name or let VAST assign one
      # bucketName: myapp-data-123

    The COSI controller will then provision an S3 bucket on the VAST platform according to the defined BucketClass.

    5. Access Credentials with BucketAccess

    To get credentials and endpoint information for your newly provisioned bucket:

    apiVersion: objectstorage.k8s.io/v1alpha1
    kind: BucketAccess
    metadata:
      name: myapp-data-access
    spec:
      bucketClaimName: myapp-data-bucket

    This resource enables Kubernetes (and your workloads) to retrieve credentials and other relevant metadata automatically.

    6. Use in a Pod or Deployment

    You can now mount or inject access credentials via Kubernetes secrets generated by the COSI controller. They may look like this in a pod:

    apiVersion: v1
    kind: Pod
    metadata:
      name: app-using-object-storage
    spec:
      containers:
        - name: app
          image: your-app-image:latest
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: myapp-data-access
                  key: accessKeyID
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: myapp-data-access
                  key: secretAccessKey
            - name: S3_ENDPOINT
              valueFrom:
                secretKeyRef:
                  name: myapp-data-access
                  key: endpoint

    Your application now consumes S3-compatible object storage provisioned via VAST, fully automated via Kubernetes.


    How This Reflects VAST’s COSI Capabilities

    • Dynamic Provisioning: The spec-driven bucket creation (via BucketClaim) automates request handling, bucket creation, and credentials—all without manual intervention  .
    • Modern Authentication: Support for VMS token-based authentication (available in VAST Cluster 5.3+) enhances security and reduces the need for static credentials  .
    • Tight Integration with VAST Policies: The BucketClass maps directly to backend policy configurations in VAST—e.g., snapshot-enabled classes, performance tiers, etc.—so storage behavior is precisely controlled  .

    Summary

    With this YAML workflow, developers can request object storage through Kubernetes, and the VAST COSI driver will handle provisioning, access control, and lifecycle—all aligned with enterprise-grade policies and authentication strategies.

    Let me know if you’d like to include a full end-to-end blog outline or further illustrations—like diagramming this workflow, best practices on bucket naming, cleanup strategies, or operational guidance!

    see a demo all it all looks, below

    please always use the official documentation as things may change

    https://support.vastdata.com/s/document-item?bundleId=vast-cosi-driver-2.6-administrator-s-guide&topicId=about-vast-cosi-driver.html&_LANG=enus

  • From Borg to Block: How Kubernetes Evolved to Power the AI Revolution with VAST Data AI OS

    From Borg to Block: How Kubernetes Evolved to Power the AI Revolution with VAST Data AI OS

    The Artificial Intelligence revolution runs on a specific and powerful foundation: Kubernetes. Once an internal Google project, it has become the cloud’s de facto operating system, orchestrating the complex, data-intensive workloads that define modern AI. This post explores the journey of Kubernetes and how it forms the core of a unified AI stack, enabled by NVIDIA’s Inference Microservices (NIM) and the VAST Data Platform. We’ll focus on how this ecosystem provides a complete solution, from compute orchestration down to a pivotal new capability: high-performance block storage over NVMe/TCP.

    The Genesis of Kubernetes

    Kubernetes is the direct descendant of Google’s internal container management systems, Borg and Omega. Developed over a decade, these systems ran services like Gmail and Google Search at a massive scale. However, they were proprietary and tightly coupled to Google’s infrastructure. In 2014, Google engineers distilled the lessons from Borg and Omega into a new, open-source project: Kubernetes.

    The design of Kubernetes directly addressed the pain points of its predecessors. Concepts like the Pod (the atomic unit of scheduling for co-located containers), Labels (for flexible resource organization), and an IP-per-Pod networking model were not theoretical but practical solutions born from years of operational experience.

    Google’s most strategic move was donating Kubernetes to the newly formed Cloud Native Computing Foundation (CNCF) in 2015. This vendor-neutral governance model fostered a massive collaborative community, leading to its rapid adoption as the industry standard for container orchestration.

    Kubernetes: The Undisputed Engine for AI

    The features that made Kubernetes a powerful general-purpose orchestrator also made it the ideal platform for demanding AI/ML workloads.

    • Scalability and Portability: Kubernetes excels at automatically scaling resources up and down to meet the fluctuating demands of AI training and inference. Its container-based model ensures that ML environments are portable and reproducible, eliminating the “it works on my machine” problem.
    • Hardware Abstraction for GPUs: Modern AI relies on specialized hardware like GPUs. Kubernetes abstracts this complexity through its device plugin framework.9 Platform teams manage the complex hardware and driver installations, while data scientists simply request a GPU in their pod configuration (e.g.nvidia.com/gpu: 1). The scheduler handles the rest, making powerful hardware a simple, consumable resource.
    • An Extensible Platform: Kubernetes’s greatest strength is its extensibility. Through Custom Resource Definitions (CRDs) and Operators, an entire ecosystem of AI/ML platforms has been built on top of Kubernetes. Tools like
      Kubeflow provide a complete ML lifecycle toolkit, turning Kubernetes into the foundational “operating system” for a universe of AI innovation.

    NVIDIA NIM: Packaging AI for a Kubernetes World

    NVIDIA simplified AI deployment with NVIDIA Inference Microservices (NIM), a collection of pre-built, optimized containers that package AI models into enterprise-ready microservices. Each NIM is a self-contained Docker container that includes not just the AI model, but also a high-performance inference engine like Triton, all necessary CUDA libraries, and a standard API endpoint compatible with the OpenAI API specification.This packaging transforms complex AI models into standard, cloud-native applications.

    NIMs are fine-tuned for specific hardware, leveraging tools like NVIDIA TensorRT to optimize the model graph, fuse layers, and reduce precision to maximize inference throughput and efficiency. They also include an embedded Prometheus endpoint for monitoring GPU utilization, latency, and other critical telemetry. NVIDIA offers a vast catalog of NIMs for various domains, including LLMs like Llama 3.1, speech AI for translation and text-to-speech, digital biology models for molecular docking, and simulation models for generating 3D worlds with OpenUSD.

    This turns AI models into standard, cloud-native applications. Deployment is handled through familiar Kubernetes tools. While Helm charts provide an easy entry point for deploying a single NIM, the NIM Operator offers production-grade lifecycle management for complex AI pipelines. The operator, first released in 2024 and now at version 2.0, uses CRDs to automate complex tasks. Its

    NIMCache feature intelligently pre-caches large models to a persistent volume, dramatically reducing startup times for new pods.The

    NIMPipeline CRD allows operators to manage an entire graph of interdependent NIMs such as a RAG system combining embedding, reranking, and LLM models—as a single, cohesive unit.

    With version 2.0, the NIM Operator expanded its capabilities to manage the lifecycle of NVIDIA NeMo microservices. This includes tools for building complete AI data flywheels, such as the NeMo Customizer for fine-tuning models, NeMo Evaluator for performance benchmarking, and NeMo Guardrails for adding content safety and topic controls to LLM chatbots. By managing both inference (NIM) and customization (NeMo) microservices, the NIM Operator provides a unified, declarative interface for deploying and managing the entire lifecycle of enterprise-grade, production AI applications on Kubernetes.

    VAST Data: A Unified Data Foundation for Kubernetes

    AI requires a high-performance, scalable data foundation, a role filled by the VAST Data Platform. VAST’s relationship with Kubernetes is symbiotic; it not only serves data to Kubernetes but is built on the same cloud-native principles.

    VAST’s Disaggregated, Shared-Everything (DASE) architecture decouples compute logic (CNodes) from physical storage (DNodes), allowing performance and capacity to scale independently—a philosophy that mirrors Kubernetes’s own design.

    VAST Data’s CNodes (Compute Nodes) play a critical role in the DASE architecture, handling system coordination, cluster metadata services, management APIs, and cloud integration. Within these CNodes, VAST does use containers extensively, especially for modularity, scalability, and cloud portability.

    VAST even uses containers internally for its serverless DataEngine, proving its deep alignment with the cloud-native world.

    For external Kubernetes clusters, VAST provides storage via its full-featured Container Storage Interface (CSI) driver. This allows developers to dynamically provision persistent volumes from VAST using standard Kubernetes objects like

    PersistentVolumeClaim and StorageClass, automating storage management and making VAST a frictionless data backend for any stateful application.

    The New Frontier: Customer Choice with Block Storage over NVMe/TCP

    Historically, data centers were split between Network Attached Storage (NAS) for files and Storage Area Networks (SAN) for block-based workloads like databases. VAST’s most significant recent innovation is the addition of

    native block storage, a strategic move that collapses this final silo and delivers on the promise of a truly unified data platform.

    This unification is centered on customer choice. Architects can now select the optimal protocol—File (NFS), Object (S3), or Block (NVMe/TCP)—for any given workload, all provisioned from the same underlying VAST storage pool. This dramatically simplifies infrastructure and reduces costs.

    The choice of NVMe over TCP (NVMe/TCP) for block storage is critical. It delivers SAN-like, microsecond-level latency over standard Ethernet networks, eliminating the need for expensive, specialized SAN fabrics. This is perfect for latency-sensitive AI components like vector databases and feature stores, which can become bottlenecks if they are waiting on slow storage. VAST provides a dedicated

    Block CSI driver (block.csi.vastdata.com), allowing Kubernetes pods to consume raw block devices with the same ease as file or object storage, ensuring GPUs are always fed with data at maximum speed.

    The Future is Agentic: The VAST DataEngine and AgentEngine

    VAST is positioning its platform not just for today’s AI workloads, but for the next frontier: agentic AI. This new paradigm involves intelligent, autonomous agents that can reason, plan, and act to achieve complex goals, often by orchestrating multiple smaller, specialized models and tools. To power this vision, VAST has developed a comprehensive AI Operating System that unifies storage, database, and compute.

    The foundation of this is the VAST DataEngine, a serverless, containerized function execution environment built directly into the platform.The DataEngine allows event-driven processing, where functions written in Python are automatically triggered by data events—such as a new file being written to perform tasks like data transformation, indexing, or enrichment in real-time.

    Building on this foundation is the upcoming VAST AgentEngine, an AI agent deployment and orchestration system scheduled for release in the second half of 2025. The AgentEngine is designed to be the application management layer for agentic AI, providing the runtime, tooling, and observability needed to deploy and manage agents at scale. Key features include:

    • A Dedicated Runtime for Agents: An operational framework that handles not just container startup, but also loading models into GPU memory and verifying the tools an agent needs to function.
    • AI-Native Resiliency: For long-running agents that may operate for hours or days, AgentEngine provides checkpointing of the agent’s memory and reasoning state. This allows for seamless recovery from failures without having to restart the entire process from scratch.
    • The AgentEngine Studio: An integrated environment where developers can define how agents interact with tools and data, configure access rules and security, and manage their lifecycle.
    • An Agent Tool Server: A core component that allows agents to securely invoke data, functions, web searches, or even other agents using the emerging Model Context Protocol (MCP) standard for agent-tool interaction.

    Together, the DataEngine and AgentEngine aim to provide the end-to-end infrastructure needed to transform experimental AI agents into scalable, recoverable, and observable production applications.

    KubeVirt: Bridging the Past and Future

    While the future is cloud-native, enterprises have vast investments in legacy applications running in virtual machines (VMs). KubeVirt provides the crucial bridge, extending Kubernetes to manage VMs alongside containers on the same cluster.

    KubeVirt transforms Kubernetes into a universal infrastructure control plane, capable of managing any workload. A legacy application in a VM can now run on the same cluster, use the same network policies, and mount storage from the same VAST CSI driver as a modern containerized microservice. This allows organizations to modernize their infrastructure management without a disruptive “all-or-nothing” rewrite of every application, consolidating operations onto a single, future-proof platform.

    Conclusion: The Unified AI Stack

    The modern AI infrastructure stack has converged on a powerful, cohesive, and unified architecture:

    • A Universal Control Plane: Kubernetes, augmented with KubeVirt, provides a single control plane for both modern containers and legacy VMs.
    • Cloud-Native AI Services: NVIDIA NIMs package AI models into standardized, easy-to-deploy microservices managed by the Kubernetes-native NIM Operator.
    • A Unified Data Platform: The VAST Data Platform, built on a cloud-native DASE architecture, provides the essential data foundation. Its comprehensive CSI driver offers a choice of high-performance file, object, and now ultra-low-latency block storage via NVMe/TCP. Looking forward, its DataEngine and AgentEngine are built to power the next generation of agentic AI.

    This convergence eliminates infrastructure silos and operational friction, creating a future-proof stack that allows organizations to accelerate their journey from raw data to transformative, AI-driven insight.