AI workloads, data platforms, and infrastructure notes, written from the engineering edge between benchmarks and production.

RSS feed
/ / /

High-Performance I/O Without the Myths – Best practices from VAST Data on NFS, RDMA, page cache, load balancing, and GPU Direct Storage

Source: Best Practices for High Performance I/O | VAST Data For years, infrastructure teams have repeated the same assumptions about storage: NFS cannot scale for high performance, object storage is always too slow for serious workloads, RDMA is mandatory, and…

I

Itzik — VP Mission Alignment, VAST Data

·

·

7 min read


Source: Best Practices for High Performance I/O | VAST Data

For years, infrastructure teams have repeated the same assumptions about storage: NFS cannot scale for high performance, object storage is always too slow for serious workloads, RDMA is mandatory, and users should simply rewrite their applications if the I/O path is inefficient.

The VAST Data session Best Practices for High Performance I/O makes a more useful argument. High-performance I/O is not primarily a protocol debate. It is a systems problem that spans architecture, client behavior, networking, caching, and workload realism.

That shift matters for AI and HPC teams because the number that actually matters is not the benchmark headline. It is whether the real job finishes faster, whether GPUs stay fed, whether latency-sensitive workflows stop stalling, and whether the system remains resilient when links, optics, or paths inevitably misbehave.

1. Optimize for Real I/O Patterns, Not Ideal Ones

One of the strongest ideas in the talk is that users should not have to spend their lives working around storage quirks. In HPC in particular, there has long been a tendency to label small I/O, mixed access, and metadata-heavy behavior as bad patterns. The more honest framing is that these are simply I/O patterns.

Physics still matters. Small I/O will always be harder than large I/O. Random access will always cost more than sequential access. But the platform should still behave smoothly and gracefully instead of assuming every research team or application developer will become an expert in storage internals.

This is the first best practice: build systems that remain effective when workloads are imperfect. The practical win is not just better throughput. It is lower operational friction and less time wasted trying to force applications into unrealistic benchmark-friendly behavior.

2. High-Performance NFS Depends on Architecture

The session focuses on NFS and makes the case that modern NFS performance should not be judged by old assumptions. In the VAST model, protocol servers can all access the full namespace rather than owning isolated slices of metadata or data. That is what lets the client land on different servers without losing access to the right files.

This matters because architecture decides whether NFS becomes a bottleneck or a scalable front door. If every request has to chase a unique owner, the protocol suffers. If any front-end node can reach the required data efficiently, NFS becomes far more practical for large environments.

The best practice here is straightforward: if you want NFS to scale, choose a storage architecture that supports wide distribution of traffic across protocol servers instead of forcing clients through narrow ownership paths.

3. Use Multiple Connections and Multipathing

A major theme in the talk is that load balancing matters as much as raw bandwidth. Traditional NFS clients often connect through a single mount target, which becomes limiting when line rates and client sizes grow. The VAST session explains how nconnect and multipathing help distribute load across more connections, more server IPs, and more client interfaces.

That matters in modern AI systems because a single client is often no longer a small host. It may be a dense node packed with GPUs and multiple high-speed NICs. In those environments, the goal is not just to make one connection faster. It is to avoid hotspots and let the client pull from enough paths to use the infrastructure well.

The benchmark slide above is one of the clearest practical takeaways in the session. It shows that a single connection leaves performance on the table, while multiple connections scale close enough to linearly to make a real difference in production.

Best practice: use nconnect and multipathing deliberately. The point is not to connect to everything. The point is to spread load enough to avoid hotspots and achieve stable aggregate throughput.

4. Treat RDMA as a Tool, Not a Religion

The talk takes a pragmatic position on RDMA. Yes, RDMA can improve throughput per connection and reduce CPU overhead. Yes, it matters for some data paths, especially GPU Direct Storage. it is not magic.

TCP can still deliver very high throughput when enough connections are used. RDMA also requires real coordination across compute, storage, and networking teams. That means it should be adopted where it clearly helps, not because it sounds impressive in a design review.

Best practice: be data-driven about RDMA. If your hardware and workload benefit from it, use it. If TCP already meets your job goals, do not assume RDMA is the missing ingredient by default.

5. Understand the Linux Page Cache Before You Trust a Benchmark

The Linux page cache is one of the most important sections in the talk because it exposes the gap between synthetic tests and application reality. Writes often land in memory first and drain later. Reads can benefit from readahead. Sequential access can look dramatically faster because the kernel is already pulling data forward on behalf of the application.

That means many application teams are not really experiencing storage directly. They are experiencing the client stack, the page cache, and the kernel’s prediction logic. This is not a flaw. It is an important part of why real applications can perform much better than raw direct I/O measurements suggest.

This benchmark is especially useful because it visualizes how misleading direct-I/O-only thinking can be. The slide shows a large jump between direct I/O and cold page-cache reads, reinforcing the session’s point that application-visible performance is often shaped by client-side behavior.

Best practices from this section are clear:

  • Know whether your workload uses buffered I/O or direct I/O.
  • Tune readahead for sequential workloads instead of assuming Linux defaults are ideal.
  • Use direct I/O for storage diagnostics, but do not confuse it with application truth.
  • Remember that page cache behavior can be a major part of the real performance story.

6. Use Delegations and Modern Client Features for Metadata-Heavy Workloads

Another practical point in the talk is NFS delegations. Delegations let the server grant the client more authority over a file when there are no conflicting accesses. That reduces unnecessary round trips and helps latency-sensitive workflows such as git clone, untar operations, and other small-file-heavy developer patterns.

These are not bandwidth problems. They are usually metadata and latency problems. That is why modern client behavior matters so much.

Best practice: use the latest client capabilities and current NFS versions where available. Small-file workflows often gain more from reduced round trips than from chasing another few gigabytes per second of bulk throughput.

7. GPU Direct Storage Is Real, but Workload Context Matters

The session also covers GPU Direct Storage in a refreshingly grounded way. In theory, GDS allows data to move from the NIC directly into GPU memory instead of traveling through system memory first. That can reduce overhead, and in the right conditions it clearly helps.

But many real AI stacks already use double-buffering and CPU-side data movement effectively. In those pipelines, CPUs prepare and stream the next batch while GPUs compute, which can reduce the incremental impact of GDS compared with the marketing narrative.

The value of this slide is that it ties GDS to something measurable: stalls, throughput, and the cost of the classic staging path. The takeaway is not that every AI workload needs GDS. It is that teams should measure whether their current ingest path is actually the bottleneck.

Best practice: evaluate GDS against real pipeline behavior. If your CPUs already keep GPUs saturated, the gain may be small. If the data path is the bottleneck, GDS can be compelling.

8. NFS Versus S3 Is the Wrong Fight

The session was closed by reframing the NFS versus S3 debate. NFS still has protocol efficiency advantages and remains a strong fit for many file-based applications and HPC workflows. S3, however, is increasingly attractive because it is cloud-native, portable, and operationally convenient.

The more useful choice is not which camp wins forever. It is whether the storage platform lets both coexist so teams can use the right interface for the right workflow without forcing a disruptive migration.

Best practice: treat NFS and S3 as workload choices, not ideology. A multi-protocol namespace is often more valuable than a single-protocol argument.

The Best Practices, Summarized

  • Design for real I/O behavior, not idealized patterns.
  • Choose architecture that distributes traffic broadly across protocol servers.
  • Use nconnect and multipathing to avoid hotspots and improve aggregate throughput.
  • Adopt RDMA where it provides measurable value, not just marketing value.
  • Understand Linux page-cache behavior before drawing conclusions from benchmarks.
  • Tune readahead for sequential read-heavy workloads.
  • Use delegations and modern NFS client behavior for metadata-heavy workflows.
  • Evaluate GPU Direct Storage in the context of the real AI pipeline.
  • Compare NFS and S3 based on workload fit and operational convenience.
  • Optimize for job outcomes, not just synthetic benchmark numbers.

Conclusion

The strongest message in this session is that high-performance I/O should be approached as end-to-end systems engineering. Architecture, load balancing, client features, caching, network behavior, and workload design all shape the result. Protocol labels alone do not.

That is why this talk is useful. It does not just repeat old claims about NFS, RDMA, or object storage. It replaces mythology with engineering and connects performance advice to the way applications really run.

Discover more from Lots of Data - Thoughts around AI Workloads

Subscribe now to keep reading and get access to the full archive.

Continue reading