Reports & Publications

64 GPU AI Computing Performance Comparison Test H3C RoCE Network (S9827 Series Switches) vs. InfiniBand Network

Sponsor: New H3C Technologies Co., Ltd
H3C RoCE Network (S9827 Series Switches) vs. InfiniBand Network

Abstract

New H3C Technologies commissioned Tolly to evaluate the AI-computing performance of an H3C RoCE network built with S9827 series switches against an InfiniBand network using NVIDIA QM9700 switches. The main focus of the project was to determine whether an Ethernet-based RoCE fabric could deliver NCCL collective-communication and large-language-model training performance comparable to InfiniBand in a 64-GPU environment.  


The H3C S9827 family is positioned as a high-density Ethernet switching platform for ultra-large data centers and AI computing networks, supporting up to 64 x 800GE or 128 x 400GE ports, port splitting to 256 x 200GE, and compatibility with LPO and ZR optical modules. In Tolly’s test environment, both the H3C RoCE network and the InfiniBand network used a multi-track topology with eight servers, each equipped with eight NVIDIA H20 GPUs and eight 400G NICs, for a total of 64 GPUs. The software stack included Ubuntu 22.04.4, CUDA 12.4, and NCCL 2.22.3. The RoCE fabric used H3C S9827-128DH switches in both spine and leaf roles, while the comparison IB fabric used NVIDIA QM9700 switches.  


Tolly first measured NCCL Ring-AllReduce performance across multiple message sizes. The H3C S9827 RoCE network achieved an average bus bandwidth of 231.877GB/s, compared with 230.698GB/s for the InfiniBand network, giving RoCE a slight average advantage of 0.51%. Results were closely matched throughout the size range, with H3C RoCE ahead at 8MB, 64MB, 128MB, 512MB, 2GB, 8GB, and 16GB, and especially strong at 512MB where it reached 311.18GB/s versus 288.36GB/s for IB.  


The evaluation also included Llama3 70B training to assess a real AI workload beyond synthetic collective tests. Here again, the two fabrics were nearly identical. The InfiniBand network delivered an average iteration time of 17,724ms and throughput of 14.44 samples per second, while the H3C S9827 RoCE network achieved 17,695ms and 14.47 samples per second. Overall, the report concludes that the H3C S9827-based RoCE network provides performance and user experience comparable to InfiniBand for 64-GPU AI training workloads, while also offering very high Ethernet port density and flexibility for large-scale AI fabric deployments.  


Switches used in this test: 


  • H3C S9827-128DH — H3C RoCE fabric switch used in both spine and leaf roles in the test. Part of the S9827 series, it supports very high-density 800GE/400GE/200GE ports and is aimed at large-scale AI and data center networks.  
  • NVIDIA QM9700 — InfiniBand switch used in both spine and leaf roles in the comparison network for the 64-GPU AI training tests.