Reports & Publications

64 GPU AI Computing Performance Comparison Test H3C RoCE Network (S9825-G & S9855-G Series Switches) vs. InfiniBand Network

Sponsor: New H3C Technologies Co., Ltd
H3C RoCE Network (S9825-G & S9855-G Series Switches) vs. InfiniBand Network

Abstract

New H3C Technologies commissioned Tolly to evaluate AI-computing performance in a 64-GPU environment, comparing an H3C RoCE fabric built with S9825-G and S9855-G series switches against an InfiniBand network using NVIDIA QM9700 switches. The main focus of the project was to determine whether this Ethernet-based RoCE architecture could deliver NCCL collective-communication and Llama3 training performance comparable to InfiniBand under the same large-scale AI workload conditions.  


The H3C S9825-G and S9855-G families are positioned as high-performance, high-density 400GE and 100GE Ethernet switches for high-end data centers and AIGC computing environments, with redundant hot-swappable power supplies and fans. In the tested RoCE fabric, H3C S9825-8C-G switches served as spine devices and H3C S9855-32DH-G switches served as leaf devices connecting to servers. The comparison InfiniBand fabric used NVIDIA QM9700 switches in both spine and leaf roles. Both networks used a multi-track topology with eight servers, each equipped with eight NVIDIA H20 GPUs and eight 400G NICs, running Ubuntu 22.04.4, CUDA 12.4, and NCCL 2.22.3.  


Tolly first measured NCCL Ring-AllReduce performance across a range of message sizes. The H3C RoCE network achieved an average bus bandwidth of 231.947GB/s versus 230.698GB/s for the InfiniBand network, giving the H3C fabric a slight average advantage of 0.54%. Results were closely matched throughout the size range, with the H3C network outperforming IB at 128MB, 512MB, 2GB, 4GB, 8GB, and 16GB, while trailing slightly at some smaller sizes. Tolly characterizes the overall result as essentially equivalent collective-communication performance in this 64-GPU configuration.  


The evaluation also included Llama3 70B training to validate a real AI workload beyond synthetic collectives. Here again, the two networks were nearly identical. The InfiniBand fabric delivered an average iteration time of 17,724ms and throughput of 14.44 samples per second, while the H3C S9825-8C-G and S9855-32DH-G RoCE network achieved 17,680ms and 14.48 samples per second. Overall, the report concludes that the H3C RoCE fabric provides performance and user experience comparable to InfiniBand for 64-GPU AI training, positioning the H3C Ethernet design as a viable alternative for large-scale AI deployments that want RoCE-based networking without giving up near-IB-class results.  


Switches used in this test:


  • H3C S9825-8C-G — Spine switch used in the H3C RoCE fabric for the 64-GPU test environment. It is part of H3C’s S9825-G series of high-performance, high-density 400GE/100GE Ethernet switches for high-end data centers and AIGC computing scenarios.  
  • H3C S9855-32DH-G — Leaf switch used in the H3C RoCE fabric, connecting the servers in the test topology. It is part of H3C’s S9855-G series, designed for high-density Ethernet switching in AI and large-scale data center environments.  
  • NVIDIA QM9700 — InfiniBand switch used in both spine and leaf roles in the comparison IB network for the 64-GPU performance tests.