Reports & Publications

64 GPU AI Computing Performance Comparison Test H3C RoCE Network (S9825 Series Switch) vs. InfiniBand Network

Sponsor: New H3C Technologies Co., Ltd
H3C RoCE Network (S9825 Series Switch) vs. InfiniBand Network

Abstract

New H3C Technologies commissioned Tolly to evaluate the AI-computing performance of an H3C RoCE network built with S9825 series switches against an InfiniBand network using NVIDIA QM9700 switches. The main focus of the project was to determine whether an Ethernet-based RoCE fabric could deliver NCCL collective-communication and large-language-model training performance comparable to InfiniBand in a 64-GPU environment.  


The test environment used eight servers, each equipped with eight NVIDIA H20 GPUs and eight 400G NICs, for a total of 64 GPUs. Both the H3C RoCE fabric and the InfiniBand fabric used a multi-track topology with 400G x 8 links between spine and leaf layers and 400G x 2 links from leaf switches to each server. The RoCE network used H3C S9825-64D switches in both spine and leaf roles, while the IB network used NVIDIA QM9700 switches. Software components included Ubuntu 22.04.4, CUDA 12.4, and NCCL 2.22.3.  


Tolly first tested NCCL Ring-AllReduce performance across multiple message sizes. The H3C S9825 RoCE network achieved an average bus bandwidth of 232.384GB/s, compared with 230.698GB/s for the InfiniBand fabric, giving RoCE a slight average advantage of 0.73%. Results were closely matched across nearly all message sizes, with H3C RoCE ahead at 8MB, 16MB, 64MB, 128MB, 512MB, 2GB, 4GB, 8GB, and 16GB, while IB led slightly at a smaller set of sizes. Tolly characterizes this as essentially equivalent collective-communication performance in the tested 64-GPU scenario.  


The evaluation also included Llama3 70B training to validate real AI workload behavior beyond synthetic collectives. Here again, the two fabrics were extremely close. The InfiniBand network delivered an average iteration time of 17,724ms and throughput of 14.44 samples per second, while the H3C S9825 RoCE network achieved 17,651ms and 14.50 samples per second. Lower iteration time and higher samples per second indicate better performance, so the H3C RoCE network was marginally ahead in this run.  


Overall, the report concludes that the H3C S9825-based RoCE network provides AI communication and Llama3 training performance comparable to InfiniBand while maintaining a consistent user experience under the same workload conditions. The results position H3C’s high-density 400G Ethernet switch fabric as a viable alternative for large-scale AI deployments that want RoCE-based networking without giving up near-IB-class performance.