Reports & Publications
64 GPU AI Computing Performance Comparison Test H3C RoCE Network (S12500CR Series Switches) vs. InfiniBand Network
Login or create an account to download this report
Abstract
New H3C Technologies commissioned Tolly to evaluate AI-computing performance in a 64-GPU environment, comparing an H3C RoCE network built with S12500CR series switches against an InfiniBand network using NVIDIA QM9700 switches. The main focus of the project was to determine whether this Ethernet-based RoCE design could deliver NCCL collective-communication and Llama3 training performance comparable to InfiniBand while supporting a simpler RoCE topology in which a single H3C S12508CR switch directly connected all servers.
The H3C S12500CR is presented as a flagship switch family for intelligent computing, large-scale models, and HPC data center scenarios. According to the report, the platform uses a CLOS+ orthogonal architecture intended to provide a 100% lossless data channel for networks and AI computing, along with high-density, high-speed interfaces for non-blocking server access. In the tested RoCE topology, one H3C S12508CR switch connected eight servers directly, with each server linked by 8 x 400G Ethernet connections. The comparison IB fabric used NVIDIA QM9700 switches in a multi-track spine-leaf topology with 8 x 400G spine-to-leaf links and 2 x 400G leaf-to-server links. Each server was equipped with eight NVIDIA H20 GPUs and eight 400G NICs, and the software stack included Ubuntu 22.04.4, CUDA 12.4, and NCCL 2.22.3.
In NCCL Ring-AllReduce testing, the H3C S12500CR RoCE network achieved an average bus bandwidth of 232.60GB/s, compared with 230.698GB/s for the InfiniBand network, giving the H3C fabric a slight average advantage of 0.82%. Performance was closely matched across message sizes, with the RoCE fabric ahead at 8MB, 16MB, 64MB, 128MB, 256MB, 512MB, 2GB, 8GB, and 16GB. The report characterizes this as essentially equivalent collective-communication performance in the tested 64-GPU scenario.
Tolly also evaluated Llama3 70B training to measure a real AI workload beyond synthetic collectives. The results were again nearly identical: the InfiniBand network delivered an average iteration time of 17,724ms and throughput of 14.44 samples per second, while the H3C S12500CR RoCE network achieved 17,665ms and 14.49 samples per second. Overall, the report concludes that the H3C S12508CR-based RoCE network provides AI performance and user experience comparable to InfiniBand for 64-GPU training workloads.
Switches used in this test:
- H3C S12508CR — RoCE switch used in the H3C test network, directly connecting all eight servers in the 64-GPU environment over 400G Ethernet links. It is part of H3C’s S12500CR flagship switch family for intelligent computing, large-scale models, and high-performance computing data centers.
- NVIDIA QM9700 — InfiniBand switch used in the comparison IB network in both spine and leaf roles within the multi-track topology.