Reports & Publications
64 GPU AI Performance Comparison Test and Automated O&M Test H3C RoCE Network AD-DC Path Navigation Solution vs. Traditional ECMP Solution
Login or create an account to download this report
Abstract
New H3C Technologies commissioned Tolly to evaluate the performance and operational benefits of its AD-DC path navigation solution in a 64-GPU AI environment. The main focus of the project was to compare AD-DC against traditional ECMP load balancing under the same RoCE-based training scenario, measuring collective-communication bandwidth and validating automated operations and maintenance capabilities for fault detection and root-cause analysis.
The test bed used eight servers, each equipped with eight NVIDIA H20 GPUs and eight 400G NICs, for a total of 64 GPUs. The network fabric used H3C S9827-128DH switches in both spine and leaf roles, with two spine switches and four leaf switches, all running a RoCE architecture. The software stack included Ubuntu 22.04.4, CUDA 12.4, and NCCL 2.22.3. Tolly compared two traffic-engineering methods on the same hardware: H3C’s AD-DC path navigation, which actively plans service traffic paths based on topology and traffic characteristics, and a traditional ECMP hash-based approach.
In NCCL Ring-AllReduce testing, AD-DC showed clear performance gains over ECMP at larger message sizes. Average bus bandwidth increased from 212.644GB/s with ECMP to 233.262GB/s with AD-DC, a 9.70% improvement overall. While the two methods were close at smaller message sizes, AD-DC’s advantage became substantial as transfer sizes increased, reaching 10.43% at 512MB, 11.29% at 1GB, 17.32% at 8GB, and 22.76% at 16GB. Tolly presents this as evidence that globally coordinated path planning can improve traffic load sharing and reduce congestion in large-scale AI communication workloads.
The report also emphasizes operations and maintenance capabilities. Tolly verified that AD-DC can monitor key lossless-network metrics including optical module status, switch interface bandwidth utilization, interface byte rate, PFC pause-frame counts, and ECN metrics. According to the screenshots and table on pages 3 through 6, the system can also perform comparative analysis across time periods at both the fabric and switch-interface-queue levels, correlating changes in NIC throughput, PFC frame rate, and ECN packet activity to help identify faults and root causes during model training.
Overall, the report positions H3C’s AD-DC path navigation as both a performance optimization and an operational intelligence layer for RoCE-based AI fabrics. In this evaluation, it improved NCCL collective bandwidth over traditional ECMP while also providing richer observability for maintaining stability in large-scale model training environments.