Skip to main content

University of East Anglia - Shop


AI Data Center Network Design and Technologies

Paperback by Subramaniam, Mahesh; Styszynski, Michal; Tambakuwala, Himanshu

AI Data Center Network Design and Technologies

WAS £51.99   SAVE £7.80

£44.19

ISBN:
9780135436288
Publication Date:
4 Apr 2026
Language:
English
Publisher:
Pearson Education (US)
Imprint:
Addison Wesley
Pages:
384 pages
Format:
Paperback
For delivery:
Estimated despatch 6 - 7 May 2026
AI Data Center Network Design and Technologies

Description

Artificial intelligence is redefining the scale, architecture, and performance expectations of modern data centers. Training large ML models demand infrastructure capable of moving massive data sets through highly parallel, compute-intensive environments-where traditional data center designs simply can't keep up. AI Data Center Network Design and Technologies is the first comprehensive, vendor-agnostic guide to the design principles, architectures, and technologies that power AI training and inference clusters. Written by leading experts in AI Data center design, this book helps engineers, architects, and technology leaders understand how to design and scale networks purpose-built for the AI era. INSIDE, YOU'LL LEARN HOW TO Architect scalable, high-radix network fabrics to support xPU (GPE, TPU)-based AI clusters Integrate lossless Ethernet/IP fabrics for high-throughput, low-latency data movement Align network design with AI/ML workload characteristics and server architectures Address challenges in cooling, power, and interconnect design for AI-scale computing Evaluate emerging technologies from the Ultra Ethernet Consortium (UEC) and their affect on future AI data centers Apply best practices for deployment, validation, and performance measurement in AI/ML environments With broad coverage of both foundational concepts and emerging innovations, this book bridges the gap between network engineering and AI infrastructure design. It empowers readers to understand not only how AI data centers work-but why they must evolve.

Contents

Foreword.. . . . . . . . . . . . . . . . xv Preface.. . . . . . . . . . . . . . . . . xvii Acknowledgments.. . . . . . . . . . . . . . xix About the Authors.. . . . . . . . . . . . . . xxi 1 Wonders in the Workload. . . . . . . . . . . . 1 What's New in AI Data Center Workloads.. . . . . . . . 1 The Life Cycle of an AI Model.. . . . . . . . . . . 2 Training an AI Model. . . . . . . . . . . . 3 Parallelism. . . . . . . . . . . . . . 4 Job Completion Time (JCT). . . . . . . . . . . 6 Tail Latency.. . . . . . . . . . . . . . 7 Summary. . . . . . . . . . . . . . 16 Test Your Knowledge. . . . . . . . . . . . 17 2 "The Common-Man View" of AI Data Center Fabrics.. . . . . 19 Training vs. Inference AI Data Centers. . . . . . . . . 19 InfiniBand vs. Ethernet for AI Training Data Centers.. . . . . . 21 Ethernet Hardware Switches and Advanced Software Features.. . . . 22 Handling Elephant Flows.. . . . . . . . . . . 24 Load-Balancing Techniques. . . . . . . . . . . 25 Congestion Management and Mitigation Techniques.. . . . . . 26 Summary. . . . . . . . . . . . . . 28 Test Your Knowledge. . . . . . . . . . . . 29 3 Network Design Considerations. . . . . . . . . . 31 Background Introduction.. . . . . . . . . . . 31 Training Data Center Architecture. . . . . . . . . . 33 Rail-Optimized Design (ROD).. . . . . . . . . . 34 Rail-Unified Design (RUD).. . . . . . . . . . . 42 Rack Design. . . . . . . . . . . . . . 45 Scheduled Fabric. . . . . . . . . . . . . 49 Topologies. . . . . . . . . . . . . . 50 Inference Data Center Architecture. . . . . . . . . 56 Multi-Planar Scale-Out Architectures.. . . . . . . . . 56 Summary. . . . . . . . . . . . . . 63 Test Your Knowledge. . . . . . . . . . . . 64 References. . . . . . . . . . . . . . 66 4 Optics and Cable Management.. . . . . . . . . . 67 Scaling Optics for AI Clusters.. . . . . . . . . . 67 Challenges in Optical Innovation.. . . . . . . . . . 70 Packet Flow. . . . . . . . . . . . . . 70 Transmission Modes.. . . . . . . . . . . . 73 Transceiver Types.. . . . . . . . . . . . . 76 Cable and Connector Types. . . . . . . . . . . 78 Standards.. . . . . . . . . . . . . . 79 Further Innovations in Optics.. . . . . . . . . . 82 Summary. . . . . . . . . . . . . . 83 Test Your Knowledge. . . . . . . . . . . . 85 References. . . . . . . . . . . . . . 86 5 Thermal and Power Efficiency Considerations. . . . . . . 87 Thermal Footprints in AI Data Centers.. . . . . . . . . 87 Airflow Options. . . . . . . . . . . . . 88 Liquid Cooling. . . . . . . . . . . . . 89 Summary. . . . . . . . . . . . . . 93 Test Your Knowledge. . . . . . . . . . . . 94 References. . . . . . . . . . . . . . 95 6 Efficient Load Balancing. . . . . . . . . . . . 97 Per-Flow Load Balancing. . . . . . . . . . . 99 Per-Packet Load Balancing.. . . . . . . . . . . 115 Load-Balancing Mechanism Comparison.. . . . . . . . 117 Summary. . . . . . . . . . . . . . 118 Test Your Knowledge. . . . . . . . . . . . 119 7 RoCEv2 Transport and Congestion Management.. . . . . . 123 Congestion Points. . . . . . . . . . . . 123 Explicit Congestion Notification (ECN).. . . . . . . . 127 Data Center Quantized Congestion Notification (DCQCN).. . . . . 134 Source Flow Control (SFC). . . . . . . . . . . 136 Congestion Signaling.. . . . . . . . . . . . 137 Summary. . . . . . . . . . . . . . 139 Test Your Knowledge. . . . . . . . . . . . 140 8 IP Routing for AI/ML Fabrics.. . . . . . . . . . 143 Dynamic IP Routing Options. . . . . . . . . . 144 eBGP Underlay for Three-Stage/Five-Stage Fabric for an AI Data Center.. . 145 Multi-tenancy for an AI/ML Cluster Data Center Network. . . . . 171 Microsegmentation and Multi-tenancy for an AI/ML Data Center.. . . 177 Extending IP Routing to the Server. . . . . . . . . 177 Traffic Engineering in the AI Data Center Fabric.. . . . . . . 178 Segment Routing and SRv6 for AI/ML Fabrics. . . . . . . 179 Summary. . . . . . . . . . . . . . 184 Test Your Knowledge. . . . . . . . . . . . 185 References. . . . . . . . . . . . . . 187 9 Storage Network Design and Technologies.. . . . . . . 189 The AI Data Center Life Cycle and Storage Networks.. . . . . . 191 Storage Network Design Types. . . . . . . . . . 193 Block, Object, and File Storage Systems.. . . . . . . . 198 NVMe-oF for Block-Level Access.. . . . . . . . . . 199 NVMe-o-RDMA/RoCEv2 State Machine. . . . . . . . 206 High-Performance File Systems. . . . . . . . . . 208 GPUDirect Storage.. . . . . . . . . . . . 211 Summary. . . . . . . . . . . . . . 217 Test Your Knowledge. . . . . . . . . . . . 218 References. . . . . . . . . . . . . . 219 10 AI Network Performance KPIs. . . . . . . . . . 221 Significance of Performance Benchmarking. . . . . . . 221 MLCommons for AI Data Centers.. . . . . . . . . 223 MLCommons Initiatives. . . . . . . . . . . 224 MLCommons Benchmarking Suites.. . . . . . . . . 224 Benchmarking a Data Center for Machine Learning. . . . . . 225 Summary. . . . . . . . . . . . . . 226 Test Your Knowledge. . . . . . . . . . . . 227 References. . . . . . . . . . . . . . 228 11 Monitoring and Telemetry.. . . . . . . . . . . 229 Exploring Monitoring Options.. . . . . . . . . . 229 Network Monitoring in an AI/ML Data Center Network.. . . . . 231 In-Band Flow Analyzer (IFA). . . . . . . . . . . 234 Corrective Actions. . . . . . . . . . . . 237 Summary. . . . . . . . . . . . . . 238 Reference.. . . . . . . . . . . . . . 238 12 Ultra Ethernet Consortium (UEC). . . . . . . . . 239 UEC Developments and Working Groups.. . . . . . . . 241 UEC Key Terminology.. . . . . . . . . . . . 244 The UEC and Network Architectures. . . . . . . . . 246 A New Protocol Stack.. . . . . . . . . . . . 247 Data Plan: Packet Forwarding Options.. . . . . . . . 252 Packet Delivery Modes.. . . . . . . . . . . 257 Congestion Management (CM) in the UEC Specification.. . . . . 261 Packet Trimming and Fast Retransmissions. . . . . . . . 264 Link Layer Reliability (LLR) Mechanism.. . . . . . . . 265 In-Network Collectives (INC) and xCCL.. . . . . . . . 266 Management and Orchestration. . . . . . . . . . 268 Interoperability and Backward Compatibility.. . . . . . . 269 Compliance and Certification.. . . . . . . . . . 269 UEC Challenges and Future Directions.. . . . . . . . 269 Comparing UEC to InfiniBand and RoCEv2. . . . . . . . 270 Summary. . . . . . . . . . . . . . 271 Test Your Knowledge. . . . . . . . . . . . 272 References. . . . . . . . . . . . . . 273 13 Scale-Up Systems.. . . . . . . . . . . . . 275 Key Building Blocks of Scale-Up Systems.. . . . . . . . 278 Scale-Up Ethernet Transport (SUE-T). . . . . . . . . 281 Ultra Accelerator Link (UALink).. . . . . . . . . . 286 Memory Coherence in Scale-Up Systems.. . . . . . . . 291 Scale-Up Systems: Key Differences and Similarities.. . . . . . 292 Summary. . . . . . . . . . . . . . 294 Test Your Knowledge. . . . . . . . . . . . 295 References. . . . . . . . . . . . . . 297 14 Conclusion.. . . . . . . . . . . . . . 299 DC Network Role for AI.. . . . . . . . . . . 299 Caveats and Challenges.. . . . . . . . . . . 300 Future Developments.. . . . . . . . . . . . 302 Final Remarks.. . . . . . . . . . . . . 304 References. . . . . . . . . . . . . . 305 Appendix A Questions and Answers.. . . . . . . . . . 307 Appendix B Acronyms.. . . . . . . . . . . . . 329 9780135436288, TOC, 1/8/2026

Back

University of East Anglia