Times are displayed in (UTC-07:00) Pacific Time (US & Canada)Change
Session: 02-05 Hardware Cooling I
Paper Number: 96972
96972 - Liquid Cooling Practice on Meta's AI Training Platform
Due to continuous growth of AI accelerator chip power and heatflux, implementation of advanced cooling technologies for AI platforms seems to be inevitable for hyper scale users. Liquid cooling is one of the relatively more mature category of advanced cooling technologies, and has been adopted in a variety of forms across industry. However, not all liquid cooling solutions are able to deliver high performance with reasonable cost and efficiency. In addition, it's not straightforward to arrive at proper balance of performance, reliability, serviceability and scalability for a product, and prepare the facility accordingly to align with long term strategy.
In this presentation, we will introduce three different liquid cooling solutions (Tide 1.0, Torrent 1.5 and Tide 1.5), based on Meta’s AI training platform (Zion) with eight Open Accelerator Modules (OAM). Both closed loop liquid cooling and open loop liquid cooling options are studied. They reflect different design philosophies, with variation in performance and complexity. Thermal simulation and optimization studies will be presented. The solutions are tested on dummy thermal test vehicles and real functional system, along with cooling capability forecast. Considerations on reliability, leakage detection and quanlity control will also be presented. Results showed a good match between simulation, TTV test and real system test. The resulting performance demonstrated strong use case of liquid cooling solutions on upcoming AI platforms.
Presenting Author: Cheng Chen Facebook Inc.
Liquid Cooling Practice on Meta's AI Training Platform