Es tut uns ehrlich leid! Wahrscheinlich benutzen sie den Internet Explorer (IE) 10 oder 11. Leider funktionieren einige unserer Webshop-Dienste nicht mit dem IE 10 oder 11. Wir bitten sie freundlichst darum, einen modernen Browser zu benutzen.
NVIDIA DGX POD

NVIDIA DGX POD

The DGX POD is an optimized data center rack containing up to nine DGX-1 servers, twelve storage servers, and three networking switches to support single and multi-node AI model training and inference using NVIDIA AI software.

DGX POD discount until 27.01.2019
2-3 pieces 10%
4-8 pieces 15%
9+ pieces 20% 

Discount applies to DGX-1 hardware only.
Please ask us for a quote!

There are several factors to consider when planning a DGX POD deployment in order to determine if more than one rack is needed per DGX POD. This reference architecture is based on a single 35 kW high-density rack to provide the most efficient use of costly data center floorspace and to simplify network cabling. As GPU usage grows, the average power per server and power per rack continues to increase. However, older data centers may not yet be able to support the power and cooling densities required; hence the three-zone design allowing the DGX POD components to be installed in up to three lower-power racks.

The DGX POD is designed to fit within a standard-height 42 RU data center rack. A taller rack can be used to include redundant networking switches, a management switch, and login servers. This reference architecture uses an additional utility rack for login and management servers, and has been sized and tested with up to six DGX PODs. Larger configurations of DGX PODs can be defined by an NVIDIA solution architect.

A primary 10 GbE (minimum) network switch is used to connect all servers in the DGX POD and to provide access to a data center network. The DGX POD has been tested with an Arista switch with 48 x 10 GbE ports and 4 x 40 GbE uplinks. VLAN capabilities of the networking hardware are used to allow the out-of-band management network to run independently from the data network, while sharing the same physical hardware. Alternatively, a separate 1 GbE management switch may be used. While not included in the reference architecture, a second 10 GbE network switch can be used for redundancy and high availability. In addition to Arista, NVIDIA is working with other networking vendors who plan to release switch reference designs compatible with the DGX POD.

A 36-port Mellanox 100 Gbps switch is configured to provide four 100 Gbps InfiniBand connections to the nine DGX-1 servers in the rack. This provides the best possible scalability for multi-node jobs. In the event of switch failure, multi-node jobs can fall back to use the 10 GbE switch for communications. The Mellanox switch can also be configured in 100 GbE mode for organizations that prefer to use Ethernet networking. Alternately, by configuring two 100 Gbps ports per DGX-1 server, the Mellanox switch can also be used by the storage servers.

With the DGX family of servers, AI and HPC workloads are fusing into a unified architecture. For organizations that want to utilize multiple DGX PODs to run cluster-wide jobs, a core InfiniBand switch is configured in the utility rack in conjunction with a second 36-port Mellanox switch in DGX POD.

Technical Structure

  • Nine DGX-1 server (9x 3 RU = 27 RU)

  • Twelve storage servers (12x 1RU = 12 RU)

  • 10GbE (min) storage and mangement switch (1 RU)

  • Mellanox 100 Gbps intra-rack high speed network switches (1 or 2 RU)





Storage architecture is important for optimized DL training performance.
The DGX POD uses a hierarchical design with multiple levels of cache storage using the DGX-1 SSD and additional cache storage servers in the DGX POD.

Long-term storage of raw data can be located on a wide variety of storage devices outside of the DGX POD, either on-premises or in public clouds. The DGX POD baseline storage architecture consists of standard NFS on the storage servers in conjunction with the local DGX SSD cache. Additional storage performance may be obtained by using the Ceph object-based file system or other caching file system on the storage servers.


  • Login server which allows users to login to the cluster and launch Slurm batch jobs. (1 RU)

  • Three management servers running Kubernetes server components and other DGX POD management software. (3 x 1 RU = 3 RU)

  • Optional multi-POD 10 GbE storage and management network switches. (2 RU)

  • Optional multi-POD clustering using a Mellanox 216 port EDR InfiniBand switch. (12 RU)


The DGX POD is also designed to be compatible with a number of third-party storage solutions, see the reference architectures from DDN, NetApp, and Pure Storage for additional information.

NVIDIA is also working with other storage vendors who plan to release DGX POD compatible reference architectures. While based on the DGX-1 server, the DGX POD has been designed in a modular fashion to support the NVIDIA DGX-2™ server which starts shipping in Q3 of 2018. Each of the three compute zones in a DGX POD is  designed such that the three DGX-1 servers in the compute zone can be replaced with a single DGX-2 server.

DGX POD Installation and Management

Deploying a DGX POD is similar to deploying traditional servers and networking in a rack. However, with high-power consumption and corresponding cooling needs, server weight, and multiple networking cables per server, additional care and preparation is needed for a successful deployment. As with all IT equipment installation, it is important to work with the data center facilities team to ensure the DGX POD environmental requirements can be met.

Area Design Guidelines
Rack

• Supports 3000 lbs of static load
• Dimensions of 1200 mm depth x 700 mm width
• Structured cabling pathways per TIA 942 standard

Cooling

• Removal of 119,420 BTU/hr
• ASHRAE TC 9.9 2015 Thermal Guidelines “Allowable Range”

Power

• North America: A/B power feeds, each three-phase 400V/60A/33.2kW (or three-phase 208V/60A/17.3 kW with additional considerations for redundancy as required)
• International: A/B power feeds, each 380/400/415V, 32A, three-phase – 21-23kW each.

NVIDIA AI Software

NVIDIA AI software running on the DGX POD provides a high-performance DL training environment for large scale multi-user AI software development teams. NVIDIA AI software includes the DGX operating system (DGX OS), cluster management and orchestration tools, NVIDIA libraries and frameworks, workload schedulers, and optimized containers from the NGC container registry. To provide additional functionality, the DGX POD management software includes third-party open-source tools recommended by NVIDIA which have been tested to work on DGX PODs with the NVIDIA AI software stack. Support for these tools can be obtained directly through third-party support structures.

NGDIA GPU Cl Cloud)
NGC (NVIDIA GPU Cloud)
NGC (NVIDIA GPU Cloud)

This content is taken from the "DGX POD Reference Design Whitepaper" from NVIDIA.

If you have any questions, please do not hesitate to contact us. 

We are available for you at any time. 
via e-mail or by phone +49 (0) 40-300-672-20