Hardware
Let’s take a look at the NVIDIA DGX in detail, first from a hardware point of view.
DGX Station A100 | DGX A100 | DGX H100 | DGX B200 | DGX B100 | |
---|---|---|---|---|---|
GPU | 4× NVIDIA A100 80GB | 8× NVIDIA A100 40GB/80GB | 8× NVIDIA H100 80GB | 8x NVIDIA B200 | 8x NVIDIA B100 |
Performance (tensor operations) | 2,5 PetaFLOPS | 5 PetaFLOPS | 32 petaFLOPS (FP8) | 72 petaFLOPS | 56 petaFLOPS |
Total GPU memory | 160/320 GB HBM2 | 320/640 GB HBM2 | 640 GB HBM3 | up to 1536 GB HBM3e | up to 1536 GB HBM3e |
CPU | AMD Rome 7742, 2.25 GHz (64 cores) | 2× AMD Rome 7742, 2.25 GHz (64 cores) | 2x 56-core 4th Gen Intel Xeon Scalable CPU | up to 2x Intel Xeon Platinum 8570 2.1 GHz (56 cores) | up to 2x Intel Xeon Platinum 8570 2.1 GHz (56 cores) |
NVIDIA CUDA cores | 27 648 | 55 296 | 135 168 | TBA | TBA |
NVIDIA Tensor cores | 1 728 (3. gen) | 3 456 (3. gen) | 4 224 (4. gen) | TBA | TbA |
Multi-instance GPU | 28 instances | 56 instances | 56 instances | 56 instances | 56 instances |
RAM | 512 GB | 1 TB | 2 TB | up to 4 TB | up to 4 TB |
HDD | OS: 1x 1.92TB M.2 NVMe SSD, DATA: 1x 7,68TB U.2 NVMe SSD | OS: 2x 1.92TB M.2 NVMe SSD, DATA: 4x 3.84TB (15TB) U.2 NVMe SSD | OS: 2x 1.9TB NVMe M.2 DATA: 8x 3.84TB NVMe U.2 | OS: 2x 1.9TB NVMe M.2 DATA: 8x 3.84TB NVMe U.2 | OS: 2x 1.9TB NVMe M.2 DATA: 8x 3.84TB NVMe U.2 |
GPU interconnect | NVLink | 6x NVIDIA NVSwitch (4.8TB/s) 12x NVIDIA NVLink/GPU (600GB/s) | 4x NVIDIA NVSwitch (7.2TB/s) 18x NVIDIA NVLink/GPU (900GB/s) | 36x NVLink/GPU (1.8TB/s) | 36x NVLink/GPU (1.8TB/s) |
Network | 2× 10GbE | 8x Single-Port ConnectX-6 (200Gb/s HDR InfiniBand) 1x Dual-Port 200Gb/s Ethernet | 8x Single-port ConnectX-7 VPI (400Gb/s InfiniBand / 200 Gb/s Ethernet) 2x Dual-port ConnectX-7 VPI (400Gb/s InfiniBand / 200 Gb/s Ethernet) | 4x OSFP for 8x single-port NVIDIA ConnectX-7 VPI (400Gb/s InfiniBand/Ethernet) 2x dual-port QSFP112 NVIDIA BlueField-3 DPUs (400Gb/s InfiniBand/Ethernet) | 4x OSFP for 8x single-port NVIDIA ConnectX-7 VPI (400Gb/s InfiniBand/Ethernet) 2x dual-port QSFP112 NVIDIA BlueField-3 DPUs (400Gb/s InfiniBand/Ethernet) |
Power consumption | 1 500 W | 6,6 kW | ~10.2kW max | ~ 14.3kW max | ~ 12.2 kW max |
Form factor | tower, water cooling CPU and GPU | rack, 6U | rack, 8U | rack, 10U | rack, 10U |
All NVIDIA DGX systems are equipped with the latest and fastest accelerators. The standard configuration includes eight cards, allowing NVLink to get the most out of the architecture with speeds up to 8TB/s. All NVIDIA GPUs boast dedicated Tensor cores that efficiently accelerate computation for training machine learning and artificial intelligence models. Fast and efficient training is also aided by the large and fast graphics memory, which in the case of the DGX B200 can be up to 1536 GB HBM3e.
How to effectively leverage Multi-GPU systems?
This is one of the most often questions we can hear from our customers. There are a couple of techniques you can use as described in a webinar on the Multi-GPU topic. Or you can attend Fundamentals of Deep Learning for Multi-GPUs workshop organized by us, together with NVIDIA Deep Learning Institute (DLI).
Software
What is more interesting, however, is the already mentioned software package offered by NVIDIA DGX machines. All of these offer pre-installed and performance-tuned environments for machine learning (e.g. Caffe, resp. Caffe 2, Theano, TensorFlow, PyTorch, or MXNet) or an intuitive environment for data analysts (NVIDIA Digits). All of this is elegantly packed in Docker Containers. Such a tuned environment provides 30% more power for machine learning applications against applications deployed purely on NVIDIA hardware. The main advantage of the pre-installed environment is the deployment speed, which is in units of hours. The base DGX system image contains Ubuntu operating system, NVIDIA GPU drivers and Docker environment for application containers downloadable from NVIDIA GPU Cloud (NGC). NVIDIA also supports running these Docker images in a Singularity environment.
NVIDIA GPU Cloud
NVIDIA GPU Cloud (NGC) represents the repository of the most used frameworks for machine learning and deep learning applications, HPC applications, or NVIDIA GPU cards accelerated visualization. Deploying these applications is a question minutes — copying a link of the appropriate Docker image from the NGC repository, moving it on the DGX system, and downloading and running the Docker container. Individual development environments — versions of all included libraries and frameworks, settings of environment parameters — are updated and optimized by NVIDIA for deployment on DGX systems. https://ngc.nvidia.com/
Support
The strength of the NVIDIA solution is to support the entire system. Hardware support (in case of failure of any of the components) is a matter of course. Software support for the entire environment is critical if something does not work as intended. The customer has hundreds of developers ready to help. Support is part of NVIDIA DGX purchase. It is available for 1, 3 or 5 years and can be further extended after this time.
NVIDIA support includes:
- Hardware SLA (replacement parts) 1 day
- Software support — DGX OS image and full AI software including ML frameworks available on NGC
- access to Enterprise Support Portal
- 24×7 hotline
- access to Nvidia Knowledgebase
- access to NVIDIA GPU Cloud (NGC) portal
- NVIDIA Cloud Management
- DGX Software upgrades
- DGX Software updates
- DGX Firmware updaty
Primary telephone support is provided by M Computers in Czech.
NVIDIA DGX Systems deliver much better performance for data analytics and training of AI algorithms thanks to the combination of tuned hardware, rich software stack and high quality of NVIDIA support that covers both DGX hardware and software.
The difference between tuned DGX system solution for fast and powerful machine learning and DIY variant (Do It Yourself) is evident from the following video:
We delivered our first Nvidia DGX-2 system to IT4Innovations Supercomputing Center VSB in Ostrava, Czech Republic. You can see the details of the installation in the short reference video.
Reference Architectures
NVIDIA DGX systems represent huge computing power. When designing an architecture, it is necessary to consider their involvement in the overall IT infrastructure and its tuning to achieve maximum performance. NVIDIA has introduced NVIDIA DGX POD Reference Architecture, including networking and storage disc arrays. Here you will find individual design proposals from the key storage vendors that describe the overall infrastructure solution for running ML and AI applications. Interesting technology to speed up data transfer between the GPU and the data storage can be GPUDirect storage.
NetApp ONTAP AI reference architecture
NVIDIA offers special programs for DGX systems and Tesla accelerators for EDU organizations or start-up companies. Thanks to the international collaboration between NVIDIA and IBM Global Financing, preferential financing in the form of operating leases is available for DGX models.
The NVIDIA Deep Learning Institute (DLI) offers both online and hands-on training for developers, data scientists, and researchers looking to solve challenging problems with deep learning and accelerated computing.
Testing
To test the performance and especially the speed of deployment of ML and AI applications, we have the NVIDIA DGX Station available via NVIDIA Tesla Test Drive program as well as 2× NVIDIA Tesla V100 and NVIDIA Tesla T4. If you are interested in our testing offer, please fill out this form.