Let’s take a look at the NVIDIA DGX in detail, first from a hardware point of view.
|DGX Station A100||DGX A100||DGX H100|
|GPU||4× NVIDIA A100 80GB||8× NVIDIA A100 40GB/80GB||8× NVIDIA H100 80GB|
|Performance (tensor operations)||2,5 PetaFLOPS||5 PetaFLOPS||32 petaFLOPS (FP8)|
|Total GPU memory||160/320 GB HBM2||320/640 GB HBM2||640 GB HBM3|
|CPU||AMD Rome 7742, 2.25 GHz (64 cores)||2× AMD Rome 7742, 2.25 GHz (64 cores)||2x 56-core 4th Gen Intel Xeon|
|NVIDIA CUDA cores||27 648||55 296||135 168|
|NVIDIA Tensor cores||1 728 (3. gen)||3 456 (3. gen)||4 224 (4. gen)|
|Multi-instance GPU||28 instances||56 instances||56 instances|
|RAM||512 GB||1 TB||2 TB|
|HDD||OS: 1x 1.92TB M.2 NVMe SSD,|
DATA: 1x 7,68TB U.2 NVMe SSD
|OS: 2x 1.92TB M.2 NVMe SSD,|
DATA: 4x 3.84TB (15TB) U.2 NVMe SSD
|OS: 2x 1.9TB NVMe M.2|
DATA: 8x 3.84TB NVMe U.2
|GPU interconnect||NVLink||6x NVIDIA NVSwitch (4.8TB/s)|
12x NVIDIA NVLink/GPU (600GB/s)
|4x NVIDIA NVSwitch (7.2TB/s)|
18x NVIDIA NVLink/GPU (900GB/s)
|Network||2× 10GbE||8x Single-Port ConnectX-6 (200Gb/s HDR InfiniBand)|
1x Dual-Port 200Gb/s Ethernet
|8x Single-port ConnectX-7 VPI (400Gb/s InfiniBand / 200 Gb/s Ethernet)|
2x Dual-port ConnectX-7 VPI (400Gb/s InfiniBand / 200 Gb/s Ethernet)
|Power consumption||1 500 W||6,6 kW||~10.2kW max|
|Form factor||tower, water cooling CPU and GPU||rack, 6U||rack, 8U|
All NVIDIA DGX systems feature the latest and fastest accelerators today. DGX Station A100 with four NVIDIA A100 80 GB cards and DGX A100 even eight NVIDIA A100 40 GB or 80 GB accelerators! The main advantages of NVIDIA cards include specialized Tensor cores for AI applications and large memory (up to 80 GB for each accelerator), secured by ECC technology. NVIDIA Tesla cards are also equipped with an interface for high bandwidth card communication — NVLink. NVLink achieves transfer rates of up to 600 GB/s. In addition, the NVIDIA DGX A100 offers a super-powerful NVSwitch. This provides a total throughput between eight NVIDIA Ampere A100 cards of up to 4.8 TB/s.
How to effectively leverage Multi-GPU systems?
This is one of the most often questions we can hear from our customers. There are a couple of techniques you can use as described in a webinar on the Multi-GPU topic. Or you can attend Fundamentals of Deep Learning for Multi-GPUs workshop organized by us, together with NVIDIA Deep Learning Institute (DLI).
What is more interesting, however, is the already mentioned software package offered by NVIDIA DGX machines. All of these offer pre-installed and performance-tuned environments for machine learning (e.g. Caffe, resp. Caffe 2, Theano, TensorFlow, PyTorch, or MXNet) or an intuitive environment for data analysts (NVIDIA Digits). All of this is elegantly packed in Docker Containers. Such a tuned environment provides 30% more power for machine learning applications against applications deployed purely on NVIDIA hardware. The main advantage of the pre-installed environment is the deployment speed, which is in units of hours. The base DGX system image contains Ubuntu operating system, NVIDIA GPU drivers and Docker environment for application containers downloadable from NVIDIA GPU Cloud (NGC). NVIDIA also supports running these Docker images in a Singularity environment.
NVIDIA GPU Cloud
NVIDIA GPU Cloud (NGC) represents the repository of the most used frameworks for machine learning and deep learning applications, HPC applications, or NVIDIA GPU cards accelerated visualization. Deploying these applications is a question minutes — copying a link of the appropriate Docker image from the NGC repository, moving it on the DGX system, and downloading and running the Docker container. Individual development environments — versions of all included libraries and frameworks, settings of environment parameters — are updated and optimized by NVIDIA for deployment on DGX systems. https://ngc.nvidia.com/
The strength of the NVIDIA solution is to support the entire system. Hardware support (in case of failure of any of the components) is a matter of course. Software support for the entire environment is critical if something does not work as intended. The customer has hundreds of developers ready to help. Support is part of NVIDIA DGX purchase. It is available for 1, 3 or 5 years and can be further extended after this time.
NVIDIA support includes:
- Hardware SLA (replacement parts) 1 day
- Software support — DGX OS image and full AI software including ML frameworks available on NGC
- access to Enterprise Support Portal
- 24×7 hotline
- access to Nvidia Knowledgebase
- access to NVIDIA GPU Cloud (NGC) portal
- NVIDIA Cloud Management
- DGX Software upgrades
- DGX Software updates
- DGX Firmware updaty
Primary telephone support is provided by M Computers in Czech.
NVIDIA DGX Systems deliver much better performance for data analytics and training of AI algorithms thanks to the combination of tuned hardware, rich software stack and high quality of NVIDIA support that covers both DGX hardware and software.
The difference between tuned DGX system solution for fast and powerful machine learning and DIY variant (Do It Yourself) is evident from the following video:
We delivered our first Nvidia DGX-2 system to IT4Innovations Supercomputing Center VSB in Ostrava, Czech Republic. You can see the details of the installation in the short reference video.
NVIDIA DGX systems represent huge computing power. When designing an architecture, it is necessary to consider their involvement in the overall IT infrastructure and its tuning to achieve maximum performance. NVIDIA has introduced NVIDIA DGX POD Reference Architecture, including networking and storage disc arrays. Here you will find individual design proposals from the key storage vendors that describe the overall infrastructure solution for running ML and AI applications. Interesting technology to speed up data transfer between the GPU and the data storage can be GPUDirect storage.
NetApp ONTAP AI reference architecture
NVIDIA offers special programs for DGX systems and Tesla accelerators for EDU organizations or start-up companies. Thanks to the international collaboration between NVIDIA and IBM Global Financing, preferential financing in the form of operating leases is available for DGX models.
The NVIDIA Deep Learning Institute (DLI) offers both online and hands-on training for developers, data scientists, and researchers looking to solve challenging problems with deep learning and accelerated computing.
To test the performance and especially the speed of deployment of ML and AI applications, we have the NVIDIA DGX Station available via NVIDIA Tesla Test Drive program as well as 2× NVIDIA Tesla V100 and NVIDIA Tesla T4. If you are interested in our testing offer, please fill out this form.
M: 734 161 516