Let’s take a look at the NVIDIA DGX in detail, first from a hardware point of view.
|Parameter||DGX A100||DGX Station A100|
|GPUs||8× NVIDIA A100 40GB/80GB||4× NVIDIA A100 40GB/80GB|
|Performance (tensor operations)||5 PetaFLOPS||2,5 PetaFLOPS|
|GPU memory||320/640 GB in total||160/320 GB in total|
|CPU||2× AMD Rome 7742, 2.25 GHz (64 cores)||AMD Rome 7742, 2.25 GHz (64 cores)|
|NVIDIA CUDA cores||55 296||27 648|
|NVIDIA Tensor cores||3 456||1 728|
|Multi-instance GPU||56 instances||28 instances|
|RAM||1 TB||512 GB|
|HDD||2x 1.92TB M.2 NVMe SSD,|
4x 3.84TB (15TB) U.2 NVMe SSD
|1x 1.92TB M.2 NVMe SSD,|
7,68TB U.2 NVMe SSD
|GPU interconnect||6x NVIDIA NVSwitch 3, non-blocking, 4.8 TB/s||NVLink|
|Network||8x Single-Port 200Gb/s HDR InfiniBand|
1x Dual-Port 200Gb/s Ethernet
|Power consumption||6,6 kW||1 500 W|
|Form factor||rack, 6U||tower, water cooling CPU and GPU|
All NVIDIA DGX systems feature the latest and fastest accelerators today. DGX Station A100 with four GPUs NVIDIA Tesla A100 40 GB or 80 GB and DGX A100 even eight accelerators NVIDIA Ampere A100 40 GB or 80 GB! The main benefits of NVIDIA Tesla cards are specialized Tensor cores for accelerating machine learning applications or large memory (40 GB for each card) secured by ECC technology. NVIDIA Tesla cards are also equipped with an interface for high bandwidth card communication — NVLink. NVLink can reach speeds up to 600 GB/s. NVIDIA DGX A100 additionally offers super powerful NVSwitch. Which connects eight NVIDIA Ampere A100 with 4.8 TB/s bisectional bandwidth in non-blocking architecture.
How to effectively leverage Multi-GPU systems?
This is one of the most often questions we can hear from our customers. There are a couple of techniques you can use as described in a webinar on the Multi-GPU topic. Or you can attend Fundamentals of Deep Learning for Multi-GPUs workshop organized by us, together with NVIDIA Deep Learning Institute (DLI).
What is more interesting, however, is the already mentioned software package offered by NVIDIA DGX machines. All of these offer pre-installed and performance-tuned environments for machine learning (e.g. Caffe, resp. Caffe 2, Theano, TensorFlow, PyTorch, or MXNet) or an intuitive environment for data analysts (NVIDIA Digits). All of this is elegantly packed in Docker Containers. Such a tuned environment provides 30% more power for machine learning applications against applications deployed purely on NVIDIA hardware. The main advantage of the pre-installed environment is the deployment speed, which is in units of hours. The base DGX system image contains Ubuntu operating system, NVIDIA GPU drivers and Docker environment for application containers downloadable from NVIDIA GPU Cloud (NGC). NVIDIA also supports running these Docker images in a Singularity environment.
NVIDIA GPU Cloud
NVIDIA GPU Cloud (NGC) představuje katalog Docker obrazů nejpoužívanějších prostředí pro vývoj machine learning a deep learning aplikací, HPC aplikací nebo vizualizaci akcelerovanou NVIDIA GPU kartami. Nasazení těchto aplikací je pak otázkou zkopírování odkazu na příslušný Docker obraz, jeho spuštění na DGX systému a stažení a spuštění v Docker kontejneru. Jednotlivá vývojová prostředí — verze všech obsažených knihoven a frameworků, nastavení parametrů prostředí — jsou aktualizovaná a optimalizovaná NVIDIÍ pro nasazení na DGX systémech. https://ngc.nvidia.com/
The strength of the NVIDIA solution is to support the entire system. Hardware support (in case of failure of any of the components) is a matter of course. Software support for the entire environment is critical if something does not work as intended. The customer has hundreds of developers ready to help. Support is part of NVIDIA DGX purchase. It is available for 1, 3 or 5 years and can be further extended after this time.
NVIDIA support includes:
- Hardware SLA (replacement parts) 1 day
- Software support — DGX OS image and full AI software including ML frameworks available on NGC
- access to Enterprise Support Portal
- 24×7 hotline
- access to Nvidia Knowledgebase
- access to NVIDIA GPU Cloud (NGC) portal
- NVIDIA Cloud Management
- DGX Software upgrades
- DGX Software updates
- DGX Firmware updaty
Primary telephone support is provided by M Computers in Czech.
Applications performance comparison
GPU vs. CPU: Data analysis
A100 vs. V100: Model training
NVIDIA DGX Systems deliver much better performance for data analytics and training of AI algorithms thanks to the combination of tuned hardware, rich software stack and high quality of NVIDIA support that covers both DGX hardware and software.
The difference between tuned DGX system solution for fast and powerful machine learning and DIY variant (Do It Yourself) is evident from the following video:
We delivered our first Nvidia DGX-2 system to IT4Innovations Supercomputing Center VSB in Ostrava, Czech Republic. You can see the details of the installation in the short reference video.
NVIDIA DGX systems represent huge computing power. When designing an architecture, it is necessary to consider their involvement in the overall IT infrastructure and its tuning to achieve maximum performance. NVIDIA has introduced NVIDIA DGX POD Reference Architecture, including networking and storage disc arrays. Here you will find individual design proposals from the key storage vendors that describe the overall infrastructure solution for running ML and AI applications. Interesting technology to speed up data transfer between the GPU and the data storage can be GPUDirect storage.
NetApp ONTAP AI reference architecture
NVIDIA offers special programs for DGX systems and Tesla accelerators for EDU organizations or start-up companies. Thanks to the international collaboration between NVIDIA and IBM Global Financing, preferential financing in the form of operating leases is available for DGX models.
GPU Technology Conference (GTC)
Conference GTC (GPU Technology Conference) is held several times a year. They provide an opportunity to get acquainted with NVIDIA DGX systems and examples of their deployment and listen to the visionary lectures of the most influential people in the fields of artificial intelligence and machine learning, including the charismatic CEO of NVIDIA Jensen Huang.
The NVIDIA Deep Learning Institute (DLI) offers both online and hands-on trainings for developers, data scientists, and researchers looking to solve challenging problems with deep learning and accelerated computing.
To test the performance and especially the speed of deployment of ML and AI applications, we have the NVIDIA DGX Station available via NVIDIA Tesla Test Drive program as well as 2× NVIDIA Tesla V100 and NVIDIA Tesla T4. If you are interested in our testing offer, please fill out this form.
M: 734 161 516