CEITEC BUT introduces new multi-GPU system for advanced machine learning applications
28. 5. 2024
28. 5. 2024
The CEITEC research facility at Brno University of Technology (BUT) has introduced the newly installed Nvidia DGX A100 and Nvidia DGX H100 computing systems, which will enable expanded capabilities in artificial intelligence research and applications. These systems combine two generations of Nvidia DGX technology, providing exceptional computing capacity and flexibility for a variety of research and industrial applications. Prof. Ing. Pavel Václavek, Ph.D., head of the Cybernetics and Robotics research group and coordinator of the Industrial Cybernetics, Instrumentation and Systems Integration research program at CEITEC BUT, highlights the use of the new systems in the Digital Europe program projects, such as EDIH-DIGIMAT, aimed at digitalization and robotization of manufacturing companies, and AI TEF AI-MATTERS, a network of test environments for AI verification in the industrial sector.“As part of our EDIH and TEF services, we provide companies with the opportunity to experiment with AI, learn and test AI applications on cutting-edge systems that are part of a newly installed supercomputer,” explains Prof. Václavek. ”
This enables small and medium-sized businesses with up to 499 employees to use advanced technology at a 100% subsidised price. Our goal is also to integrate the DGX system with other technologies of our RICAIP Testbed Brno so that we can process data from production machines and robots in real time. ”
The new NVIDIA DGX A100 and NVIDIA DGX H100 systems, each with eight interconnected GPU accelerators and a total of 640GB of memory, provide powerful tools for massively parallel computing, which is critical for processing large datasets derived from production data. The two compute nodes are interconnected by an InfiniBand network with transfer rates of up to 200 Gbps, ensuring extremely fast and efficient communication between systems. In addition to the high performance, these systems also provide a robust software layer, including a pre-installed and tuned machine learning environment that enables easy and rapid deployment to the field. Another advantage is the direct connection to an online database of the most widely used AI frameworks and libraries, allowing users to easily download and use various software tools in container form, accelerating the development and implementation of AI applications. “Thanks to these systems, we can offer companies and our scientists access to state-of-the-art technologies, enabling faster and more efficient research,” adds Prof. Václavek. After the installation of the campus 5G network, this is the next addition to the RICAIP Testbed Brno infrastructure in this area. CEITEC BUT thus confirms its position as a leading scientific institution in the field of research and the use of state-of-the-art technologies to support science and industry.
NVIDIA DGX systems aren’t just cutting-edge hardware, they also come with innovative enhancements for easier infrastructure management and AI implementation. They feature a fine-tuned Docker environment and DGX OS, in addition to the new NVIDIA Base Command tool that enables efficient management of the entire infrastructure. This simplifies the deployment and implementation of AI applications for research and development teams.
The system also includes the NVIDIA AI Enterprise (NVAIE) software stack, which provides a complete set of tools for developing and optimizing AI applications. This combination of technologies facilitates and accelerates the process of developing and deploying AI solutions across the entire infrastructure.
Parametr | NVIDIA DGX H100 640 GB | NVIDIA DGX A100 640 GB |
---|---|---|
GPUs | 8× NVIDIA H100 SXM5 80 GB | 8× NVIDIA A100 SXM4 80 GB |
GPU memory | 640 GB total | 640 GB total |
CPU | 2x Intel Xeon Platinum 8480C CPU, (112 jader) 2.00 GHz | 2× AMD Epyc 7742 (128 jader, 2.25GHz) |
Výkon (tensor operace) | 32 PetaFLOPS (FP8) | 5 PetaFLOPS (FP16) |
# CUDA jader | 135 168 | 55 296 |
# Tensor jader | 4 224 | 3 456 |
Multi-instantce GPU | 56 instancí | 56 instancí |
RAM | 2 TB | 2 TB |
HDD | OS: 2× 1.92 TB NVMe data: 30 TB (8× 3.84 TB) NVMe | OS: 2× 1.92 TB NVMe data: 30 TB (8× 3.84 TB) NVMe |
Network | 8x ConnectX-7 400Gb/s InfiniBand 4x ConnectX-7 200Gb/s Ethernet | 8x ConnectX-7 200Gb/s InfiniBand 4x ConnectX-7 200Gb/s Ethernet |
Max. spotřeba | 10,2 kW | 6.5 kW |
Provedení | rack, 8U | rack, 6U |
Technická specifikace | Stáhnout datasheet | Stáhnout datasheet |
NVIDIA GPU Cloud (NGC) represents the repository of the most used frameworks for machine learning and deep learning applications, HPC applications, or NVIDIA GPU cards accelerated visualization. Deploying these applications is a question minutes — copying a link of the appropriate Docker image from the NGC repository, moving it on the DGX system, and downloading and running the Docker container.
The individual development environments – versions of all included libraries and frameworks, settings of environment parameters – are updated and optimized by NVIDIA for deployment on DGX systems. https://ngc.nvidia.com/
What makes DGX systems the most different from bare-metal solutions is the software. All of them offer pre-installed and, above all, performance-tuned environments for machine learning (eg Caffe or Caffe 2, Theano, TensorFlow, Torch or MXNet) or an intuitive environment for data analytics (NVIDIA Digits). All of this is elegantly packed in Docker Containers. These constantly updated containers can be downloaded from the website NVIDIA GPU Cloud (NGC).
According to NVIDIA, such a tuned environment provides 30% higher performance for machine learning applications compared to applications deployed purely on NVIDIA hardware. However, the main advantage of a pre-installed environment is the speed of deployment, which can be fully operational in the order of hours.
The strength of the NVIDIA solution is also the support of the whole system. Fast hardware support (in case of failure of any of the components) is a matter of course.
Software support for the entire environment is critical if something does not work as intended. The customer has hundreds of developers ready to help. Support is included with the purchase of all NVIDIA DGX systems. It is available for 3-5 years and can be extended beyond this period.
To test the performance and especially the speed of deploying ML and AI applications, we have not only NVIDIA H100, A30, A16 and other accelerators, but also demo licenses for GPU virtualization(vGPU) and a software environment for easy deployment of AI applications – NVIDIA AI Enterprise (NVAIE). If you are interested in our testing offer, please fill out this form.