In this hardware section, we’re going to explore the different components of building a high-performance computing (HPC) system for machine learning. We’ll refer to this process as DIY (do-it-yourself) or BIY (build-it-yourself). It should come as no surprise that building your own compute infrastructure is cheaper and better (performance) than renting it in the cloud. However, we’d add the caveat that this applies only to training ML models, not inference. If ML inference is to occur in an environment that requires capturing data from users around the world, then a combination of DIY and cloud works best.
- Training: DIY
- Inference: DIY + Cloud
When it comes to training machine learning models, performance is paramount. Aside from optimizations that can be done in software, performance is also highly dependent on using the right combination of hardware components. For some, selecting the right combination of components might be rocket science since there are literally hundreds of motherboards, CPUs, enclosures, NICs, and so on, to choose from.
One of the most important components in the HPC system is the GPU. It enables parallel processing on a massive scale where calculations such as matrix multiplication are done in the tensor cores. Popular frameworks such as TensorFlow and PyTorch support the major GPU manufacturers including Nvidia, AMD, Intel, and up-in-comers like Graphcore. The other hardware components are just as important because together they work as a unified system.
Different ML use cases require slightly different CPU, memory, and storage configurations. For example, reinforcement learning requires a better CPU than deep learning. Here is a list of the different components we’ll explore in this section.
- Rack Enclosures
- Storage Hardware
- Storage Software Stack