The global Ethernet switch market generated $6.7B in Q1-2021 according to IDC, with 100GbE switches making up 24.5% of the industry. The industry has been highly competitive for more than a decade, which is good news for consumers of enterprise switching products since there are plenty of options available at various price points. That in turn makes selecting the ideal switch for a given ML use case complicated. Not only are there dozens of switch manufacturers (name brand and white-box) to choose from but identifying the right price/performance/features of a product can be overwhelming for the non-network engineer.
There are literally hundreds if not thousands of price/performance/feature combinations available. Therefore, we’re going to narrow down the playing field to a category of switches, in order to find the ideal switch for the ML use case. Our first requirements, we are interested in the data center switches, not a consumer market product. Consumer-based networking products are widely available at rock bottom prices. Simply plug them in, and they’ll be up and running without configuring it. That’s not our use case. Also, we’re not interested in campus and branch office switches, that’s a whole other ball game.
The data center switch market has several known players and some startups that are pushing the envelope. This category of product has similar features, such as the ability to configure each port, switch fully manageable, there is support for VLANs, specialized ASICs enable each port to perform at line rate, and much more. Some of the popular vendors are listed below, however, this doesn’t include every switch vendor. Also, we’re big fans of white-box no name brand switches.
- Arista Networks
- Bare-Metal: Supermicro, Cumulus, Delta Networks, Quanta, and Edgecore Networks
The famous 3-tier architecture comprised of the Access, Distribution, and Core layers ruled the networking landscape for decades and Cisco was the 800 lb. ruler of it all. Switches were developed for each layer. Devices like computers and workstations connected to the access layer, access switches were connected to distribution switches, and so on.
Then came the hyperscalers like Facebook, Apple, AWS and GCP. The 3-tier network architecture along with these specialized switches was not a good fit for their requirements. First of all, the proprietary switches did not perform well under high traffic loads. Next, troubleshooting proprietary software was not an option. Ultimately, they had no control over these devices so they developed their own software stack, and in some cases hardware, which manufacturers designed to their specifications. In the end, they forced the network industry to change for the better by promoting open source software and bare-metal switching hardware.
During this process, the hyperscalers embraced the 2 tier network architecture called spine and leaf. As a result, the rest of the data center industry followed suit. The concept behind spine and leaf was actually invented by Charles Clos in 1952, although that’s not what he called it. Facebook published volumes on its network fabric architecture. The two diagrams below display the simplicity and sophistication of the 2 tier architecture. One of the key features of the spine and leaf network is the strong support for east-west traffic, unlike the 3-tier network.
Building ML Infrastructure
Building the right machine learning infrastructure inside of a rack/s will depend on several factors. Let’s say a company has set aside a fixed budget to start an AI initiative in order to improve product demand forecasting because consistent shortages are hurting our business and helping the competition. The first question facing an engineer is build vs buy. Buying ML services from a cloud provider like AWS make sense in certain use cases, such as when you have an unlimited budget. It is decided that building it is the best option because costs are capped and conducting thousands of additional experimentations beyond what was originally planned costs no additional capital aside from labor.
In designing a machine learning infrastructure that’ll sit in a 3rd party data center, at a bare minimum, the devices below will be needed for training purposes.
- Firewall (like Palo Alto)
- Switch (Top of Rack)
- Storage Array
- Machine Learning Servers
- Router (get a port from the data center)
- SD-WAN Provider (like Aryaka, Cato Networks, etc.) for a point-to-point connection
Since the setup will be for training and not inferencing, the infrastructure can be locked down completely to the public. All ports should be closed in the firewall except the one for VPN. For Palo Alto Networks, that port is 443 for TCP. However, it is highly recommended to use an SD-WAN provider like Aryaka or Cato so they lock down the ML infrastructure and only allow access to authorized users. The setup above is only for training. When it comes to inferencing, that is a whole other ball game.
Based on the above example, a network switch with 24 to 32 ports should suffice. In terms of performance, if the budget allows, selecting 100Gb/s per port is the dream. In this configuration, each ML server will be able to communicate with each other and the storage array at 100Gbps. If the budget doesn’t allow it, the next step below is 50Gb/port, then 25 Gb/port. Here are some prices: