Meta gets into the supercomputer game with its AI Research SuperCluster – TechCrunch
There’s a global competition to build the biggest, most powerful computers on the planet, and Meta (AKA Facebook) is about to jump into the fray with the “AI Research SuperCluster” or RSC. Once fully operational, it could well be among the ten fastest supercomputers in the world, which it will use for the massive calculations needed for modeling language and computer vision.
Large AI models, of which OpenAI’s GPT-3 is probably the best known, are not assembled on laptops and desktops; they are the end product of weeks and months of calculations backed by high-performance computing systems that eclipse even the most advanced gaming rigs. And the sooner you can complete the training process of a model, the sooner you can test it and produce a new and better one. When training times are measured in months, it really matters.
RSC is up and running and the company’s researchers are already putting it to work…with user-generated data, it must be said, though Meta was careful to say that it’s encrypted until training time and that the entire installation is isolated from the Internet at large. .
The team that built RSC is justifiably proud to have succeeded almost entirely remotely – supercomputers are amazingly physical constructs, with basic considerations like heat, cabling and interconnect affecting performance and design. Exabytes of storage seem large enough digitally, but they also have to exist somewhere, on-site, and accessible within a microsecond. (Pure storage is also proud of the setup they have in place for it.)
RSC currently has 760 Nvidia DGX A100 systems with a total of 6,080 GPUs, which Meta says should put it roughly in competition with Perlmutter at the Lawrence Berkeley National Lab. It is the fifth most powerful supercomputer currently in operation, according to the longtime Top 500 ranking site. (#1 is Fugaku in Japan by far, in case you were wondering.)
This may change as the company continues to develop the system. Ultimately, they predict it to be about three times as powerful, which would theoretically put it in contention for third place.
There is probably a caveat there. Systems like the second Summit from the Lawrence Livermore National Lab are used for research purposes where precision is paramount. If you are simulating the molecules of a region of the Earth’s atmosphere at unprecedented levels of detail, you must perform every calculation to a large number of decimal places. And this means that these calculations are more computationally expensive.
Meta explained that AI applications don’t require a similar degree of precision because the results don’t depend on that thousandth of a percent – inference operations end up producing things like “90% certainty is a cat”, and whether that number were 89% or 91% wouldn’t make a big difference. The difficulty lies more in reaching 90% certainty for a million objects or sentences rather than a hundred.
This is an oversimplification, but the result is that RSC, running TensorFloat-32 math mode, can achieve more FLOP/s (floating point operations per second) per core than other more precision-oriented systems. In this case, it’s up to 1,895,000 teraFLOP/s or 1.9 exaFLOP/s, more than 4x Fugaku. Is it important? And if so, to whom? If anyone, this might be important to the top 500, so I asked if they had any comments on this. But that doesn’t change the fact that RSC will be among the fastest computers in the world, perhaps the fastest to be operated by a private company for its own purposes.