GTC, NVIDIA’s flagship event, is always a source of announcements around all things AI. The fall 2021 edition is no exception. Huang’s keynote emphasized what NVIDIA calls the Omniverse. Omniverse is NVIDIA’s virtual world simulation and collaboration platform for 3D workflows, bringing its technologies together.
Based on what we’ve seen, we would describe the Omniverse as NVIDIA’s take on Metaverse. You will be able to read more about the Omniverse in Stephanie Condon and Larry Dignan’s coverage here on ZDNet. What we can say is that indeed, for something like this to work, a confluence of technologies is needed.
So let’s go through some of the updates in NVIDIA’s technology stack, focusing on components such as large language models (LLMs) and inference.
NeMo Megatron, NVIDIA’s open source large language model platform
NVIDIA unveiled what it calls the NVIDIA NeMo Megatron framework for training language models. In addition, NVIDIA is making available the Megatron LLM, a model with 530 billion that can be trained for new domains and languages.
Bryan Catanzaro, Vice President of Applied Deep Learning Research at NVIDIA, said that “building large language models for new languages and domains is likely the largest supercomputing application yet, and now these capabilities are within reach for the world’s enterprises”.
While LLMs are certainly seeing lots of traction, and a growing number of applications, this particular offering’s utility warrants some scrutiny. First off, training LLMs is not for the faint of heart, and requires deep pockets. It has been estimated that training a model such as OpenAI’s GPT-3 costs around $12 million.
OpenAI has partnered with Microsoft and made an API around GPT-3 available in order to commercialize it. And there are a number of questions to ask around the feasibility of training one’s own LLM. The obvious one is whether you can afford it, so let’s just say that Megatron is not aimed at the enterprise in general, but a specific subset of enterprises at this point.
The second question would be – what for? Do you really need your own LLM? Catanzaro notes that LLMS “have proven to be flexible and capable, able to answer deep domain questions, translate languages, comprehend and summarize documents, write stories and compute programs”.
We would not go as far as to say that LLMs “comprehend” documents, for example, but let’s acknowledge that LLMs are sufficiently useful, and will keep getting better. Huang claimed that LLMs “will be the biggest mainstream HPC application ever”.
The real question is – why build your own LLM? Why not use GPT-3’s API, for example? Competitive differentiation may be a legitimate answer to this question. The cost to value function may be another one, in another incarnation of the age-old “buy versus build” question.
In other words, if you are convinced you need a LLM to power your applications, and you’re planning on using GPT-3, or any other LLM with similar usage terms, often enough, it may be more economical to train your own. NVIDIA mentions use cases such as building domain-specific chatbots, personal assistants and other AI applications.
To do that, it would make more sense to start from a pre-trained LLM and tailor it to your needs via transfer learning, rather than train one from scratch. NVIDIA notes that NeMo Megatron builds on advancements from Megatron, an open-source project led by NVIDIA researchers studying efficient training of large transformer language models at scale.
The company adds that the NeMo Megatron framework enables enterprises to overcome the challenges of training sophisticated natural language processing models. So, the value proposition seems to be — if you decide to invest in LLMs, why not use Megatron? Although that sounds like a reasonable proposition, we should note that Megatron is not the only game in town.
Recently, EleutherAI, a collective of independent AI researchers, open-sourced their 6 billion parameter GPT-j model. In addition, if you are interested in languages beyond English, we now have a large European language model fluent in English, German, French, Spanish, and Italian by Aleph Alpha. Wudao, is a Chinese LLM which is also the largest LLM with 1.75 trillion parameters, and HyperCLOVA is a Korean LLM with 204 billion parameters. Plus, there’s always other, slightly older / smaller open source LLMs such as GPT2 or BERT and its many variations.
Aiming at AI model inference addresses total cost of ownership and operation
One caveat is that when it comes to LLMs, bigger (as in having more parameters) does not necessarily mean better. Another one is, that even with a basis such as Megatron to build on, LLMs are expensive beasts both to train and to operate. NVIDIA’s offering is set to address both of these aspects, by specifically targeting inference, too.
Megatron, NVIDIA notes, is optimized to scale out across the large-scale accelerated computing infrastructure of NVIDIA DGX SuperPOD™. NeMo Megatron automates the complexity of LLM training with data processing libraries that ingest, curate, organize and clean data. Using advanced technologies for data, tensor and pipeline parallelization, it enables the training of large language models to be distributed efficiently across thousands of GPUs.
But what about inference? After all, in theory at least, you only train LLMs once, but the model is used many-many times to infer — produce results. The inference phase of operation accounts for about 90% of the total energy cost of operation for AI models. So having inference that is both fast and economical is of paramount importance, and that applies beyond LLMs.
NVIDIA is addressing this by announcing major updates to its Triton Inference Server, as 25,000+ companies worldwide deploy NVIDIA AI inference. The updates include new capabilities in the open source NVIDIA Triton Inference Server™ software, which provides cross-platform inference on all AI models and frameworks, and NVIDIA TensorRT™, which optimizes AI models and provides a runtime for high-performance inference on NVIDIA GPUs.
NVIDIA introduces a number of improvements for the Triton Inference Server. The most obvious tie to LLMs is that Triton now has multi-GPU multinode functionality. This means Transformer-based LLMs that no longer fit in a single GPU can be inferenced across multiple GPUs and server nodes, which NVIDIA says provides real-time inference performance.
The Triton Model Analyzer is a tool that automates a key optimization task by helping select the best configurations for AI models from hundreds of possibilities. According to NVIDIA, It achieves the optimal performance while ensuring quality of service required for applications.
RAPIDS FIL is a new back-end for GPU or CPU inference of random forest and gradient-boosted decision tree models. which provides developers a unified deployment engine for both deep learning and traditional machine learning with Triton.
Last but not least on the software front, Triton now comes with Amazon SageMaker Integration, enabling users to easily deploy multi-framework models using Triton within SageMaker, AWS’s fully managed AI service.
On the hardware front, Triton now also supports Arm CPUs, in addition to NVIDIA GPUs and x86 CPUs. The company also introduced the NVIDIA A2 Tensor Core GPU, a low-power, small-footprint accelerator for AI inference at the edge that NVIDIA claims offers up to 20X more inference performance than CPUs.
Triton provides AI inference on GPUs and CPUs in the cloud, data center, enterprise edge and embedded, is integrated into AWS, Google Cloud, Microsoft Azure and Alibaba Cloud, and is included in NVIDIA AI Enterprise. To help deliver services based on NVIDIA’s AI technologies to the edge, Huang announced NVIDIA Launchpad.
NVIDIA moving proactively to maintain its lead with its hardware and software ecosystem
And that is far from everything NVIDIA unveiled today. NVIDIA Modulus builds and trains physics-informed machine learning models that can learn and obey the laws of physics. Graphs — a key data structure in modern data science — can now be projected into deep-neural networks frameworks with Deep Graph Library, or DGL, a new Python package.
Huang also introduced three new libraries: ReOpt, for the $10 trillion logistics industry. cuQuantum, to accelerate quantum computing research. And cuNumeric, to accelerate NumPy for scientists, data scientists and machine learning and AI researchers in the Python community. And NVIDIA is introducing 65 new and updated SDKs at GTC.
So, what to make of all that? Although we cherry-picked, each of these items would probably warrant its own analysis. The big picture is that, once again, NVIDIA is moving proactively to maintain its lead in a concerted effort to tie in its hardware to its software.
LLMs may seem exotic for most organizations at this point, but NVIDIA is betting that they will see more interest and practical applications, and positioning itself as an LLM platform for others to build on. Although alternatives do exist, having something that is curated, supported, and bundled with NVIDIA’s software and hardware ecosystem and brand will probably seem like an attractive proposition to many organizations.
Same goes for the focus on inference. In the face of increasing competition by an array of hardware vendors building on architectures designed specifically for AI workloads, NVIDIA is doubling down on inference. This is the part of AI model operation that plays the biggest part in total cost of ownership and operation. And NVIDIA is, once again, doing it in its signature style – leveraging hardware and software into an ecosystem.