Will NVIDIA maintain it's market supremacy forever?
It's a gold rush, and NVIDIA is the one selling shovels, capitalizing on the fertile land filled with exabytes of data collected by trillion-dollar giants, lying dormant in cold storage. NVIDIA quickly ascended to become one of the most valuable companies, achieving a $2T valuation. I'm here to share my opinion, drawing on my knowledge of ml, parallel computing and business. These are my personal opinions and are not to be taken as investment advice.
The State of ML Compute in 2024
Early Adopters of Large ML and Big Tech
Beyond big tech, the adoption of large-scale ML models in production by the broader corporate sector is very low. While experimenting with AI and deploying internal tools is not uncommon, the leap to productionizing these tools for user interaction introduces significant risks that many companies are hesitant to embrace. This reluctance can be attributed to a lack of sufficient data for training deep neural networks and the engineering expertise required for effective productionization. Without access to data from millions - if not billions of users - it becomes challenging to train these large models with reasonable accuracy. Additionally, the acquisition of necessary hardware, such as GPUs, often plays a significant role in the decision against implementing large neural networks. Nonetheless, many companies have successfully implemented simple, classical ML models trained on modest yet robust datasets. Typically, the outputs of these models serve as recommendations rather than directives, and likely not client facing. Stripe's Radar is a great example of a deep neural network in production with the power to not allow transactions.
NVIDIA's Competitive Edge (Spoiler - It's not the hardware)
GPUs offer massive parallel performance, a level that is simply impossible for a CPU to rival as of this writing. This massive parallelism makes GPUs far superior for model training. For training their own models using GPUs, the choice of NVIDIA over competitors like AMD or the Google TPU (which remains commercially unavailable) is a pragmatic one. NVIDIA's CUDA is a GPU programming platform which has has matured significantly, pre-dating the current ML boom and securing a dominant position in GPU computing. Although AMD, Groq, and other emerging chip manufacturers may offer comparable hardware performance, their ecosystems lack the extensive support from pivotal deep learning frameworks such as PyTorch or TensorFlow. This comprehensive ecosystem and support network justify NVIDIA's premium pricing, mitigating potential obstacles down the line. The success of NVIDIA is largely owed to the unparalleled maturity of the CUDA software ecosystem, a foundation laid well before the recent boom caused by Generative AI.
The Future of ML Computing
Inference as the Primary Bottleneck
According to Pete Warden (2023), training costs scale with the number of researchers, inference costs scale with the number of users. Presently, significant ML costs outside big tech revolve around data collection and model training. As models transition to production and begin to demonstrate usefulness, the scale of inference grows very quickly. Training, albeit costly, is generally a one-off expense, whereas inference costs are perpetuatual. Training costs should also rise (at a lower rate however) as well as researchers explore new model architectures to increase accuracy. Even in 2018, when OpenAI was relatively unknown outside the ML research community, before the era of every student having ChatGPT as a pinned tab, their inference costs were significantly higher than their training costs.
The Rise of Tiny ML
GPT-3.5 required 300 GBs of VRAM, making it impossible to fit the entire model's weights onto a single GPU. Inference on complex models is only feasible for tech companies with the resources to maintain massive GPU-accelerated clusters. Significant research has been conducted into methods for reducing the size and inference time of large neural networks. Techniques like quantization and pruning have reduced the size of many large neural networks by up to 49x (Han et al., 2015). This field is rapidly evolving, with advancements occurring almost daily. I predict that we will start performing inference locally on devices like our phones using open-source models that have been optimized heavily for edge performance. We don't always need super powerful models, and with shifts in CPUs and mobile GPUs for increasing throughput for inference workloads, I believe this will be a significant trend in the near future. The closer your AI is to your user, the lower the latency demanded. Users are impatient, and even the slightest increase in latency can greatly affect user satisfaction metrics and an external GPU might not be feasible given they are a whole bus away.
What Does This Mean for the Future?
The landscape of ML Computing is poised for significant change. I predict AMD will close the gap in with NVIDIA as support from frameworks like TensorFlow and PyTorch expands, leading to a more competitive market landscape that may offer lower prices than NVIDIA's current monopoly allows. The software AMD needs won't be built overnight and the shift away from NVIDIA's dominance will be a slow gradual decline. The larger costs of inferencing over training expenses indicates a shift towards edge computing, with models becoming optimized for the edge through methods like quantization, pruning and neural architecture search to facilitate efficient inference on edge devices powered by CPUs or mobile GPUs. Companies such as ARM, Intel, and Qualcomm are likely to benefit from this shift towards inference, capitalizing on the gradual transition as AI adoption extends beyond the confines of big tech.