In June, Israeli start-up Habana Labs announced Gaudi, a 16nm training chip for neural networks. Gaudi represents Habana’s second attempt to break into the AI market following the commercial launch of its Goya inference chips in Q4 2018. Habana claims it has already shipped Goya to 20 select clients.
Gaudi builds on the same basic architecture as the Goya inference accelerator and uses eight Tensor Processor Cores (TPCs), each with dedicated on-die memory, a GEMM math engine and Gen 4 PCIe (Exhibit 1). While Goya focuses on integer computation, Gaudi supports floating-point formats required for training and integrates 32 GB High Bandwidth Memory (HBM2) to enable large chip clusters. Additionally, it features the industry’s first on-die implementation of Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) on an AI chip, which provides 10x100Gb or 20x50Gb communication links to enable scaling up to thousands of accelerators.
Software-wise, Gaudi comes with Habana’s AI software stack, known as SynapseAI, which comprises a graph compiler, runtime, debugger, deep learning library and drivers. At present, Habana supports TensorFlow for building models but plans to add support for PyTorch and other machine learning frameworks as well.
Exhibit 1: High-Level Architecture of Habana Labs’ Gaudi Processor
Products:
Although Habana only offers a single Goya-based product, a PCIe accelerator card, it plans to offer three Gaudi form factors.
The company is testing first silicon and expects all three Gaudi products to sample by the end of 2019, with volume production expected to start in mid-2020.
Exhibit 2: Habana Labs HLS-1 System which combines eight Gaudi accelerator cards
Assessment:
NVIDIA’s GPUs have dominated the cloud data center AI training market for several years with many customers now regarding NVIDIA as having a vendor lock on them. Habana Labs is one of a small band of start-ups seeking to disrupt this market and claims that its Gaudi chip already offers better performance than NVIDIA’s Tesla V100.
For example, in the popular ResNet50 CNN image recognition test, Habana claims that Gaudi exceeds 1,650 images per second (IPS) with a batch size of 64 compared to 1,360 IPS with an unspecified batch size for NVIDIA’s Tesla V100. In addition, the company claims that Gaudi uses only 140 Watts of power when running the benchmark, around half that of the V100.
Aside from raw performance, an important characteristic of AI training processors is scalability. AI accelerators are used in their multiples in large training farms, with many devices collaborating on training the same neural network. Habana offers integrated standards-based Ethernet connectivity that it claims enables unlimited scaling. This frees customers from NVIDIA’s proprietary software and interfaces. Habana is also the first vendor to announce hardware for Facebook’s OCP form factor and Glow software.
The demand for more powerful AI capabilities is creating a highly competitive market where nimble execution is nearly as important as architectural design. NVIDIA has proved itself to be an agile innovator and a formidable competitor, and with its well-established CUDA software ecosystem, it is unlikely to cede its dominant market position any time soon. Its Volta AI chip launched around two years ago, and the Volta's successor will likely be announced later this year. As such, Habana’s performance advantage claim may be short-lived. Also, with Facebook working with several other accelerator chip start-ups, there is, of course, no guarantee that Habana will receive major orders from the social media giant.
Nevertheless, if its technology delivers as promised, Intel-backed Habana could emerge as one of the leading challengers to NVIDIA in the AI training market. With its freedom from proprietary software and interfaces – and probably a much lower price – it should appeal to cloud data center customers who currently buy expensive NVIDIA GPUs and are anxious to see alternative suppliers.