Mastering Deep Learning VMs: Expert Answers to Your key Questions

My last post was an ultimate introduction to VMware Private AI. If you haven’t read that post, I highly recommend you do it. As a natural extension of my last post, I thought I would write FAQs in this post about one of the key components of vPAIF (VMware Private AI Foundation), i.e., Deep Learning VMs

Note: Though starting with basic FAQs, I plan to keep this blog post updated with deeper questions over time until all key aspects (including advanced ones) of Deep Learning VMs are covered

1. What is preconfigured in DL VM?

Deep Learning VMs (DL VMs): These are specialized VMs (Ubuntu guest OS) preconfigured with AI/ML libraries, frameworks, tools, and drivers, all validated and optimized by NVIDIA and VMware for deployment within the VCF environment. Since these VMs are preconfigured, data scientists or AI developers can immediately focus on their AI app development, including LLM fine-tuning and inference, without the need to spend significant time deploying and installing compatible tools and frameworks, thus saving considerable time.

As you see in above diagram, VMware delivered base image has few components preconfigured and there are few things get deployed at first boot time. First one is NVIDIA GPU driver : It is a guest vGPU driver associated with underlying host vGPU driver already installed on the ESXi as part of setting up the Private AI foundation workload domain. Another key thing is NGC i.e. NVIDIA GPU cloud and this is where all validated GPU-optimized software bundles (those are microservices based) are available such as CUDA samples, DCGM exporter etc and based on the cloud-init script passed as part of DL VM deployment, it would be fetched at first boot.

Note: With Miniforge for conda, now DL VM will have pyTorch and Tensorflow environment embedded in the base image itself. In addition, with every DL VM release, new features are getting added. Above diagram is based on VCF 5.2 associated DL VM version 1.1. There is another super cool feature integrated with VCF 5.2.1 associated DL VM version 1.2 i.e. VMware PAIS (Private AI Services) CLI directly integrated into DL VM, which allows MLOps /Data scientist to pull and push LLM models into VMware harbor registry backed model store . Learn more here. I will also write separate blog post for the same.

2. How are they delivered/released?

These validated Deep learning VM images are available for consumption from this content delivery network (CDN) URL i.e. https://packages.vmware.com/dl-vm/lib.json (This is similar to the TKG/guest cluster images). As a vSphere admin, you need to create a subscribed content library using above URL to make it available for consumption. In case, there is air-gapped environment, admin needs to manually download these images from this URL https://packages.vmware.com/dl-vm/ (latest version recommended) and create a local content library with downloaded deep learning VM images. While each VCF release will have associated Deep learning VM but it is expected to have async releases periodically. Refer deep learning VM release notes for more details

3. What are the different ways to deploy deep learning VM?

These Deep Learning VMs can be deployed in one of ways below. It is assumed that content library is configured as mentioned in last question.

Directly from the vSphere client UI using the standard “Deploy from content library template” workflow. This does not require you to have vSphere IaaS control plane (aka WCP or supervisor cluster) enabled on the cluster. Refer official documentation here
If you have vSphere IaaS control plane enabled, you can use regular kubectl k8s interface to deploy VM-service based DL VM. Refer official documentation here
One of the compelling value of the vPAIF VCF Add on is ability for users to deploy DL VMs as a AI workstations using the VCF automation (aka Aria automation) self-service catalog. This drastically simplifies the consumption of AI workstations with just a few clicks. Note that VCF automation eventually interface with underlying supervisor cluster in vSphere IaaS control plane and deploys the DL VM using VM service construct. Refer official documentation here

If you see official documentation for #1 and #2, user has to pass several properties, inputs needed to deploy a DL VM. This is where option #3 is super cool, where everything is self service with few clicks.

4. Can we customize DL VMs?

Yes, as shown in the diagram above, while DL VM comes preconfigured with few stuffs, user can customize the way they want using industry standard cloud-init mechanism as per their requirement. Refer for more details on how should you customize DL VMs

5. How the life cycle of the DL VM works?

As specified above, deep learning VMs are delivered by VMware by Broadcom through content library. Each of the DL VM release is versioned and as of VCF 5.2.1, DL VM 1.2 was released. Users are recommended to deploy latest version. Since these VMs are meant as development environment for MLOPs or AI developers to kick start their work, it is not supported to upgrade from a version to new version. While older version of the DL VM should continue to work, user must re-deploy new DL VM with latest image from content library. I will update on the DL VM compatibility with underlying VCF version soon.

6. Who are the consumers of the DL VM?

DL VM can be deployed by various user personas such as cloud admin & devops for ML Ops or data scientist. Even MLOps/data scientist can deploy it using VCF automation self service. DL VM is meant for data scientist or ML Ops engineer for their AI app development, including LLM fine-tuning and inference, without the need to spend significant time deploying and installing compatible tools and frameworks, thus saving considerable time.

Hope you got the quick value out of this post, please stay tuned for new questions those will deep dive into various key aspects of deep learning VMs.

Further learning

My first post an ultimate introduction to VMware Private AI foundation
Official documentation here

Vikas Shitole

Vikas Shitole is a Staff engineer 2 at VMware (by Broadcom) India R&D. He currently contributes to core VMware products such as vSphere, VMware Private AI foundation and partly VCF . He is an AI and Kubernetes enthusiast. He is passionate about helping VMware customers & enjoys exploring automation opportunities around core VMware technologies. He has been a vExpert since last 11 years (2014-24) in row for his significant contributions to the VMware communities. He is author of 2 VMware flings & holds multiple technology certifications. He is one of the lead contributors to VMware API Sample Exchange with more than 35000+ downloads for his API scripts. He has been speaker at International conferences such as VMworld Europe, USA, Singapore & was designated VMworld 2018 blogger as well. He was the lead technical reviewer of the two books “vSphere design” and “VMware virtual SAN essentials” by packt publishing.

In addition, he is passionate cricketer, enjoys bicycle riding, learning about fitness/nutrition and one day aspire to be an Ironman 70.3