The Ultimate Introduction to VMware Private AI Foundation with NVIDIA

Earlier, I discussed my involvement with private AI at VMware Explore Singapore 2023 here and and later, when the first release of ‘VMware Private AI Foundation with NVIDIA’ became generally available here. Today, I am starting a series of blog posts on various aspects of VMware Private AI Foundation with NVIDIA. I will begin with an easy-to-understand introduction to the platform.

Basics

  1. VMware Private AI Foundation with NVIDIA is also known as vPAIF-N, PAIF-N, PAIF, or simply VMware Private AI. I will refer to it as PAIF-N going forward
  2. As the name indicates, it is a jointly engineered solution by VMware by Broadcom (a Private cloud leader) and NVIDIA (a leader in GPU-accelerated computing).
  3. PAIF-N primarily focuses on Generative AI (GenAI) use cases. It enables enterprises to quickly start building Generative AI capabilities around their business. It has been generally available since the VCF 5.1.1 release.
  4. It is sold as an add-on (one of the advanced services) on top of the flagship VMware Cloud Foundation (VCF).
    • In fact, customers need to have a PAIF-N add-on license from VMware by Broadcom, as well as a separate license from NVIDIA for the NVIDIA AI Enterprise (NVAIE) suite.
    • Most importantly, customers must separately purchase supported GPU cards (such as H100 or A100) for certified servers/hosts
  5. PAIF-N caters to key user personas such as cloud administrators, data scientists, and DevOps/developers.
  6. While we are focusing on PAIF with NVIDIA, there are different such partnership with Intel and AMD (may be more in future) respectively as well.

High level architecture

Source/Credit: VMware Explore 2024 session

What are the key components from VMware by Broadcom?

  1. Since PAIF-N is an add-on to VCF, customers benefit from all the strengths of the underlying platform (compute, storage, networking, and management—each of which has been a leader in its respective category for many years).
  2. VCF can be deployed as a private cloud in your own data centers, through VMware cloud providers, or via certified hyperscalers with solutions such as Google Cloud VMware Engine. This flexibility allows PAIF-N to be deployed based on the customer’s choice, closer to their domain-specific data
  3. With each release, new capabilities are continuously being developed within VCF to maximize the value of PAIF-N. Examples include:
    • Virtualizing GPUs in much the same way that compute, storage, and networking are virtualized.
    • Bringing all the goodness of DRS (Distributed Resource Scheduler) to effectively manage GPU resources.
    • Fine-tuning the amazing VMware vMotion capability specifically for GPU workloads,
    • Extending the goodness of vSphere IaaS platform to GPU workloads using constructs such as
      • VM-class with GPU,
      • VM-service with GPU, TKG/Guest clusters with GPU,
      • Harbor as a LLM store (Harbor capability announced in VMware Explore 2024, session link at the end) and so on…
    • Unified vCenter UI for standing up the entire GenAI infrastructure on VCF
    • VMware Data Service Manager (DSM), which supports vector databases for Retrieval-Augmented Generation (RAG), a popular GenAI use case. DSM, available as an advanced service within VCF, also caters to non-PAIF-N use cases
    • and so on..hopefully you get the idea.
  4. Deep Learning VMs (DL VMs): These are specialized VMs (Ubuntu guest OS) preconfigured with AI/ML libraries, frameworks, tools, and drivers, all validated and optimized by NVIDIA and VMware for deployment within the VCF environment. These validated VM images are released by VMware by Broadcom through a content delivery network (CDN), similar to TKG/Guest cluster images. This allows customers to subscribe to the CDN via the well-known content library construct in vSphere
  5. These Deep Learning VMs can be deployed either directly from the vSphere UI using the standard “Deploy from content library template” workflow or through the kubectl interface as part of the VM service construct on top of the vSphere IaaS platform (also known as WCP or Supervisor cluster). Since these VMs are preconfigured, data scientists or AI developers can immediately focus on their AI app development, including LLM fine-tuning and inference, without the need to spend significant time deploying and installing compatible tools and frameworks, thus saving considerable time.
  6. If a user prefers a Kubernetes cluster (instead of DL VMs) with preconfigured tools and libraries, it is easily achievable by deploying TKG clusters (also known as Guest clusters) and installing NVIDIA-specific operators such as the GPU-operator and RAG operator. Of course, these Kubernetes clusters can be customized with a choice of AI tools and libraries.
  7. Importantly, data scientists or DevOps users can deploy DL VMs or Kubernetes clusters (TKG/Guest clusters) as AI workstations using the Aria Automation (formerly known as vRealize Automation) self-service catalog. This drastically simplifies the consumption of AI workstations with just a few clicks. Even though these catalog items (including RAG-based AI workstations) are preconfigured, users can still customize them to meet their specific needs.
  8. From a GPU monitoring perspective, vCenter Server’s H5C client offers basic performance charts, while Aria Operations provides advanced dashboards specifically designed for GPU monitoring.
  9. Although obvious, it’s important to note that all NVIDIA components (more on it below) are seamlessly integrated with PAIF-N.
  10. As per VMware Explore 2024 announcements, vRA/Aria automation will be called as VCF automation and vRops/Aria Ops will be called as VCF Operations. I am hoping this is the last naming ceremony (lol).
  11. Disclaimer : VMware Explore 2024 also announced new capabilities in VCF 5.2.1 and VCF 9.0. The list above is not exhaustive, as new capabilities are being rapidly added with each release to maximize the value of this jointly engineered solution

Why the term “Private-AI”?

An architectural approach that balances the business gains from AI with privacy and compliance needs of the organisation

  1. There are multiple challenges in the GenAI space today, but the most critical ones are privacy, compliance, and security.
  2. Privacy: Customers are concerned about the privacy of their domain-specific data and IP assets, how proprietary LLMs process or handle them, who gets access to it, etc. There is fear about how their data is being used by LLMs for inferencing or training.
  3. Compliance: Many enterprise customers operate in highly regulated industries, so they must be 100% compliant with GDPR, HIPAA, and other regional rules and laws.
  4. Security: With the advent of Gen AI, new security threats are emerging. Data leaks and unauthorized access can lead to significant breaches unless proper guardrails, secure APIs, encrypted data sources, and secure AI infrastructure are in place. Security is a major concern
  5. PAIF-N resolves all these challenges. In fact, it not only addresses these critical issues but also tackles other challenges such as the choice of open LLMs, cost, and even performance (equal to or better than bare metal).

What are the key components from NVIDIA?

Credit/source: NVIDIA’s official website
  1. In short, NVIDIA defines it as the “Operating System” for enterprise AI
  2. It brings a lots of goodness from data-scientist, AI developer perspective with various tooling, frameworks, libraries & drivers around GenAI application development, model inferencing, model fine tuning, pretrained models etc.
  3. Examples:
    • NVIDIA vGPU technology (joint engineering between VMware and NVIDIA to virtualize GPUs)
    • vGPU drivers for guest OS (DL VM or K8s worker node) compatible with VMware ESXi GPU host drivers
    • NIM microservices with AI frameworks like pyTorch, Tensorflow, CUDA, sample chatbots and so on. NVIDIA NIM microservices are fastest way to AI interference
    • Validated and optimized pre-trained LLMs (Community i.e. open source LLMs, NVIDIA’s custom models etc)
    • Recently announced, NVIDIA NIM agent blueprints for various enterprise GenAI use cases
    • GPU-operator and Network operator for K8s clusters making GPU as a first class citizen in K8s world.
    • RAG operator for RAG use case
    • and so on…I hope you get the idea
  4. Note: Reading the names above might feel overwhelming, but don’t worry! You don’t need to set up or deploy any of this individually. The VMware capabilities I mentioned earlier make everything seamlessly integrated out of the box—how cool is that?
  5. Disclaimer: Just like VMware, NVIDIA is also keep improving/adding their capabilities as we speak, so above list goes on.

How it fits together: an example

  1. Imagine you are a multinational bank named ABC Inc., and you are already a VMware customer (or decided to go for it) with your own data center (though you had a choice to get this VCF stack on hyperscalers or providers) running at scale on an industry-leading VCF stack.
  2. As a bank, you offer a wide range of products and services to millions of customers. All these services are developed, deployed, and managed on top of your VCF infrastructure.
  3. With the rise of amazing GenAI technology, and to stay ahead in your industry, you have developed a GenAI strategy to integrate AI into every aspect of your operations. A couple of simple examples might include improving customer service or deploying a code assistant for your internal developer teams
  4. As a multinational bank operating in a highly regulated industry, you are concerned about critical challenges like privacy, compliance, and security. You want to move quickly but still retain full control over your data, intellectual property (IP), and costs.
  5. Given your focus on the banking sector, building your own proprietary LLM (like OpenAI’s GPT) is not feasible due to the high cost and complexity. Instead, your strategy is to embrace open LLMs that are pre-trained for specific use cases (such as text, code, or video) to accelerate your AI journey and solve your business-specific AI challenges.
  6. You want to leverage existing VMware investments, skillset and reduce the learning curve around GenAI for these new set of workloads
  7. With above constraints and concerns you are wondering how to go about enabling your internal teams to move forward in order to quickly integrate Gen AI into all facets of your business.
  8. By now, it should be clear that PAIF-N is the perfect solution for you! 🙂

Further learning

  1. If you want to see how it looks like in action with cool demos, this VMware Explore 2024 session is must watch
  2. Glimpse into VMware Private AI future state and PAIF-N with VCF 9.0, another session from VMware Explore 2024
  3. Official VMware Private-AI documentation
  4. Please stay tuned for further deep dives.

If you think this blog post added value to your time, please share with others as appropriate. Please feel free to connect with me on linkedIn or Twitter for all such posts.

Cricket and Life: 5 Insights from 100 Matches

People close to me know that I love playing, talking, watching, and writing about cricket. Cricket as a sport has had a lot of influence on my personal and professional life. Whether it’s having fun with cricket friends, building cricket communities, experiencing the joy of winning, overcoming tough situations, losing close matches, or captaining the team, cricket is far more than just a sport to me.

I love the number 100—be it my first 100 km cycle ride, my first and only century (44 balls, 100 runs), 100 workouts over time, completing 100 km running over time, or 100 days of meditation in a row. Recently, I noticed that I reached 100 recorded (Matches ranges from 6, 10, 20 overs) cricket matches on the CricHeroes app. I thought to take this opportunity to share the top 5 learnings from my journey to 100 recorded matches and how they have helped me in my career and life.

Let us look at my batting and bowling stats after 100 matches.

Disclaimer: These are just raw thoughts off the top of my head and seem like limited perspectives on each point. Also, most of these matches are community/apartment-level, with a few corporate ones and one of the major objectives has been to rejuvenate and get ready for busy week. Also there could be at least 200 not recorded matches as well. Also, many thanks to CricHeroes app team for creating such an amazing app for recording local cricketers stats.

Here are the top 5 things on my mind

1. Teamwork Matters:

Cricket being a team sport, no matter how many superstars (Classic example is RCB in IPL) are in your team, unless the team works together for a common goal, you are going to lose more often than not. As per my experience, the mindset of working as a team, leading by example and not thinking about individual milestones are key to teams success. The same mindset has had a positive impact on my career and what my corporate team has achieved over time as well. I would like to take this opportunity to call out couple of cricket teams those have been largely part of my journey. First is “South City” team in Bangalore and “Master Blasters” in Pune. I am grateful for all the support, banters & fun we had together. You will always be close to my heart.

The strength of the team is each individual member. The strength of each member is the team – Phil Jackson

2. Learning from Losses:

If you see the batting stats above, our overall winning percentage is just above 50% (58 won, 53 lost). This shows that while our team won many matches/tournaments, we lost many matches as well. Every time we lost, I would reflect on my contribution and continuously improve for the next time, be it improving fitness, assessing my decisions, listening to feedback from team, creating a bond with individual players to make sure they have all the support to contribute as much as they can etc. Losing taught me resilience and the mindset of turning up again for the next match.

In addition, when I lost matches/tournament as a captain, of course I had to face some criticism also. When Rohit and Virat had to face the hit over their captaincy, who am I ? At times it affected me but over the time, I learned to take criticism positively, listen to individual players concerns & feedback and tried to keep getting better.

Every adversity, every failure, every heartache carries with it the seed of an equal or greater benefit– Napoleon Hill

3. Appreciating Individual Differences:

Often, just because I put my body on the line, I expected the same from every team member. Over time after playing or leading at-least 5-6 different teams, I realized that every individual is different, everyone brings different skills to the table, everyone is inspired by different aspects , everyone tries their best in their own way etc. We need to accept and get the most out of their abilities and most importantly enjoy togetherness, banter, enjoy each other’s success.

4. Working on physical & Mental fitness

Cricket has been my number 1 hobby and I was knowing that I am not going to play competitive cricket but I use to think about how can cricket help me in my day to day life. Leading by example or contributing to team’s success inspired me keep working on my fitness for 5 days a week just to make sure I perform better during weekend matches and it helped me in my day 2 day life also. Whenever I could not do well, it also reminded me that fitness (both physical and mental) has come down so that I need to again get going.

In addition, I always made sure that whatever difficult situations are going in life or career, I never stopped playing cricket. Banter with friends, being on the cricket field always helped me not to think about any outside noise/distractions going in day to day life. In fact, I have set the expectation with my wife that no matter what, during weekend, I will be playing cricket for 3-4 hrs. Usually, all my travel and family function plans are done considering cricket. At times, I take my kids (Son-Virat – yeah, it was inspired by none other than Virat Kohli but my son is into football, probably India needs a Virat in football also and Daughter-Virakshi- yeah, there is Virat connection again) also with me to avoid any possible clashes (“Galia“) with my wife (lol)

5. Managing Emotions:

Naturally, I have been an aggressive batsman, and many times the team situation required me to play a certain way, but my instincts and emotions took over, leading to my dismissal and sometimes the team’s loss. Multiple times, not staying calm led to poor decisions as a player and captain. Over time, I have learned that being calmer during the match helps both myself and the team.

Lately, I have been on a journey to keep my emotions like anxiety, fear, and self-doubt in check. While this journey is in its early stages, this effort, combined with my experience, has started to work in most of the matches I played in 2024. My batting strike rate is improving and is mostly above 200 (screenshot below). In the past, I used to get bogged down by the match situation, such as when the team is 30 for 3 or chasing 200 runs in 20 overs, and this affected my performance. Nowadays, I try to focus on each ball coming up and act on it as best as possible without thinking about the past or future match situations. is not it applicable to day to day life also? Still, there is a lot of scope to improve on this as I keep faltering, but hopefully, I get better as I keep playing.

Small note: Couple of recent back to back fifties are not part of above stats

One memorable moment/tournament

There have been many moments but If I need to pick up one of it, it has to be when our team won VMware Cricket Bash 2018 (tournament with around 40 teams). Being a captain, I could see so much positive energy within the team to win this tournament, this positive energy and support from each team member made by captaincy easy and I could focus on my other key strength i.e. Batting. As a batsman, it was challenging as this was one of the tournaments which was played on 3 different grounds and conditions for all matches were different (Morning, evening , rain interruptions, pitch, ground size and final was day night with visibility around 60% and national anthem made it look like world cup final match). I was delighted to be awarded as best batsman of the tournament award, below are some of the moments captured. I was fortunate enough that my first celebration was captured on camera, once in life kind of moment. We were chasing in finals and when I went for batting, we needed 39 runs in 4 overs, so I was little under pressure as visibility also was really down as you could see in the video but I was determined to finish.

Winning final moment and trademark celebration
Winning team

That is all I have for now, let me know what are the learnings most resonated with you also. I also want to re-iterate that these are just top off my head and limited perspective. There are many other aspects I learned over time that helped me implicitly in my career and I continue to learn even now. Looking forward to continuing my journey to 200 matches and applying these lessons to my professional and personal journey as well & who knows probably I may write a small e-book around “Learning from cricket for professionals”. Finally I would like to leave you with a thought I came across on social media. I think this applies to any sport you play.

You do not stop playing cricket when you get old, you get old when you stop playing cricket – Unknown

Further reading: just would like to share my last post on cricket after ODI world cup 2023, you may resonate with quick learnings from ODI WC 2023 campaign as well.

My story: How can a VMware vExpert program fuel your professional growth?

Recently I was awarded “VMware vExpert” title for the 10th year in row.


vExpert Badge

Over the years, as per my contribution to the community, I was awarded with below badges

Based on my 10+ years of professional experience, I think being part of vExpert community has had profound impact on my career and I am sure it will continue to have in future as well.

Few weeks back, I had created a video on how being part of vExpert community I was able to compound top 5-6 aspects/elements that shaped my career. I think in order to have well rounded organic professional growth, these key elements are necessary for every professional to excel. I always believe that anything good if you do consistently over the longer (considerable) period of time, it is bound to compound and it will start opening up multiple opportunities. I thought this post can be dedicated to have this video (just under 5 min) shared with you. Let me know if you resonate with 5-6 elements I talked about.

I also would like to let you know that since last 2 years, I am part of one of vExpert sub-programs i.e. VMware vExpert PRO.

Being vExpert PRO, my job is to help & mentor aspiring vExperts in my region (India) and beyond. If you are someone working as a professional on VMware technologies and aspire to be vExpert, please feel free to connect with me on linkedIn or Twitter, I will be happy to share my experience and be small part of your vExpert journey.

Further learning

  1. My VMware vExpert youtube playlist for aspiring vExperts and yes, please subscribe for valuable content in future
  2. VMware vExpert official portal

Mastering Deep Learning VMs: Expert Answers to Your key Questions

My last post was an ultimate introduction to VMware Private AI. If you haven’t read that post, I highly recommend you do it. As a natural extension of my last post, I thought I would write FAQs in this post about one of the key components of vPAIF (VMware Private AI Foundation), i.e., Deep Learning VMs

Note: Though starting with basic FAQs, I plan to keep this blog post updated with deeper questions over time until all key aspects (including advanced ones) of Deep Learning VMs are covered

1. What is preconfigured in DL VM?

  1. Deep Learning VMs (DL VMs): These are specialized VMs (Ubuntu guest OS) preconfigured with AI/ML libraries, frameworks, tools, and drivers, all validated and optimized by NVIDIA and VMware for deployment within the VCF environment. Since these VMs are preconfigured, data scientists or AI developers can immediately focus on their AI app development, including LLM fine-tuning and inference, without the need to spend significant time deploying and installing compatible tools and frameworks, thus saving considerable time.
Source: VMware official blog

As you see in above diagram, VMware delivered base image has few components preconfigured and there are few things get deployed at first boot time. First one is NVIDIA GPU driver : It is a guest vGPU driver associated with underlying host vGPU driver already installed on the ESXi as part of setting up the Private AI foundation workload domain. Another key thing is NGC i.e. NVIDIA GPU cloud and this is where all validated GPU-optimized software bundles (those are microservices based) are available such as CUDA samples, DCGM exporter etc and based on the cloud-init script passed as part of DL VM deployment, it would be fetched at first boot.

Note: With Miniforge for conda, now DL VM will have pyTorch and Tensorflow environment embedded in the base image itself. In addition, with every DL VM release, new features are getting added. Above diagram is based on VCF 5.2 associated DL VM version 1.1. There is another super cool feature integrated with VCF 5.2.1 associated DL VM version 1.2 i.e. VMware PAIS (Private AI Services) CLI directly integrated into DL VM, which allows MLOps /Data scientist to pull and push LLM models into VMware harbor registry backed model store . Learn more here. I will also write separate blog post for the same.

2. How are they delivered/released?

These validated Deep learning VM images are available for consumption from this content delivery network (CDN) URL i.e. https://packages.vmware.com/dl-vm/lib.json (This is similar to the TKG/guest cluster images). As a vSphere admin, you need to create a subscribed content library using above URL to make it available for consumption. In case, there is air-gapped environment, admin needs to manually download these images from this URL https://packages.vmware.com/dl-vm/ (latest version recommended) and create a local content library with downloaded deep learning VM images. While each VCF release will have associated Deep learning VM but it is expected to have async releases periodically. Refer deep learning VM release notes for more details

3. What are the different ways to deploy deep learning VM?

These Deep Learning VMs can be deployed in one of ways below. It is assumed that content library is configured as mentioned in last question.

  1. Directly from the vSphere client UI using the standard “Deploy from content library template” workflow. This does not require you to have vSphere IaaS control plane (aka WCP or supervisor cluster) enabled on the cluster. Refer official documentation here
  2. If you have vSphere IaaS control plane enabled, you can use regular kubectl k8s interface to deploy VM-service based DL VM. Refer official documentation here
  3. One of the compelling value of the vPAIF VCF Add on is ability for users to deploy DL VMs as a AI workstations using the VCF automation (aka Aria automation) self-service catalog. This drastically simplifies the consumption of AI workstations with just a few clicks. Note that VCF automation eventually interface with underlying supervisor cluster in vSphere IaaS control plane and deploys the DL VM using VM service construct. Refer official documentation here

If you see official documentation for #1 and #2, user has to pass several properties, inputs needed to deploy a DL VM. This is where option #3 is super cool, where everything is self service with few clicks.

4. Can we customize DL VMs?

Yes, as shown in the diagram above, while DL VM comes preconfigured with few stuffs, user can customize the way they want using industry standard cloud-init mechanism as per their requirement. Refer for more details on how should you customize DL VMs

5. How the life cycle of the DL VM works?

As specified above, deep learning VMs are delivered by VMware by Broadcom through content library. Each of the DL VM release is versioned and as of VCF 5.2.1, DL VM 1.2 was released. Users are recommended to deploy latest version. Since these VMs are meant as development environment for MLOPs or AI developers to kick start their work, it is not supported to upgrade from a version to new version. While older version of the DL VM should continue to work, user must re-deploy new DL VM with latest image from content library. I will update on the DL VM compatibility with underlying VCF version soon.

6. Who are the consumers of the DL VM?

DL VM can be deployed by various user personas such as cloud admin & devops for ML Ops or data scientist. Even MLOps/data scientist can deploy it using VCF automation self service. DL VM is meant for data scientist or ML Ops engineer for their AI app development, including LLM fine-tuning and inference, without the need to spend significant time deploying and installing compatible tools and frameworks, thus saving considerable time.

Hope you got the quick value out of this post, please stay tuned for new questions those will deep dive into various key aspects of deep learning VMs.

Further learning

  1. My first post an ultimate introduction to VMware Private AI foundation
  2. Official documentation here