Editor’s Note: This post is part of the AI Decoded series, which helps you understand AI by making it more accessible and introduces new hardware, software, tools, and acceleration technologies for GeForce RTX PC and NVIDIA RTX workstation users.
Large Language Models (LLMs) are reshaping productivity. They write documents, summarize web pages, are trained on massive amounts of data, and can accurately answer questions on almost any topic. LLMs are at the heart of many and diverse use cases in generative AI, including digital assistants, conversational avatars, and customer service agents.
Most modern LLMs can be run locally on a PC or workstation. Users can keep their conversations and content private on their devices, use AI without the internet, or simply use powerful NVIDIA GeForce RTX You can utilize your GPU. Other models do not fit into the local GPU’s video memory (VRAM) due to their size and complexity and require hardware in large data centers.
However, you can accelerate some of the prompts in a data center class model locally on an RTX-based PC using a technique called GPU offload. This allows users to take advantage of GPU acceleration without being limited by GPU memory constraints.
Compare size and quality with performance
There is a trade-off between model size and response quality and performance. In general, larger models provide higher quality responses but run more slowly. Performance goes up and quality goes down in smaller models.
This balance is not always simple. There are times when performance can be more important than quality. Some users may prioritize accuracy for use cases such as content creation because tasks can run in the background. Meanwhile, conversational assistants must provide accurate responses quickly.
The most accurate LLMs designed to run in data centers are tens of gigabytes in size and may not fit into the memory of a GPU. This traditionally prevents applications from taking advantage of GPU acceleration. However, GPU offloading allows users to use parts of the LLM on both the GPU and CPU, allowing them to take full advantage of GPU acceleration regardless of model size.
AI acceleration optimization with GPU offload and LM Studio
LM Studio is an application that allows users to download and host LLM on their desktop or laptop computer, with an easy-to-use interface that allows for extensive customization of how the model operates. LM Studio call.cpp Built on top of GeForce RTX and NVIDIA RTX Fully optimized for use with GPUs.
LM Studio and GPU Offload allow users to leverage GPU acceleration to improve the performance of locally hosted LLM.
GPU offloading allows LM Studio to divide the model into small chunks, or “subgraphs,” that represent the layers of the model architecture. Subgraphs are not permanently pinned to the GPU, but are loaded and unloaded as needed. LM Studio’s GPU Offload slider allows users to determine how many layers are processed by the GPU.
LM Studio’s interface makes it easy to determine how much of an LLM should be loaded onto the GPU.
For example, imagine using this GPU offload technology on a huge model like the Gemma 2 27B. “27B” indicates the number of parameters in the model and estimates how much memory is needed to run the model. According to 4-bit quantization, a technique that reduces the size of the LLM without significantly reducing accuracy, each parameter occupies half a byte of memory. This means that the model requires approximately 13.5 billion bytes, or 13.5 GB, with overhead typically in the range of 1 to 5 GB.
Accelerating this model on a GPU requires 19GB of VRAM available on a GeForce RTX 4090 desktop GPU. GPU offloading allows this model to run on systems with lower-end GPUs and still benefit from acceleration.
The table above shows how to run several popular models of increasing size on various GeForce and NVIDIA RTX GPUs. The maximum GPU offload level is displayed for each combination. Even with GPU offload, users will need enough system RAM for the entire model.
LM Studio allows you to evaluate the performance impact of different levels of GPU offloading by comparing CPU alone. The table below shows the results of running the same query at different offloading levels on a GeForce RTX 4090 desktop GPU.
Depending on the proportion of models offloaded to the GPU, users see improved throughput performance compared to running them on the CPU alone. For the Gemma 2 27B model, performance goes from a meager 2.1 tokens per second to increasingly usable speeds as more GPUs are used. Users can take advantage of the power of larger models that they would not otherwise be able to run.
On this particular model, even users with an 8GB GPU will enjoy meaningful speedups compared to running on CPU alone. Of course, an 8GB GPU can always run smaller models that fit entirely into GPU memory and get full GPU acceleration.
Achieve optimal balance
LM Studio’s GPU offload capabilities are a powerful tool that can unleash the full potential of LLM for data centers locally on RTX AI PCs like the Gemma-2-27B. This allows larger, more complex models to be used across the entire PC lineup using GeForce RTX and NVIDIA RTX GPUs.
Download LM Studio Try GPU offloading on larger models or experiment with different RTX-accelerated LLMs running locally on RTX AI PCs and workstations.
Generative AI is revolutionizing gaming, video conferencing, and all kinds of interactive experiences. AI Decoded newsletter Subscribe and see how AI technology will change the present and future.