Spread the love


There has been tremendous advancements in large language models(LLMs) in open source community, and now it is possible to use consumer-grade GPU workstations for experimenting with some of the open-source LLMs. Here. I explain various GPU choices that are available.

When choosing a GPU for a data science workstation, there are several factors to consider such as the size of your data sets, the complexity of your models, and your budget. According to some sources, having 2, 3, or even 4 GPUs in a workstation can provide a surprising amount of compute capability and may be sufficient for even many large problems. It is also recommended to have at least two GPUs when doing development work to enable local testing of multi-GPU functionality and scaling – even if the “production” jobs will be offloaded to a larger cluster.

NVIDIA versus AMD GPUs for Deep Learning and other Data Science Applications

NVIDIA dominates for GPU compute acceleration, and is unquestionably the standard. Their GPUs will be the most supported and easiest to work with. NVIDIA also provides an excellent data handling application suite called RAPIDS. The NVIDIA RAPIDS tools may provide significant workflow throughput.

AMD GPUs are also used for deep learning projects. However, one of the main reasons that AMD Radeon graphics card is not used for deep learning is because the software and drivers for deep learning on Radeon GPU are not actively developed. NVIDIA has good drivers and software stacks for deep learning such as CUDA, CUDNN and more.

Selecting Best GPU for Deep Learning Workstation

According to some sources, NVIDIA RTX ™ 6000 Ada Generation GPUs are bringing unprecedented performance, features, and efficiency for creative and technical professionals.

The NVIDIA GeForce RTX 3060 is a good choice for data science use cases because it has a high number of CUDA cores and Tensor Cores that can accelerate machine learning training up to 215X faster and perform more iterations, increase experimentation and carry out deeper exploration1. It also has 12GB of VRAM which is one of the sweet spots for training deep learning models. The GPU takes advantage of NVIDIA’s Ampere architecture, the company’s second-generation RTX framework. It offers Ray Tracing Cores and Tensor Cores, new streaming multiprocessors, and high-speed G6 memory.

The NVIDIA Quadro range of processors are designed for professional use cases such as CAD, 3D modeling, and video editing1. The RTX 3060 is a consumer-grade GPU that is also suitable for data science use cases. The RTX 3060 has a high number of CUDA cores and Tensor Cores that can accelerate machine learning training up to 215X faster and perform more iterations, increase experimentation and carry out deeper exploration2. It also has 12GB of VRAM which is one of the sweet spots for training deep learning models.

Here are some of the best consumer-grade GPUs for data science use cases:

  • NVIDIA GeForce RTX 3090 – Best GPU for Deep Learning Overall
  • NVIDIA GeForce RTX 3080 (12GB) – The Best Value GPU for Deep Learning
  • NVIDIA GeForce RTX 3060 – Best Affordable Entry Level GPU for Deep Learning
  • NVIDIA GeForce RTX 3070 – Best Mid-Range GPU If You Can Use Memory Saving Techniques

Here’s a comparison of the NVIDIA GeForce RTX 3090, RTX 3080, RTX 3070, and RTX 3060 based on price and performance:

GPU Comparison

The table above shows that the RTX 3090 has the highest number of CUDA cores and Tensor Cores and the highest memory bandwidth but it is also the most expensive. The RTX 3080 has a good balance between price and performance. The RTX 3070 is a good mid-range GPU if you can use memory-saving techniques. The RTX 3060 is an affordable entry-level GPU for deep learning.


According to gpu.userbenchmark.com, the RTX 4090 is based on NVidia’s Ada Lovelace architecture. It features 16,384 cores with base / boost clocks of 2.2 / 2.5 GHz, 24 GB of memory, a 384-bit memory bus, 128 3rd gen RT cores, 512 4th gen Tensor cores, DLSS 3 and a TDP of 450W. The RTX 3070 has a GA104 GPU with a base clock of 1.5 GHz and a boost clock of 1.73 GHz, with a TDP of 220W and has only 8GB of GDDR6 memory. The performance gains will vary depending on the specific game and resolution.

Why Do we need CUDA Cores and Tensor Cores

CUDA cores and Tensor cores are both types of processing units that are used in NVIDIA GPUs for accelerating deep learning tasks.

CUDA cores are the basic processing units in NVIDIA GPUs, and they are responsible for executing the general-purpose computations required for deep learning tasks. CUDA cores perform scalar and vector operations and are optimized for single-precision floating-point arithmetic.

Tensor cores, on the other hand, are specialized processing units that perform matrix operations required for deep learning tasks. Tensor cores are designed to accelerate the operations involved in training and inference of deep neural networks by performing mixed-precision matrix multiply and accumulate operations. Tensor cores can handle both single- and half-precision floating-point arithmetic, which makes them more efficient than CUDA cores for deep learning tasks.

In summary, CUDA cores are general-purpose processing units, while Tensor cores are specialized units that are optimized for deep learning tasks, particularly matrix operations.

Combining Multiple GPU’s for Data Science Solutions

You can combine multiple GPUs to create a more powerful hardware solution for data science applications. Computationally-intensive CUDA C++ applications in high performance computing, data science, bioinformatics, and deep learning can be accelerated by using multiple GPUs, which can increase throughput and/or decrease your total runtime1. You need to build parallelism into your deep learning processes by using model parallelism or data parallelism.


GPU Requirements for LLM such as Llama model


According to an article on arstechnica.com, typically running GPT-3 requires several datacenter-class A100 GPUs (also, the weights for GPT-3 are not public), but LLaMA made waves because it could run on a single beefy consumer GPU.


The minimum requirements for having a functional LLaMA model are quite low: ~32 Gb of disk space and a 8 Gb NVIDIA card is all that’s required to install a LLaMA-7B model.







Full model takes 31.17 GB, 4.21 GB, fully quantized/compressed



Full model takes 60.21 GB, compressed model takes 8.14 GB



Full model takes 150.48 GB, compressed model takes 20.36 GB



Full model takes 432.64 GB, compressed model takes 40.88 GB


The article discusses the concepts of designing and building a custom GPU workstation for basic deep learning tasks such as training/fine-tuning LLMs. It highlights the importance of having CUDA and Tensor cores for deep learning tasks, which may not be supported by AMD GPUs. The article emphasizes that data scientists require a good GPU workstation for local experimentation at a small scale, even though cloud-based or data center-based GPU setups with A100s are necessary for heavy-duty tasks. The article concludes that although AMD GPUs may have excellent utility for gaming, they are not very useful for data scientists because of the lack of software support for CUDA libraries.

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.