C++ - NVIDIA CUDA - Using the compute_xy and sm_xy compile options to generate an executable

Submitted by Mi-K on Wednesday, June 22, 2022 - 10:10pm

To use an NVIDIA GPU (the graphic card processor) you need a CUDA driver.

And if you want to compile a program for a GPU you'll have to use the CUDA toolkit.

In this tutorial we're going to see how to compile your project in order to generate an executable able to function on all available NVIDIA GPU (past, present and future).

By saying "all" we say not deprecated yet of course.

First of all

For the example we are going to use GPU capabilities from version 3.5 to 8.7 (and beyond with the PTX).

Indeed we'll use the CUDA toolkit 11.6.

Take the latest version in order to compile our project with nvcc:

https://developer.nvidia.com/cuda-toolkit

With the version 11 of the CUDA toolkit, capabilities below 3.5 are deprecated so we won't deal with them.

All CUDA capabilities are indeed of the following organization:

Tesla:

1.0;
1.1;
1.2;
1.3.

Fermi:

2.0;
2.1.

Kepler:

3.0;
3.2;
3.5;
3.7.

Maxwell:

5.0;
5.2;
5.3.

Pascal:

6.0;
6.1;
6.2.

Volta:

7.0;
7.2.

Turing:

7.5.

Ampere:

8.0;
8.6;
8.7.

Hopper:

9.0.

Lovelace:

9.y > 9.0.

So our playground will be from Kepler (at least 3.5) to Ampere (up to 8.7).

But we'll generate a program able to works on GPU with capabilities not yet out.

To achieve this we need to generate a partial program called PTX (Parallel Thread eXecution).

This PTX allows any future nvcc versions to compile a code able to work on future GPUs.

It's the JIT (Just In Time) compilation technique.

In this case nvcc will get a part from the executable file at the runtime in order to retrieve the code necessary to run the program.

And of course this executable needs to be compiled first.

The nvcc needs some options during the compilation.

They are represented by "compute_xy" an "sm_xy" where x is the Major capability and y the Minor one.

These options are virtual and real but the CUDA documentation may be a bit confused about that.

To have something clearer (even not specifically accurate) let's say that the virtual part ("compute_xy") is the "software" and the real one ("sm_xy") the "electronic" side of your GPU.

So to compile this executable we have different situations.

Example 1

First we know explicitely the GPU capabilities and want to compile for these capabilities only.

For example we have a GPU with capabilities 7.2 (Volta model) and we'll be sure to use this GPU only for our research project.

So we have to compile the executable like this:

CUDA_GEN = \
-gencode arch=compute_72,code=sm_72

(Note that the CUDA_GEN is a variable in your Makefile.)

It's really important to not have any space before and after the comma (,) otherwise it will produce errors like the following:

nvcc fatal   : Unsupported gpu architecture ''
nvcc fatal   : 'arch' is not in 'keyword=value' format
nvcc fatal   : Option '--generate-code arch=compute_72', missing code

Example 2

Another example, we have 2 GPUs on which the program will run only, let say on capabilities 5.3 (Maxwell) and 8.0 (Ampere).

This is for example the case when you use 2 computers with different GPU.

We'll use these compilation lines:

CUDA_GEN = \
-gencode arch=compute_53,code=sm_53 \
-gencode arch=compute_80,code=sm_80

So as you can see each time we need a certain GPU with specific capabilities we add a new line corresponding to these capabilities.

Really straightforward.

Example 3

Let's see now a third example, and this time we have created a software but we don't know which kind of GPU will be used to run our program.

For example a software for clients you don't know like in a video game.

In this case we have to be the more generic as possible.

So we start from the oldest GPU not yet deprecated from our CUDA 11.6 toolkit and do the following:

CUDA_GEN  = \
-gencode arch=compute_35,code=sm_35 \
-gencode arch=compute_37,code=sm_37 \
-gencode arch=compute_50,code=sm_50 \
-gencode arch=compute_52,code=sm_52 \
-gencode arch=compute_53,code=sm_53 \
-gencode arch=compute_60,code=sm_60 \
-gencode arch=compute_61,code=sm_61 \
-gencode arch=compute_62,code=sm_62 \
-gencode arch=compute_70,code=sm_70 \
-gencode arch=compute_72,code=sm_72 \
-gencode arch=compute_75,code=sm_75 \
-gencode arch=compute_80,code=sm_80 \
-gencode arch=compute_86,code=sm_86 \
-gencode arch=compute_87,code=sm_87 \
-gencode arch=compute_87,code=compute_87

In this example we have added all GPU capabilities able to compile with our CUDA 11.6 toolkit.

But we want our program to be able to run on forward capabilities we don't know yet.

And yes it's possible.

For that we add the last line that tell the compiler to generate a PTX part in the executable in order to use the last capabilities known when the program has been made.

In this case, if in the future someone uses the program with GPU capabilities 9.9 or 12.3 the compiler will use the version 8.7.

Of course the latest features from 9.9 or 12.3 won't be available in this program but we don't mind because the time the program was compiled, those features didn't exist yet.

Conclusion

Those compilation features aren't so obvious but it's really necessary to understand them.

It's also important to note that each line of compilation will increase the size of your executable.

So if size matters, reduce the number of lines.

If the size doesn't matter then go for the maximum lines, in this case you'll be sure to have a program able to run on every GPU listed with maximum performance for each of them.

Anyway if you got it, good job, you did it.