PDF # 1 | x x l, | File type: PDF | pages serial communications c programmers guide to serial communications how a simple idea by reading. Second used book I got through Amazon. Looks brand new, except that it's been out of print for more then a decade. Great reference book. The price was. The Simple Serial Interface (SSI) Programmer's Guide provides system communications link between Zebra Technologies decoders and a serial host. E+C denotes 2 AIM IDs are transmitted: one for the UPC/EAN block; the second prefixes . PDF 0x X. L. 0 - Conforms with PDF spec. 1 - Backslash.
|Language:||English, Spanish, Arabic|
|ePub File Size:||25.59 MB|
|PDF File Size:||20.21 MB|
|Distribution:||Free* [*Regsitration Required]|
C programmer's guide to serial communications. Material. Type. Book. Language English. Title. C programmer's guide to serial communications. Author(S) Joe. Serial Programming Guide for POSIX Operating Systems. 5th Edition, 2nd Revision .. as well as how to access a serial port from a C program. What Are Serial. This document describes how to program communications with devices over a serial . This document is copyrighted (c) Peter Baumann, (c) Gary (ftp://caite.info−guide/lpg−caite.info in.
The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed. Streams are released by calling cudaStreamDestroy. Kernel launches are asynchronous, so to check for asynchronous errors, the application must synchronize in-between the kernel launch and the call to cudaPeekAtLastError or cudaGetLastError. Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess as illustrated in the following code sample.
Texture gather is only supported for CUDA arrays created with the cudaArrayTextureGather flag and of width and height less than the maximum specified in Table 14 for texture gather, which is smaller than for regular texture fetch. Texture gather is only supported on devices of compute capability 2.
For devices of compute capability 2. Table 14 lists the maximum surface width, height, and depth depending on the compute capability of the device. A surface object is created using cudaCreateSurfaceObject from a resource description of type struct cudaResourceDesc.
A surface reference is declared at file scope as a variable of type surface:. A surface reference can only be declared as a static global variable and cannot be passed as an argument to a function. A CUDA array must be read and written using surface functions of matching dimensionality and type and via a surface reference of matching dimensionality; otherwise, the results of reading and writing the CUDA array are undefined. Unlike texture memory, surface memory uses byte addressing.
This means that the x-coordinate used to access a texture element via texture functions needs to be multiplied by the byte size of the element to access the same element via a surface function. Cubemap surfaces are accessed using surfCubemapread and surfCubemapwrite surfCubemapread and surfCubemapwrite as a two-dimensional layered surface, i. Faces are ordered as indicated in Table 1. Cubemap layered surfaces are accessed using surfCubemapLayeredread and surfCubemapLayeredwrite surfCubemapLayeredread and surfCubemapLayeredwrite as a two-dimensional layered surface, i.
They are one dimensional, two dimensional, or three-dimensional and composed of elements, each of which has 1, 2 or 4 components that may be signed or unsigned 8-, , or bit integers, bit floats, or bit floats. CUDA arrays are only accessible by kernels through texture fetching as described in Texture Memory or surface reading and writing as described in Surface Memory. The texture and surface memory is cached see Device Memory Accesses and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write in the same kernel call returns undefined data.
In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.
Registering a resource is potentially high-overhead and therefore typically called only once per resource. Each CUDA context which intends to use the resource is required to register it separately. In CUDA, it appears as a device pointer and can therefore be read and written by kernels or via cudaMemcpy calls.
Kernels can read from the array by binding it to a texture or surface reference. They can also write to it via the surface write functions if the resource has been registered with the cudaGraphicsRegisterFlagsSurfaceLoadStore flag. The array can also be read and written via cudaMemcpy2D calls.
Please note: The application needs to register the texture for interop before requesting an image or texture handle.
The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object:. The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object. There are however special considerations as described below when the system is in SLI mode. Because of this, allocations may fail earlier than otherwise expected.
While this is not a strict requirement, it avoids unnecessary data transfers between devices. Therefore on SLI configurations when data for different frames is computed on different CUDA devices it is necessary to register the resources for each separatly. There are two version numbers that developers should care about when developing a CUDA application: The compute capability that describes the general specifications and features of the compute device see Compute Capability and the version of the CUDA driver API that describes the features supported by the driver API and runtime.
It allows developers to check whether their application requires a newer device driver than the one currently installed.
This is important, because the driver API is backward compatible , meaning that applications, plug-ins, and libraries including the C runtime compiled against a particular version of the driver API will continue to work on subsequent device driver releases as illustrated in Figure The driver API is not forward compatible , which means that applications, plug-ins, and libraries including the C runtime compiled against a particular version of the driver API will not work on previous versions of the device driver.
It is important to note that there are limitations on the mixing and matching of versions that is supported:. On Tesla solutions running Windows Server and later or Linux, one can set any device in a system in one of the three following modes using NVIDIA's System Management Interface nvidia-smi , which is a tool distributed as part of the driver:.
This means, in particular, that a host thread using the runtime API without explicitly calling cudaSetDevice might be associated with a device other than device 0 if device 0 turns out to be in prohibited mode or in exclusive-process mode and used by another process.
Note also that, for devices featuring the Pascal architecture onwards compute capability with major revision number 6 and higher , there exists support for Compute Preemption. This allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architecture, with the benefit that applications with long-running kernels can be prevented from either monopolizing the system or timing out.
However, there will be context switch overheads associated with Compute Preemption, which is automatically enabled on those devices for which support exists. The individual attribute query function cudaDeviceGetAttribute with the attribute cudaDevAttrComputePreemptionSupported can be used to determine if the device in use supports Compute Preemption.
Users wishing to avoid context switch overheads associated with different processes can ensure that only one process is active on the GPU by selecting exclusive-process mode. Applications may query the compute mode of a device by checking the computeMode device property see Device Enumeration. GPUs that have a display output dedicate some DRAM memory to the so-called primary surface , which is used to refresh the display device whose output is viewed by the user.
When users initiate a mode switch of the display by changing the resolution or bit depth of the display using NVIDIA control panel or the Display control panel on Windows , the amount of memory needed for the primary surface changes.
For example, if the user changes the display resolution from xxbit to xxbit, the system must dedicate 7. Full-screen graphics applications running with anti-aliasing enabled may require much more display memory for the primary surface.
If a mode switch increases the amount of memory needed for the primary surface, the system may have to cannibalize memory allocations dedicated to CUDA applications.
Therefore, a mode switch results in any call to the CUDA runtime to fail and return an invalid context error. However, the TCC mode removes support for any graphics functionality. When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity.
The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
A multiprocessor is designed to execute hundreds of threads concurrently. The instructions are pipelined to leverage instruction-level parallelism within a single thread, as well as thread-level parallelism extensively through simultaneous hardware multithreading as detailed in Hardware Multithreading.
Unlike CPU cores they are issued in order however and there is no branch prediction and no speculative execution. SIMT Architecture and Hardware Multithreading describe the architecture features of the streaming multiprocessor that are common to all devices. Compute Capability 3. The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. A quarter-warp is either the first, second, third, or fourth quarter of a warp.
When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.
Thread Hierarchy describes how thread IDs relate to thread indices in the block. A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path.
If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads.
For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge.
In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually. Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp.
As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from. Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp.
With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
A schedule optimizer determines how to group active threads from the same warp together into SIMT units. Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity 1 of previous hardware architectures. In particular, any warp-synchronous code such as synchronization-free, intra-warp reductions should be revisited to ensure compatibility with Volta and beyond.
See Compute Capability 7. The threads of a warp that are participating in the current instruction are called the active threads, whereas threads not on the current instruction are inactive disabled. Threads can be inactive for a variety of reasons including having exited earlier than other threads of their warp, having taken a different branch path than the branch path currently executed by the warp, or being the last threads of a block whose number of threads is not a multiple of the warp size.
If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device see Compute Capability 3.
The execution context program counters, registers, etc. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction the active threads of the warp and issues the instruction to those threads. In particular, each multiprocessor has a set of bit registers that are partitioned among the warps, and a parallel data cache or shared memory that is partitioned among the thread blocks.
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor.
There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor.
These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix Compute Capabilities. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Which strategies will yield the best performance gain for a particular portion of an application depends on the performance limiters for that portion; optimizing instruction usage of a kernel that is mostly limited by memory accesses will not yield any significant performance gain, for example.
Optimization efforts should therefore be constantly directed by measuring and monitoring the performance limiters, for example using the CUDA profiler. Also, comparing the floating-point operation throughput or memory throughput - whichever makes more sense - of a particular kernel to the corresponding peak theoretical throughput of the device indicates how much room for improvement there is for the kernel.
To maximize utilization the application should be structured in a way that it exposes as much parallelism as possible and efficiently maps this parallelism to the various components of the system to keep them busy most of the time. At a high level, the application should maximize parallel execution between the host, the devices, and the bus connecting the host to the devices, by using asynchronous functions calls and streams as described in Asynchronous Concurrent Execution.
It should assign to each processor the type of work it does best: For the parallel workloads, at points in the algorithm where parallelism is broken because some threads need to synchronize in order to share data with each other, there are two cases: The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic.
Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible.
At a lower level, the application should maximize parallel execution between the multiprocessors of a device.
Multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable enough kernels to execute concurrently as described in Asynchronous Concurrent Execution.
At an even lower level, the application should maximize parallel execution between the various functional units within a multiprocessor. As described in Hardware Multithreading , a GPU multiprocessor relies on thread-level parallelism to maximize utilization of its functional units. Utilization is therefore directly linked to the number of resident warps. At every instruction issue time, a warp scheduler selects a warp that is ready to execute its next instruction, if any, and issues the instruction to the active threads of the warp.
The number of clock cycles it takes for a warp to be ready to execute its next instruction is called the latency , and full utilization is achieved when all warp schedulers always have some instruction to issue for some warp at every clock cycle during that latency period, or in other words, when latency is completely "hidden".
The number of instructions required to hide a latency of L clock cycles depends on the respective throughputs of these instructions see Arithmetic Instructions for the throughputs of various arithmetic instructions. Assuming maximum throughput for all instructions, it is: For devices of compute capability 3. The most common reason a warp is not ready to execute its next instruction is that the instruction's input operands are not available yet.
If all input operands are registers, latency is caused by register dependencies, i. In the case of a back-to-back register dependency i. Execution time varies depending on the instruction, but it is typically about 11 clock cycles for devices of compute capability 3.
This is also assuming enough instruction-level parallelism so that schedulers are always able to issue pairs of instructions for each warp. If some input operand resides in off-chip memory, the latency is much higher: The number of warps required to keep the warp schedulers busy during such high latency periods depends on the kernel code and its degree of instruction-level parallelism. In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands i.
For example, assume this ratio is 30, also assume the latencies are cycles on devices of compute capability 3. Then about 40 warps are required for devices of compute capability 3.
Another reason a warp is not ready to execute its next instruction is that it is waiting at some memory fence Memory Fence Functions or synchronization point Memory Fence Functions. A synchronization point can force the multiprocessor to idle as more and more warps wait for other warps in the same block to complete execution of instructions prior to the synchronization point.
Having multiple resident blocks per multiprocessor can help reduce idling in this case, as warps from different blocks do not need to wait for each other at synchronization points.
The number of blocks and warps residing on each multiprocessor for a given kernel call depends on the execution configuration of the call Execution Configuration , the memory resources of the multiprocessor, and the resource requirements of the kernel as described in Hardware Multithreading.
The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory and the amount of dynamically allocated shared memory. The number of registers used by a kernel can have a significant impact on the number of resident warps. For example, for devices of compute capability 6. But as soon as the kernel uses one more register, only one block i.
Therefore, the compiler attempts to minimize register usage while keeping register spilling see Device Memory Accesses and the number of instructions to a minimum. Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds. Each double variable and each long long variable uses two registers.
The effect of execution configuration on performance for a given kernel call generally depends on the kernel code. Experimentation is therefore recommended. Applications can also parameterize execution configurations based on register file size and shared memory size, which depends on the compute capability of the device, as well as on the number of multiprocessors and memory bandwidth of the device, all of which can be queried using the runtime see reference manual.
The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under-populated warps as much as possible. Several API functions exist to assist programmers in choosing thread block size based on register and shared memory requirements. The following code sample calculates the occupancy of MyKernel. It then reports the occupancy level with the ratio between concurrent warps versus maximum warps per multiprocessor.
The following code sample configures an occupancy-based kernel launch of MyKernel according to the user input. A spreadsheet version of the occupancy calculator is also provided. The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy block size, registers per thread, and shared memory per thread.
The first step in maximizing overall memory throughput for the application is to minimize data transfers with low bandwidth. That means minimizing data transfers between the host and the device, as detailed in Data Transfer between Host and Device , since these have much lower bandwidth than data transfers between global memory and the device.
That also means minimizing data transfers between global memory and the device by maximizing use of on-chip memory: Shared memory is equivalent to a user-managed cache: The application explicitly allocates and accesses it.
As illustrated in CUDA C Runtime , a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block:. For some applications e. As mentioned in Compute Capability 3. The throughput of memory accesses by a kernel can vary by an order of magnitude depending on access pattern for each type of memory.
The next step in maximizing memory throughput is therefore to organize memory accesses as optimally as possible based on the optimal memory access patterns described in Device Memory Accesses. This optimization is especially important for global memory accesses as global memory bandwidth is low, so non-optimal global memory accesses have a higher impact on performance. Applications should strive to minimize data transfer between the host and the device. One way to accomplish this is to move more code from the host to the device, even if that means running kernels with low parallelism computations.
Intermediate data structures may be created in device memory, operated on by the device, and destroyed without ever being mapped by the host or copied to host memory. Also, because of the overhead associated with each transfer, batching many small transfers into a single large transfer always performs better than making each transfer separately.
On systems with a front-side bus, higher performance for data transfers between host and device is achieved by using page-locked host memory as described in Page-Locked Host Memory.
In addition, when using mapped page-locked memory Mapped Memory , there is no need to allocate any device memory and explicitly copy data between device and host memory. Data transfers are implicitly performed each time the kernel accesses the mapped memory. For maximum performance, these memory accesses must be coalesced as with accesses to global memory see Device Memory Accesses.
Assuming that they are and that the mapped memory is read or written only once, using mapped page-locked memory instead of explicit copies between device and host memory can be a win for performance. On integrated systems where device memory and host memory are physically the same, any copy between host and device memory is superfluous and mapped page-locked memory should be used instead. Applications may query a device is integrated by checking that the integrated device property see Device Enumeration is equal to 1.
An instruction that accesses addressable memory i. How the distribution affects the instruction throughput this way is specific to each type of memory and described in the following sections. For example, for global memory, as a general rule, the more scattered the addresses are, the more reduced the throughput is.
Global memory resides in device memory and device memory is accessed via , , or byte memory transactions. These memory transactions must be naturally aligned: Only the , , or byte segments of device memory that are aligned to their size i. When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.
In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. For example, if a byte memory transaction is generated for each thread's 4-byte access, throughput is divided by 8. How many transactions are necessary and how much throughput is ultimately affected varies with the compute capability of the device.
To maximize global memory throughput, it is therefore important to maximize coalescing by:. Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes.
Any access via a variable or a pointer to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned i. If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing.
It is therefore recommended to use types that meet this requirement for data that resides in global memory. The alignment requirement is automatically fulfilled for the built-in types of char, short, int, long, longlong, float, double like float2 or float4. Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least bytes. Reading non-naturally aligned 8-byte or byte words produces incorrect results off by a few words , so special care must be taken to maintain alignment of the starting address of any value or array of values of these types.
A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays with multiple calls to cudaMalloc or cuMemAlloc is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block's starting address.
For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size. In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch and cuMemAllocPitch functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.
Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:. Inspection of the PTX assembly code obtained by compiling with the -ptx or -keep option will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the.
Even if it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture: Inspection of the cubin object using cuobjdump will tell if this is the case. Note that some mathematical functions have implementation paths that might access local memory.
The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as described in Device Memory Accesses.
Local memory is however organized such that consecutive bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address e. On some devices of compute capability 3. On devices of compute capability 5. Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory. To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously.
Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized.
The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests.
If the number of separate memory requests is n , the initial memory request is said to cause n -way bank conflicts.
To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts.
This is described in Compute Capability 3. The constant memory space resides in device memory and is cached in the constant cache. A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise. The texture and surface memory spaces reside in device memory and are cached in texture cache, so a texture fetch or surface read costs one memory read from device memory only on a cache miss, otherwise it just costs one read from texture cache.
The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant latency; a cache hit reduces DRAM bandwidth demand but not fetch latency. Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory:.
In this section, throughputs are given in number of operations per clock cycle per multiprocessor. All throughputs are for one multiprocessor. They must be multiplied by the number of multiprocessors in the device to get throughput for the whole device. Table 2 gives the throughputs of the arithmetic instructions that are natively supported in hardware for devices of various compute capabilities. Other instructions and functions are implemented on top of the native instructions.
The implementation may be different for devices of different compute capabilities, and the number of native instructions after compilation may fluctuate with every compiler version. For complicated functions, there can be multiple code paths depending on input.
The nvcc user manual describes these compilation flags in more details. To preserve IEEE semantics the compiler can optimize 1. It is therefore recommended to invoke rsqrtf directly where desired. Single-precision floating-point square root is implemented as a reciprocal square root followed by a reciprocal instead of a reciprocal square root followed by a multiplication so that it gives correct results for 0 and infinity.
More precisely, the argument reduction code see Mathematical Functions for implementation comprises two code paths referred to as the fast path and the slow path, respectively. The fast path is used for arguments sufficiently small in magnitude and essentially consists of a few multiply-add operations.
The slow path is used for arguments large in magnitude and consists of lengthy computations required to achieve correct results over the entire argument range. At present, the argument reduction code for the trigonometric functions selects the fast path for arguments whose magnitude is less than As the slow path requires more registers than the fast path, an attempt has been made to reduce register pressure in the slow path by storing some intermediate variables in local memory, which may affect performance because of local memory high latency and bandwidth see Device Memory Accesses.
At present, 28 bytes of local memory are used by single-precision functions, and 44 bytes are used by double-precision functions. However, the exact amount is subject to change. Due to the lengthy computations and use of local memory in the slow path, the throughput of these trigonometric functions is lower by one order of magnitude when the slow path reduction is required as opposed to the fast path reduction.
Integer division and modulo operation are costly as they compile to up to 20 instructions. They can be replaced with bitwise operations in some cases: In order to achieve good half precision floating-point add, multiply or multiply-add throughput it is recommended that the half2 datatype is used.
Vector intrinsics eg. Using half2 in place of two calls using half may also help performance of other intrinsics, such as warp shuffles.
Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for:. This last case can be avoided by using single-precision floating-point constants, defined with an f suffix such as 3. Any flow control instruction if , switch , do , for , while can significantly impact the effective instruction throughput by causing threads of the same warp to diverge i. If this happens, the different executions paths have to be serialized, increasing the total number of instructions executed for this warp.
To obtain best performance in cases where the control flow depends on the thread ID, the controlling condition should be written so as to minimize the number of divergent warps. This is possible because the distribution of the warps across the block is deterministic as mentioned in SIMT Architecture. In this case, no warp diverges since the controlling condition is perfectly aligned with the warps. Sometimes, the compiler may unroll loops or it may optimize out short if or switch blocks by using branch predication instead, as detailed below.
In these cases, no warp can ever diverge. The programmer can also control loop unrolling using the pragma unroll directive see pragma unroll. When using branch predication none of the instructions whose execution depends on the controlling condition gets skipped. Instead, each of them is associated with a per-thread condition code or predicate that is set to true or false based on the controlling condition and although each of these instructions gets scheduled for execution, only the instructions with a true predicate are actually executed.
Instructions with a false predicate do not write results, and also do not evaluate addresses or read operands. The compute capability, number of multiprocessors, clock frequency, total amount of device memory, and other properties can be queried using the runtime see reference manual. Function execution space specifiers denote whether a function executes on the host or on the device and whether it is callable from the host or from the device.
Such a function is:. Variable memory space specifiers denote the memory location on the device of a variable. However in some cases the compiler might choose to place it in local memory, which can have adverse performance consequences as detailed in Device Memory Accesses. If none of them is present, the variable:. All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets.
For example, if one wants the equivalent of. Note that pointers need to be aligned to the type they point to, so the following code, for example, does not work since array1 is not aligned to 4 bytes. Alignment requirements for the built-in vector types are listed in Table 3. Restricted pointers were introduced in C99 to alleviate the aliasing problem that exists in C-type languages, and which inhibits all kind of optimization from code re-ordering to common sub-expression elimination.
Here is an example subject to the aliasing issue, where use of restricted pointer can help the compiler to reduce the number of instructions:. In C-type languages, the pointers a , b , and c may be aliased, so any write through c could modify elements of a or b.
This means that to guarantee functional correctness, the compiler cannot load a and b into registers, multiply them, and store the result to both c and c , because the results would differ from the abstract execution model if, say, a is really the same location as c. So the compiler cannot take advantage of the common sub-expression.
Likewise, the compiler cannot just reorder the computation of c into the proximity of the computation of c and c because the preceding write to c could change the inputs to the computation of c.
By making a , b , and c restricted pointers, the programmer asserts to the compiler that the pointers are in fact not aliased, which in this case means writes through c would never overwrite elements of a or b. This changes the function prototype as follows:.
Note that all pointer arguments need to be made restricted for the compiler optimizer to derive any benefit. The effects here are a reduced number of memory accesses and reduced number of computations. This is balanced by an increase in register pressure due to "cached" loads and common sub-expressions. Since register pressure is a critical issue in many CUDA codes, use of restricted pointers can have negative performance impact on CUDA code, due to reduced occupancy.
The alignment requirements of the vector types are detailed in Table 3. This type is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3 , any component left unspecified is initialized to 1. Built-in variables specify the grid and block dimensions and the block and thread indices. They are only valid within functions that are executed on the device.
This variable is of type dim3 see dim3 and contains the dimensions of the grid. This variable is of type uint3 see char, short, int, long, longlong, float, double and contains the block index within the grid. This variable is of type dim3 see dim3 and contains the dimensions of the block.
This variable is of type uint3 see char, short, int, long, longlong, float, double and contains the thread index within the block.
This variable is of type int and contains the warp size in threads see SIMT Architecture for the definition of a warp. The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device is not necessarily the order in which the data is observed being written by another CUDA or host thread.
Memory fence functions can be used to enforce some ordering on memory accesses. The memory fence functions differ in the scope in which the orderings are enforced but they are independent of the accessed memory space shared memory, global memory, page-locked host memory, and the memory of a peer device.
Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is ensured by using the volatile keyword as detailed in Volatile Qualifier. A common use case is when threads consume some data produced by other threads as illustrated by the following code sample of a kernel that computes the sum of an array of N numbers in one call. Each block first sums a subset of the array and stores the result in global memory. When all blocks are done, the last block done reads each of these partial sums from global memory and sums them to obtain the final result.
In order to determine which block is finished last, each block atomically increments a counter to signal that it is done with computing and storing its partial sum see Atomic Functions about atomic functions. The last block is the one that receives the counter value equal to gridDim.
If no fence is placed between storing the partial sum and incrementing the counter, the counter might increment before the partial sum is stored and therefore, might reach gridDim. In the code sample below, the visibility of memory operations on the result variable is ensured by declaring it as volatile see Volatile Qualifier. When some threads within a block access the same addresses in shared or global memory, there are potential read-after-write, write-after-read, or write-after-write hazards for some of these memory accesses.
These data hazards can be avoided by synchronizing threads in-between these accesses. Devices of compute capability 2. Mathematical Functions provides accuracy information for some of these functions when relevant. Texture fetching is described in Texture Fetching. It does not perform any texture filtering. For integer types, it may optionally promote the integer to single-precision floating point. The level-of-detail is derived from the X-gradient dx and Y-gradient dy. The level-of-detail is derived from the dx and dy gradients.
The level-of-detail used is given by level. The level-of-detail is given by level. The level-of-detail is derived from the dx and dy X- and Y-gradients.
Type is a 4-component vector type. Surface functions are only supported by devices of compute capability 2. In the sections below, boundaryMode specifies the boundary mode, that is how out-of-range surface coordinates are handled; it is equal to either cudaBoundaryModeClamp , in which case out-of-range coordinates are clamped to the valid range, or cudaBoundaryModeZero , in which case out-of-range reads return zero and out-of-range writes are ignored, or cudaBoundaryModeTrap , in which case out-of-range accesses cause the kernel execution to fail.
The read-only data cache load function is only supported by devices of compute capability 3. An atomic function performs a read-modify-write atomic operation on one bit or bit word residing in global or shared memory.
For example, atomicAdd reads a word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. In other words, no other thread can access this address until the operation is complete. Atomic functions can only be used in device functions. On GPU architectures with compute capability lower than 6.
The new scoped versions of atomics are available for all atomics listed below only for compute capabilities 6. For example, atomicAdd for double-precision floating-point numbers is not available on devices with compute capability lower than 6. These three operations are performed in one atomic transaction. The function returns old.
The bit floating-point version of atomicAdd is only supported by devices of compute capability 2. The bit floating-point version of atomicAdd is only supported by devices of compute capability 6. These two operations are performed in one atomic transaction. The bit version of atomicMin is only supported by devices of compute capability 3.
The bit version of atomicMax is only supported by devices of compute capability 3. The function returns old Compare And Swap. The bit version of atomicAnd is only supported by devices of compute capability 3. The bit version of atomicOr is only supported by devices of compute capability 3. The bit version of atomicXor is only supported by devices of compute capability 3. Returns 1 if ptr contains the generic address of an object in global memory space, otherwise returns 0.
Returns 1 if ptr contains the generic address of an object in shared memory space, otherwise returns 0. Returns 1 if ptr contains the generic address of an object in constant memory space, otherwise returns 0. Returns 1 if ptr contains the generic address of an object in local memory space, otherwise returns 0.
Deprecation notice: Removal notice: When targeting devices with compute capability 7. The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input an integer predicate from each thread in the warp and compare those values with zero.
The results of the comparisons are combined reduced across the active threads of the warp in one of the following ways, broadcasting a single return value to each participating thread:. A bit, representing the thread's lane ID, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware.
All active threads named in mask must execute the same intrinsic with the same mask, or the result is undefined. T can be int , unsigned int , long , unsigned long , long long , unsigned long long , float or double.
A bit, representing the thread's lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. All non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined. Deprecation Notice: Removal Notice: The exchange occurs simultaneously for all active threads within the warp and named in mask , moving 4 or 8 bytes of data per thread depending on the type.
Threads within a warp are referred to as lanes , and may have an index between 0 and warpSize-1 inclusive.
Four source-lane addressing modes are supported:. If the target thread is inactive , the retrieved value is undefined. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. If srcLane is outside the range [0: The value of var held by the resulting lane ID is returned: The source lane index will not wrap around the value of width , so effectively the lower delta lanes will be unchanged. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned.
This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast. If the source lane ID is out of range or the source thread has exited, the calling thread's own var is returned.
If the target thread is inactive, the retrieved value is undefined. Results are unspecified for other values. These operations are supported on mixed-precision floating point data for devices of compute capability 7. This requires co-operation from all threads in a warp. In addition, these operations are allowed in conditional code only if the condition evaluates identically across the entire warp , otherwise the code execution is likely to hang.
All following functions and types are defined in the namespace nvcuda:: Sub-byte operations are considered preview, i. This extra functionality is defined in the nvcuda:: An overloaded class containing a section of a matrix distributed across all threads in the warp. The mapping of matrix elements into fragment internal storage is unspecified and subject to change in future architectures. The m , n and k sizes describe the shape of the warp-wide matrix tiles participating in the multiply-accumulate operation.
The dimension of each tile depends on its role. The Layout parameter for an accumulator matrix should retain the default value of void. A row or column layout is specified only when the accumulator is loaded or stored as described below. The values of mptr , ldm , layout and all template parameters for a must be the same for all threads in the warp. This function must be called by all threads in the warp, or the result is undefined.
Fill a matrix fragment with a constant value v. Programming Guides. Software Examples. Related Links. Sales Network. Web Shop. Contact Us. The table below lists all of the currently available programming guides and indicates. Document Title. BT81x Programming guide. FT User Guide.
User Guide for FT LibFT User Guide. User Guide for LibFT FT51A Programming Guide. Programming Guide for using FT51A firmware libraries. D3XX Programmer's Guide. FT81x Series Programmers Guide.
EVE2 programming guide. FT Series Programmers Guide. D2XX Programmer's Guide.