Computer resource U SM. Where cuda leads: practical application of gpgpu technology - the best equipment Applications that run on cuda

For decades, Moore's Law has been in effect, which states that every two years the number of transistors on a chip will double. However, this was back in 1965, and over the past 5 years the idea of ​​physical multi-cores in consumer-class processors has begun to rapidly develop: in 2005, Intel introduced the Pentium D, and AMD introduced the Athlon X2. Back then, applications using 2 cores could be counted on the fingers of one hand. However, the next generation of Intel processors, which made a revolution, had exactly 2 physical cores. Moreover, the Quad series appeared in January 2007, at which time Moore himself admitted that his law would soon cease to apply.

What now? Dual-core processors even in budget office systems, and 4 physical cores have become the norm, and this is in just 2-3 years. The frequency of processors is not increased, but the architecture is improved, the number of physical and virtual cores is increased. However, the idea of ​​using video adapters equipped with tens or even hundreds of computing “units” has been around for a long time.

And although the prospects for GPU computing are enormous, the most popular solution is Nvidia CUDA, which is free, has a lot of documentation and is generally very easy to implement, there are not many applications using this technology. Basically, these are all kinds of specialized calculations, which the average user in most cases does not care about. But there are also programs designed for the mass user, and we will talk about them in this article.

First, a little about the technology itself and what it is used with. Because When writing an article, I focus on a wide range of readers, so I will try to explain it in an accessible language without complex terms and somewhat briefly.

CUDA(English: Compute Unified Device Architecture) is a software and hardware architecture that allows you to perform calculations using NVIDIA graphics processors that support GPGPU technology (random computing on video cards). The CUDA architecture first appeared on the market with the release of the eighth generation NVIDIA chip - G80 and is present in all subsequent series of graphics chips that are used in the GeForce, Quadro and Tesla accelerator families. (c) Wikipedia.org

Incoming streams are processed independently of each other, i.e. parallel.

There is a division into 3 levels:

Grid- core. Contains a one/two/three-dimensional array of blocks.

Block– contains many threads. Threads of different blocks cannot interact with each other. Why was it necessary to introduce blocks? Each block is essentially responsible for its own subtask. For example, a large image (which is a matrix) can be divided into several smaller parts (matrices) and worked with each part of the image in parallel.

Thread– flow. Threads within one block can interact either through shared memory, which, by the way, is much faster than global memory, or through thread synchronization tools.

Warp is a union of threads interacting with each other, for all modern GPUs the Warp size is 32. Next comes half-warp, which is half of warp, because Memory access usually occurs separately for the first and second half of the warp.

As you can see, this architecture is excellent for parallelizing tasks. And although programming is carried out in the C language with some restrictions, in reality not everything is so simple, because... not everything can be parallelized. There are also no standard functions for generating random numbers (or initialization); all this has to be implemented separately. And although there are plenty of ready-made options, none of this brings joy. The ability to use recursion appeared relatively recently.

For clarity, a small console program (to minimize code) was written that performs operations with two arrays of the float type, i.e. with non-integer values. For the reasons stated above, initialization (filling the array with various arbitrary values) was carried out by the CPU. Next, 25 different operations were performed with the corresponding elements from each array, the intermediate results were written to the third array. The size of the array changed, the results are as follows:

A total of 4 tests were carried out:

1024 elements in each array:

It is clearly seen that with such a small number of elements, parallel computing is of little use, because The calculations themselves are much faster than their preparation.

4096 elements in each array:

And now you can see that the video card performs operations on arrays 3 times faster than the processor. Moreover, the execution time of this test on the video card did not increase (a slight decrease in time can be attributed to an error).

There are now 12288 elements in each array:

The separation of the video card has increased by 2 times. Again, it is worth noting that the execution time on the video card has increased
insignificantly, but on the processor more than 3 times, i.e. proportional to the complexity of the task.

And the last test is 36864 elements in each array:

In this case, the acceleration reaches impressive values ​​- almost 22 times faster on a video card. And again, the execution time on the video card increased slightly, but on the processor - the required 3 times, which is again proportional to the complexity of the task.

If you continue to complicate the calculations, the video card wins more and more. Although the example is somewhat exaggerated, the overall situation clearly shows. But as mentioned above, not everything can be parallelized. For example, calculating Pi. There are only examples written using the Monte Carlo method, but the accuracy of the calculations is 7 decimal places, i.e. regular float. In order to increase the accuracy of calculations, long arithmetic is required, and this is where problems arise, because It is very, very difficult to implement this effectively. I couldn’t find examples on the Internet that use CUDA and calculate Pi to 1 million decimal places. Attempts have been made to write such an application, but the simplest and most efficient method for calculating Pi is the Brent-Salamin algorithm or the Gauss formula. The well-known SuperPI most likely (judging by the speed of operation and the number of iterations) uses the Gaussian formula. And, judging by
Due to the fact that SuperPI is single-threaded, the lack of examples under CUDA and the failure of my attempts, it is impossible to effectively parallelize Pi counting.

By the way, you can notice how the load on the GPU increases during calculations, and memory is also allocated.

Now let's move on to the more practical benefits of CUDA, namely the currently existing programs that use this technology. For the most part, these are all kinds of audio/video converters and editors.

3 different video files were used in testing:

      *The history of the making of the film Avatar - 1920x1080, MPEG4, h.264.
      *Series "Lie to me" - 1280x720, MPEG4, h.264.
      *Series “It’s Always Sunny in Philadelphia” - 624x464, xvid.

The container and size of the first two files was .mkv and 1.55 GB, and the last one was .avi and 272 MB.

Let's start with a very sensational and popular product - Badaboom. Version used - 1.2.1.74 . The cost of the program is $29.90 .

The program interface is simple and intuitive - on the left we select the source file or disk, and on the right - the required device for which we will encode. There is also a user mode in which parameters are manually set, which is what we used.

First, let’s look at how quickly and efficiently the video is encoded “into itself,” i.e. same resolution and approximately the same size. We will measure the speed in fps, and not in elapsed time - this way it is more convenient to compare and calculate how much a video of arbitrary length will be compressed. Because Today we are considering “green” technology, then the graphs will be corresponding -)

Encoding speed directly depends on quality, this is obvious. It is worth noting that light resolution (let's call it traditionally SD) is not a problem for Badaboom - the encoding speed is 5.5 times higher than the original (24 fps) video framerate. And even heavy 1080p video is converted by the program in real time. It is worth noting that the quality of the final video is very close to the original video material, i.e. Badaboom encodes very, very efficiently.

But usually they transfer video to a lower resolution, let's see how things are in this mode. As the resolution decreased, the video bitrate also decreased. It was 9500 kbps for 1080p output file, 4100 kbps for 720p and 2400 kbps for 720x404. The choice was made based on a reasonable size/quality ratio.

No comments needed. If you make a rip from 720p to regular SD quality, then transcoding a film lasting 2 hours will take about 30 minutes. And at the same time, the processor load will be insignificant, you can go about your business without feeling discomfort.

What if you convert the video into a format for a mobile device? To do this, select the iPhone profile (bitrate 1 Mbit/s, 480x320) and look at the encoding speed:

Do I need to say anything? A two-hour movie in normal iPhone quality is transcoded in less than 15 minutes. With HD quality it is more difficult, but still very fast. The main thing is that the quality of the output video remains at a fairly high level when viewed on a phone display.

In general, impressions from Badaboom are positive, the speed of operation is pleasing, and the interface is simple and clear. All sorts of bugs in earlier versions (I used the beta in 2008) have been fixed. Except for one thing - the path to the source file, as well as to the folder in which the finished video is saved, should not contain Russian letters. But compared to the advantages of the program, this drawback is insignificant.

Next in line we will have Super LoiLoScope. For the regular version they ask 3,280 rubles, and for the touch version, which supports touch control in Windows 7, they ask for as much 4,440 rubles. Let's try to figure out why the developer wants that kind of money and why the video editor needs multitouch support. Latest version used - 1.8.3.3 .

It is quite difficult to describe the program interface in words, so I decided to make a short video. I’ll say right away that, like all video converters for CUDA, GPU acceleration is supported only for video output to MPEG4 with the h.264 codec.

During encoding, the processor load is 100%, but this does not cause discomfort. The browser and other light applications do not slow down.

Now let's move on to performance. To begin with, everything is the same as with Badaboom - transcoding the video into a similar one in quality.

The results are much better than Badaboom. The quality is also excellent, the difference with the original can only be noticed by comparing frames in pairs under a magnifying glass.

Wow, here LoiloScope outperforms Badaboom by 2.5 times. At the same time, you can easily cut and encode another video in parallel, read news and even watch movies, and even FullHD is played without problems, even though the processor load is maximum.

Now let's try to make a video for a mobile device, let's call the profile the same as it was called in Badaboom - iPhone (480x320, 1 Mbit/s):

There is no error. Everything was rechecked several times, each time the result was the same. Most likely, this happens for the simple reason that the SD file was recorded with a different codec and in a different container. When transcoding, video is first decoded, divided into matrices of a certain size, and compressed. The ASP decoder used in the case of xvid is slower than AVC (for h.264) when decoding in parallel. However, 192 fps is 8 times faster than the speed of the original video; a 23-minute series is compressed in less than 4 minutes. The situation was repeated with other files compressed into xvid/DivX.

LoiloScope I left only pleasant impressions - the interface, despite its unusualness, is convenient and functional, and the speed of operation is beyond praise. The relatively poor functionality is somewhat disappointing, but often with simple installation you only need to slightly adjust the colors, make smooth transitions, add text, and LoiloScope does an excellent job with this. The price is also somewhat frightening - more than $100 for the regular version is normal for foreign countries, but such figures still seem a bit wild to us. Although, I admit that if I, for example, often filmed and edited home videos, I might have thought about buying it. At the same time, by the way, I checked the possibility of editing HD (or rather AVCHD) content directly from a video camera without first converting to another format; LoiloScope did not reveal any problems with files like .mts.

New technology is like a newly emerged evolutionary species. A strange creature, unlike the many old-timers. Sometimes awkward, sometimes funny. And at first his new qualities seem in no way suitable for this settled and stable world.

However, a little time passes, and it turns out that the beginner runs faster, jumps higher and is generally stronger. And he eats more flies than his retrograde neighbors. And then these same neighbors begin to understand that there is no point in quarreling with this clumsy former. It’s better to be friends with him, and even better to organize a symbiosis. You'll see that there will be more flies.

GPGPU technology (General-Purpose Graphics Processing Units - general-purpose graphics processor) for a long time existed only in the theoretical calculations of brainy academics. How else? To propose to radically change the computing process that has developed over decades by entrusting the calculation of its parallel branches to a video card - only theorists are capable of this.

The logo of CUDA technology reminds us that it grew in the depths of
3D graphics.

But GPGPU technology was not going to gather dust for long on the pages of university journals. Having fluffed up the feathers of her best qualities, she attracted the attention of manufacturers. This is how CUDA was born - an implementation of GPGPU on GeForce graphics processors manufactured by nVidia.

Thanks to CUDA, GPGPU technologies have become mainstream. And now only the most short-sighted and covered with a thick layer of laziness developer of programming systems does not declare support for CUDA with their product. IT publications considered it an honor to present the details of the technology in numerous plump popular science articles, and competitors immediately sat down with patterns and cross-compilers to develop something similar.

Public recognition is a dream not only for aspiring starlets, but also for newly born technologies. And CUDA was lucky. She is well known, they talk and write about her.

They just write as if they continue to discuss GPGPU in thick scientific journals. They bombard the reader with a bunch of terms like “grid”, “SIMD”, “warp”, “host”, “texture and constant memory”. They immerse him to the very top in the organization diagrams of nVidia GPUs, lead him along winding paths of parallel algorithms and (the strongest move) show long code listings in the C language. As a result, it turns out that at the input of the article we have a fresh reader with a burning desire to understand CUDA, and at the output we have the same reader, but with a swollen head filled with a mess of facts, diagrams, code, algorithms and terms.

Meanwhile, the goal of any technology is to make our lives easier. And CUDA does a great job with this. The results of her work are what will convince any skeptic better than hundreds of schemes and algorithms.

Not everywhere

CUDA is supported by high-performance supercomputers
nVidia Tesla.

And yet, before looking at the results of CUDA’s work in the field of making the life of the average user easier, it is worth understanding all of its limitations. Just like with a genie: any desire, but one. CUDA also has its Achilles heels. One of them is the limitations of the platforms on which it can work.

The list of nVidia video cards that support CUDA is presented in a special list called CUDA Enabled Products. The list is quite impressive, but easy to classify. CUDA support is not denied:

    nVidia GeForce 8th, 9th, 100th, 200th and 400th series models with a minimum of 256 megabytes of video memory on board. Support extends to both desktop and mobile cards.

    The vast majority of desktop and mobile video cards are nVidia Quadro.

    All solutions from the nvidia ION netbook series.

    High-performance HPC (High Performance Computing) and nVidia Tesla supercomputer solutions used both for personal computing and for organizing scalable cluster systems.

Therefore, before using CUDA-based software products, it is worth checking this list of favorites.

In addition to the video card itself, an appropriate driver is required to support CUDA. It is the link between the central and graphics processors, acting as a kind of software interface for accessing program code and data to the multi-core treasure trove of the GPU. To make sure you don't make a mistake, nVidia recommends visiting the drivers page and getting the latest version.

...but the process itself

How does CUDA work? How to explain the complex process of parallel computing on a special GPU hardware architecture without plunging the reader into the abyss of specific terms?

You can try to do this by imagining how the central processor executes the program in symbiosis with the graphics processor.

Architecturally, the central processing unit (CPU) and its graphics counterpart (GPU) are designed differently. If we draw an analogy with the world of the automotive industry, then the CPU is a station wagon, one of those that is called a “barn”. It looks like a passenger car, but at the same time (from the point of view of the developers) “it’s a Swiss, a reaper, and a player on the pipe.” Performs the role of a small truck, bus and hypertrophied hatchback at the same time. Station wagon, in short. It has few cylinder cores, but they handle almost any task, and the impressive cache memory is capable of storing a bunch of data.

But the GPU is a sports car. There is only one function: to deliver the pilot to the finish line as quickly as possible. Therefore, no large trunk memory, no extra seats. But there are hundreds of times more cylinder cores than the CPU.

Thanks to CUDA, GPGPU program developers do not need to delve into the complexities of programming
development for graphics engines such as DirectX and OpenGL

Unlike the central processor, which is capable of solving any task, including graphics, but with average performance, the graphics processor is adapted to a high-speed solution of one task: turning a bunch of polygons at the input into a bunch of pixels at the output. Moreover, this problem can be solved in parallel using hundreds of relatively simple computing cores in the GPU.

So what kind of tandem can there be from a station wagon and a sports car? CUDA works something like this: the program runs on the CPU until there is a section of code that can be executed in parallel. Then, instead of it being slowly executed on two (or even eight) cores of the coolest CPU, it is transferred to hundreds of GPU cores. At the same time, the execution time of this section is reduced significantly, which means the execution time of the entire program is also reduced.

Technologically, nothing changes for the programmer. The code of CUDA programs is written in C language. More precisely, in its special dialect “C with streams” (C with streams). Developed at Stanford, this extension of the C language is called Brook. The interface that transfers Brook code to the GPU is the driver of a video card that supports CUDA. It organizes the entire processing process of this section of the program so that for the programmer the GPU looks like a CPU coprocessor. Very similar to the use of a math coprocessor in the early days of personal computing. With the advent of Brook, video cards with CUDA support and drivers for them, any programmer has become able to access the GPU in their programs. But before this shamanism was owned by a narrow circle of select people who spent years honing programming techniques for DirectX or OpenGL graphics engines.

In the barrel of this pretentious honey - CUDA's praises - it is worth putting a fly in the ointment, that is, restrictions. Not every problem that needs to be programmed can be solved using CUDA. It will not be possible to speed up the solution of routine office tasks, but you can trust CUDA to calculate the behavior of thousands of the same type of fighters in World of Warcraft. But this is a made-up task. Let's look at examples of what CUDA already solves very effectively.

Righteous works

CUDA is a very pragmatic technology. Having implemented its support in its video cards, nVidia quite rightly expected that the CUDA banner would be taken up by many enthusiasts both in the university environment and in commerce. And so it happened. CUDA-based projects live and bring benefits.

NVIDIA PhysX

When advertising their next gaming masterpiece, manufacturers often emphasize its 3D realism. But no matter how real the 3D game world may be, if the elementary laws of physics, such as gravity, friction, and hydrodynamics, are implemented incorrectly, the falsehood will be felt instantly.

One of the capabilities of the NVIDIA PhysX physics engine is realistic work with tissues.

Implementing algorithms for computer simulation of basic physical laws is a very labor-intensive task. The most famous companies in this field are the Irish company Havok with its cross-platform physical Havok Physics and the Californian Ageia - the progenitor of the world's first physical processor (PPU - Physics Processing Unit) and the corresponding PhysX physics engine. The first of them, although acquired by Intel, is now actively working in the field of optimizing the Havok engine for ATI video cards and AMD processors. But Ageia with its PhysX engine became part of nVidia. At the same time, nVidia solved the rather difficult problem of adapting PhysX to CUDA technology.

This became possible thanks to statistics. It has been statistically proven that no matter how complex rendering a GPU performs, some of its cores are still idle. It is on these cores that the PhysX engine runs.

Thanks to CUDA, the lion's share of calculations related to the physics of the game world began to be performed on the video card. The freed-up power of the central processor was used to solve other gameplay problems. The result was not long in coming. According to experts, the performance gain in gameplay with PhysX running on CUDA has increased by at least an order of magnitude. The likelihood of realizing physical laws has also increased. CUDA takes care of the routine calculation of the implementation of friction, gravity and other things familiar to us for multidimensional objects. Now not only the heroes and their equipment fit perfectly into the laws of the physical world we are familiar with, but also dust, fog, blast wave, flame and water.

CUDA version of NVIDIA Texture Tools 2 texture compression package

Do you like realistic objects in modern games? It is worth saying thanks to the texture developers. But the more reality there is in the texture, the greater its volume. The more it takes up precious memory. To avoid this, textures are pre-compressed and dynamically decompressed as needed. And compression and decompression are pure calculations. To work with textures, nVidia has released the NVIDIA Texture Tools package. It supports efficient compression and decompression of DirectX textures (the so-called HF format). The second version of this package boasts support for BC4 and BC5 compression algorithms implemented in DirectX 11 technology. But the main thing is that NVIDIA Texture Tools 2 includes CUDA support. According to nVidia, this gives a 12-fold increase in performance in texture compression and decompression tasks. This means that gameplay frames will load faster and delight the player with their realism.

The NVIDIA Texture Tools 2 package is designed to work with CUDA. The performance gain when compressing and decompressing textures is obvious.

Using CUDA can significantly improve the efficiency of video surveillance.

Real-time video stream processing

Whatever one may say, the current world, from the point of view of spying, is much closer to the world of Orwell’s Big Brother than it seems. Both car drivers and visitors to public places feel the gaze of video cameras.

Full-flowing rivers of video information flow into the centers of its processing and... run into a narrow link - a person. In most cases, he is the last authority monitoring the video world. Moreover, the authority is not the most effective. Blinks, gets distracted and tries to fall asleep.

Thanks to CUDA, it became possible to implement algorithms for simultaneous tracking of multiple objects in a video stream. In this case, the process occurs in real time, and the video is full 30 fps. Compared to the implementation of such an algorithm on modern multi-core CPUs, CUDA gives a two- or three-fold increase in performance, and this, you see, is quite a lot.

Video conversion, audio filtering

Badaboom video converter is the first to use CUDA to speed up conversion.

It's nice to watch a new video rental product in FullHD quality and on a big screen. But you can’t take a large screen with you on the road, and the FullHD video codec will hiccup on the low-power processor of a mobile gadget. Conversion comes to the rescue. But most of those who have encountered it in practice complain about the long conversion time. This is understandable, the process is routine, suitable for parallelization, and its execution on the CPU is not very optimal.

But CUDA copes with it with a bang. The first sign is the Badaboom converter from Elevental. The Badaboom developers made the right decision when choosing CUDA. Tests show that it converts a standard hour and a half movie to iPhone/iPod Touch format in less than twenty minutes. And this despite the fact that when using only the CPU, this process takes more than an hour.

Helps CUDA and professional music lovers. Any of them would give half a kingdom for an effective FIR crossover - a set of filters that divide the sound spectrum into several bands. This process is very labor-intensive and, with a large volume of audio material, forces the sound engineer to go “smoke” for several hours. Implementing a CUDA-based FIR crossover speeds up its operation hundreds of times.

CUDA Future

Having made GPGPU technology a reality, CUDA is not resting on its laurels. As happens everywhere, the principle of reflection works in CUDA: now not only the architecture of nVidia video processors influences the development of CUDA SDK versions, but CUDA technology itself forces nVidia to reconsider the architecture of its chips. An example of such reflection is the nVidia ION platform. Its second version is specially optimized for solving CUDA problems. This means that even in relatively inexpensive hardware solutions, consumers will receive all the power and brilliant capabilities of CUDA.

And it is designed to translate host code (main, control code) and device code (hardware code) (files with the .cu extension) into object files suitable for the process of assembling the final program or library in any programming environment, for example in NetBeans.

The CUDA architecture uses a grid memory model, cluster thread modeling, and SIMD instructions. Applicable not only for high-performance graphics computing, but also for various scientific computing using nVidia video cards. Scientists and researchers widely use CUDA in a variety of fields, including astrophysics, computational biology and chemistry, fluid dynamics modeling, electromagnetic interactions, computed tomography, seismic analysis, and more. CUDA has the ability to connect to applications using OpenGL and Direct3D. CUDA is cross-platform software for operating systems such as Linux, Mac OS X and Windows.

On March 22, 2010, nVidia released CUDA Toolkit 3.0, which contained support for OpenCL.

Equipment

The CUDA platform first appeared on the market with the release of the eighth-generation NVIDIA G80 chip and became present in all subsequent series of graphics chips, which are used in the GeForce, Quadro and NVidia Tesla accelerator families.

The first series of hardware to support the CUDA SDK, the G8x, had a 32-bit single-precision vector processor using the CUDA SDK as an API (CUDA supports the C double type, but its precision has now been reduced to 32-bit floating point). Later GT200 processors have support for 64-bit precision (SFU only), but performance is significantly worse than for 32-bit precision (due to the fact that there are only two SFUs per stream multiprocessor, while there are eight scalar processors). The GPU organizes hardware multithreading, which allows you to use all the resources of the GPU. Thus, the prospect opens up to transfer the functions of the physical accelerator to the graphics accelerator (an example of implementation is nVidia PhysX). It also opens up wide possibilities for using computer graphics hardware to perform complex non-graphical calculations: for example, in computational biology and other branches of science.

Advantages

Compared to the traditional approach to organizing general-purpose computing through graphics APIs, the CUDA architecture has the following advantages in this area:

Restrictions

  • All functions executable on the device do not support recursion (CUDA Toolkit 3.1 supports pointers and recursion) and have some other limitations

Supported GPUs and graphics accelerators

The list of devices from equipment manufacturer Nvidia with declared full support for CUDA technology is provided on the official Nvidia website: CUDA-Enabled GPU Products (English).

In fact, the following peripherals currently support CUDA technology in the PC hardware market:

Specification version GPU Video cards
1.0 G80, G92, G92b, G94, G94b GeForce 8800GTX/Ultra, 9400GT, 9600GT, 9800GT, Tesla C/D/S870, FX4/5600, 360M, GT 420
1.1 G86, G84, G98, G96, G96b, G94, G94b, G92, G92b GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600 GSO, 9800GTX/GX2, GTS 250, GT 120/30/40, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32 /370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50
1.2 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M
1.3 GT200, GT200b GeForce GTX 260, GTX 275, GTX 280, GTX 285, GTX 295, Tesla C/M1060, S1070, Quadro CX, FX 3/4/5800
2.0 GF100, GF110 GeForce (GF100) GTX 465, GTX 470, GTX 480, Tesla C2050, C2070, S/M2050/70, Quadro Plex 7000, Quadro 4000, 5000, 6000, GeForce (GF110) GTX 560 TI 448, GTX570, GTX580, GTX590
2.1 GF104, GF114, GF116, GF108, GF106 GeForce 610M, GT 430, GT 440, GTS 450, GTX 460, GTX 550 Ti, GTX 560, GTX 560 Ti, 500M, Quadro 600, 2000
3.0 GK104, GK106, GK107 GeForce GTX 690, GTX 680, GTX 670, GTX 660 Ti, GTX 660, GTX 650 Ti, GTX 650, GT 640, GeForce GTX 680MX, GeForce GTX 680M, GeForce GTX 675MX, GeForce GTX 670MX, GTX 660M, GeForce GT 650M, GeForce GT 645M, GeForce GT 640M
3.5 GK110
Nvidia GeForce for desktop computers
GeForce GTX 590
GeForce GTX 580
GeForce GTX 570
GeForce GTX 560 Ti
GeForce GTX 560
GeForce GTX 550 Ti
GeForce GTX 520
GeForce GTX 480
GeForce GTX 470
GeForce GTX 465
GeForce GTX 460
GeForce GTS 450
GeForce GTX 295
GeForce GTX 285
GeForce GTX 280
GeForce GTX 275
GeForce GTX 260
GeForce GTS 250
GeForce GT 240
GeForce GT 220
GeForce 210
GeForce GTS 150
GeForce GT 130
GeForce GT 120
GeForce G100
GeForce 9800 GX2
GeForce 9800 GTX+
GeForce 9800 GTX
GeForce 9800 GT
GeForce 9600 GSO
GeForce 9600 GT
GeForce 9500 GT
GeForce 9400 GT
GeForce 9400 mGPU
GeForce 9300 mGPU
GeForce 8800 GTS 512
GeForce 8800 GT
GeForce 8600 GTS
GeForce 8600 GT
GeForce 8500 GT
GeForce 8400GS
Nvidia GeForce for mobile computers
GeForce GTX 580M
GeForce GTX 570M
GeForce GTX 560M
GeForce GT 555M
GeForce GT 540M
GeForce GT 525M
GeForce GT 520M
GeForce GTX 485M
GeForce GTX 480M
GeForce GTX 470M
GeForce GTX 460M
GeForce GT 445M
GeForce GT 435M
GeForce GT 425M
GeForce GT 420M
GeForce GT 415M
GeForce GTX 285M
GeForce GTX 280M
GeForce GTX 260M
GeForce GTS 360M
GeForce GTS 350M
GeForce GTS 160M
GeForce GTS 150M
GeForce GT 335M
GeForce GT 330M
GeForce GT 325M
GeForce GT 240M
GeForce GT 130M
GeForce G210M
GeForce G110M
GeForce G105M
GeForce 310M
GeForce 305M
GeForce 9800M GTX
GeForce 9800M GT
GeForce 9800M GTS
GeForce 9700M GTS
GeForce 9700M GT
GeForce 9650MGS
GeForce 9600M GT
GeForce 9600MGS
GeForce 9500MGS
GeForce 9500M G
GeForce 9300MGS
GeForce 9300M G
GeForce 9200MGS
GeForce 9100M G
GeForce 8800M GTS
GeForce 8700M GT
GeForce 8600M GT
GeForce 8600MGS
GeForce 8400M GT
GeForce 8400MGS
Nvidia Tesla *
Tesla C2050/C2070
Tesla M2050/M2070/M2090
Tesla S2050
Tesla S1070
Tesla M1060
Tesla C1060
Tesla C870
Tesla D870
Tesla S870
Nvidia Quadro for desktop computers
Quadro 6000
Quadro 5000
Quadro 4000
Quadro 2000
Quadro 600
Quadro FX 5800
Quadro FX 5600
Quadro FX 4800
Quadro FX 4700 X2
Quadro FX 4600
Quadro FX 3700
Quadro FX 1700
Quadro FX 570
Quadro FX 470
Quadro FX 380 Low Profile
Quadro FX 370
Quadro FX 370 Low Profile
Quadro CX
Quadro NVS 450
Quadro NVS 420
Quadro NVS 290
Quadro Plex 2100 D4
Quadro Plex 2200 D2
Quadro Plex 2100 S4
Quadro Plex 1000 Model IV
Nvidia Quadro for mobile computing
Quadro 5010M
Quadro 5000M
Quadro 4000M
Quadro 3000M
Quadro 2000M
Quadro 1000M
Quadro FX 3800M
Quadro FX 3700M
Quadro FX 3600M
Quadro FX 2800M
Quadro FX 2700M
Quadro FX 1800M
Quadro FX 1700M
Quadro FX 1600M
Quadro FX 880M
Quadro FX 770M
Quadro FX 570M
Quadro FX 380M
Quadro FX 370M
Quadro FX 360M
Quadro NVS 5100M
Quadro NVS 4200M
Quadro NVS 3100M
Quadro NVS 2100M
Quadro NVS 320M
Quadro NVS 160M
Quadro NVS 150M
Quadro NVS 140M
Quadro NVS 135M
Quadro NVS 130M
  • Models Tesla C1060, Tesla S1070, Tesla C2050/C2070, Tesla M2050/M2070, Tesla S2050 allow GPU calculations with double precision.

Features and Specifications of Various Versions

Feature support (unlisted features are
supported for all compute capabilities)
Compute capability (version)
1.0 1.1 1.2 1.3 2.x

32-bit words in global memory
No Yes

floating point values ​​in global memory
Integer atomic functions operating on
32-bit words in shared memory
No Yes
atomicExch() operating on 32-bit
floating point values ​​in shared memory
Integer atomic functions operating on
64-bit words in global memory
Warp vote functions
Double-precision floating-point operations No Yes
Atomic functions operating on 64-bit
integer values ​​in shared memory
No Yes
Floating-point atomic addition operating on
32-bit words in global and shared memory
_ballot()
_threadfence_system()
_syncthreads_count(),
_syncthreads_and(),
_syncthreads_or()
Surface functions
3D grid of thread block
Technical specifications Compute capability (version)
1.0 1.1 1.2 1.3 2.x
Maximum dimensionality of grid of thread blocks 2 3
Maximum x-, y-, or z-dimension of a grid of thread blocks 65535
Maximum dimensionality of thread block 3
Maximum x- or y-dimension of a block 512 1024
Maximum z-dimension of a block 64
Maximum number of threads per block 512 1024
Warp size 32
Maximum number of resident blocks per multiprocessor 8
Maximum number of resident warps per multiprocessor 24 32 48
Maximum number of resident threads per multiprocessor 768 1024 1536
Number of 32-bit registers per multiprocessor 8K 16 K 32 K
Maximum amount of shared memory per multiprocessor 16 KB 48 KB
Number of shared memory banks 16 32
Amount of local memory per thread 16 KB 512 KB
Constant memory size 64 KB
Cache working set per multiprocessor for constant memory 8 KB
Cache working set per multiprocessor for texture memory Device dependent, between 6 KB and 8 KB
Maximum width for 1D texture
8192 32768
Maximum width for 1D texture
reference bound to linear memory
2 27
Maximum width and number of layers
for a 1D layered texture reference
8192 x 512 16384 x 2048
Maximum width and height for 2D
texture reference bound to
linear memory or a CUDA array
65536 x 32768 65536 x 65535
Maximum width, height, and number
of layers for a 2D layered texture reference
8192 x 8192 x 512 16384 x 16384 x 2048
Maximum width, height and depth
for a 3D texture reference bound to linear
memory or a CUDA array
2048 x 2048 x 2048
Maximum number of textures that
can be bound to a kernel
128
Maximum width for a 1D surface
reference bound to a CUDA array
Not
supported
8192
Maximum width and height for a 2D
surface reference bound to a CUDA array
8192 x 8192
Maximum number of surfaces that
can be bound to a kernel
8
Maximum number of instructions per
kernel
2 million

Example

CudaArray* cu_array; texture< float , 2 >tex; // Allocate array cudaMalloc( & cu_array, cudaCreateChannelDesc< float>(), width, height) ; // Copy image data to array cudaMemcpy( cu_array, image, width* height, cudaMemcpyHostToDevice) ; // Bind the array to the texture cudaBindTexture( tex, cu_array) ; // Run kernel dim3 blockDim(16, 16, 1) ; dim3 gridDim(width / blockDim.x, height / blockDim.y, 1) ; kernel<<< gridDim, blockDim, 0 >>> (d_odata, width, height) ; cudaUnbindTexture(tex) ; __global__ void kernel(float * odata, int height, int width) ( unsigned int x = blockIdx.x * blockDim.x + threadIdx.x ; unsigned int y = blockIdx.y * blockDim.y + threadIdx.y ; float c = texfetch(tex, x, y) ; odata[ y* width+ x] = c; )

Import pycuda.driver as drv import numpy drv.init() dev = drv.Device(0) ctx = dev.make_context() mod = drv.SourceModule( """ __global__ void multiply_them(float *dest, float *a, float *b) ( const int i = threadIdx.x; dest[i] = a[i] * b[i]; ) """) multiply_them = mod.get_function ("multiply_them" ) a = numpy.random .randn (400 ) .astype (numpy.float32 ) b = numpy.random .randn (400 ) .astype (numpy.float32 ) dest = numpy.zeros_like (a) multiply_them( drv.Out (dest) , drv.In (a) , drv.In (b) , block= (400 , 1 , 1 ) ) print dest-a*b

CUDA as a subject in universities

As of December 2009, the CUDA software model is taught in 269 universities around the world. In Russia, training courses on CUDA are given at the St. Petersburg Polytechnic University, Yaroslavl State University. P. G. Demidov, Moscow, Nizhny Novgorod, St. Petersburg, Tver, Kazan, Novosibirsk, Novosibirsk State Technical University, Omsk and Perm State Universities, International University of the Nature of Society and Man "Dubna", Ivanovo State Energy University, Belgorod State University, MSTU them. Bauman, Russian Chemical Technical University named after. Mendeleev, Interregional Supercomputer Center RAS, . In addition, in December 2009, it was announced that the first Russian scientific and educational center “Parallel Computing”, located in the city of Dubna, began operating, whose tasks include training and consultations on solving complex computing problems on GPUs.

In Ukraine, courses on CUDA are taught at the Kiev Institute of System Analysis.

Links

Official resources

  • CUDA Zone (Russian) - official CUDA website
  • CUDA GPU Computing (English) - official web forums dedicated to CUDA computing

Unofficial resources

Tom's Hardware
  • Dmitry Chekanov. nVidia CUDA: computing on a video card or the death of the CPU? . Tom's Hardware (June 22, 2008). Archived
  • Dmitry Chekanov. nVidia CUDA: Benchmarking GPU Applications for the Mass Market. Tom's Hardware (May 19, 2009). Archived from the original on March 4, 2012. Retrieved May 19, 2009.
iXBT.com
  • Alexey Berillo. NVIDIA CUDA - non-graphical computing on GPUs. Part 1 . iXBT.com (September 23, 2008). Archived from the original on March 4, 2012. Retrieved January 20, 2009.
  • Alexey Berillo. NVIDIA CUDA - non-graphical computing on GPUs. Part 2 . iXBT.com (October 22, 2008). - Examples of implementation of NVIDIA CUDA. Archived from the original on March 4, 2012. Retrieved January 20, 2009.
Other resources
  • Boreskov Alexey Viktorovich. CUDA Basics (January 20, 2009). Archived from the original on March 4, 2012. Retrieved January 20, 2009.
  • Vladimir Frolov. Introduction to CUDA technology. Online magazine “Computer Graphics and Multimedia” (December 19, 2008). Archived from the original on March 4, 2012. Retrieved October 28, 2009.
  • Igor Oskolkov. NVIDIA CUDA is an affordable ticket to the world of big computing. Computerra (April 30, 2009). Retrieved May 3, 2009.
  • Vladimir Frolov. Introduction to CUDA Technology (August 1, 2009). Archived from the original on March 4, 2012. Retrieved April 3, 2010.
  • GPGPU.ru. Using video cards for computing
  • . Parallel Computing Center

Notes

see also

In the development of modern processors, there is a tendency towards a gradual increase in the number of cores, which increases their capabilities in parallel computing. However, GPUs have long been available that are significantly superior to CPUs in this regard. And these capabilities of GPUs have already been taken into account by some companies. The first attempts to use graphics accelerators for non-target computing have been made since the late 90s. But only the emergence of shaders became the impetus for the development of a completely new technology, and in 2003 the concept of GPGPU (General-purpose graphics processing units) appeared. An important role in the development of this initiative was played by BrookGPU, which is a special extension for the C language. Before the advent of BrookGPU, programmers could work with GPUs only through the Direct3D or OpenGL API. Brook allowed developers to work with a familiar environment, and the compiler itself, using special libraries, implemented interaction with the GPU at a low level.

Such progress could not help but attract the attention of the leaders of this industry - AMD and NVIDIA, who began developing their own software platforms for non-graphical computing on their video cards. No one knows better than GPU developers all the nuances and features of their products, which allows these same companies to optimize the software package for specific hardware solutions as efficiently as possible. Currently, NVIDIA is developing the CUDA (Compute Unified Device Architecture) platform; AMD calls a similar technology CTM (Close To Metal) or AMD Stream Computing. We will look at some of the capabilities of CUDA and evaluate in practice the computing capabilities of the G92 graphics chip of the GeForce 8800 GT video card.

But first, let’s look at some of the nuances of performing calculations using GPUs. Their main advantage is that the graphics chip is initially designed to execute multiple threads, while each core of a conventional CPU executes a stream of sequential instructions. Any modern GPU is a multiprocessor consisting of several computing clusters, with many ALUs in each. The most powerful modern GT200 chip consists of 10 such clusters, each of which has 24 stream processors. The tested GeForce 8800 GT video card based on the G92 chip has seven large computing units with 16 stream processors each. CPUs use SIMD SSE blocks for vector calculations (single instruction multiple data - one instruction is executed on multiple data), which requires transforming the data into 4 vectors. The GPU processes threads scalarly, i.e. one instruction is applied over several threads (SIMT - single instruction multiple threads). This saves developers from converting data into vectors, and allows arbitrary branching in streams. Each GPU compute unit has direct memory access. And the video memory bandwidth is higher, thanks to the use of several separate memory controllers (on the top-end G200 there are 8 64-bit channels) and high operating frequencies.

In general, in certain tasks when working with large amounts of data, GPUs are much faster than CPUs. Below you see an illustration of this statement:


The chart shows the dynamics of CPU and GPU performance growth since 2003. NVIDIA likes to cite this data as advertising in its documents, but they are only theoretical calculations and in reality the gap, of course, may turn out to be much smaller.

But be that as it may, there is a huge potential of GPUs that can be used, and which requires a specific approach to software development. All this is implemented in the CUDA hardware and software environment, which consists of several software levels - the high-level CUDA Runtime API and the low-level CUDA Driver API.


CUDA uses the standard C language for programming, which is one of its main advantages for developers. Initially, CUDA includes the BLAS (basic linear algebra package) and FFT (Fourier transform) libraries. CUDA also has the ability to interact with OpenGL or DirectX graphics APIs, the ability to develop at a low level, and is characterized by an optimized distribution of data streams between the CPU and GPU. CUDA calculations are performed simultaneously with graphics ones, unlike the similar AMD platform, where a special virtual machine is launched for calculations on the GPU. But such “cohabitation” is also fraught with errors if a large load is created by the graphics API while CUDA is running simultaneously - after all, graphical operations still have a higher priority. The platform is compatible with 32- and 64-bit operating systems Windows XP, Windows Vista, MacOS X and various versions of Linux. The platform is open and on the website, in addition to special drivers for the video card, you can download software packages CUDA Toolkit, CUDA Developer SDK, including a compiler, debugger, standard libraries and documentation.

As for the practical implementation of CUDA, for a long time this technology was used only for highly specialized mathematical calculations in the field of particle physics, astrophysics, medicine or forecasting changes in the financial market, etc. But this technology is gradually becoming closer to ordinary users, in particular, special plug-ins for Photoshop are appearing that can use the computing power of the GPU. On a special page you can study the entire list of programs that use the capabilities of NVIDIA CUDA.

As a practical test of the new technology on the MSI NX8800GT-T2D256E-OC video card, we will use the TMPGEnc program. This product is commercial (the full version costs $100), but for MSI video cards it comes as a bonus in a trial version for a period of 30 days. You can download this version from the developer’s website, but to install TMPGEnc 4.0 XPress MSI Special Edition you need the original disk with drivers from the MSI card - without it the program will not be installed.

To display the most complete information about computing capabilities in CUDA and compare them with other video adapters, you can use the special CUDA-Z utility. This is the information it gives about our GeForce 8800GT video card:




Compared to the reference models, our copy operates at higher frequencies: the raster domain is 63 MHz higher than the nominal, and the shader units are faster by 174 MHz, and the memory is 100 MHz faster.

We will compare the conversion speed of the same HD video when calculating only using the CPU and with additional activation of CUDA in the TMPGEnc program on the following configuration:

  • Processor: Pentium Dual-Core E5200 2.5 GHz;
  • Motherboard: Gigabyte P35-S3;
  • Memory: 2x1GB GoodRam PC6400 (5-5-5-18-2T)
  • Video card: MSI NX8800GT-T2D256E-OC;
  • Hard drive: 320GB WD3200AAKS;
  • Power supply: CoolerMaster eXtreme Power 500-PCAP;
  • Operating system: Windows XP SP2;
  • TMPGEnc 4.0 XPress 4.6.3.268;
  • Video card drivers: ForceWare 180.60.
For tests, the processor was overclocked to 3 GHz (in the 11.5x261 MHz configuration) and to 4 GHz (11.5x348 MHz) with a RAM frequency of 835 MHz in the first and second cases. Video in Full HD 1920x1080 resolution, one minute and twenty seconds long. To create additional load, a noise reduction filter was turned on, the settings of which were left at default.


Encoding was carried out using the DivX 6.8.4 codec. In the quality settings of this codec, all values ​​are left at default, multithreading is enabled.


Multithreading support in TMPGEnc is initially enabled in the CPU/GPU settings tab. CUDA is also activated in the same section.


As you can see from the above screenshot, filter processing using CUDA is enabled, but the hardware video decoder is not enabled. The program documentation warns that activating the last parameter increases the file processing time.

Based on the results of the tests, the following data was obtained:


At 4 GHz with CUDA enabled, we only gained a couple of seconds (or 2%), which isn't particularly impressive. But at a lower frequency, the increase from activating this technology allows you to save about 13% of time, which will be quite noticeable when processing large files. But still the results are not as impressive as expected.

The TMPGEnc program has a CPU and CUDA load indicator; in this test configuration, it showed the CPU load at about 20%, and the graphics core at the remaining 80%. As a result, we have the same 100% as when converting without CUDA, and there may not be a time difference at all (but it still exists). The small memory capacity of 256 MB is also not a limiting factor. Judging by the readings from RivaTuner, no more than 154 MB of video memory was used during operation.



conclusions

The TMPGEnc program is one of those that introduces CUDA technology to the masses. Using the GPU in this program allows you to speed up the video processing process and significantly relieve the central processor, which will allow the user to comfortably do other tasks at the same time. In our specific example, the GeForce 8800GT 256MB video card slightly improved the timing performance when converting video based on an overclocked Pentium Dual-Core E5200 processor. But it is clearly visible that as the frequency decreases, the gain from activating CUDA increases; on weak processors, the gain from its use will be much greater. Against the background of this dependence, it is quite logical to assume that even with an increase in load (for example, the use of a very large number of additional video filters), the results of a system with CUDA will be distinguished by a more significant delta of the difference in the time spent on the encoding process. Also, do not forget that the G92 is not the most powerful chip at the moment, and more modern video cards will provide significantly higher performance in such applications. However, while the application is running, the GPU is not fully loaded and, probably, the load distribution depends on each configuration separately, namely on the processor/video card combination, which ultimately can give a larger (or smaller) increase as a percentage of CUDA activation. In any case, for those who work with large volumes of video data, this technology will still allow them to significantly save their time.

True, CUDA has not yet gained widespread popularity; the quality of software working with this technology requires improvement. In the TMPGEnc 4.0 XPress program we reviewed, this technology did not always work. The same video could be re-encoded several times, and then suddenly, the next time it was launched, the CUDA load was already 0%. And this phenomenon was completely random on completely different operating systems. Also, the program in question refused to use CUDA when encoding into the XviD format, but there were no problems with the popular DivX codec.

As a result, so far CUDA technology can significantly increase the performance of personal computers only in certain tasks. But the scope of application of such technology will expand, and the process of increasing the number of cores in conventional processors indicates an increase in the demand for parallel multi-threaded computing in modern software applications. It’s not for nothing that recently all industry leaders have become obsessed with the idea of ​​combining CPU and GPU within one unified architecture (just remember the much-advertised AMD Fusion). Perhaps CUDA is one of the stages in the process of this unification.


We thank the following companies for providing test equipment:

– a set of low-level software interfaces ( API) for creating games and other high-performance multimedia applications. Includes high performance support 2D- And 3D-graphics, sound and input devices.

Direct3D (D3D) – interface for displaying three-dimensional primitives(geometric bodies). Included in .

OpenGL(from English Open Graphics Library, literally - open graphics library) is a specification that defines a programming language-independent cross-platform programming interface for writing applications using two-dimensional and three-dimensional computer graphics. Includes over 250 functions for drawing complex 3D scenes from simple primitives. Used to create video games, virtual reality, and visualization in scientific research. On the platform Windows competes with .

OpenCL(from English Open Computing Language, literally – an open language of calculations) – framework(software system framework) for writing computer programs related to parallel computing on various graphics ( GPU) And ( ). To the framework OpenCL includes a programming language and application programming interface ( API). OpenCL provides parallelism at the instruction level and at the data level and is an implementation of the technique GPGPU.

GPGPU(abbreviated from English) General-P urpose G raphics P rocessing U nits, literally – GPU general purpose) is a technique for using a graphics processing unit (GPU) or video card for general computing that is typically performed by a computer.

Shader(English) shader) – a program for constructing shadows on synthesized images, used in three-dimensional graphics to determine the final parameters of an object or image. Typically includes arbitrarily complex descriptions of light absorption and scattering, texture mapping, reflection and refraction, shading, surface displacement, and post-processing effects. Complex surfaces can be visualized using simple geometric shapes.

Rendering(English) rendering) – visualization, in computer graphics, the process of obtaining an image from a model using software.

SDK(abbreviated from English) Software Development Kit) – a set of software development tools.

CPU(abbreviated from English) Central Processing Unit, literally – central/main/main computing device) – central (micro); a device that executes machine instructions; a piece of hardware responsible for performing computational operations (specified by the operating system and application software) and coordinating the operation of all devices.

GPU(abbreviated from English) Graphic Processing Unit, literally – graphic computing device) – graphic processor; a separate device or game console that performs graphic rendering (visualization). Modern GPUs are very efficient at processing and displaying computer graphics in a realistic manner. The graphics processor in modern video adapters is used as a 3D graphics accelerator, but in some cases it can also be used for calculations ( GPGPU).

Problems CPU

For a long time, the increase in the performance of traditional ones mainly occurred due to a consistent increase in the clock frequency (about 80% of the performance was determined by the clock frequency) with a simultaneous increase in the number of transistors on one chip. However, a further increase in the clock frequency (at a clock frequency of more than 3.8 GHz, the chips simply overheat!) encounters a number of fundamental physical barriers (since the technological process has almost come close to the size of an atom: , and the size of a silicon atom is approximately 0.543 nm):

Firstly, as the crystal size decreases and the clock frequency increases, the leakage current of the transistors increases. This leads to increased power consumption and increased heat emissions;

Second, the benefits of higher clock speeds are partially negated by memory access latency, as memory access times do not keep up with increasing clock speeds;

Third, for some applications, traditional serial architectures become inefficient as clock speeds increase due to the so-called “von Neumann bottleneck,” a performance limitation resulting from sequential computation flow. At the same time, the resistive-capacitive signal transmission delays increase, which is an additional bottleneck associated with an increase in clock frequency.

Development GPU

In parallel with this, there was (and is!) development GPU:

November 2008 – Intel introduced a line of 4-core Intel Core i7, which are based on a new generation microarchitecture Nehalem. The processors operate at a clock frequency of 2.6-3.2 GHz. Made using a 45nm process technology.

December 2008 – deliveries of 4-core began AMD Phenom II 940(code name - Deneb). Operates at a frequency of 3 GHz, produced using a 45-nm process technology.

May 2009 – company AMD introduced the GPU version ATI Radeon HD 4890 with the core clock speed increased from 850 MHz to 1 GHz. This is the first graphic processor running at 1 GHz. The computing power of the chip, thanks to the increase in frequency, increased from 1.36 to 1.6 teraflops. The processor contains 800 (!) computing cores and supports video memory GDDR5, DirectX 10.1, ATI CrossFireX and all other technologies inherent in modern video card models. The chip is manufactured on the basis of 55 nm technology.

Main differences GPU

Distinctive Features GPU(compared with ) are:

– an architecture maximally aimed at increasing the speed of calculation of textures and complex graphic objects;

– peak power typical GPU much higher than that ;

– thanks to a specialized conveyor architecture, GPU much more efficient in processing graphic information than .

"Crisis of the genre"

"Genre crisis" for matured by 2005 - that’s when they appeared. But, despite the development of technology, the increase in productivity of conventional decreased noticeably. At the same time performance GPU continues to grow. So, by 2003, this revolutionary idea crystallized - use the computing power of graphics for your needs. GPUs have become increasingly used for “non-graphical” computing (physics simulation, signal processing, computational mathematics/geometry, database operations, computational biology, computational economics, computer vision, etc.).

The main problem was that there was no standard programming interface GPU. The developers used OpenGL or Direct3D, but it was very convenient. Corporation NVIDIA(one of the largest manufacturers of graphics, media and communications processors, as well as wireless media processors; founded in 1993) began developing a unified and convenient standard - and introduced the technology CUDA.

How it started

2006 – NVIDIA demonstrates CUDA™; the beginning of a revolution in computing GPU.

2007 – NVIDIA releases architecture CUDA(original version CUDA SDK was submitted on February 15, 2007); nomination “Best New Product” from the magazine Popular Science and "Readers' Choice" from the publication HPCWire.

2008 – technology NVIDIA CUDA won the “Technical Excellence” category from PC Magazine.

What's happened CUDA

CUDA(abbreviated from English) Compute Unified Device Architecture, literally - unified computing architecture of devices) - architecture (a set of software and hardware) that allows you to produce on GPU general purpose calculations, while GPU actually acts as a powerful coprocessor.

Technology NVIDIA CUDA™ is the only development environment in a programming language C, which allows developers to create software that solves complex computing problems in less time, thanks to the processing power of GPUs. Millions of people are already working in the world GPU with the support CUDA, and thousands of programmers are already using (free!) tools CUDA to accelerate applications and solve the most complex, resource-intensive tasks - from video and audio encoding to oil and gas exploration, product modeling, medical imaging and scientific research.

CUDA gives the developer the opportunity, at his own discretion, to organize access to the set of instructions of the graphics accelerator and manage its memory, and organize complex parallel calculations on it. Graphics accelerator support CUDA becomes a powerful programmable open architecture, similar to today's . All this provides the developer with low-level, distributed and high-speed access to hardware, making CUDA a necessary basis for building serious high-level tools, such as compilers, debuggers, mathematical libraries, and software platforms.

Uralsky, leading technology specialist NVIDIA, comparing GPU And , says this: “ - This is an SUV. He drives always and everywhere, but not very fast. A GPU- This is a sports car. On a bad road, it simply won’t go anywhere, but give it a good surface, and it will show all its speed, which an SUV has never even dreamed of!..”

Technology capabilities CUDA