IA Algorithm Acceleration Using GPUs (Artificial Intelligence)

INTRODUCTION

Graphics Processing Units (GPUs) have been evolving very fast, turning into high performance programmable processors. Though GPUs have been designed to compute graphics algorithms, their power and flexibility makes them a very attractive platform for general-purpose computing. In the last years they have been used to accelerate calculations in physics, computer vision, artificial intelligence, database operations, etc. (Owens, 2007).

In this paper an approach to general purpose computing with GPUs is made, followed by a description of artificial intelligence algorithms based on Artificial Neural Networks (ANN) and Evolutionary Computation (EC) accelerated using GPU.

BACKGROUND

General-Purpose Computation using Graphics Processing Units (GPGPU) consists in the use of the GPU as an alternative platform for parallel computing taking advantage of the powerful performance provided by the graphics processor (General-Purpose Computation Using Graphics Hardware Website; Owens, 2007).

There are several reasons that justify the use of the GPU to do general-purpose computing (Luebke, 2006):

• Last generation GPUs are very fast in comparison with current processors. For instance, a NVIDIA 8800 GTX card has computing capability of approximately 330 GFLOPS, whereas an Intel Core2 Duo 3.0 GHz processor has only a capability of about 48 GFLOPS.

• GPUs are highly-programmable. In the last years graphical chip programming capacities have grown very much, replacing fixed-programming engines with programmable ones, like pixel and vertex engines. Moreover, this has derived in the appearance of high-level languages that help its programming.

• GPUs evolution is faster than CPU’s one. The increase in GPU’s performance is nowadays from 1.7x to 2.3x per year, whereas in CPUs is about 1.4x. The pressure exerted by videogame market is one of the main reasons of this evolution, what forces companies to evolve graphics hardware continuously.

• GPUs use high-precision data types. Although in the very beginning graphics hardware was designed to work with low-precision data types, at the present time internal calculations are computed using 32 bits float point numbers.

• Graphics cards have low cost in relation to the capacities that they provide. Nowadays, GPUs are affordable for any user.

• GPUs are highly-parallel and they can have multiple processors that allow making high-performance parallel arithmetic calculations.

Nevertheless, there are some obstacles. First, not all the algorithms fit for the GPU’s programming model, because GPUs are designed to compute high-intensive parallel algorithms (Harris, 2005). Second, there are difficulties in using GPUs, due mainly to:

• GPU’s programming model is different from CPU’s one.

• GPUs are designed to graphics algorithms, therefore, to graphics programming. The implementation of general-purpose algorithms on GPU is quite different to traditional implementations.

• Some limitations or restrictions exist in programming capacities. Most functions on GPU’s programming languages are very specific and dedicated to make calculations in graphics algorithms.

• GPU’s architectures are quite variable due to their fast evolution and the incorporation of new features.

Therefore it is not easy to port an algorithm developed for CPUs to run in a GPU.

Overview of the Graphics Pipeline

Nowadays GPUs make their computations following a common structure called Graphics Pipeline. The Graphics Pipeline (Akenine-Moller, 2002) is composed by a set of stages that are executed sequentially inside the GPU, allowing the computing of graphics algorithms. Recent hardware is made up of four main elements. First, the vertex processors, that receive vertex arrays from CPU and make the necessary transformations from their positions in space to the final position in the screen. Second, the primitive assembly build graphics primitives (for instance, triangles) using information about connectivity between different vertex. Third, in the rasterizer, those graphical primitives are discretized and turned into fragments. A fragment represents a potential pixel and contains the necessary information (color, depth, etc.) to generate the final color of a pixel. Finally, in the fragment processors, fragments become pixels to which final color is written in a target buffer, that can be the screen buffer or a texture.

In the present, GPUs have multiple vertex and fragment processors that compute operations in parallel. Both are programmable using little pieces of code called vertex and fragment programs, respectively. In the last years different high-level programming languages have released like Cg/HLSL (Mark, 2003; HLSL Shaders) or GLSL (OpenGL Shading Language Information Site), that make easier the programming of those processors.

The GPU Programming Model

There is a big difference between programming CPUs and GPUs due mainly to their different programming models. GPUs are based on the stream programming model (Owens, 2005a; Luebke, 2006; Owens, 2007), where all data are represented by a stream that can be defined as a sorted set of data of the same type. A kernel operates on full streams, and takes input data from one or more streams to produce one or more output streams. The main characteristic of a kernel is that it operates on the whole stream, instead individual elements. The typical use of a kernel is the evaluation of a function over each element from an input stream, calling this a map operation. Other operations of a kernel are expansions, reductions, filters, etc. (Buck, 2004; Horn, 2005; Owens, 2007). The kernel generated outputs are always based on their input streams, what means that inside the kernel, the calculations made on an element never depends of the other ones. In stream programming model, applications are built connecting multiple kernels. An application can be represented as a dependency graph where each graph node is a kernel and each edge represents a data stream between kernels (Owens, 2005b; Lefohn, 2005).

The behavior of graphic pipeline is similar to the stream programming model. Data flows through each stage, where the output feeds the next one. Stream elements (vertex or fragment arrays) are processed independently by kernels (vertex or fragment programs) and their output can be received again by another kernels.

The stream programming model allows an efficient computation, because kernels operate on independent elements from a set of input streams and can be processed using hardware like GPU, that process vertex or fragments streams in parallel. This allows making parallel computing without the complexity of traditional parallel programming models.

Computational Resources on GPU

In order to implement any kind of algorithm on GPU, there are different computational resources (Harris, 2005; Owens, 2007). By one side, current GPUs have two different parallel programmable processors: vertex and fragment processors. Vertex processors compute vertex streams (points with associated properties like position, color, normal, etc.).A vertex processor applies a vertex program to transform each input vertex to its position on the screen. Fragment processors compute fragment streams. They apply a fragment program to each fragment to calculate the final color ofthe pixel. In addition of using the attributes of each fragment, those processors can access to other data streams like textures when they are generating each pixel. Textures can be seen as an interface to access to read-only memory.

Another available resource in GPU is the rasterizer. It generates fragments using triangles built in from vertex and connectivity information. The rasterizer allows generating an output set of data from a smaller input one, because it interpolates the properties of each vertex that belongs to a triangle (like color, texture coordinates, etc.) for each generated fragment.

One of the essential features of GPUs is the render-to-texture one. This allows storing the pixels generated by the fragments processor in a texture, instead of a screen buffer. This is at the moment the only mechanism to obtain directly output data from GPU computing. Render-to-texture cannot be thought as an interface to read-write memory, due to the fact that fragment processor can read data from a texture in multiple times, but it can write there just one time, at the end of each fragment processing.

ARTIFICIAL INTELLIGENCE ALGORITHMS ON GPU

Using the stream programming model as well as resources provided by graphics hardware, Artificial Intelligence algorithms can be parallelized and therefore computing-accelerated. The parallel and high-intensive computing nature of this kind of algorithms makes them good candidates for being implemented on the GPU.

Consider the evolution process of genetic algorithms, where a fitness value needs to be computed for each individual. Population could be considered as a data stream and fitness function as a kernel to process this stream. On GPU, for instance, the data stream must be represented as a texture, whereas the kernel must be implemented on a fragment program. Each individual’s fitness would be obtained in an output stream, represented also by a texture, and obtained by the use of render-to-texture feature.

Recently some works have been realized mainly in paralleling ANN and EC algorithms, described in following sections.

Artificial Neural Networks

Bohn (1998) used GPGPU to reduce training time in Kohonen’s feature maps. In this case, the bigger the map, the higher was the time reduction using the GPU. On 128×128 sized maps, time was similar using CPU and GPU, but on 512×512 sized maps, GPU was almost 3.5 times faster than CPU, increasing to 5.3 faster rates on 1024×1024 maps. This was one of the first implementations of GPGPU, made on a non programmable graphics system, a SiliconGraphics Infinite Reality workstation.

Later, with programmable hardware, Oh (2004) used the GPU for accelerating the process of obtaining the output ofa multilayer perceptron ANN. Developed system was applied to pattern recognition obtaining 20x lower computing time than CPU implementation..

Considering another kind of ANNs, Zhongwen (2005) used GPGPU to reduce computing time in training Self-Organizing Maps (SOMs). The bigger the SOM, the higher was the reduction. Whereas using 128×128 neurons maps computing time was similar between CPU and GPU, 512×512 neuron maps involved a training process 4x faster using GPU implementation.

Bernhard (2005) used GPU to simulate Spiking Neurons model. This ANN model both requires high intensive calculations and has a parallel nature, so fits very well on GPGPU computation.Authors made different implementations depending on the neural network application. In the first case, an image segmentation algorithm was implemented using a locally-excitatory globally-inhibitory Spiking Neural Network (SNN). In this experiment, authors obtained up to 10x faster results. In the second case, SNNs were used to image segmentation using an algorithm based on histogram clustering where the ANN minimized the objective function. Here the speed was improved up to 10 times also.

Seoane (2007) showed multilayer perceptron ANN training time acceleration using GA. GPGPU techniques for ANN computing allowed accelerating it up to 11 times.

The company Evolved Machines (EvolvedMachines Website) uses the powerful performance of GPUs to simulating of neural computation, obtaining results up to 100x faster than CPU computation.

Evolutionary Computation

In EC related works, Yu (2005) describes how parallel genetic algorithms can be mapped in low-cost graphics hardware. In their approach, chromosomes and fitness values are stored in textures. Fitness calculation and genetic operators were implemented using fragment programs on GPU. Different population sizes applied to the Colville minimization problem were used for testing, resulting in better time reductions according to bigger populations. In the case of a 128×128 sized population, GPU genetic operators computing was 11.8 time s faster than CPU, whereas in a 512×512 sized population, that rate incremented to 20.1. In fitness function computing, rates were 7.9 and 17.1 respectively.

In another work, Wong (2006) implemented Hybrid Genetic Algorithms on GPU incorporating the Cauchy mutation operator. All algorithm steps were implemented in graphics hardware, except random number generation. In this approach, a pseudo-deterministic method was proposed for selecting process, allowing significant running-time reductions. GPU implementation was 3x faster than CPU’s one.

Fok (2007) showed how to implement evolutionary algorithms on GPU. Since the crossover operators of GA requires more complex calculations than mutation ones, authors studied a GPU implementation of Evolutionary Programming, using only mutation operators. Tests have been proved with the Cauchy distribution to 5 different optimization problems, obtaining between 1.25 and 5 times faster results.

FUTURE TRENDS

Nowadays GPUs are very powerful and they are evolving quite fast. By one side, there are more and more programmable elements in GPUs; by the other one, programming languages are becoming full-featured. There are more and more implementations of different kinds of general-purpose algorithms that take advantage of these features.

In Artificial Intelligence field the number of developments is rather low, in spite of the great amount of current algorithms and their high computing requirements. It seems very interesting using GPUs to extend existent implementations. For instance, some examples of speeding ANNs simulations up have been shown, however there is no works in accelerating training times. Likewise same ideas can be applied to implement other kinds of ANNs architectures or IA techniques, like in genetic programming field, where there is neither any development.

CONCLUSION

This paper has introduced general-purpose programming on GPUs. They have been shown as powerful parallel processors, which programming capabilities allow using for general-purpose high-intensive computing algorithms. Based on this idea, existent implementations of IA models like ANN or EC on GPUs have been described, with a considerable computing time reduction.

General-purpose computing on GPU and its use to accelerating IA algorithms provides great advantages, being an essential contribution in application where computing time is a decisive factor.

KEY terms

Fragment: Potential pixel containing all the necessary information (color, depth, etc.) to generate the final fragment color.

Fragment Processor: Graphics system element that receives as input a set of fragments and processes it to obtain pixel, writing them in a target buffer. Present GPUs have multiple fragment processors working in parallel and can be programmed using fragment programs.

Graphics Pipeline: Three dimensional graphics oriented architecture, composed by several stages that run sequentially.

Graphics Processing Unit (GPU): Electronic device designed for graphics rendering in computers. Its architecture is specialized in graphics calculations.

General-Purpose Computation on GPUs (GP-GPU): Trend in computing devices dedicated to implement general-purpose algorithms using graphics devices, called GPUs. At the moment, the high programmability and performance of GPUs allow developers run classical algorithms in these devices to speed non-graphics applications up, especially those algorithms with parallel nature.

Pixel: Picture Element abbreviation, used for referring graphic image points.

Rasterizer: Graphics Pipeline element, which from graphic primitives provides appropriate fragments to a target buffer.

Render-to-Texture: GPU feature that allows stocking the fragment processor output on a texture instead on a screen buffer.

Stream Programming Model: This parallel programming model is based on defining, by one side, sets of input and output data, called streams, and by the other side, intensive computing operations, called kernel functions, to be applied sequentially on the streams.

Texture: In computer graphics field, it refers to a digital image used to modify the appearance of a tri-dimensional object. The operation that wraps around a texture over an object is called texture mapping. Talking about GPGPU, a texture can be considered as a data stream.

Vertex: In computer graphics field, it refers to a clearly defined point in a tridimensional space, which is processed by Graphics Pipeline. Relationships can be established between those vertices (like triangles) to assembly structures that define a tridimensional object. Talking about GPGPU, an vertex array can be considered as a data stream.

Vertex Processor: Graphics system component that receives as input a set of 3D vertex and process them to obtain 2D screen positions. Present GPUs have multiple vertex processors working in parallel and can be programmed using vertex programs.