GpuScript Key Features

How can GpuScript achieve 23 PFLOPS on a 2020 GeForce RTX 3070 GPU (20.3 TFLOPS of peak single-precision performance) , faster GPU computation than integrated FPGAs and TPU/MXUs?

 

FPGAs and v4 TPUs are lightning fast, at over an exaflop for half-precision floating point (4 decimal digits), but memory transfer slows them down 4 million times, to max 275 TFLOPS. GpuScript has 9 decimal digits and runs 100 times faster than the TPU theoretical limit. TPUs only speed up matrix multiplication, a tiny portion of an application. GpuScript speeds up the entire application, eliminating memory transfer bottlenecks. In addition, the high speed is achieved by operating on a dense matrix generated from a boundary element model.

 

GpuScript is 32M times faster than the CPU, 4000 times faster than HLSL, OpenGL, CUDA, or other GPU languages, and 1000 times faster than the theoretical limit of the GPU. GpuScript is a paradigm shift in programming.

 

 

 

Without GpuScript

With GpuScript

Speed

Any speedup, even 1.2 times faster than the CPU, is considered a great success and major improvement.

Runs several orders of magnitude faster*, not only faster than any other CPU programming language, but faster than any other GPU programming language, including CUDA.

Memory Transfer

Small and simple GPU routines require frequent memory transfer between the CPU and GPU. Few, if any, large and complex programs have been implemented on the GPU, so memory transfer for these cases is not available.

Automatically transfers memory between the GPU and CPU only when necessary. Supports Object Oriented Programming, large portions of the code can be run on the GPU. Instead of only running critical code on the GPU, almost all off the code can run on the GPU. This can significantly reduce or eliminate memory transfer between the CPU and GPU. Data often can be sent to the GPU once and may never need to be transferred back to the CPU again.

Debugging

None, difficult, or very limited debugging features. This is perhaps the most significant reason why programmers are not writing large complex programs on the GPU.

Easy to debug. Programs can be written in Visual Studio with auto-completion, syntax highlighting, and full debugging support**.

Learning

Difficult to learn and master, similar to assembly programming on the CPU.

Easy to learn. It allows the entire CPU and GPU program to be written in C# without the need to learn GPU programming languages.

Interface Between User, CPU & GPU

Requires careful planning before implementation, as modifications can be tedious, bug-prone, and time consuming.

Builds the user interface and a shell for the entire CPU and GPU program automatically by examining variables and methods in code***, increasing programming productivity by at least 50 times.

 

 

* GpuScript can compute over 700,000 4096X4096 matrix multiplications in a second, equivalent to 1 matrix multiplication in 1.44 nanoseconds. With 32 million floating point operations in a single matrix multiply, this results in 23 PFLOPS super-computer speed. A 4096 sample FFT runs in 3 nanoseconds. Only GpuScript can exceed the theoretical limit of the GPU, using matrix scaling, transferring floating point operations to integers, using group-shared memory, and using intrinsic functions. Small routines run fast, but speedups for large complex programs are significantly greater. For example, a program required 15 minutes to load 200 M points from a text file. GpuScript completed the same task in 4 seconds: 3 seconds to read the file and transfer the text to the GPU, a millisecond to convert the text to binary, and then a second to transfer the data back to the CPU and save it to disk.

 

** A programmer can run any GPU method partially or fully on the CPU for debugging 

a.       The CPU is considerably slower than the GPU, so this feature is critical for debugging large and complex programs. 

b.      The CPU has thread limitations, so the CPU cannot possibly create a thread for each GPU thread. GpuScript does not create CPU threads, but does everything using a single thread. This is especially critical for debugging GPU kernels that use Group Shared Memory. 

 

*** Write a C# class with variables, methods, enumerations, structs, internal classes and some code, with the same structure as a normal C# class. An optional list of instructions can be specified before each variable or method:

a.       Display name of the button, textbox, checkbox, etc. 

b.      Status bar description when the mouse hovers over the user interface element 

c.       The display format, such as to display a currency with a dollar sign and 2 decimal places 

d.      Units, such as mm, km, ft, day, PFLOPS, etc. 

e.       Valid range, whether to use a scrollbar, default value 

f.        Conditions for showing or hiding the variable 

g.       Simple task to do when the value changes. More complex tasks can be specified later in code. 

h.      If the textbox contents should be hidden for a password 

i.         If the user interface element should be read-only 

j.         If the variable references an internal or external library  

k.       How the variable or method is accessed by and run on the GPU

l.        How the data is organized in trees, groups, lists, and tables

 

When the GpuScript class is saved, it automatically compiles. If successful, GpuScript builds separate classes and files with:

a.       All the user interface elements organized in trees, groups, and grids 

b.      The CPU and GPU code with variables and methods 

c.       Imports and links libraries 

d.      Generates and links GPU computation and graphics code 

e.       Links in all the GPU kernel methods 

f.        Specifies GPU thread groups 

g.       Initializes and allocates GPU buffers

The programmer can modify the GpuScript class at any time to reorganize the user interface & add, remove, or modify user interface elements, variables, or methods.

The programmer can override GPU code in libraries. This avoids having to reorganize data when calling or returning from library routines. Libraries can directly access parent code and data, and the parent code can directly access library code and data. This allows libraries to be fully customized and run at full speed.