GpuScript Key
Features
How can GpuScript achieve 23 PFLOPS on a 2020 GeForce RTX
3070 GPU (20.3 TFLOPS of peak single-precision performance) , faster GPU
computation than integrated FPGAs and TPU/MXUs?
FPGAs and v4 TPUs are lightning fast, at over an exaflop for
half-precision floating point (4 decimal digits), but memory transfer slows
them down 4 million times, to max 275 TFLOPS. GpuScript has 9 decimal digits
and runs 100 times faster than the TPU theoretical limit. TPUs only speed up
matrix multiplication, a tiny portion of an application. GpuScript speeds up
the entire application, eliminating memory transfer bottlenecks. In addition,
the high speed is achieved by operating on a dense matrix generated from a
boundary element model.
GpuScript is 32M times faster than
the CPU, 4000 times faster than HLSL, OpenGL, CUDA, or other GPU languages, and
1000 times faster than the theoretical limit of the GPU. GpuScript is a
paradigm shift in programming.
|
Without GpuScript |
With GpuScript |
Speed |
Any speedup, even 1.2 times
faster than the CPU, is considered a great success and major improvement. |
Runs several orders of
magnitude faster*, not only faster than any other CPU programming language,
but faster than any other GPU programming language, including CUDA. |
Memory Transfer |
Small and simple GPU routines
require frequent memory transfer between the CPU and GPU. Few, if any, large
and complex programs have been implemented on the GPU, so memory transfer for
these cases is not available. |
Automatically transfers memory
between the GPU and CPU only when necessary. Supports Object Oriented
Programming, large portions of the code can be run on the GPU. Instead
of only running critical code on the GPU, almost all off the code can run on
the GPU. This can significantly reduce or eliminate memory transfer between the
CPU and GPU. Data often can be sent to the GPU once and may never need to be
transferred back to the CPU again. |
Debugging |
None, difficult, or very limited
debugging features. This is perhaps the most significant reason why programmers
are not writing large complex programs on the GPU. |
Easy to debug. Programs can
be written in Visual Studio with auto-completion, syntax highlighting,
and full debugging support**. |
Learning |
Difficult to learn and master,
similar to assembly programming on the CPU. |
Easy to learn. It allows
the entire CPU and GPU program to be written in C# without the need
to learn GPU programming languages. |
Interface Between User, CPU & GPU |
Requires careful planning before
implementation, as modifications can be tedious, bug-prone, and time
consuming. |
Builds the user
interface and a shell for the entire CPU and GPU program automatically
by examining variables and methods in code***, increasing
programming productivity by at least 50 times. |
*
GpuScript can compute over 700,000 4096X4096
matrix multiplications in a second, equivalent to 1 matrix multiplication
in 1.44 nanoseconds. With 32 million floating point operations in a single
matrix multiply, this results in 23 PFLOPS super-computer speed. A 4096
sample FFT runs in 3 nanoseconds. Only GpuScript can
exceed the theoretical limit of the GPU, using matrix scaling,
transferring floating point operations to integers, using group-shared
memory, and using intrinsic functions. Small routines run fast, but
speedups for large complex programs are significantly greater. For example, a
program required 15 minutes to load 200 M points from a text file. GpuScript completed the same task in 4 seconds: 3 seconds
to read the file and transfer the text to the GPU, a millisecond to convert the
text to binary, and then a second to transfer the data back to the CPU and save
it to disk.
**
A programmer can run any GPU method partially or fully on the CPU for
debugging
a. The CPU is considerably slower than the GPU, so this
feature is critical for debugging large and complex programs.
b. The CPU has thread limitations, so the CPU cannot
possibly create a thread for each GPU thread. GpuScript does
not create CPU threads, but does everything using a single
thread. This is especially critical for debugging GPU kernels
that use Group Shared Memory.
*** Write a C# class with
variables, methods, enumerations, structs, internal classes and
some code, with the same structure as a normal C# class. An optional
list of instructions can be
specified before each variable or method:
a.
Display name of the button,
textbox, checkbox, etc.
b.
Status
bar description when the mouse hovers over the user interface
element
c.
The display format, such as to
display a currency with a dollar sign and 2 decimal places
d.
Units, such as mm, km,
ft, day, PFLOPS, etc.
e.
Valid range, whether to use a
scrollbar, default value
f.
Conditions for showing or hiding
the variable
g.
Simple task to do when the value
changes. More complex tasks can be specified later in code.
h.
If the textbox contents should be
hidden for a password
i.
If the user interface element
should be read-only
j.
If the variable references an
internal or external library
k.
How the variable or method is
accessed by and run on the GPU
l.
How the data is organized in
trees, groups, lists, and tables
When the GpuScript class
is saved, it automatically compiles. If successful, GpuScript builds separate
classes and files with:
a.
All the user interface
elements organized in trees, groups, and grids
b.
The CPU and GPU
code with variables and methods
c.
Imports and links libraries
d.
Generates and links GPU
computation and graphics code
e.
Links in all the GPU kernel
methods
f.
Specifies GPU thread groups
g.
Initializes
and allocates GPU buffers
The programmer can modify
the GpuScript class at any time
to reorganize the user interface & add, remove, or modify user
interface elements, variables, or methods.
The programmer can override GPU code in libraries. This
avoids having to reorganize data when calling or returning
from library routines. Libraries can directly access parent code and
data, and the parent code can directly access library code and data. This
allows libraries to be fully customized and run at full speed.