Minimal Single Symbol ELF For Using GL

With the exception of graphics, everything else on Linux can easily be done via kernel system calls. For those working in a custom language, using GL adds the complexity nightmare of dynamic linking. However there is a way to reduce this complexity to the following (which enables a simpler ELF),

(1.) Link to libdl.so only.
(2.) Use only one symbol: dlsym().

Using dlsym() one can get the address for dlopen(), which together enables manually opening libraries. Unfortunately some GL libraries require that the caller have the full libc treatment (everything that happens between _start() and main()). Implementing all that crap in a custom language is a great way to gain the full realization that the Linux people lost track of the K.I.S.S. principle a long time ago. Instead one can just dlopen() libc.so itself and manually pass through the complexity filter via __libc_start_main(). Here is a proof-of-concept in C for x86-64 which lacks a bunch of important stuff, but works enough to get into main() and then print "ok" via a Linux syscall.

static inline long Linux1(long num, long arg1) { long ret; asm volatile ("syscall" : "=a" (ret) : "a" (num), "D" (arg1) : "cc", "memory", "%rcx", "%rdx", "%rsi", "%r8", "%r9", "%r10", "%r11"); return ret; }

static inline long Linux3(long num, long arg1, long arg2, long arg3) { long ret; asm volatile ("syscall" : "=a" (ret) : "a" (num), "D" (arg1), "S" (arg2), "d" (arg3) : "cc", "memory", "%rcx", "%r8", "%r9", "%r10", "%r11"); return ret; }

void* dlsym(void*, const char*);

typedef void* (*dlopenF)(const char*, int);

#define RTLD_NOW 2

typedef int (*__libc_start_mainF)(int (*)(int, char**, char**), int, char**, void (*)(void), void(*)(void), void (*)(void), void (*));

static void None(void) { }

static int Main(int a, char** b, char** c) { char msg[]="ok\n"; Linux3(1, 1, (long)msg, 3); Linux1(231, 0); }

void _start(void) {
dlopenF dlopen;
dlopen = (dlopenF)dlsym(0, "dlopen");
void* libc = dlopen("libc.so", RTLD_NOW);
__libc_start_mainF __libc_start_main;
__libc_start_main = (__libc_start_mainF)dlsym(libc, "__libc_start_main");
__libc_start_main(Main, 0, 0, None, None, None, 0); }

This can be built with,

gcc -std=gnu99 src.c -ldl -Os -fno-unwind-tables -fno-exceptions -nostdlib

Even given the unnecessary garbage GNU ld dumps in the ELF, this binary will be under 3K, which almost half the size of compiling "void main(void) { }" using the standard libs. Going to assembly or some other custom language and manually building the ELF can make this much smaller.


Why GL's ARB_Bindless_Texture is Awesome

Unquestionably GL bindless is well designed to NVIDIA hardware. After all, they wrote the extension that ARB_bindless_texture was based on. Over time I have slowly started to understand how it is also a great match for AMD's GCN hardware as well...

Quick GCN Hardware Review
Each non-buffer texture in GL can be accessed by either a floating point offset or an integer texel offset. On GCN the integer texel offset case requires only a 16-byte or 32-byte descriptor depending on texture type. The floating point offset case requires an extra 16-byte sample descriptor. GCN has ability to block load the descriptors into the scalar register file at 4-byte, 8-byte, 16-byte, 32-byte, and 64-bytes at a time with a single instruction.

GL vs DX With Regards to Sampler Style
The DX style of separate samplers and textures on GCN would require two block load instructions for the first texture accessed with a new sampler (one load for the texture descriptor the other for the sampler descriptor). Each additional texture using the same sampler would require just one block load instruction.

The GL style of a combined sampler and texture on GCN only requires one block load instruction for any texture. The texture and sampler descriptor sitting side by side in memory can be fetched together with just one 32-byte or 64-byte scalar block load instruction. As long as the K$ is not a limiter, the GL case could be faster (less instructions). This could be more important as the wavefront occupancy goes down (less options to multi-issue scalar and vector and mem instructions).

A Bindless Indirection can be an Optimization?
Lets look in detail at a situation which would be common for a Clustered Forward based engine. Typically objects are drawn with a material shader, and this material shader will have a selection of new textures in combination with some textures shared with the prior shader such as per-view textures (to fetch lights, etc). If the graphics API builds a unique buffer of descriptors for each draw call, then there is an extra amount of CPU overhead to copy in all the per-view texture descriptors into each chunk of buffer per draw call. Another option would be an API which enables keeping a unique cached table/material of descriptors in a buffer (table changes per draw). However in that case during rendering, after each table change, parts of the K$ need to get reloaded for the per-view descriptors which already might exist in the cache at a different address. Usage of the indirection of GL bindless could in theory reduce CPU setup (less to build), and also give a better chance to keeping a warmed K$ between draw calls (as the much larger descriptors shared between shaders have the same address).

Using Bindless in a "Binding" Way
Another option for using bindless which was recently brought to my attention was the option of just replacing glBindTextures() with glProgramUniformHandleui64vARB(). The later sets the traditional GL sampler bind points with a collection of bindless handles, and in theory has the ability to remove the driver overhead and validation of the legacy glBindTextures() path. This also in theory opens up an option to the driver of skipping the bindless indirection if that indirection would possibly be slower on some kinds of hardware. Lastly this option is 100% compatible with engines which need to support GPUs which have no ARB_bindless_texture support.

AMD's Socketed Kabini

Semiaccurate on the Socketed Kabini

CHIP________ CORES GHz MemMhz L2MB $__
Athlon_ 5350 4____ 2.0 1600__ 2___ $59
Athlon_ 5150 4____ 1.6 1600__ 2___ $49
Sempron 3850 4____ 1.3 1600__ 2___ $39
Sempron 2650 2____ 1.4 1333__ 1___ ???

Interesting options for those who need maximum single GPU performance, but can accept minimum CPU performance also limited to PCIe x4, all at minimum cost. With USB 3.0 support, can run this kind of machine from a fast USB stick. The ASRock AM1B-ITX mITX mother board for the AM1 platform is just $40. Total build costs for a case-less system could be around,

$40_ motherboard
$50_ Athlon 5150
$60_ 8GB memory (2x 4GB)
$70_ 550W power supply
$570 AMD R9 290X GPU
$780 total


Practical Assembly

Market Evolution
The market has evolved such that x86-64 is the only practical consumer platform which supports an attached GPU which has medium to high performance. Personally I'd rather work on ARM-64 as it is a much better ISA, but market forces have made that impossible. The first sign was that AMD (an x86 company) secured all the console wins. Then when Microsoft decided to limit Windows on ARM to Metro only, that effectively killed ARM-64 as an interesting platform for desktop graphics.

One positive side effect of this is that for those who build graphics which requires 100W and up, there is only one ISA, x86-64, which is, by design, fully backwards compatible. Thus for those that are brave, ISA portability is really no longer an excuse not to leverage assembly any more.

Where Traditional Assembly Goes Wrong
Core problem with traditional assembly usage is the complete lack of practical front-end tools to the language. A second problem is that of developers attempting to program assembly in a C style, using the platform ABI, resulting in a massive mess of constantly pushing and poping everything from a stack around functions.

There is a much better way to look at x86-64 as a platform for solving problems. First while x86-64 only has 16 integer registers, it is somewhat mitigated by the ISA's built-in ability to take one operand from a memory location. This provides enough effective working space to factor register allocation from inside calls to outside calls. This is a powerful simplication tool as many algorithms can then effectively have a programmer-defined global register allocation. Different subsets of program running at different times simply alias the meaning of various registers. Likewise the stack is no longer used for data (only then used to interopt with dynamic link libaries) and a pure data flow based design is much easier to realize. Assembly, with a good front-end tool, becomes easier to write than C or C++.

What is a Good Front-End Tool?
My choice has always been something similar to the Forth language. A language front-end should be powerfull enough to use code to generate code. For instance a language front-end should be able to load a file, transform that file as a executable would and produce an inline data structure when compiling. In contrast think about how limited the "#include" functionality is in the C pre-processor.

I have been leveraging Forth like interpreters for ages now. For instance, my prior FarrarFocus website was written in a server side Forth-like language which was interpreted by PHP which generated client-side javascript. One of the advantages of using a Forth style front-end, is that the assembler itself can be defined in the language. Also in contrast to the hundreds of MB of install required to compile "hello world" in gcc or clang, a simple complete Forth based system could easily be under 16K.

Here is what I've found to be useful over the years as front-end language syntax,

\comment is anything between slashes\
0deafb175- \hex number starting with 0-9, with postfix sign\
builtin \evaluate a built-in symbol\
.symbol \push the value of a symbol on the data stack\
:symbol \pop the top of data stack into symbol\
{ symbol \define a symbol as some code\ }
`symbol \evaluate a defined symbol\
'symbol \push the address of a symbol on the data stack\
"string" \push the address of a string on the data stack\

Note the use of character prefix to define usage of symbols. When editing, I use simple colored syntax highlighting to make code meaning easy to understand visually. It is important to note the data stack is a construct of the interpreter (not code generation). The language works with postfix notation: "a + b" is written "a b +". Now for some examples of parts of writing a x86-64 assembler, starting with the syntax of the assembly,

\somewhere setup the aliased global register allocation\
0 :dst \the symbol "dst" is register 0 or RAX\
1 :address \the symbol "address" is register 1 or RCX\
\then in code use symbols instead of register names or numbers\
.dst .address 10 `@MOV8, \mov rax,[rcx+16]\

The "@MOV8," is used for the 64-bit form of the MOV instruction in the form where the source operand is a load (the @ means load) such as [reg] or [reg+imm8] or [reg+imm32]. The comma in this case "," means write to the output file. The language is really flexible. The above code could be also written as follows,

0 :RAX 1 :RCX
.RAX :dst .RCX :address
.dst .address 10 `@MOV8, \mov rax,[rcx+16]\

Here are the bits of an assembler (written in the front-end language) which could implement the "@MOV8" symbol. I'm writing these in reverse order,

{ @MOV8, 8b `@OP28, } \MOV is opcode 8b then call common @OP2 for 64-bit\

\pull args from stack then write opcode\
{ @OP28, :$op :$imm :$rm :$r `REX8, .$op ,1 `@MODrm, }
\generate 64-bit REX prefix\
{ REX8, .$r 1 >> 4 & .$rm 3 >> + 48 + ,1 }
\generate MODrm byte and offset in [r] [r+imm8] [r+imm32] cases\
{ @MODrm, .$imm .@MODrm_0, if0 `#1? .@MODrm_4, if!0 40 :$mod `MODrm, .$imm ,1 }
{ @MODrm_0, 0 :$mod `MODrm, drp; }
{ @MODrm_4, 80 :$mod `MODrm, .$imm ,4 drp; }
\write out MODrm byte\
{ MODrm, .$r 7 & 8 * .$rm 7 & + .$mod + ,1 }
\check if immediate is 8-bit\
{ #1? .$imm dup s1>2 sub }

Typically the selection of builtin ops for the front-end would match what the programmer wanted to make it easy to write code. The builtin ops in this case work as follows,

2 ,1 \append 2 as one byte to the memory block of the executable\
2 ,4 \append 2 as one 32-bit word to the memory block of the executable\
.a .b >> \pop a and b from stack as 64-bit integers, push a>>b (shift left)\
.a .b & \... a&b (and)\
.a .b + \... a+b (add)\
{ bob \some defined symbol\ }
.a .bob if0 \if a==0 then call bob\
.a .bob if!0 \if a!=0 then call bob\
drp; \drop top value of return stack, later returns to caller's caller\
s1>2 \sign extend the top of data stack from byte to 2 byte\

The process of writing code for this kind of system becomes more of an exersize of generating words (or symbols) in the language a programmer would like to solve a problem by writing sentences and paragraphs of code. For instance here is an example of implementation of a simple data stack,

0 :stk \stack top pointer in RAX\
{ Stk+ .stk 8 `#ADD8, }
{ Stk- .stk 8- `#ADD8, }
{ @Stk0 \reg\ .stk 0 `@MOV8, }
{ @Stk1 \reg\ .stk 8- `@MOV8, }
{ !Stk \reg\ .stk 0 `!MOV8, }

And a simple program to implement something which takes two 64-bit values off the stack, adds them, and pushes the result (all using the above interface),

1 :stk \later decided data stack pointer is in RCX\
.tempA `@Stk0 .tempB `@Stk1 .tempA .tempB `ADD8, `Stk- .tempA `!Stk

This is only starting to scratch the surface of how such a front-end can be used to generate code...


GCN and Wavefront Occupancy

References: GCN Architecture Whitepaper | Sea Islands ISA

A Compute Unit (CU) is partitioned into four separate SIMD units.
Each SIMD unit has the capacity of 1 to 10 wavefronts.
Once launched, wavefronts do not migrate across SIMD units.
CU can decode and issue 5 instructions/clk for one SIMD unit.
It takes 4 clocks to issue across all four SIMD units.
The 5 instructions need to come from different wavefronts.
The 5 instructions need to be of different types.
Each SIMD unit has 512 scalar registers.
Each SIMD unit has 256 vector registers.
Each CU has 64KB of LDS space.

Occupancy Table
Waves = Wavefronts per SIMD unit (4 SIMD units/CU)
Scalar = Maximum number of scalar registers/wavefront
Vector = Maximum number of 64-wide vector registers/wavefront
LdsW/I = Maximum amount of LDS space per vector lane per wavefront in 32-bit words
Issue = Maximum number of instructions which can issue per clock

Waves Scalar Vector LdsW/I Issue
1____ 128___ 256___ 64____ 1
2____ 128___ 128___ 32____ 2
3____ 128___ 85____ 21____ 3
4____ 128___ 64____ 16____ 4
5____ 102___ 51____ 12____ 5
6____ 85____ 42____ 10____ 5
7____ 73____ 36____ 9_____ 5
8____ 64____ 32____ 8_____ 5
9____ 56____ 28____ 7_____ 5
10___ 51____ 25____ 6_____ 5

Shaders can easily get issue limited when the number of wavefronts becomes small. Without at least 3 wavefronts per SIMD unit, the device cannot tri-issue a {vector, scalar, and memory} group of operations. This will be further limited by the number of wavefronts which are blocked during execution (say because they are waiting on memory). Abusing instruction level parallelism and increasing register pressure at the expense of occupancy can result in low ALU utilization.


Related to Filtering for VR

Ideas for high quality VR warp and reconstruction filtering for higher-end GPUs:

(1.) Instead of applying a resolve or filter before the warp, apply during the warp.

(2.) Starting with a ray-tracing example with temporal changing stratified sampling. For all samples before the warp, use a pre-pass to compute then store out the screen position for the sample after the warp. Then when filtering for the warp, fetch this position per sample in the reconstruction filter and use it to compute the weighting for the filter kernel. Weight is a function of distance the post-warp sample position to the center of the back-buffer pixel. Note one can also rotate and scale the offset before computing the distance (and weight) to reshape a circle kernel into an ellipse aligned to the gradient of curvature of the warp. The core optimization here is to avoid doing the warp computation many times per output pixel.

(3.) Post warp screen position output needs sub-pixel precision for quality filtering. If using FP16 to store position, make sure to use a signed pixel offset with {0,0} at center of the eye. This doubles precision, and places the high precision area in the center of the eye's view where it is needed the most. Remember FP16 represents -2048 to +2048 exactly, but -4096 to +4096 with a gap of 2. So process left and right eyes separately when filtering. This way the worst precision which is 1/2 out from the center of each view, is steps of 960/4096 pixels for the DK2.

(4.) The reconstruction filter can do the inverse warp computation once to get a location in the source framebuffer, then use fixed immediate offsets from that position to do fast sampling of a large fixed pattern. Must use more than four samples. Beginning with 12 samples is a good starting point.

(5.) For raster based engines with fixed sample positions, warp and kernel weights can be pre-computed. Per back-buffer pixel, compute the source framebuffer pattern center offset. Then store out weights for all the samples in the pattern. These weights will vary per pixel, and can even be compressed. Compute the maximum weight per sample in the fixed pattern for all pixels in the back buffer. Use this per sample maximum as a constant to scale the sample weight before packing.

(6.) The Oculus DK1 can support higher resolution in X by computing filtering weights separate for {red,green,blue}. Not sure what to do yet with the display of the DK2.

(7.) Fixing for chromatic aberration requires computing separate filter weights for {red,green,blue}. However towards the center of the eye, the sample pattern can remain constant. Can use different shader source based on distance of a screen tile from the eye. Tiles at the ends of the screen for instance need different sample patterns.

(8.) If ray tracing (or ray marching) with the Oculus (I ray march at home), use temporally stratified sampling which is evenly distributed in the warped space. Do not just shoot rays in a regular grid.

(9.) If forward rendering, especially if on NVIDIA hardware, instead of rendering higher resolution, render with MSAA and some amount of super-sampling on. In OpenGL this can be done using the glMinSampleShadingARB() extension. This can enable for instance 4xMSAA shading at 2 samples per pixel, or 8xMSAA shading at 2 or 4 samples per pixel. The end result of this is a better sample pattern distribution for the resolve filter. Also it enables a reduction in shading rate compared to regular grid super-sampling.

(10.) It is possible to up-sample during the reconstruction filter on devices which cannot do super-sampled shading at good frame rates. In this case it is still a good idea to use the MSAA trick for a better base sampling pattern.

(11.) Temporal filtering is very hard and beyond the scope of this blog post. It is possible to use temporal filtering for either jittered AA or noise reduction with stratified sampling and a ray traced (or marched) engine.

(12.) Crazy optimization which I haven't tried yet. Could work with a pair (or more) of output pixels during the reconstruction filtering pass. Have the pair share the same slightly larger fixed pattern of input samples. This better amortizes the cost of the inverse warp computation, and the cost of fetching all the samples. Might increase the number of samples/pixel which can be done in the same amount of time. Output of the pixel shader in this case would be surface stores instead of ROP.

(13.) Continuing that line of thought, in the pre-computed weights situation, one can reuse the weights for both the left and right eye. In this case, a pixel's pair is the mirror across the center of the back-buffer. Surface stores in this case would be coherent...

Brigade 3.0 Preview on Youtube


New Laptop + VR = Fail

Talked to a tech at a custom laptop builder today. Heard some disturbing news. If I wanted a single dGPU laptop which had ability to scan-out from the dGPU, I was limited to Asus models G750JW/G750JX/G750JH, all of which support NVIDIA GPUs only and definitely not Maxwell (they said all of the current Maxwell GPUs are Optimus only). All other single dGPU laptops they carry with NV or AMD only have scan-out from the iGPU.

From the IHV perspective this lowers cost, the dGPU does not need to burn area and power on display hardware which gets duplicated. Makes sense for the average case. Except my target is semi-portable VR, I cannot afford the latency and variability of the dGPU to iGPU back-buffer copy.

If I'm reading specs right MXM (mobile PCIe) 3.0 could only get 8 GB/s max, and MXM 3.1 tops at 16 GB/s. Not sure what laptops actually get in practice. At 8 GB/s, time for a 1080p frame transfer would be around 1ms. At 120Hz that is around 1/8 of the frame time. Typically last draws into the back buffer are blending operations for the UI, and not something which one would want to do across PCIe. Blending operations on modern cards are the wrong way to do things anyway (easily export limited). Alternative would be to have the last drawing operation as a full screen triangle composite pass which would apply some amount of post processing in conjunction with adding the UI. Just 1ms on say a GTX 880M at 1080p has capacity for maybe around 60 tex/pixel, 80 bytes/pixel, and 1.4 Kflops/pixel. It might be hard in some engines to fully utilize the GPU on a PCIe store limited final post process pass taking 1ms. However what I typically do involves heavy filtering which can saturate the GPU in this case.

From my understanding, the bigger problem is that the driver "handles" it for you in the worst way possible. Currently I know of no interface which enables the application to get a texture handle to the physical CPU-side back buffer in the iGPU+dGPU situation. My understanding is that instead there is some CPU/GPU sync point which enables CPU side code to then schedule the GPU->CPU copy, which then adds a long variable latency.

NVScene Demos