Parallel Noise Generation

Re a related Twitter Post ...

Concerning making tile-able textures for grain or noise, and getting various desired properties. My preference is towards algorithms which parallelize trivially. The shadertoy referenced in the tweet generates a noise pattern, by starting out with some poor non-random noise, then applying filters (ending with a high-pass) to transform it into something which is pleasing to the eye. I'd advise always using the technique I outlined in this GPUOpen Post which remaps the texture to a perfect distribution of values (see the follow up post as well) while maintaining it's original form (this applies to both techniques in this post).

A second technique I've leveraged in the past is to work with one {x,y} coordinate for a grain position distributed in a regular grid array of grains. Starting with the perfect honeycomb distribution (start from a regular grid position, where every other row is shifted to the left or right, and make the grid have proper aspect ratio for honeycomb),


Then permuting grain position by some function (which could be a noise function with various distributions based on frequency, or perhaps do some kind of clustered rotation of points by nearest cluster, etc). This process typically results in undesired look. Then applying various passes on the array where the position of each grain is filtered against the positions of the pre-filtered neighbors (only dependent on prior pass). The point being to re-shape the array into something which has a more visually pleasing feel. The filter can work with a hex neighborhood (neighbors depend on if the pixel is on a even or odd row),

_x_x_ ... ab_ ... _ab
x_x_x ... cde ... cde
_x_x_ ... ef_ ... _ef

Could be something as easy as relaxing the position of the point (push the point in the direction towards being equal distance from neighbors, but not so much that one resets to a honeycomb). After getting grains distributed as desired, can use for {x,y} coordinates, or transform back into an image of grain (which could be a different resolution image).


The Great MacOS 9

This "An OS 9 odyssey: Why these Mac users won’t abandon 16-year-old software." is an awesome article. OS 9 was the peak of Apple operating systems. Low latency, instant response. If only the industry didn't fabricate internet "standards" complexity at a rate which is impossible to dream of supporting on the older machines.


Thinking "Clearly" About 4K

The real advantage of this console "upgrade cycle" is that now a developer should be able to produce a good 1080p @ 60 Hz game without loosing pixel quality.

However what is likely going to happen instead is 1st party titles under 4K marketing pressure will try for 4K, accepting a reduction of pixel quality, and get stuck at 30Hz. Games built around 30Hz won't scale to 60Hz due to CPU loads, etc, and we don't get 60Hz this round in the general case. Instead games will generally offer some amount of extra super-sampling for 1080p on upgrade consoles.

Beating the System
If I was developing a console game for the "upgrade generation", I'd take advantage of 4K by dropping render resolution to 960x540 (yeah) in the game, then use a stylized CRT up-sampler. The advantage of 4K: 540p shader scaled to 2160p has 4 lines per rendered line, so it is possible to have the scan-line effect without loosing as much brightness (three mostly solid lines with one 50% darker gap line is only a 1/8 drop in total brightness). 540p@60Hz on a PS4 Pro is 135 Kflop/pixel, or roughly a 4.6x scaling of perf/pixel from a 1080p@30Hz PS4 title. That would be a transformative visual upgrade.


Majority of devs will render way under 4K,
adopting up-scaling temporal AA to get to 4K,
and will stay with 30Hz.

The why,

Xbox 360 was a 720p@30Hz target with 9 Kflop/pix.
Many developers did less than 720p rendering on the 360.

Xbox One is a 1080p@30Hz target with 19 Kflop/pix.
So roughly double Xbox 360 perf/pix, a transformative change.
Many developers continue to do less than 1080p rendering on Xbox One.

PS4 is a 1080p@30Hz target with 29 Kflop/pix.
I'm going to take this as the baseline Kflop/pix requirement 
for a current generation visual standard.

PS4 PRO in my opinion is a good 1080p@60Hz target with 34 Kflop/pix,
as that maintains the PS4's perf/pixel standard.
To hit 4K@30Hz is 17 Kflop/pix, 
which is a downgrade to an Xbox One perf/pix level.

Xbox Scorpio, using a 6.0 Tflop number, at 4K@30Hz is 24 Kflop/pix, 
which does *not* hit the PS4 perf/pix quality standard,
but does maintain an Xbox One perf/pix level.
So as with 360 and XB1, devs will continue to render under native resolution.


Using "Google Supplied" numbers for Tflops,

 360 -> .24 Tflops [Xbox 360    ]
 XB1 -> 1.2 Tflops [Xbox One    ]
 PS4 -> 1.8 Tflops [PS4         ]
 PRO -> 4.2 Tflops [PS4 Pro     ]
 SCO -> 6.0 Tflops [Xbox Scorpio]

And looking at flop/pix in units of 1000,

                                    360   XB1   PS4   PRO   SCO
 ================================= ===== ===== ===== ===== =====
  960 x _540 @  30 Hz =  16 Mpix/s   15    77   116   270   386
 1280 x  720 @  30 Hz =  28 Mpix/s    9    43    65   152   217
  960 x  540 @  60 Hz =  31 Mpix/s    8  __39_   58   135   193
 1280 x  720 @  60 Hz =  55 Mpix/s    4    22    33    76   109
  960 x  540 @ 120 Hz =  62 Mpix/s    4    19    29    68    96
 1920 x 1080 @  30 Hz =  62 Mpix/s    4    19  __29_   68    96
 1280 x  720 @ 120 Hz = 111 Mpix/s    2    11    16    38    54
 1920 x 1080 @  60 Hz = 124 Mpix/s    2    10    14  __34_ __48_
 1920 x 1080 @ 120 Hz = 249 Mpix/s    1     5     7    17    24
 3840 x 2160 @  30 Hz = 249 Mpix/s    1     5     7    17    24
 3840 x 2160 @  60 Hz = 498 Mpix/s    0     2     4     8    12
 3840 x 2160 @ 120 Hz = 995 Mpix/s    0     1     2     4     6


Transistor Count Thoughts

Wikipedia's Transistor Count Page
Really interesting page on Wikipedia. Amazing how many of the original cache-free Acorn RISC Machines will fit in the transistor budget of modern processors. The rest of this post is some high level thinking about the compromises required to scale ALU density upwards by simplifying and shrinking core size down to something sized like an ARM2.

_____________________ARM2 ~ _______30,000 transistors ~ ______1 ARM2
____________________80386 ~ ______275,000 transistors ~ ______9 ARM2
__________________Pentium ~ ____3,100,000 transistors ~ ____103 ARM2
____________1st Pentium 4 ~ ___42,000,000 transistors ~ ____800 ARM2
_____________________Cell ~ __241,000,000 transistors ~ __8,033 ARM2
_____________Apple A8 SOC ~ 2,000,000,000 transistors ~ _66,666 ARM2
22-core Xeon Broadwell-E5 ~ 7,200,000,000 transistors ~ 240,000 ARM2
________________AMD FuryX ~ 8,900,000,000 transistors ~ 296,666 ARM2

Dividing the 512 GB/s of external bandwidth for FuryX across a variable number of ARM2 sized cores clocked a 1 GHz, suggests as on-chip ALU scales beyond what a GPU can do, that cores must mostly fully consume on-chip generated data. Also given GPU on-chip routing networks typically have some small integer scaling of off-chip bandwidth, this suggests not the classic GPU formula for production and consumption of on-chip data (meaning not routing through some coherent L2, but rather more neighbor to neighbor, or very localized).

___1 ARM2 ~ 512 GB/s ~ 512 B/op
__16 ARM2 ~ _32 GB/s ~ _32 B/op
_256 ARM2 ~ __2 GB/s ~ __2 B/op
__4K ARM2 ~ 128 MB/s ~ __8 op/B <--- FuryX is a 4K core GPU
_64K ARM2 ~ __8 MB/s ~ 128 op/B
256K ARM2 ~ __2 MB/s ~ 512 op/B

Below looking at this from another perspective, taking 6 G transistors for SRAM cells, dividing into N cores, looking at limit of SRAM bytes per core (not including anything other than 6 transistors per bit in this approximation). If one wanted to scale to mass numbers of simple small cores, the amount of on-chip memory per core would be tiny. Suggests that maybe sharing of instruction RAMs and on-chip memories becomes one of the major design challenges.

___1 ~ 128 MB
__16 ~ __8 MB
_256 ~ 512 KB
__4K ~ _32 KB
_64K ~ __2 KB
256K ~ 512 B

This next table looks at 64K cores clocked at 1 GHz running at 64 frames/second, or roughly 1024 Gop/frame. Then taking this 1024 Gop/frame number divided by the number of instructions fetched from off-chip memory per frame. Providing a rough idea of the level of instruction reuse required. This table tops out at 4 G instructions working with a 16-bit instruction width, that would be fully utilizing 512 GB/s of off-chip bandwidth.

__4 G instructions ~ __256 usage average/instruction ~ ___full usage of off-chip bandwidth
128 M instructions ~ __8 K usage average/instruction ~ ___1/32 usage of off-chip bandwidth
__4 M instructions ~ 256 K usage average/instruction ~ _1/1024 usage of off-chip bandwidth
128 K instructions ~ __8 M usage average/instruction ~ 1/32768 usage of off-chip bandwidth

Suggests that the majority of program workflow must traverse similar code paths, either through SIMD or looping or something else. Another important aspect to this problem is looking at random access (unique) vs broadcast (same) for filling instruction RAMs. Starting with assuming instruction RAMs are not shared across cores, and cores are running random programs (non-SIMD).

4 Ginst/frame / 64 Kcores = 64 Kinst/core/frame ~ ___full usage of off-chip bandwidth
_____________________________8 Kinst/core/frame ~ ____1/8 usage of off-chip bandwidth
_____________________________1 Kinst/core/frame ~ ___1/64 usage of off-chip bandwidth

If broadcast is not used, programs need to be pinned to a given core across multiple frames. Suggests as cores/chip increases, broadcast update of on-chip RAMs is critical. Meaning if supporting unique control paths per core, the window of code must the the same across many cores.

These tiny poor estimations of large scale effects paints a very clear picture of why GPUs are SIMD machines with SIMD units clustered and sharing memories. I'm personally interested in figuring out what comes after the GPU, meaning what does to the GPU what the GPU did to the CPU in terms of ALU density on a chip. This post talks in terms of the classic model of an ALU connected to a memory, fetching operands and sending results back into the memory. Perhaps we are at the point where the next form of scaling requires leaving that model behind?


GPU Parking Lot

Push Model
Perhaps the GPU parking lot, aka register file waiting on long latency returns, is a side effect of not having ability to issue a load which pushes data to a different SIMD unit's register file? If loads could be issued and return somewhere else, one could possibly split a problem into 2 components: the part figuring out how to route memory traffic, and the part consuming the memory traffic. No call and return, thus no parking of state after loads.


Uber Shader Unrolling

Looking at running a compute only pass, no graphics waves to contend with on the machine, so it becomes relatively easy to think about occupancy. Target 4 waves per SIMD via a 4 wave work-group (one work-group per SIMD unit). That provides 4 work-groups sharing a Compute Unit (CU) and L1 cache. This is only 16/40 occupancy, but enough in theory to maintain a good amount of multi-issue of waves to functional units. Each wave gets 64 VGPRs, each work-group gets 16KB LDS (16 32-bit words/invocation on average).

In this fixed context, one can leverage compile-time unrolling to manage variable register allocation for different sub-shaders in an uber-shader. Unrolling as in running more than one instance of the shader in the uber-shader at a given time.

Unroll 2 in parallel = 32 VGPRs, 8 words LDS
Unroll 3 in parallel = 21 VGPRs, 5 words LDS
Unroll 4 in parallel = 16 VGPRs, 4 words LDS

But it doesn't have to be this fixed. One can have variable blending of N parallel instances. Meaning as register usage starts to drain from one instance, start ramping up the other instance. Also enables instances to share intermediate computations.

This more "pinned task" model with unrolling in theory in some cases (maybe really short shaders like particle blending) would allow better utilization of the machine, than separate kernel launches for everything. During shader start as the shader ramps up, the VGPRs allocated to it are under utilized. During shader ramp down towards exit, VGPRs are also under utilized. Unrolling can blend the fill and drain.

Clearly there is also a question of unrolling out of instruction cache.


Vulkan From Scratch Part 2

Continuing posting when I find some time to work on the "from-scratch" Vulkan engine...

Review From Last Time
Bringing up on Windows first this time, will get to Linux later. Got basic system interface without non-system libraries on Windows. Still have {gamepad, audio, usb} interfaces to bring up. Got basic Vulkan window rendering full-screen compute PSO into back-buffer. Switched to always using UNDEFINED layout for source for back buffer to remove a permutation in the command buffer baking. Have new simplified rapid prototyping working.

Mechanics of Batch File
No "make" system, just a simple shell script to run the development cycle per platform. I include the source of all the helper programs inside the single source file, then comment them out in the shell script unless I need to recompile them. Using #define's to control what actually gets compiled. Two helpers. The "glsl.exe" takes the single source file and prefixes with the "#version 450" to use as GLSL shader source, outputs to "tmp.comp". Then "head.exe" converts the "comp.spv" output of "glslangValidator.exe" to a header file. Arguments of both are the shader ID which I use a two digit number for. Shell script compiles shaders, cleans up temporaries, compiles the program (aka "rom.exe"), runs the program, and depending on exit code relaunches or quits the shell script. Script below has only one shader currently,

@echo off
@rem cl /nologo /O1 /Oi /Os /Oy /fp:fast /DCPU_ /DGLSL_ /Feglsl.exe /Tprom.c
@rem cl /nologo /O1 /Oi /Os /Oy /fp:fast /DCPU_ /DHEAD_ /Fehead.exe /Tprom.c
glsl.exe 00
glslangValidator.exe -V tmp.comp
head.exe 00
del tmp.comp
del comp.spv
cl /nologo /O1 /Oi /Os /Oy /fp:fast /DCPU_ /DGAME_ /Ferom.exe /Tprom.c
if /i %ERRORLEVEL% equ 0 goto :eof
goto :loop

Mechanics of Graphics Abstraction Interface
This should be thought of as work in progress as it will change during bring up. I'm posting this because it shows just how simple a Compute-Generates-Graphics style engine can be in Vulkan. There are exactly 2 Descriptor Sets, one for the even and one for the odd frames. They contain everything, so no need to ever think about "binding" anything.


// Initialize the Vulkan interface.
// Inputs are,
//  (1.) Descriptor pool setup.
//  (2.) Descriptor set layout.
{ S_ VkDescriptorPoolSize count[1]={
  S_ VkDescriptorSetLayoutBinding bind[1]={
  GfxInit(count,1,bind,1); }

// Used to compile PSOs (this eventually will be parallelized when necessary).
// Specialization constants.
// This example passes in the size of the frame buffer.
F4 con[]={ F4_(wndR->x), F4_(wndR->y) };
// Include the shader source files.
#include "s00.h"
// Generate PSOs.
U8 pso[1];

// Build the baked command buffers.
// Leaving a lot out here, will return later when this is cleaned up more.
U8 cmd=GfxBegin(cmdIdx);

// Loop forever replaying the groups of {even, odd} command buffers.

Specialization Constants
Absolutely great feature to have in the API. Setup constants which can have overrides set at PSO compile-time. Enables factoring out evaluation of various expressions to compile-time, setting array sizes, etc. Perfect for my setup, as I can just pass in frame size (and later other important things). Translates to this in the current dummy bring-up shader,

 // Part of what I use for cleaner types...
 #define F4 float
 #define F4x2 vec2
 #define F4x3 vec3
 #define F4x4 vec4

 // Specialization constants with default values.
 layout(constant_id=0) const F4 SCRX_=1920.0;
 layout(constant_id=1) const F4 SCRY_=1080.0;

 // Bind "everything", which thus far is just the back buffer.
 layout (set=0,binding=0,rgba8) writeonly uniform image2D img[1];

 // Showing a specific shader, in this case the dummy shader for bring up.
 #ifdef S00_
  layout (local_size_x=16,local_size_y=16) in;
  void main() { imageStore(img[0],S4x2(gl_GlobalInvocationID.xy),
   F4x4(F4(gl_GlobalInvocationID.x)*(1.0/SCRX_),F4(gl_GlobalInvocationID.y)*(1.0/SCRY_),1.0,1.0)); }

What's Next
Setting up a double buffered SSBO for the even/odd frames for game state. Each frame reads from prior frame game state and builds the new frame game state. Everything updated GPU-side via compute dispatches. Yes burning a whole wave to do some scalar processing once and a while (like computing new player position and view-matrix, etc), but in practice the scalar work is in the noise in terms of run-time, so it doesn't matter.

Then on to simple CPU->GPU uploads (for late latched IO) and GPU->CPU downloads (for saving current state, etc)...


Blink Mechanic for Fast View Switch for VR

As seen in the SIGGRAPH Realtime section for Bound for PS4 PSVR around 42 minutes in this video. Great to see someone making good use of the "blink mechanic" to quickly switch view in VR. Scene quickly transitions to black, to simulate eyelid closing, followed by fading back to a new view, simulating eyelid opening.

My recommendation for the ideal VR display interface used this mechanic. Specifically "blink" to hide the transition of exclusive ownership of the Display between a "Launcher / VR Portal" application and the game. The advantages of exclusive app-owned Display for VR on PC would have been tremendous. For instance it then becomes possible to,

(1.) Fully saturate the GPU. No more massive sections of GPU idle time.

(2.) Render directly to the post-warped final image for a 2x reduction in shaded samples for Compute generated Graphics Non-Triangle based rendering, and pixel perfect image quality.

(3.) Factor out shading to Async Compute, and only generate the view right before v-sync. Rendering just in time is better than time-warp: no more incorrect transparency, no more doubling of visual error for dynamic objects which are moving differently than the head camera tracking.

(4.) Race the beam for the ultimate in low latency.

(5.) Great MGPU scaling (app owned display cuts MGPU transfer cost by 4x).

(6.) Have any GPU Programming API, even compute APIs, be able to work well in VR without complex cross-process interopt.

(7.) Etc.

Ultimately no one on the PC space implemented this, and thus all my R&D on the ultimate VR experience got locked out and blocked by external process VR compositors, pushing me personally out of VR, and back to flat 3D where I still can actually push the limits, with good frame delivery, without artifacts, and with perfect image quality.


Vulkan - How to Deal With the Layouts of Presentable Images

EDITS: Turns out it is possible to make the PRESENT source into an UNDEFINED source, so the first use permutation is only needed for apps which read from the prior presented output. Leaving the rest as originally written ...

Continuing my posts on building a Vulkan based Compute based Graphics engine from scratch (no headers, no libraries, no debug tools, no using existing working code)...

Interesting to Google something and already get hits on Vulkan questions on Stack Overflow - How to Deal With the Layouts of Presentable Images. Turns out one of the frustrating aspects of Vulkan is the WSI or presentation interface. Three specific things make this a pain in the butt, quoting from the Vulkan spec.

(1.) "Use of a presentable image must occur only after the image is returned by vkAcquireNextImageKHR, and before it is presented by vkQueuePresentKHR. This includes transitioning the image layout and rendering commands."

(2.) "The order in which images are acquired is implementation-dependent. Images may be acquired in a seemingly random order that is not a simple round-robin."

(3.) "Let n be the total number of images in the swapchain, m be the value of VkSurfaceCapabilitiesKHR::minImageCount, and a be the number of presentable images that the application has currently acquired (i.e. images acquired with vkAcquireNextImageKHR, but not yet presented with vkQueuePresentKHR). vkAcquireNextImageKHR can always succeed if a<=n-m at the time vkAcquireNextImageKHR is called. vkAcquireNextImageKHR should not be called if a>n-m ..."

The last part (3.) roughly translates into the fact that you might not be guaranteed ability to Acquire all images at any one time. Placing all these problems together means that it is impossible to do the following,

(A.) No way to robustly pre-transform all images into VK_IMAGE_LAYOUT_PRESENT_SRC_KHR before entering normal run-time conditions. Instead you have to special case the transition the 1st time acquire returns a given index. IMO this adds unnecessary complexity for absolutely no benefit, and makes it really easy to introduce bugs. I've see online Vulkan examples violate rule (1.).

(B.) No way to ensure a simplified round-robin order even in cases where it is physically impossible to get anything other than round-robin (such as full-screen flip with v-sync on and a 2 deep swap chain).

Working Around the Problem
This problem infuriates me personally because of all the wasted time required to add complexity for no benefit. Likewise being forced to double buffer instead of front buffer is also a large waste of time for a regression in latency. Since my engine is command buffer replay based (no command buffers are generated after init-time), I ended up needing 8 baked command buffer permutations.

(1.) Even frame pre-acquire.
(2.) Odd frame pre-acquire.
(3.) Even frame post-acquire image index 0.
(4.) Even frame post-acquire image index 1.
(5.) Odd frame post-acquire image index 0.
(6.) Odd frame post-acquire image index 1.
(7.) Transition from UNDEFINED to PRESENT_SRC for image index 0.
(8.) Transition from UNDEFINED to PRESENT_SRC for image index 1.

The workaround I have for needing to special case transitions on 1st acquire of a given index, is to run the transition from UNDEFINED command buffer instead of the one which normally draws into the frame. So there is a possibility of randomly seeing one extra black frame after init time. This IMO is all throw-away code anyway once I can get some kind of front-buffer access.

Interesting to look back at the bugs I had to deal with on route to getting the basic example of a compute shader rendering into a back-buffer. Really only 2 bugs, one I forgot to vkGetDeviceQueue(), which was trivial to find and fix. The other was that when creating the swap chain I accidentally set imageExtent.width to height and left imageExtent.height to zero. No amount of type-checking would ever help in finding that bug. Didn't see any errors, so took a while of reinspecting the code to see what I had screwed up.

In hindsight, after knowing what to do, using Vulkan was actually quite easy.


On Killing WIN32?

Many years ago I used to be a dedicated reader of Ars, but it slowly transitioned to something a little too biased for my taste, so I avoid it, but thanks to twitter, it is possible to get sucked into a highly controversial article: "Tim Sweeney claims that Microsoft will remove Win32, destroy Steam".

My take on this is quite simple. Everyone in this industry who has lived long enough to have programmed in the C64 era, has witnessed a universal truth on every mass market platform: the freedom and access to the computer by the user or programmer is reduced annually at a rate which is roughly scaling with the complexity of the software and hardware.

The emergent macro level behavior is undeniable. Human nature is undeniable. It is possible to continuously limit freedom as long as it is done slowly enough such that it falls under the instantaneous tolerance to act on each micro-level regression of freedom. Or translation, humans are lazy, humans adapt fast, and humans don't live long. Each new generation lacks the larger perspective of the last, and starts ignorant of what had been lost.

The reason why computers and freedom are so important is that computers are on a crash course to continue deeper and deeper integration with our lives. I believe ultimately humans will transcend the limits of our biology, blurring the lines between the mind and machine. Seems rather important at that apex to have the individual freedoms we have today, the privacy of our thoughts, etc.

In the short term as a developer I'm also somewhat concerned that the infants that will grow up to replace the generation I started in, will have the same opportunities I had, the same ability to get access to the hardware, to have the freedom implement their dreams, and to if they choose to, make a living doing so, in a free market, controlling their own destiny, selling their own product, without a larger controlling interest gating that process.

WIN32 is one such manifestation of that freedom.

There are some very obvious trends in the industry specifically in the layers of complexity being introduced either in hardware or software. For example, virtualization in hardware mixed with more attempts to sandbox software. Or the increased distance one has to the display hardware. Look at VR, you as an application developer are locked out of the display, and have to pass through a graphics API interopt layer which does a significant amount of extra processing in a separate process. Or perhaps the "service-ication" of software to subscription models. Or perhaps the HDR standard removing your ability to control tone-mapping. Or perhaps it is just the complexity of the API which makes it no longer practical to do what was done before, even if it is still actually possible.

Following the trends to their natural conclusion perhaps paints a different picture for system APIs like WIN32. They don't go away per say, they just get virtualized behind so many layers, it is becomes impossible to gain the advantages those APIs had when they were direct. That is one of the important freedoms which is eventually lost.

One of the best examples of this phenomenon is how the new generation perceives old arcade games. Specifically as, games with incorrect color (CRT gamma around 2.5 being presented as sRGB without conversion), giant exactly square pixels (never happened on CRTs), with dropped frames (arcade had crystal clear no-jitter on v-sync animation), with high latency input due to emulation in a browser for example (arcade input was instant in contrast), with more latency due to swap-chains added in the program (arcade hardware generated images on scan-out), with added high latency displays (HDTVs and their +100 milliseconds, vs instant CRTs), and games with poor button and joystick quality (arcade controls are a completely different experience). Everything which made arcades awesome was lost in the emulation translation.

Returning to the article, I don't believe there is any risk in WIN32 being instantly deprecated, because if that was to happen, it would be a macro-level event well beyond the tolerance level required to trigger action. The real risk is the continued slow extinction.


Simplified Vulkan Rapid Prototyping

Nothing simple about using Vulkan, so this title is a little misleading ...
Trying something new for my next Vulkan based at-home prototyping effort and building from scratch for 64-bit machines only. Building a simplified version of my prior rapid prototyping system. This version on code change instead of reloading a DLL, actually does re-compile and restart the program. My theory is that restart time is going to be lower than the time it takes to recompile shaders. I'm not concerned with re-fill of the GPU with baked data because I don't ever use much, and also never have much non-runtime-regeneratable state either. Program is required, somewhat like a "save snapshot" game emulator, to be able to instantly restart to where it was running before (at the time of last snapshot). This has some interesting advantages, like error handling becomes trivial, just exit the program and restart! For correct handling of things like VK_ERROR_DEVICE_LOST or VK_ERROR_SURFACE_LOST_KHR just exit. No need to have two binaries (one for development, one for release), as I never use debug.

I've got only one source file, with #defines to enabling keeping both GLSL and C code in the same file. Also I've got no includes to optimize for compile time. Notice on Windows, "vulkan.h" ultimately includes "windows.h", for example to get HWMD and HINSTANCE types, so sans rolling your own version of the headers, the compile dips into the massive platform include tree. Re-rolling only what I need from the Vulkan headers is quite frankly a nightmare of work due to Vulkan verbosity, but should be mostly over soon. I've also in the process made un-type-safe (yeah) version of the Vulkan API, returning to base system types, so I never have to bother with silly compile warnings. All handles are just 64-bit pointers, etc. It works great. I was beyond having type-safety bugs from birth, being brought up on assembly first. The bugs I have now are more like, "the last time I worked on this was a month ago, and I forgot to call vkGetDeviceQueue(), but already wrote code out-of-order using the queue handle". As any programmer, out of habit, I first blamed the driver, and ultimately realized that I was the idiot instead.

Part of the motivation for this design is out of laziness. Since Vulkan requires SPIR-V input, and I work in GLSL, I need to call "glslangValidator.exe" to convert my GLSL into SPIR-V, and I sure didn't feel like writing a complex system to be spawning processes from inside my app. So I have a shell script per platform which does, {compile shaders, convert SPIR-V binaries to headers which are included in the program, recompile the program, launch program, then repeat}.

Engine design is trivial as well, just setting up baked command buffers and then replaying them until exit. Everything compute based, and dispatch indirect based to manage variability. No graphics makes using Vulkan quite easy relatively speaking, no graphics state, no render passes, trivial transitions.

I'm debating on if to eventually release basic source for this project or not. On one hand it is a good example of Windows/Linux Vulkan app from scratch. On the other hand, my code is very much in shorthand which looks alien to other humans (likely the inverse of how C++ looks totally alien to me). For example, the following (which might get wrapped poorly by the browser) is my implementation of everything I need for printf style debugging writing to terminal.

// A background message thread which handles printing.
// This works around the problem of slow console print on Windows.
// This also allows single point to override to stream to file, etc.
// Multiple threads can send messages simultaneously.
// It would be faster to queue messages per thread, but this isn't about speed, but rather mostly debug.
// Merging per message gives proper idea of sequence of events across threads.
// This has spin waits in case of overflow panic, so set limits so overflow panic never happens. 
 // Defaults, must be a power of two.
 // Number of characters in ring.
 #ifndef KON_BUF_MAX
  #define KON_BUF_MAX 32768
 // Number of messages in ring.
 #ifndef KON_SIZE_MAX
  #define KON_SIZE_MAX 1024
 // Maximum message size for macro message generation.
 #ifndef KON_CHR_MAX
  #define KON_CHR_MAX 1024
 typedef struct {
  A_(64) U1 buf[KON_BUF_MAX*2]; // Buffer for messages, double size for overflow.
  A_(64) U4 size[KON_SIZE_MAX]; // Size of messages.
  A_(64) U8 atomReserved[1];    // Amount reserved: packed {MSB 32-bit buffer bytes, LSB 32-bit size count}.
  U8 atom[1];                   // Next: packed {MSB 32-bit buffer offset, LSB 32-bit size offset}.
  C2 write;                     // Function to write to console (adr,size). 
  U4 frame;                     // Updated +1 everytime the writter goes to sleep (used for drain).
 } KonT;
 S_ KonT kon_[1];
 #define konR TR_(KonT,kon_)
 #define konV TV_(KonT,kon_)
 // Begin KON_CHR_MAX macro message.
 #define K_ { U1 konMsg[KON_CHR_MAX]; U1R konPtr=U1R_(konMsg)
 // Ends.
 #define KON_MSG KonWrite(konMsg,U4_(U8_(konPtr)-U8_(konMsg)))
 #define KE_ KON_MSG; }
 #define KW_ KON_MSG; KonWake(); }
 #define KD_ KON_MSG; KonDrain(); }
 #define KN_ konPtr[0]='\n'; konPtr++
 // Ends with newline.
 #define KNE_ KN_; KE_
 #define KNW_ KN_; KW_
 #define KND_ KN_; KD_
 // Append numbers.
 #define KH_(a) konPtr=Hex(konPtr,a)
 #define KU1_(a) konPtr=HexU1(konPtr,a)
 #define KU2_(a) konPtr=HexU2(konPtr,a)
 #define KU4_(a) konPtr=HexU4(konPtr,a)
 #define KU8_(a) konPtr=HexU8(konPtr,a)
 #define KS1_(a) konPtr=HexS1(konPtr,a)
 #define KS2_(a) konPtr=HexS2(konPtr,a)
 #define KS4_(a) konPtr=HexS4(konPtr,a)
 #define KS8_(a) konPtr=HexS8(konPtr,a)
 // Append decimal.
 #define KDec1_(a) konPtr=Dec1(konPtr,a)
 #define KDec2_(a) konPtr=Dec2(konPtr,a)
 #define KDec3_(a) konPtr=Dec3(konPtr,a)
 // Append raw data.
 #define KR_(a,b) do { U4 konSiz=U4_(b); CopyU1(konPtr,U1R_(a),konSiz); konPtr+=konSiz; } while(0)
 // Append character.
 #define KC_(a) konPtr[0]=U1_(a); konPtr++
 // Append zero terminated compile time immediate C-string.
 #define KZ_(a) CopyU1(konPtr,Z_(a)-1); konPtr+=sizeof(a)-1
 // Append non-compile time immediate C-string.
 #define KZZ_(a) KR_(a,ZeroLen(U1R_(a)))
 // Quick message for debug.
 #define KQ_(a) K_; KZ_(a); KD_
 // Quick decimal.
 #define KDec2Dot3_(a) KDec2_(a/1000); KC_('.'); KDec3_(a%1000)
 #define KDec3Dot3_(a) KDec3_(a/1000); KC_('.'); KDec3_(a%1000)
 S_ void KonWake(void) { SigSet(SIG_KON); }
 // Unpack components from atom.
 I_ U4 KonSize(U8 atom) { return U4_(atom); } 
 I_ U4 KonBuf(U8 atom) { return U4_(atom>>U8_(32)); }
 // Unpack components from atom and mask.
 I_ U4 KonMaskSize(U8 atom) { return KonSize(atom)&(KON_SIZE_MAX-1); }
 I_ U4 KonMaskBuf(U8 atom) { return KonBuf(atom)&(KON_BUF_MAX-1); }
 // Reserve space to write message.
 I_ U8 KonReserve(U4 bytes) { return AtomAddU8(konV->atomReserved,(U8_(bytes)<<32)+1); } 
 // Release space reservation.
 S_ void KonRelease(U4 bytes,U4 msgs) { AtomAddU8(konV->atomReserved,(-(U8_(bytes)<<32))+(-U8_(msgs))); }
 // Check if reservation under limits.
 S_ U4 KonOk(U8 atom) { return (KonSize(atom)atom,(U8_(bytes)<<32)+1); } 
 // Copy in message.
 S_ void KonCopy(U8 atom,U1R adr,U4 bytes) { CopyU1(konR->buf+KonMaskBuf(atom),adr,bytes);
  AtomSwapU4(konV->size+KonMaskSize(atom),bytes); }
 // Used for debug busy wait until message is displayed.
 S_ void KonDrain(void) { U4 f=konV->frame; while(f==konV->frame) { SigSet(SIG_KON); ThrYield(); } }
 // Write message to console.
 S_ void KonWrite(U1R adr,U4 bytes) { while(1) { 
  if(KonOk(KonReserve(bytes))) { KonCopy(KonNext(bytes),adr,bytes); return; }
  KonRelease(bytes,1); KonWake(); ThrYield(); } }
 // Background thread which sends messages to the actual console.
 S_ U8 KonThread(U8 unused) { U4 bufOffset=0; U4 sizeOffset=0; 
  while(1) { U4 bytes=0; U4 msgs=0; SigWait(SIG_KON,1000); SigReset(SIG_KON);
   while(1) { U4 size=konV->size[sizeOffset]; bytes+=size;       
    // If not zero need to force clear before adjusting free counts, to mark as unused entry.
    if(size) konV->size[sizeOffset]=0; 
    // Force write if would wrap, or found zero size message.
    if(((bufOffset+bytes)>=KON_BUF_MAX)||(size==0)) {
     KonRelease(bytes,msgs); bytes=0; msgs=0;
     // If hit zero size break (zero size means rest of messages are empty).
     if(size==0) break; }
    msgs++; sizeOffset=(sizeOffset+1)&(KON_SIZE_MAX-1); } 
   // Only advance frame until after draining.
   BarC(); konV->frame++; } } 
 S_ void KonInit(void) { konR->write=C2_(ConWrite); ThrOpen(KonThread,THR_KON); }


Why Motion Blur Trivially Breaks Tone-Mapping - Part 2

Continuing from last post...

The question remains how to "fix it"? Any "fix" requires that the post-tone-mapped image is physically energy conserving under motion blur. As in energy conserving from the perspective of the viewer looking at the display. This requires, by definition, knowledge of the tone-mapping transfer function. On conventional displays this is not a problem, the application owns the tone-mapping transfer function.

However all the standards bodies for the new HDR Displays (HDR-10, Dolby Vision, ...) decided to do something rather rash: they took away the ability for you the developer to tone-map, and gave that step to the display OEM!


Long but important technical tangent for those who haven't been following what is happening in the display world, here is a refresher. Until recently you have enjoyed the freedom of targeting a display with enough knowledge of the output transfer function to do what you need. Basically on PCs you use sRGB, on Mac you use Gamma 2.2, and on HDTVs you use Rec709.

This ends with "HDR" Displays.

HDR Display signals like HDR-10 have switched from display relative to a absolute nits scale signal with absolute wide-gamut primaries both of which are far outside the realm of capacity for a consumer display to output. Each display has a different capacity. Luminance range for the HDR signal is {0-10000 nits}, but it is roughly only {0-500 nits} for an OLED display (yeah likely not even a stop brighter than the LDR screen you are reading this post from). Gamut for the signal is Rec 2020, but gamut for the displays are around P3. The only consumer devices which are similar in gamut to Rec 2020 are the MicroVision-based pico laser projectors, and related licensed products, which have existed for a while. In order to reach such a wide gamut they resort to ultra narrow primaries which have a side effect: metamerism (meaning all viewers see something different on the display, they are impossible to calibrate for multiple human viewers). There also has existed a market selling displays over 5000 nits, the outdoor LED sign industry. While LCD HDR TVs are driven by LED back lights, adapting outdoor sign LEDs would require water cooling, and power draw which is very far outside the range of what is acceptable in the consumer space. Point being, the range of the HDR signal will remain outside of realm of consumer displays for the foreseeable future: they will always be tone-mapped when driven by these new "absolute" scale signals.

In fact, the HDR standards *require* the display to tone-map the input signal and reduce to the range that the display can output. But the tone-mapper is up to the display OEM (or possibly some other 3rd party in the signal output chain in the future for non-TV cases). OEMs like to have a collection of different tone-mapping transfer functions depending on TV settings like "standard", "movie mode", "vivid", etc. You as a developer have no way of knowing what the user selected, and each TV can be different, even within the same product due to firmware updates.

So yes, the HDR TV standard for tone-mapping is effectively random!

Double Ouch!

Many developers understand what it means to target "random", because existing HDTVs already have this problem with things like contrast and saturation settings, just not to the extent of HDR TVs. The only way to author content is to take and purchase a large collection of displays, then play whack-a-mole, take the cases with worst case output visually, and keep re-adjusting the content until it looks ok on the largest amount of displays. The problem with this, besides the expensive iteration cycles, as many color professionals know, is that when you cannot target calibrated displays, you also cannot push to the limits of the displays (especially in the darks), you must play it safe with content, and accept that your visual message gets diluted before your consumer sees it.

But it gets better (well at least sarcastically speaking): you also cannot really calibrate these new HDR displays!

Triple Ouch!

Both LED driven LCDs and OLEDs have different problems. Lets start with LCDs. The UHD HDR certification label requires an output contrast range which is physically impossible for LCDs to display without cheating. Quoting the above press release for the part applying to LCDs: "More than 1000 nits peak brightness and less than 0.05 nits black level". Taken literally that means 1000/0.05 or a 20000:1 contrast ratio.

PC LCD displays reached their ultra cheep pricing due to cutting panel quality, they typically top out at around 1000:1 ANSI contrast. Below is a table yanked from TFT Central showing some recent measured examples.

The best LCD displays are around 5000:1 ANSI contrast. The HDR TV industry high-end LCD models use LCDs which have around 4000:1 ANSI contrast. So LCDs are anywhere between 2 to 5 stops away from the minimum requirements for the HDR label.

Now on to the cheating part, enter Local Dimming. The back-light of these LCDs are driven by a regular grid of a hundred or so LEDs (called zones) each which can be controlled individually. It then becomes possible to very coarsely adjust black level by dropping down peak white level in a given zone. Lets go through an example of how this works with a LCD panel capable of 2000:1 ANSI contrast,

full _______ bright zone -> _2000:1 contrast ratio from display peak to black level in zone
1 stop_ less bright zone -> _4000:1 contrast ratio from display peak to black level in zone
2 stops less bright zone -> _8000:1 contrast ratio from display peak to black level in zone
3 stops less bright zone -> 16000:1 contrast ratio from display peak to black level in zone
4 stops less bright zone -> 32000:1 contrast ratio from display peak to black level in zone

Now lets see at what this looks like visually on a real display with some simple test cases. First a white line going through the center of the screen, then a white line outlining the screen. The image on the left is on a display without Local Dimming and represents what the output is supposed to look like, the image on the right is on a display with Local Dimming. Images below yanked from Gizmodo.

This visually shows exactly why it is impossible to calibrate a Local Dimming display, because the error introduced into the signal by the TV exceeds all sane standards for calibration. Quite literally these displays have horrible "uniformity". The error introduced by the display in the blacks can range 2-5 stops depending on the quality of the LCD panel.

Bringing this back to authoring content, the display introduces it's own "square-ish bloom filter". Where the "square-ish bloom" doesn't move as the bright content moves by a fraction of a zone, and where the bloom doesn't actually scale in intensity with the average intensity of the output signal. Instead even a tiny amount of highlight will cause the full "bloom", because the display needs to fire the associated zone at peak to correctly reproduce the brightness of just a few pixels. Also the color of the "bloom" might not track the color of the associated highlight(s). No self respecting developer would ever release a game with such poor quality "bloom".

This becomes even more of a problem with real-time games. Existing games often have bright UI elements wrapping the screen, or overlaid on top of game content. The reason this UI content is bright, is because it needs to be, in order to be visible over in-game content. Existing LDR PC displays often reach roughly 400 nits, so these new "HDR" LCD displays are only a little over 1 stop brighter (UHD HDR cert minimum is 1000 nits). Taking the example of a 2000:1 ANSI contrast UHD "HDR" labeled display, any 400 nit white text in that UI for instance is going to bring the nearby zones to close to 1 stop from peak, which drops the effective contrast ratio to around 4000:1, and will introduce "square-ish bloom" if the game content is dark.

High quality "non-HDR-labeled" displays from the past will actually produce much better quality output than current HDR LCDs, and they don't require any new HDR signal to do so. For example, the circa 2013 Eizo Foris FG2421 120 Hz LCD as reviewed by TFT Central has around a 5000:1 ANSI Contrast ratio without resorting to Local Dimming and has a low-persistence mode for gaming. Older top-of-the-line discontinued Plasma displays are much better than these new HDR LCDs because they don't have Local Dimming. Plasma unfortunately was ended by the display industry. Like OLED, Plasma is not as energy efficient as the LEDs driving the LCD back lights, so they get lower APL peaks. Lets look at one of the best, the Pioneer Elite KURO PRO-110FD Plasma HDTV from 2007. The KUDO's ANSI contrast is APL limited measured around 3239:1. That doesn't really tell the full story, as real content is non-APL limited, at which point the contrast ratio is measured around 10645:1 which is over double the best current HDR LCDs true contrast ratios (meaning what is possible without major signal degradation).

Now lets look at OLED.

OLEDs don't have back lights, so they do not suffer from the wows of Local Dimming, they do however have a "darker" problem, quite literally in fact. The problem is described quite well by the latest HDTV Test's LG OLED55E6 4K HDR OLED TV Review: "the E6V’s default [Brightness] position of “50” has been purposely set up from factory to crush some shadow detail so that most users won’t pick up its near-black foibles. Once we raised [Brightness] to its correct reference value, we could see that the television was applying dithering to shadowed areas to better mask above-black blockiness."

The OLEDs crush the blacks because they have severe problems in near-black uniformity. From some personal testing and calibration on a slightly older LG OLED, I could see uniformity problems similar to burn-in that were larger in the darks than a few steps of 8-bit sRGB output. So while absolute black is quite dark, they don't have enough accuracy in the darks to reproduce anything well when APL is dropped down to levels appropriate for a dark room good for HDR viewing. These OLEDs can look ok in day time viewing because the default black crush in combination with ambient reflection on the screen masks the problem in the darks.

The black crush problem is a symptom of the larger problem facing OLED TVs. This problem is also described quite well by HDTV Test's review of the LG OLED65G6P HDR OLED TV, "We calibrated several G6s in both the ISF Night and ISF Day modes, and found that with about 200-250 hours run-in time on each unit, the grayscale tracking was consistently red deficient and had a green tint as a result (relative to a totally accurate reference). We witnessed the same inaccuracy on the European Panasonic CZ950/CZ952 OLED (which of course also uses a WRGB panel from LG Display).".

OLEDs cannot hold calibration.

OLED as a technology suffers from gradual pixel decay issues. These latest LG OLEDs resort to a 10-20 minute "Cleaning" attempted auto-re-calibration cycle when the TV is in powered standby after around 4 hours of viewing time. Even this cannot solve the problem. The accuracy of the calibration is poor, which is why OLEDs have such a problem reproducing anything interesting in the darks.

Returning to "Fixing" the Motion Blur Issue

So as established above you are effectively screwed on newer HDR Displays if you use the HDR input signal, tone-mapping is out of your hands. It is possible on HDR displays to still select the "PC" input and drive the display without tone-mapping using a traditional non-HDR signal. Often the HDR TVs (given they really are not that bright anyway) still support their full brightness range in that mode. Some HDR TVs I believe even support disabling Local Dimming. However there is no guarantee this will be the case or will remain the case. Also it is placing a large burden on the consumer to be tech savvy enough to correctly navigate the TVs options and do the right thing.

Until game developers organize in mass and throw their combined weight into forcing a display property query and pass-through signal standard based on measured minimal pixel-level signal degradation as the cert metric, you will be forced to live with all the problems of the new HDR standards.

On non-HDR-signal output there are a lot of interesting options for "fixing" the motion blur issue.

Lets break this down into non-photo-real and photo-real categories. The non-photo-real case is what I'm personally interested in because it aligns to one of my current at-home projects involving a state-of-the-art Vulkan based engine. My intended graphics pipeline does {auto-exposure, color-grading and tone-mapping} of graphics elements into the limited display range then afterwards does a linear composite of applying linearly processed motion-blur and DOF to the elements. Thus side-stepping the problem. Everything done post-tone-mapping is energy preserving, so motion does not change Average Picture Level.

One of the interesting things I've found when playing around with low persistence 160 Hz output, most recently playing through DOOM on a CRT, is that motion blur still plays an important role in visual quality. One would think at 160 Hz that motion blur simply isn't necessary. This is true as long as it is possible to restrict camera motion slow enough, and the eye always tracks exactly with the motion. But with twitch games where the camera can spin instantly, the motion can get fast enough such that when the eye is not actively matching the camera rotation, it is possible to see discontiguous strobes of the image at the 160 Hz refresh. The eye/mind picks up on edges perpendicular to the direction of assumed motion. Introducing some amount of motion blur which effectively removes those perpendicular edges, enables the mind to accept a continuously moving image.

For real-time graphics it is impossible to do "correct" motion blur because of the display limits and because we lack eye tracking. Instead motion blur serves a different purpose: to best mask artifacts so the mind falls into a deeper relationship with the moving picture. The desire is that the mind lives in what it imagines the world to be based on what it is seeing, without being distracted back into the real world by awareness of things like {triangles, pixels, scan-and-hold, aliasing, flickering, etc}. This is where the true magic happens, and why games like INSIDE and LIMBO by Playdead, which are effectively visually flawless, are so deeply engaging.

As for the photo-real case with content outside the capacity of the display, I believe it would be quite interesting to re-engineer motion blur to be APL conserving as observed by the gamer, but to maintain correct linear behavior. With bloom turned off on a still image, the actual brightness of various specular highlights is very hard to know due to highlight compression done in the tone-mapper. This hints at one possible compromise. Apply motion blur linearly so the color is correct, but with the pixel intensity as seen post-tone-mapped, so the APL as observed by the gamer is constant. Then linearly add a very diffuse bloom to the motion blurred image such that the visual hint of virtual scene brightness remains. This bloom is computed prior to motion blur using the original pre-tone-mapped full-dynamic-range color. The bloom must be diffuse enough such that motion length would have not caused a perceptual visual difference in the bloom if computed using traditional motion blur. Bloom in this case is more like a colored graduated filter or like a fog on the glass.

There are other interesting ideas around this, but that is all I have time for this time ...


Why Motion Blur Trivially Breaks Tone-Mapping

Some food for visual thought ...

You have a scene with a bunch of High Dynamic Range (HDR) light sources, or bright secondary reflections. Average Picture Level (APL) of the scene as displayed is relatively low, which is standard practice and expected behavior for displayed HDR scenes. Because the display cannot output the full dynamic range of the scene, the scene is tone-mapped before display, and thus the groups of pixels representing highlights appear not as bright as they should.

Now the camera moves, and the scene gets motion blur applied in engine prior to tone-mapping. All the sudden the scene gets brighter. The APL increases. In fact the larger the motion, the brighter the scene gets.

You, yes you the reader, have seen this before in many games. And technically speaking the game engine isn't doing anything wrong.

What is happening is those small bright areas of the scene, get blurred, distributing their energy over more pixels. Each of these blurred pixels have less intensity than the original source. As the intensity lowers, it falls more away from the aggressive areas of highlight compression, approaching tonality which can be reproduced by the display. So the APL increases, because less of the scene's output energy is getting limited by the tone-mapper.

The irony in this situation is that as motion and motion blur increases, the scene is actually getting closer to it's correct energy conserving visual representation as displayed.

Note the same effect applies as more of the scene gets out of focus during application of Depth of Field.

Re Twitter: Thoughts on Vulkan Command Buffers

Because twitter is too short ...

API Review
In Vulkan the application is free to create multiple VkCommandPools each of which can be used to allocate multiple VkCommandBuffers. However effectively only one VkCommandBuffer per VkCommandPool can be actively recording commands at a given time. The intent of this design is to avoid having a mutex when command buffer recording needs to allocate new CPU|GPU memory.

Usage/Problem Case
The following hypothetical situation is my best understanding of the usage case as presented in fragmented max 140 character messages on twitter. Say one had a 16-core CPU, where each core did a variable amount of command buffer recording. The application will need at a minimum 16 VkCommandPools in order to have 16 instances of command buffer recording going in parallel (one per core). Say the application has a peak of 256 command buffers generated per frame, and cores pull a job to write a command buffer from some central queue. Now given CPU threading and preemption is effectively random, it is possible in the worst case that only one thread on the machine has to generate all 256 command buffers. In Vulkan there are two obvious methods one could attempt to manage this situation,

(1.) Could pre-allocate 256 VkCommandBuffers on the 16 VkCommandPools, resulting in needing 4096 VkCommandBuffer objects total. Unfortunately AMD's Vulkan driver currently has higher than desired minimum allocated memory for each VkCommandBuffer. On the plus side there is an active bug, number 98777 (if you want to reference this in an email to AMD), for resolving this issue.

(2.) Could alternatively allocate then free VkCommandBuffers at run-time each frame.

Once bug 98777 is resolved with a driver fix, option (1.) would be the preferred solution from the above two options.

Digging Deeper
Part of what concerns me personally about this usage case is that it implies building an engine where VkCommandPool is effectively pinned to a specific CPU thread, and then randomly asymmetrically loading each VkCommandPool! For example, say in typical case each CPU thread builds on average the same amount of command buffers in terms of CPU and GPU memory consumption. In this mostly symmetrical load pattern, the total memory utilization of each VkCommandPool will be relatively balanced. Now say at some frequency one of the threads chosen randomly, and it's associated VkCommandPool, is loaded with 50% of the frame's command buffers in terms of memory utilization. If VkCommandPools "pool" memory and keep it, then over time each VkCommandPool would end up "pooling" 50% of the memory required for all the frame's command buffers. Which in this case would be roughly 8 times what is required.

This problem isn't really Vulkan specific, it is a fundamental problem on anything which does deferred freeing of a resource. The amount of over-subscription in random asymmetrical load is a function of the delay before deferred free. Which ultimately becomes a balancing act between the overhead in run-time or synchronization cost for dynamic allocation, against the extra memory required.

Possible Better Solution?
Might be better to un-pin VkCommandPool from CPU thread. Then instead use a few more VkCommandPools than CPU threads, and have each CPU grab exclusive access to a random VkCommandPool at run-time to use to build command buffers for jobs until after a set timeout, at which point it releases a given VkCommandPool, and then chooses the next free one to start work again. Note there is no mutex in here for acquire/release pool, but rather a lock-free atomic access to a bit array in say a 64-bit word.

In this situation, assuming CPU/GPU memory overhead for a command buffer scales roughly with CPU load of filling said command buffer, regardless of how asymmetrical the mapping is of jobs to CPU threads, the VkCommandPools get loaded relatively symmetrically.

Another thing about CPU threading which is rather important IMO, is that the OS will preempt CPU threads randomly after they have taken a job, which can cause random pipeline bubbles. As long as this is a problem, it might be desirable to preempt the OS's preemption and instead manually yield execution to another CPU thread at a point which ensures no pipeline bubbles (ie after finishing a job and releasing a lock on a queue, etc). The idea being to transform the OS's perception of the thread from being "compute-bound" thread (something which always runs until preemption) to something which looks like an interactive "IO-bound" thread (something which ends in self blocking). Maybe it is possible to do this by having more worker threads than physical/virtual CPU threads, and waking another worker, then blocking until woken again. Something to think about...

Transferring Command Buffers Across Pools?
I'll admit here I've been so Vulkan focused that I'm current out of touch with how exactly DX12 works. Seems like the twitter claim is that the Vulkan design is fundamentally flawed because VkCommandBuffer is locked to a VkCommandPool at allocation-time, instead of being set at begin-recording-time like DX12. This sounds to me the same as (2.) at the top of this post, effectively making "Allocate" and "Free" very fast for command buffers in a given pool, just "Allocate" is now effectively "Begin Recording" in the DX12 model. Meaning just shuffling work around to different API entry points. Assigning the Pool at "Begin Recording" time does not do anything to solve the asymmetric Pool loading problem caused by the desire to have Pools pinned to CPU threads for this usage case.

Baking Command Buffers - And Replaying
As the number of command buffers increases, one is effectively factoring out the sorting/predication of commands which would otherwise be baked into one command buffer, and deferring that sorting/predication until batch command buffer submit time. As command buffer size gets smaller, it can cross the threshold where it becomes more expensive to generate the tiny command buffers, than to cache them and just place them into the batch submit. So if say one had roughly 256 command buffers in effectively everything outside of shadow generation and drawing, meaning everything from compute based lighting through post processing, it is likely better to just cache baked command buffers instead of always regenerating them.

My personal preference is effectively "compute-generated-graphics", rending with compute only, mixed with fully baked command buffer replay (no command buffer generation after init time), and indirect dispatch to manage adjusting amount of work to run per frame ...


LED Displays

Gathering information to attempt to understand what is required to drive indoor LED sign based displays...

256x128 2:1 letter box display (NES was 256 pixels wide).

How do LED Modules Work?
Adafruit provides one description how to drive a 32x16 LED module. Attempting a rough translation. LEDs are either on or off. The 32x16 panel can only drive 64 LEDs at one time, organized as two 32x1 lines 8 rows apart. Scanning starts with lines {0,9}, then {1,10}, then {2,11}, and so on.

Panels are designed to be chained, driven by a 16-bit connector which provides 2 pixels per clock (one pixel for top and one for bottom scan-line). Looks like some other grouped LED panels go up to 128x128, driven by 4 row chunks of 128x32, each built from two chained 64x32 panels. Seems like the 64x32 panels are driven with 2 lines of 64 pixels (based on the addition of one extra address bit). Could not find a good description of chaining yet.

Seems like the 64x32 panels have roughly a 1/16 duty cycle (meaning only 1/16 of the LEDs are active at any one time). LED displays are low-persistence high-frame-rate displays with binary pixels. Based on this thread they can drive one cable at 40 MHz. So a 128x128 panel with 4 cables would be roughly 80M pixels / (128*32 pixel/frame) = 19.5 thousand frames per second.

The basic Pulse Width Modulation (PWM) to modulate brightness would transform this low-persistence display into something effectively scan-and-hold, just with a lot of micro-strobed sub-frames doing PWM across the effective "scan-and-hold" period. Getting something truly low-persistence is more of a challenge. These displays can be over 1500 nits (even with a 1/16 duty cycle). So one option for lower persistence is to actually insert black frames between frames, dropping the scan-and-hold time.

A 120 Hz frame rate provides 8.333 ms of frame time, switching to half black frames would drop to 4.16 ms (which isn't yet low persistence IMO), and would reduce to a 750 nit display (half the contrast), leaving roughly 80 or so sub-frames for PWM.

A 240 Hz frame rate at half black frames could be at the right compromise between lost contrast and low persistence. A 480 Hz frame rate with no black frames might be able to provide full contrast, and low enough persistence, but likely would need some seriously good temporal dithering.