HDFury Nano GX: HDMI to VGA

Got my HDFury Nano GX, now have the ability to run my little CRT from the HDMI out of the laptop with the 120Hz screen and a GTX 880M...

The Nano is simply awesome. Can now also run the PS4 on a VGA CRT with this device, way better than the PlasmaHDTV I've been using. When the device needs to decript HDMI signals, it needs the cord for USB power. My little CRT can do 720p60Hz, and the tiny amount of super-sampling from the PS4's downscaler in combination with the CRT scanline bloom, low latency, and low persistance creates an awesome visual.

Running from the GTX 880M with NVIDIA drivers in Linux worked right out of the box at 720p60Hz also on the little CRT. I ran with 720p GPU-downsampled from 1080p to compare apples to apples. Yes 60Hz flickers with a white screen, but with the typically low intensity content I run, I don't really notice the flicker. Comparing the 120Hz LCD and the CRT at 60Hz is quite interesting. The CRT definitely looks better motion wise. The 120Hz LCD has no stobe backlight, so it has 4-8x higher persistence than the CRT at 60Hz. Very fast motion is easy to track visually on the CRT. When the eye tracking fails, it still does not look as bad as the 120Hz LCD. The 120Hz LCD is harder track in fast motion without seeing what looks similar to full-shutter 4-tap motion blur at 30Hz. It is still visually obvious that the frame sits for a bit in time.

In terms of responsiveness, the 120Hz LCD is direct driven by the 880M. GPU LUT reprogram to reading the value on a color calibration sensor loop is just 8 ms. My test application also minimizes latency of input by reading controller input directly from CPU memory right before view dependent rendering. Even with that, the 120Hz definitely feels more responsive. The 8ms difference between CRT at 60Hz and LCD at 120Hz seems to make an important difference.

Thoughts on Motion Blur
Motion blur on the CRT at 60Hz in my mind is completely not necessary. Motion blur on the 120Hz LCD (or even 60Hz LCD) is something I would not waste perf on any more. However it does seem as if the entire point of motion blur for "scan-and-hold" displays like LCDs is to simply reduce the confusion that the human visual system is subjected to. Specifically just to break the hard edges of objects in motion, as to reduce the blur confusion the mind is getting from the full persistance "hold". Seems like if motion blur is used at 60Hz and above on an LCD, it is much better to just limit it to a very short size with no banding.

Nano and NVIDIA Drivers in X
Had to manually adjust the xorg conf to get anything outside 720p60Hz to work. I disabled the driver from using the EDID data, went back to classic horz and vert ranges. Noticed a bunch of issues with the NVIDIA Drivers and/or hardware,

(a.) Down-sampling at scan-out has some limits, for example going from 1920x1080 to 640x480 won't work. However scaling only height or width with virtual panning in the unscaled direction does work. This implies that either the driver has a bug, or more likely that the hardware does not have enough on-chip line buffer for the scalar to do such high reductions.

(b.) Up-sampling at scan-out from NVIDIA GPUs is completely useless because they all introduce ringing (hopefully some day the will fix that, I haven't tried Maxwell GPUs yet).

(c.) Metamodes won't do under 31KHz horizontal frequency modes. Instead it forces the dead ugly "doublescan".

(d.) Skipping metamodes and fall back to modelines has a bug where small resolutions automatically extend out the virtual panning resolution in width only, but have broken panning. I still have not found a workaround for this. The modelines even under 31KHz seem to work however (must use "ModeValidation" options to turn off the safety checks).

The Nano does work with frequencies outside 60Hz: managed to get 85Hz going at 640x480. Seems as if the Nano also supports low resolutions and arcade horizontal frequencies (a modeline with 320x240 around 60Hz worked). Unfortunately I'm somewhat limited in testing by the limits of the CRT. It also seemed as if I could get the 880M to actually use an arcade horizontal frequency. But I don't have a way to validate this yet. Won't be sure until I eventually grab another converter (VGA to Component supporting 240p) and try driving a NTSC TV with 240p like old consoles did.


Thoughts on Display Color Calibration for Games

On Apple products across the board, the factory tonal configuration is Gamma 2.2 not sRGB. Using an sRGB backbuffer is totally useless, instead whatever shader converts from linear high dynamic range to display target needs to manually do the pow(). Typically this step is manual anyway, because that is required to properly dither the floating point color to 8-bit per channel output. On the plus side, Apple products are so well calibrated and matched even between desktop and mobile, that anyone with a color calibrated authoring pipeline can target the hardware and the consumer will experience the artist's intent. This is simply awesome.

On the PC side, and from what I could see on a very small sampling of the fragmented Android space, sRGB is a better match (than Gamma 2.2) to default device factory calibration. This is not surprising given that both sRGB and Rec.709 (and later) HDTV standards adopted a linear segment close to black. The idea being that the linear segment enables a better perceptual distribution given a fixed set of bits.

The disadvantage of encodings like sRGB, which mix both a linear segment and gamma curve into the tonal curve, is that "correct" manual dithering can be more expensive (because the conversion is much more expensive). Given that all realtime digital content should use temporal dithering to avoid output banding, Apple's choice of fixed Gamma 2.2 seems like a much better choice. However...

On the Topic of Banding
TN panels often are 6-bit/channel, with temporal dithering. Plasma hits the other end of the extreme (maybe 1 or 2-bit/channel?) with extreme temporal dithering (600 Hz). In both these cases, applications need to manually dither beyond the exact amount required for 8-bit output (display dither is too conservative). Also in both these cases, the application temporal dither can mix in bad ways with the display's temporal dithering. My current feeling is that the correct solution to this problem is to replace the "correct" temporal dither with a film grain with a gamma responce (like film) applied in the linear HDR colorspace. This film grain would have a minimum amount even in the light areas which is large enough to serve as the temporal dither (to remove banding on worst case target). Also the film grain would be between 1.5 and 2 pixel in size, so that it does not conflict with a display's 1 pixel sized temporal dithering. The end result of this, is that sRGB again is a fine target, and Gamma 2.2 requires extra shader overhead.

White Point Calibration
Seems best to just target the D65 (daylight filtered 6500K) of sRGB. Knowing that: (a.) displays will be +/- that value, (b.) the mind automatically adapts to small differences in white point, and (c.) that white point will change +/- towards darks as well even on a given display. The cause of (c.) is that even if the display is calibrated to D65, the native black point of the display typically is not D65, and the only way to fix that is to raise the black level (adding intensity to come channels, reducing contrast), which is not something OEMs and users want.

Simple Production Calibration Goals
So the goal of simple calibration of displays is to get the {R,G,B} LUTs to provide a D65 white, through the entire gray scale, with the sRGB tonal curve, with the exception that somewhere in the darks, the very dark grey color will start to color shift to the native black point color tint, and then terminate at something which is not fully black. The color gamut of the display ultimately decides saturation scaling, which changes per type of display.

Simple In-Game Controls
Display gamma is the wild west, so at least need some user-adjustable gamma. Something like "move the slider until the dark symbol is just barely visible". Still not sure if user-adjustable offset is required as well.

Scifi Reading Suggestion List from Twitters

Accelerando - Charles Stross
Altered Carbon - Richard K. Morgan
Anathem - Neal Stephenson
Blindsight - Peter Watts
Blue Remembered Earth - Alastair Reynolds
Book of the New Sun: Series of 4 books - Gene Wolfe
Commonwealth Saga: Pandora's Star, Judas Unchained - Peter F. Hamilton
Cryponomicon - Neal Stephenson
The Culture Series - Ian M Banks
Deepness in the Sky - Vernor Vinge
Diamond Age - Neal Stephenson
Dune Series - Frank Herbert
A Fire Upon the Deep - Vernor Vinge
Leviathan Wakes - James S. A. Corey
Lord of Light - Roger Zelazny
Heechee Series - Frederik Pohl
House of Suns - Alastair Reynolds
Hyperion Cantos: Hyperion, ... - Dan Simmons
In Her Name Series - Michael R. Hicks
The Martian - Andy Weir
The Moon is a Harsh Mistress - Heinlein
Neptune's Brood - Charles Stross
Nexus - Ramez Naam
The Night's Dawn Trilogy: The Reality Dysfunction, The Neutronium Alchemist, The Naked God - Peter F. Hamilton
Old Man's War - John Scalzi
Only Forward - Michael Marshall Smith
Player of Games - Iain M. Banks
Redshirts - John Scalzi
Revelation Space - Alastair Reynolds
Sandkings - G.R.R Martin
Silo Series: Wool, Shift, Dust - Hugh Howey
Singularity Sky, Iron Sunrise - Charles Stross
The Skinner - Neal Asher
Solaris - Stanislaw Lems
Star Wolf - David Hamilton
The Unincorporated Man - Dani Kollin and Eytan Kollin
Tuf Voyaging - G.R.R Martin
Windup Girl - Paolo Bacigalupi



Placed up the X Window Manager I've been using for over a decade on GitHub. It provides simple full or split screen tiled windowing with virtual windows. Nothing more. Ideal for me, probably not ideal for you. Works great now that GIMP can be placed in "single window mode".

 Simple yet very useful single screen X Window Manager.
 Designed to minimize wasted user time interacting with windows.
 No configuration files.
 Tiny x86-64 binary.

 ALT+ESC .......... Close window.
 ALT+TAB .......... Cycle through window list on virtual screen (like Windows).
 ALT+` ............ Cycle window shape between full, and tiled positions.
 ALT+1 ............ Switch virtual screen left.
 ALT+2 ............ Switch virtual screen right.
 ALT+3 ............ Move focus window to virtual screen left.
 ALT+4 ............ Move focus window to virtual screen right.

 The windows list is ordered as follows,

  { most recently used, 2nd most recently used, ..., last used }

 While ALT is held down pressing TAB will cycle through list,
 going to the last reciently used window from the current window.
 It will wrap around at the end.
 After ALT is released, the list is updated.
 The new current window is moved to the front of the list.

 Only requires a C compiler and the X11 library.
 Try something like,

  gcc minwm.c -Os -o minwm -I/usr/X11/include -L/usr/X11/lib -lX11
  strip minwm

 Then setup your .xinitrc file like,

  xrdb -merge $HOME/.Xresources
  xterm -rv -ls +sb -sl 4096 &
  exec $HOME/minwm

 Then run xinit and then start programs from the terminal.


Next Generation OpenGL Initiative Details from Khronos BOF

OpenGL Ecosystem BOF 2014


Cross vendor project between OpenGL and OpenGL ES working groups:
- Chair = Tom Olson (ARM)
- IL Group Chair = Bill Licea-Kane (Qualcomm)
- API Spec Editors = Graham Sellers (AMD) and Jeff Bolz (NVIDIA)

Committed to adopting a portable intermediate language for shaders.
Compatibility break from existing OpenGL.
Starting from first principles.
Multi-thread friendly.
Greatly reduced CPU overhead.
Full support for tiled and direct renderers.
Explicit control: application tells driver what it wants.



Link to the Shadertoy example.

Growing up in the era of the CRT "CGA" Arcade Monitor was just awesome. Roughly 320x240 or lower resolution at 60 Hz with a low persistence display. Mix that with stunning pixel art. One of the core reasons I got into the graphics industry.

Built the above Shadertoy example to show what I personally like in attempting to simulate that old look and feel on modern LCD displays. The human mind is exceptionally good at filling in hidden visual information. The dark gaps between scanlines enable the mind to reconstruct a better image than what is actually there. The right most panel adds a quick attempt at a shadow mask. It is nearly impossible to do a good job simulating that because the LCD cannot get bright enough. The compromise in the shader example is to rotate the mask 90 degrees to reduce chromatic aberration. The mask could definitely be improved, but this is a great place to start...

Feel free to use/modify the shader. Hopefully I'll get lucky and have the option to turn on the vintage scanline look when I play those soon to be released games with awesome pixel art!


Vintage Programming

A photo (not a screenshot) of one of my home vintage development environments running on modern fast PCs. Shot shows colored syntax highlighted source to the compiler of the language I use most often (specifically the part which generates the ELF header for Linux). More on this below.

This is running 640x480 on a small mid 90's VGA CRT which supports around 1000 lines. So no garbage double scan and no horrible squares for pixels. Instead a high quality analog display running at 85 Hz. The font is my 6x11 fixed size programming font.

This specific compiler binary on x86-64 Linux is under 1700 bytes.

A Language
The language is ultra primitive, it does not include a linker, or anything to do code generation, there is no debugger (and it frankly is not needed as debuggers are slower than instant run-time recompile/reload style development). Instead the ELF (or platform) header for the binary, and the assembler or secondary language which actually describes the program, is written in the language itself.

Over the years I've been playing with either languages which are in classic text form, and languages which require custom editors and are in a binary form. This A language is the classic text source form. All the variations of languages I've been interested in are heavily influenced by Color Forth.

This A compiler works in 2 passes, the first both parses and translates the source into x86-64 machine code. Think of this as factoring out the interpreter into the parser. The second pass simply calls the entry point of the source code to interpret the source (by running the existing generated machine code). After that whatever is written in the output buffer gets saved to a file.

Below is the syntax for the A language. A symbol is an untyped 64-bit value in memory. Like Forth there is a separate data and return stack.

012345- \compile: push -0x12345 on the data stack\
,c3 \write a literal byte into the compile stream\
symbol \compile: call to symbol, symbol value is a pointer to function\
'symbol \compile: pop top of data stack, if value is true, call symbol\
`symbol \copy the symbol data into the compile stream, symbol is {32-bit pointer, 32-bit size}\
:symbol \compile: pop data stack into symbol value\
.symbol \compile: push symbol value onto data stack\
%symbol \compile: push address of symbol value onto data stack\
"string" \compile: push address of string, then push size of string on the data stack\
{ symbol ... } \define a function, symbol value set to head of compile stream\

And that is the A language. The closing "}" writes out the 32-bit size to the packed {32-bit pointer, 32-bit size} symbol value, and also adds an extra RET opcode to avoid needing to add one at the end of every define. There is one other convention missing in the above description, there is a hidden register used for the pointer to the output buffer.

Writing Parts of the Language in the Language
The first part of any source file is a collection of opcodes, like the { xor ,48 ... } at the top of the image which is the raw x86-64 machine code to do the following in traditional assembly language (rax = top of data stack, rbx points to second data stack entry),

XOR rax, [rbx]
SUB rbx, 8

These collection of opcodes generate symbols which form the stack based language the interpreter uses. They would get used like `xor in the code (the copy symbol to compile stream syntax). For instance `long pops the top of the data stack and writes out 8-bytes to the output buffer, and `asm pushes the output buffer pointer onto the data stack.

I use this stack based language to then define an assembler (in the source code), and then I write code in the assembler using the stack based language as effectively the ultimate macro language. For instance if I was to describe the `xor command in the assembly it would look like follows,

{ xor .top .stk$ 0 X@^ .stk$ 8 #- }

Which is really hard to read without syntax coloring (sorry my HTML is lazy). For naming, the "X" = 64-bit extended, the "@" = load, and the "#" = immediate. So the "X@^" means assemble "XOR reg,[mem+imm]". The symbols "top" and "stk$" contain the numbers of the registers for the top of the stack and the pointer to the second item on the stack respectively.

Compiler Parser
The compiler parsing pass is quite easy, just a character jump table based on prefix character to a function which parses the {symbol, number, comment, white space, etc}. These functions don't return, they simply jump to the next thing to parse. As symbol strings are read they are hashed into a register and bit packed into two extra 64-bit registers (lower 4-bits/character in one register, upper 3-bits/character in another register). This packing makes string compare easy later when probing. Max symbol string is 16 characters. Hash table is a simple linear probing style, but with an array 2 of entries per hash value filling one cacheline. Each hash table entry has the following 8-byte values {lower bits of string, upper bits of string, pointer to symbol storage, unused}. The symbol storage is allocated from another stack (which only grows). Upon lookup, if a symbol isn't in the hash table it is added with new storage. Symbols never get deleted.

Highly Recommend PowerNotebooks.com

Got a custom notebook from powernotebooks.com, and I'd highly recommend them for anyone else looking for a new laptop. They have an interesting practice of providing faster orders and a few percent off for paying cash (via a few different methods). Their customer service for traditional phone calls is also quite awesome. I learned a few things talking to their technically knowledgeable staff.


HRAA And Coverage and Related Topics

Michal Drobot's HRAA Slides, great talk, I've read it a few times now. Really good seeing people get serious about solving the aliasing problem.

Coverage Fail Case?
Start with a simple example of two color and depth samples {N,S} (with associated coverage samples). And two extra coverage samples {w,e}. In the following pattern,


Starting with all cleared (unknown) case,


Render a triangle in the foreground which covers {S,w,e},


Now render a triangle in the background which covers {N,S,w,e}, this for instance could be a skybox. The N sample passes the depth test, the S sample fails the depth test, and the {w,e} coverage samples get set to unknown (coverage samples have no depth, raster unit does not know which triangle is in front, because a coverage sample's associated depth sample won't work in sloped cases). The result being the same as if there are no coverage samples,


Front-to-back drawing order (the best order for performance) clears out coverage information. Only back-to-front draw order (the worst for overdraw) builds coverage as front triangles evict and optional replace coverage sample association.

This is ultimately why I abandoned the idea of using coverage samples for reconstruction.

Since then I've learned that coverage might work if there is a front-to-back full z-pre-pass, followed by rendering back-to-front with depth test passing if depth is nearer or equal. This process would likely re-restore coverage. This likely explains why EQAA and CSAA actually seemed to work when they were first introduced, because engines did do z-pre-passes at that time. Back when I tried coverage based reconstruction I never did a z-pre-pass (couldn't afford to submit the geometry again).

The CRAA LUT in Michal Drobot's paper is a great idea for a z-pre-passing engine working on a platform which provides coverage information.

For any hardware which provides programmable sample locations in a granularity of 2x2 pixels (or beyond), one can do better than the flipquad setup. On slide 70, notice blue samples of two pixels {0,2} and {1,3} align on vertical lines.

Tried before a few times, never found it to work well as just blending in more of the source as a function of approaching full pixel offset, perhaps I did something wrong, going to need to try this again!

Abdul Bezrati: Real-time lighting via Light Linked List

Real-time lighting via Light Linked List

Seems like the basic idea is as follows,

Render G-buffer.
Get min/max depth for each 8x8 tile.
For 1/64 area (1600/8 by 900/8), raster lights with software depth test.
For each tile, build linked list of lights intersecting tile.
Linked lists of {half depthMin, half depthMax, uint8 lightIndex, uint24 nextStructureIndex}.
Keeping light bounds helps avoid shading when tile has large depth range.
Full screen pass shading for all lights intersecting a tile.

Would This be Faster?
Given the maximum of 256 lights, have a fixed 32-bytes per tile which is a bitmask of the lights intersecting the tile. Raster the lights via AtomicOr() with no return (no latency to hide), setting the bit in the bitmask. At deferred shading time per workgroup (workgroup shades a tile), first load the bitmask into LDS, then in groups of lights which fit in the LDS, do a ballot based scan of the remaining lights in the bitmask, load the active bit lights into the LDS, then switch to shading pixels with the light data, then repeat.


NVIDIA's Project Denver

NVIDIA Blogs on Project Denver

I'm reading this press release as follows,

ARMv8 64-bit Processor.
Hardware decode of ARMv8 instructions (see above image).
Seems like similar area: dual core 3-way A16 @ 2.3 GHz -- single core 7-way Denver @ 2.5 GHz.
Run-time "Dynamic Code Optimization" into a 128MB chunk of DRAM backed by a 128KB cache.
7-way looks like (see above image): 2 Load/store units, 2 FPUs, 2 Integer ALUs, 1 Branch unit?

Wronski: Volumetric Fog SIGGRAPH 2014

Wronski: Volumetric Fog SIGGRAPH 2014