20140717

Bad Industry Humor: Computer Engineering Hall of Shame

4K HDTV - Timing is critical, need to sell these 4K TVs before the next generation of 300 dpi smart-phone-addicted permanently-nearsighted kids realize the TV is not in focus when sitting on the couch. That way we can use the proceeds from the 4K gold rush to invest in the Lasik industry to ready the next generation for 8K. Which needs to happen before Google's self driving cars program takes off, because after that no one will need to pass an eye exam.

Loading Libraries Outside The Lower 32-bit Address Space on a 64-bit Machine - Forced 64-bit address for any library call, because given the slop in modern software engineering, looks like we are going to need over 4GB of executable code really soon.

Operating Systems Without Raw Human Input Interfaces for Full Screen Applications - Because we need to protect the world from viruses and hackers which read PS3 controller input just in case you might want to enter your etrade password using a game controller instead of the keyboard.

Dynamic Linking Everything - At 20Gb/$ for drive space, saving a few cents by not static linking helps insure my hundred other Linux comrades can fork this beautiful package manager then promptly dump it after they get infatuated by forking something else.

Sendfile in Linux 2.6.x - Because syscall backwards compatibility does not matter now that we have these great package managers, and because only an idiot would actually want OS support for zero-copy file copy.

Graphics Drivers 100's of MB in Size - Because we are secretly getting into the game distribution business, one set of shaders at a time.

ELF - If "hello world" is not bigger than 4KB, then clearly the system is not complicated enough.

Compositing Window Managers - Because like bell-bottoms, it is going to be a while before rectangular windows without animation or transparency comes back into fashion.

Systems Mounting Everything Read-Write with Hundreds of Unknown OS Background Jobs - Because after manufacturing has left the country, the US is readying for the days when the only export it has left is outsourcing it's massive IT and security infrastructure.

Cloud Serving Games - Because with years of desensitization to dropped frames, targeting and missing 30 Hz, non-game-mode HDTVs, and 1080-pee Youtube, this newer generation can no longer realize the difference.

Chains of Random Access First Read Page Faults When First Loading Large Dynamically Linked Applications - Because virtual memory is super important to handle the complexity we built on after designing for machines with only 1MB of memory.

Computing Devices Which are Not Allowed to Have Compilers or Desktops - Because it is important for a device to be limited to just one purpose, like making phone calls. And besides we don't have any patents useful for extorting the wireless keyboard and mouse industry.

HDR HDVTs - If we play our cards right, we can put the tanning bed industry out of business.

Private Class Members - After successfully factoring out the ability to understand the code via awesome feats of abstraction and templatation, I need to protect it from being used or changed.

ELF - Why use a 32-bit index for a symbol, we are CS majors, and we need to put all this advanced string management and hashing knowledge into something. Of course we slept through the factoring part of basic algebra, why do at compile time on one build machine what you can do at run-time on every phone instead!

Software Complexity - Is like peeing into a small pool, when only one person does it, it is awful, but now that everyone pees in the pool, children just grow up thinking that pool water was always yellow.

Phones 2084 - Government mandates loan insurance to cover the loan parents take out to cover the IP patent pool amortized into the cost of the device required for their children to connect to the internet.

Graphics Conferences 2099 - We charge people early just for thinking about giving a talk at the conference, but that is ok because we simultaneously send out spam for thought insurance so you do not have to worry about having a thought which has already been patented.

20140715

Infinite Projection Matrix Notes

If you are reading this and have to deal with GL/GLES vendors not supporting DX style [0 to 1] clip space, please talk to your hardware/OS vendors and ask for it!!

Clip Coordinates (CC)
Output of the vertex shader,
GL: gl_Position
DX: SV_Position



Normalized Device Coordinates (NDC)
The following transform is done after the vertex shader,

NDC = float4(CC.xyz * rcp(CC.w), 1.0);

On both GL and DX NDC.xy are [-1 to 1] ranged. On DX NDC.z is [0 to 1] ranged. On GL NDC.z is [-1 to 1] ranged and this can cause precision problems (see below in Window Coordinate transform). Anything outside the range is clipped by hardware unless in DX11 DepthClipEnable=FALSE, or in GL glEnable(GL_DEPTH_CLAMP) is used to clamp clipped geometry to the near and far plane.


Window Coordinates (WC) : DX11
The following transform is done in hardware,

float3 WC = float3(
NDC.x * (W* 0.5 ) + (X + W*0.5),
NDC.y * (H*(-0.5)) + (Y + H*0.5),
NDC.z * (F-N) + N);


With parameters specified by RSSetViewports(),

X = D3D11_VIEWPORT.TopLeftX;
Y = D3D11_VIEWPORT.TopLeftY;
W = D3D11_VIEWPORT.Width;
H = D3D11_VIEWPORT.Height;
N = D3D11_VIEWPORT.MinDepth;
F = D3D11_VIEWPORT.MaxDepth;


Fractional viewport parameters for X and Y are supported on DX11. DX10 does not support fractional viewports, and DX11 feature level 9 implicitly casts to DWORD internally insuring fractional viewport will not work. Not sure if DX11 feature level 10 supports fractional viewports or not. Both N and F are required to be in the [0 to 1] range. It is better to not use the viewport transform to modify depth and instead fold any transform into the application projection shader code. For best precision, N should be zero.


Window Coordinates (WC) : GL
The following transform is done in hardware,

float3 WC = float3(
NDC.x * (W*0.5) + (X + W*0.5),
NDC.y * (H*0.5) + (Y + H*0.5),
NDC.z * ((F-N)*0.5) + (N+F)*0.5);



Window Coordinates (WC) : OpenGL 4.2 and OpenGL ES 2.0
These versions of GL have the following form to specify input parameters,

glDepthRangef(GLclampf N, GLclampf F); // ES and some versions of GL
glDepthRange(GLclampd N, GLclampd F); // GL
glViewport(GLint X, GLint Y, GLsizei W, GLsizei H);

Note the inputs to glDepthRange*() are clamped to [0 to 1] range. This insures NDC.z is biased by a precision destroying floating point addition. The default N=0 and N=1 results in,

WC.z = NDC.z * 0.5 + 0.5;

Which when computed with standard 32-bit floating point, I believe has in theory exactly only enough precision for 24-bit integer depth buffers.


Window Coordinates (WC) : OpenGL 4.3 and OpenGL ES 3.0
These versions of OpenGL dropped the clamp type resulting in,

glDepthRangef(GLfloat N, GLfloat F);

Both specs say, "If a fixed-point representation is used, the parameters n and f are clamped to the range [0;1] when computing zw". So in theory for floating point depth buffers the following can be specified,

glDepthRangef(-1.0f, 1.0f);

Which results in no precision destroying floating point addition in the Window Coordinate transform,

WC.z = NDC.z * 1.0 + 0.0;

However at least some vendor(s) still do the clamp anyway. To get around the clamp in GL one can use GL_NV_depth_buffer_float which provides glDepthRangedNV() which is supported by both AMD and NVIDIA.


Projection Matrix
General form,

X 0 0 0
0 Y 0 0
0 0 A 1
0 0 B 0


The A=0 case provides the highest precision. Next highest precision from A=1 or A=-1, then ideally choose A to have an exact representation in floating point. In the low precision cases, it is better for precision to break the model view projection matrix into two parts for an increase in just one scalar multiply accumulate operation overall,

// Constants
float4 ConstX = ModelViewMatrixX * X;
float4 ConstY = ModelViewMatrixY * Y;
float4 ConstZ = ModelViewMatrixZ;
float2 ConstAB = float2(A, B);

// Vertex shader work
float3 View = float3(
dot(Vertex, ConstX) + ConstX.w,
dot(Vertex, ConstY) + ConstY.w,
dot(Vertex, ConstZ) + ConstZ.w);

float3 Projected = float3(
View.x,
View.y,
View.z * ConstA + ConstB,
View.z);



Projection Matrix : Infinite Reversed (1=near, 0=far)
This is both the fastest and highest precision path. For DX, or GL using glDepthRangedNV(-1.0, 1.0),

X 0 0 0
0 Y 0 0
0 0 0 1
0 0 N 0


This can be optimized to the following,

// Constants
float4 ConstX = ModelViewMatrixX * X;
float4 ConstY = ModelViewMatrixY * Y;
float4 ConstZ = ModelViewMatrixZ;
float ConstN = N; // Ideally N=1 and no constant is needed.

// Vertex shader work
float4 Projected = float4(
dot(Vertex, ConstX) + ConstX.w,
dot(Vertex, ConstY) + ConstY.w,
ConstN,
dot(Vertex, ConstZ) + ConstZ.w);


For GL without glDepthRangedNV(-1.0, 1.0),

X 0 0__ 0
0 Y 0__ 0
0 0 -1_ 1
0 0 2*N 0


Which can be optimized to the following,

// Constants
vec4 ConstX = ModelViewMatrixX * X;
vec4 ConstY = ModelViewMatrixY * Y;
vec4 ConstZ = ModelViewMatrixZ;
float ConstN = 2.0 * N; // Ideally N=1 and no constant is needed.

// Vertex shader work
vec4 Projected;
Projected.w = dot(Vertex, ConstZ);
Projected.xyz = float3(
dot(Vertex, ConstX) + ConstX.w,
dot(Vertex, ConstY) + ConstY.w,
ConstN - Projected.w);



References
http://www.humus.name/Articles/Persson_CreatingVastGameWorlds.pdf
http://www.geometry.caltech.edu/pubs/UD12.pdf
http://outerra.blogspot.com/2012/11/maximizing-depth-buffer-range-and.html

20140712

VR Topics: Racing Scan-Out + Filtering/Noise

Process

(1.) [A] Do view independent work.

(2.) Read latest prediction of head position and orientation for the time at which the frame gets displayed. This is reading from a client side persistent mapped buffer, then writing into a uniform buffer on the GPU. A real-time background CPU job is updating the prediction each time new sensor data arrives.

(3.) [B] Do view dependent work which all rendering depends on.

(4.) Render frame into the front buffer, racing scan-out. Must render in the coarse granularity order at which the front-buffer gets scanned out. Given that raster order is vendor dependent, this process involves splitting the frame into some number of stacked blocks (where block width is the width of the frame). Each block gets rendered independently in scan-out order. Blocks must be large enough such that they can full fill the GPU with work.

Latency
Racing scan-out might be good for a little over a half frame latency reduction in practice.

Below is an overly simplified example display frame (timing is made up, I removed v-blank, etc). Eight blocks are used. Refresh rate is 8 ms/frame, so scan out per block is 1 ms. Prediction is updated every millisecond. Display flashes 1 ms after scan-out finished, and stays lit for just 2 ms, followed by 6 ms of darkness. A frame has 4/3 ms of view independent work, and 4/3 ms of view dependent work before rendering the frame. Each block takes 2/3 ms to draw. The numbers are all made up to enable drawing an easy ASCII diagram of what is going on,

444555666777000111222333444555666777000111222333 scan-out
__AAAABBBB0011223344556677______________________ GPU work for one frame
__AAAA__________________________________________ view independent work
_____||_________________________________________ read prediction
______BBBB______________________________________ view dependent work
__________00____________________________________ GPU work for block 0
____________000_________________________________ scan-out for block 0
________________________77______________________ GPU work for block 7
_________________________________777____________ scan-out for block 7
_______________________________________XXXXXX___ global display
___PPP__________________________________________ prediction jitter
___---------------------------------------______ latency
444555666777000111222333444555666777000111222333 scan-out


In this made up example, total latency would be at best around 13 ms for a 8 ms scan-out (125 fps). If back-buffer rendering, latency will be longer than 18 ms for a 8 ms scan-out,

000111222333444555666777000111222333444555666777000111222 scan-out
AAAABBBB0011223344556677_________________________________ GPU work for one frame
_______________________||________________________________ swap
________________________------------------------_________ scan-out
_PPP_____________________________________________________ prediction jitter
___________________________________________________XXXXXX global display
_-----------------------------------------------------___ latency


Future Hardware Wish-list
Drive scan-out at the peak rate of the bus and sleep, instead of driving scan-out at the display rate. The faster the better.

Implementation Challenges
Assuming ray-tracing instead of rasterization (can ray-trace in the warped space), and a fully pull model based engine (just re-submit the same commands each frame), and relatively fixed cost per frame, there are still challenges left.

Synchronizing the CPU to the GPU with front-buffer rendering is not well supported by any API. Need something to stall the "read prediction" step until a set amount of time before scan-out of the next frame. For GPU's which support volatile reads which pass L2 and get through to the system bus, could poll a time value written by a background CPU thread. Will need some way to calibrate this system, perhaps as hacky at first as the user dialing in the delay until just before any tearing happens.

Post processing must happen while rendering each block, no screen-space effects. This means bloom needs to be replaced in the common case when diffuse bloom is being used to fake atmospheric scattering effects. For quality filtering, each block must have at least a 2 pixel stencil. Dealing with chromatic aberration is the larger problem, requires an even larger stencil.

One option is to go monochrome in green only, removing the need for any chromatic aberration based filtering,




Filtering and Noise
If DK2 is around 2Mpix/frame at 75Hz, that is a lot of pixels to push. Pixel quality can be broken down into various components,

(a.) Antialiasing. Does geometry snap to pixels?
(b.) Sharpness. What is the maximum frequency of detail in the scene?
(c.) Resolution. What granularity of pixels does geometric edges move by?

With ray-marching, sharpness can be directly related to LOD, or how close one gets to the actual surface. Resolution is relatively independent of the number of rays shot per frame. With a high quality sample pattern it is easy to resolve to a frame which has more pixels than the number of rays shot per frame. With VR, I'd argue that antialiasing and resolution is more important than sharpness because sub-pixel motion is critical for depth perception. On top of that, textures are virtually useless because they look like images painted on toys instead of real geometry. In the spectrum of options, GPU ray-tracing based methods have serious advantages over GPU raster methods simply because of the flexibility of sample distribution. With ray based methods, sharpness ends up being a function of how much GPU perf is available. Lower perf can mean a less sharp frame, but still native spatial resolution to get the same sub-pixel parallax.

I'm highly biased towards using something which feels like temporal film grain to both remove the illusion of rendering perfection and mask rendering artifacts. Probably as a hold over from the fact that I use a Plasma HDTV as a primary monitor, I stick to grain around 1.5 pixels in diameter. Plasma HDTV's use temporal error diffusion, therefore single pixel noise can result in artifacts. Not sure yet what is best for VR, but guessing the grain should be at or slightly higher than the frequency of detail in the scene.

An example of low sharpness, but full resolution, high quality antialiasing, and grain,

20140625

Unreal Engine 4 "Rivalry" Demo -- Google I/O 2014

Normally I don't talk about work related stuff here, but this is quite awesome. The "Rivalry" demo is the GL ES 3.1 AEP path running the DX11 based desktop UE4 engine. Same crazy fat G-buffer, deferred shading, reflection probes, screen space reflections, temporal AA algorithm, etc. Using the scalability options that desktop has.

All of this on a tablet platform: Tegra K1.

I'm really happy that Epic has the balls to just capture the youtube video below exactly how it looks on device, WYSIWYG! They could have just dumped out still frames generated with massive super-sampling and all settings turned up to GTX Titan level, but they did not. I know this because all the artifacts of the depth of field and motion blur are there, stuff I'm hoping to getting around to fixing later this year, and unfortunately for the sake of time, my fix for the screen space reflection flash on scene changes didn't make it into that capture (it actually looks better on device).

20140623

Sad Day for High-End ARM

Nvidia Abandons 64-bit Denver Chip for Servers - I guess this was to be expected as HPC likely only exists when highly subsidized by the volume of high-end desktop products. After Microsoft was successful in completely destroying the market for desktop ARM chips by crippling Windows on ARM (making it Metro only): no volume desktop chips = no cost competitive HPC products. The looser in this market is the consumer. 64-bit ARM for desktop would have been an awesome platform.

20140622

No Traditional Dynamic Memory Allocation

To provide some meaning to a prior tweet... Outside of current and prior day jobs, I never use traditional dynamic memory allocation. All allocations are done up front at application load. On modern CPUs the cost of virtual memory is always there, might as well use it. Load time, allocate virtual memory address space for maximum practical memory usage for various data in the application. Virtual address space is backed initial by read-only common page zero-fill (no physical memory allocated). On write the OS modifies the page table and fills written pages with unique zero-fill physical backing. Can also preempt the OS write-fault and manual force various OSs to pre-back used virtual memory space (for real-time applications).

The reduction of complexity, runtime cost, and development cost enabled by this practice is massive.

This practice is a direct analog to what is required to do efficient GPU programming: layout data by usage locality into 1D/2D/3D arrays/tables/textures with indexes or handles linking different data structures. Programs designed around transforms of data, by some mix of gather/scatter. Bits of the larger application cut up into manageable parallel pieces which can be debugged/tested/replaced/optimized individually. Capture and replay of the entire program is relatively easy. Synchronization factored into coarse granularity signals and barriers. Scaling this development up to a large project requires an architect which can layout the high-level network of how data flows through the program. Then sub-architects which own self-contained sub-networks. Then individuals which provide the programs and details of parts of each sub-network.

My programming language of choice to feed this system, is nothing standard, but rather something built around instant run-time edit/modify/test cycle which works from within the program itself. Something similar to forth (a tiny fully expressive programming language which requires no parsing) which can also express assembly which can be run-time assembled into the tiny programs which process data in the application network.

Should be obvious that the practices which enable fast shader development and fast development of the GPU side of a rendering pipeline can directly apply to everything else as well, and also elegantly solves the problem in a way which scales on a parallel machine.

20140621

Filtering and Reconstruction : Points and Filtered Raster Without MSAA

Just ran across these old screen captures (open images in a new window, they are likely getting cropped), stuff I'm no longer working on, but might re-invest in sometime in the future...

Reconstruction with Points or Surfels

The above shot is a HDR monochrome image (converted to color) with a little grain and bloom, of a fractal with lots of sub-pixel geometry rendered in a 360-degree fisheye with point based reconstruction. There is an average of 2 samples/pixel (using 1 bin/pixel but 2x the screen area). The GPU keeps a tree of the scene representation which is updated based on visibility, and then bins all the leaf nodes each frame. Finally there is a reconstruction pass which takes {color, sub-pixel offset} for all the binned points and then reconstructs the scene. Reconstruction is a simple gaussian filter which does a weighted average of a neighborhood of points with weights based on distance from pixel center. For HDR pre-convert to color*rcp(luma+1) as well to remove the HDR fire-fly problem (not energy conserving), remembering to un-convert after filtering. Binning of points uses some stratified jitter so that each frame gets a slightly different collection of points for filtering. Did not use any temporal filtering. The result is a quality analog feeling reconstruction which has soothing temporal noise. Need about 2 samples/pixel for good quality if not using temporal filtering. With really good temporal filtering (see Brian Karis's talk at Siggraph) only need on average one shaded sample/pixel/frame.

Binning for the above image was done using standard point raster, everything running in the vertex shader, near null fragment shader. This works ok on NVIDIA GPUs, but I would not advise for AMD. Future forward, with GPUs which have 64-bit integer min/max atomics and ability to run a full cache-line of atomics in one clock, the binning process should be quite fast. Bin layout in cache-lines must have good 2D locality (atomics on texture not buffer). Evaluate points in groups which have good spatial locality. Ideally should hide shading operations in the shadow of atomic throughput. If not enough work to fill that shadow, then try a pre-resolve atomic collisions in shared memory. Pack pixels into 64-bits {MSB logZ, LSB color and a few bits of sub-pixel offset}. The color is stored scaled by rcp(luma+1) and uses dithering in the conversion.

Points can early out by fetching destination pixels and checking if they are more distant than the current framebuffer result. Just make sure to schedule the load early to avoid the latency problem. Can extend this theory to scene traversal and do a hierarchical z-buffer pre-pass. Render out in a compute shader anti-conservative points into the hierarchy. This process would involve culling paths in the scene graph, and appending a list of nodes (point groups) which need to get rendered. Load balancing this traversal is a challenge which is worthy of another blog post.

Common in both point based techniques and ray-marching with fixed cost/frame algorithms, is that depending on traversal cost, some percentage of screen pixels might have holes or might have fractional coverage. It is possible to hole fill leveraging temporal reprojection, also a topic worthy of another blog post.

Reconstruction with Rotated Rendering and 2x Super-Sampling


Another monochrome image (converted to color) this time of a lot of unlit boxes. This image is also using a slight image warp and some out-of-focus vignette. This was generated by rendering an average of 2x super-sampling, but rendering the frame rotated. This removes the garbage sample pattern of regular non-rotated rendering. The result with a proper reconstruction filter (I used gaussian for performance with weights based on sample distance from pixel center) is something which is similar to 3xMSAA (horizontal and vertical gradients have three visible primary steps if the frame rotation angle is chosen correctly). There is a high cost in memory for rendering rotated. However the pre-z-fill (set to near plane) pass to mask out the invisible extents is super fast.

20140612

Oculus, the VR Revolution, and Dual GPUs

I first tried the Oculus DK1 when I worked at NVIDIA playing UDK Game (a FPS) for about an hour or so attempting to make myself vomit. Let me back up a minute and note that the first time I tried a serious 3D gaming session was also at NVIDIA years earlier playing Team Fortress 2 with the devtech team using a 3D monitor. Being a 3D nube I did what anyone who normally plays about a foot away from an HDTV would do. I played with my face as close to the screen as possible with the separation set to some extreme setting to compensate. The next morning I woke up and experienced vertigo for the first time. Fast forwarding back to the DK1, instead of filling a bucket, I felt nothing. Playing an FPS in Oculus had no effect on me.

At Epic I've tried a few of the Epic Oculus demos, first the Roller Coaster demo on a DK1. To my surprise I actually felt a little presence off the initial drop. Also on a DK1 tried the Elemental demo where the particle effects, also to my surprise, felt real. Eyes tracking the fast moving particles momentarily removed the sense of giant pixels, convinced my mind that I was somewhere else. Later I tried demos on the DK2: the Board Game and the Couch Knight demo. It is striking how much better the DK2 experience is. Beyond that, there was this realization that something as simple as playing an avatar (the Knight) from a 3rd person perspective (the Couch) is amazingly fun in VR.

Fast forward to this E3. The 3rd person game is going to have a major comeback with VR. Developers are starting to unearth the formulas which work really well for the next generation of gaming, the first major consumer VR generation.



Finally a Great Use for Dual GPUs
The common case for Dual GPUs for Oculus won't be to render separate stereo frames. Rather it will be to have two Oculus's plugged into one machine. Want local 2 player: have a two GPU box. Want local 3 player: have a three-way GPU box. The next Neo Geo console of gaming is a multi-high-end-GPU Oculus machine. 3rd person local multi-player is going to be awesome!

20140611

The 24 Hours of Bindless Continues

More Bindless - Continuing the bindless chain blog run, great to think through this stuff in more detail...

"What I’m getting at is that two 4x loads and one 8x load might end up having the same cost with enough occupancy" - If there was no advantage to 8x or 16x block loads, then either an engineer would leave it out of the design, or it is in there for future chips which might have more than 16B/clk of transfer out of K$ (since lines are 64B). Guessing the 16B/clk of transfer is really a limit of the scalar register file write ports. Guessing the advantage of larger than 4x block loads might be for cases like, issue an 16x block load, then issue a longer latency operation like a texture fetch. Meaning the 16x block load can continue writing to the register file in parallel with the texture fetch which at the same time only needs a read from the register file. If instead the code issued four 4x reads, it would delay the texture fetch.

"ideal is probably to allow the texture/sampler descriptors to be freely mixed" - The end game of what you are suggesting ends up being that all vendors directly expose their own low-level API to their unique hardware design. Then let the developers decide how to best use it. I do like that idea, I just don't believe I could convince the vendors to do that. Vendors already have that option given the ability to write extensions. Still need some common base which is portable and fast on all vendors for those who don't have the ability to target all hardware and who want to target future hardware which they don't know about yet.

"Today’s bind model basically has the app providing arrays of pointers which get chased to build contiguous descriptor blocks. Instead, the app could just provide a contiguous descriptor block" - GCN is special in that descriptors are loaded into the scalar register file. Think about the other possible GPU design options which don't involve loading descriptors into a shader register file. All of those possibilities take either an index or offset or pointer of a descriptor in the texture instruction. Which means that the hardware is doing the indirection already even when you don't use bindless. The indirection is free in that case. If there is some performance issue to worry about it might be the difference of loading a constant index/offset/pointer into a register vs using an immediate in the texture instruction instead. If the constant access can be used in place of a register in the opcode, or dual issued loaded for free, or the shader is not ALU bound, then it would not matter.

"Binding is just moving the table pointer." - Again the idea here, not including GCN, is to leverage the legacy path of immediate index textures. So every state change gets a fresh mini-table built inside the giant table. Often devs would just ring buffer these mini-tables inside the giant table. Those looking to actually remove CPU overhead, given a large enough giant table size, would just cache those mini-tables and not rebuild them (for the most part, a majority of what is drawn in one frame was drawn in the prior frame). A given texture descriptor might be duplicated in multiple mini-tables (different combinations of textures per table, or even different orderings). This design requires keeping track of all those descriptor copies and patching all of them on resource streaming. This I'm not wild about.

20140609

E3 Part 1

My must play list (below). Must be a sign of the times, majority are indie titles. The Order 1886 definitely has the bar for visual quality thus far...

ABZU
Below
Chasm
Hyper Light Drifter
Inside
No Man's Sky
The Order 1886
The Witness