20160626

FPGA Processor Rethinking : Part 2

Continued from prior post...

Why Not MIMD
Jumping to the conclusion first, I still wouldn't adopt a MIMD architecture for a highly parallel machine. This post is going to attempt to describe why, and outline the rough design that I've been pursuing for a SIMD machine.

INSTRUCTION ISSUES - With MIMD the amount of instruction RAM per core would be tiny. GRVI in the 8 CPU cluster configuration from the GRVI Phalanx Paper has just 4KB per core. Ideally instructions need to be small, with an ISA with good code density (like Forth, like the J1), but ultimately this will drop ALU IPC. The J1 Paper provides an example: 35% instructions are ALU. Rather be able to issue loads and stores and DSP ALU (d=a*b+c) all in the same cycle, which requires VLIW. SIMD amortizes the large instructions, with a broadcast of the decoded control signals, or a hierarchical instruction decode across the fan out (based on whatever is optimal). I'd like to also have parallel ability to pack and unpack sub-words (to maintain good information density in the core's local scratch RAM), as well as broadcast uniform constants.

ROUTING ISSUES - Or rather timing issues. The Hoplite Paper outlines one of the of core problems. Routing the communication of random MIMD jobs requires throttling message injection to avoid saturating the network, which would otherwise result in some drastically horrible message latencies. Seems like Hoplite requires something like roughly 0.5/max(gridWidth,gridHeight) message injection rate max (seems like the router needs to remain roughly half empty to avoid latency cliff). Optimizing for machines with highly volatile data-dependent run-time performance can be a nightmare. SIMD ensures the work synchronization required for highly efficient communication, as well as makes it easy to implement a majority of the common forms of parallel communication.

Current Design Thoughts
An 18-bit/word massively wide SIMD machine driven by VLIW with the following per lane,

One DSP
One 1024 entry x 18-bit Block RAM (2 ports both either Read or Write)
One 32 entry x 18-bit Register File (32x2Q SLICEM x 9)
One hyper-cube router


Rough machine limits/lane if targeting a 512 lane machine (using same FPGA from prior post),

11552 SLICEM / 512 = 22
22098 SLICEL / 512 = 43


To place this into context, each SLICEL has the ability to MUX four 4-bit values into one 4-bit value, given a 2-bit control signal. Each 18-bit word requires 4.5 SLICEs to do implement a 4:1 choice. That gives 43/4.5 or roughly 9 4:1 choices for the SLICELs for driving connections between {reg-file, BRAM, DSP, and router}. Not including using left over SLICEMs, and not including all the other misc things required. Likely impossible to hit a 512 lane target, won't know until I move on to implementing the cores.

BIT-PIPELINED? - Prior was thinking about doing a bit-pipelined machine (1-bit per lane, 16 lanes grouped). Each SLICE can be configured into 4 pairs of 5:1 LUTs each which can make a full adder. The bit-pipelined multiply would require 4 SLICEs per 16 1-bit lanes for each adder. Easy to burn the entire SLICEL budget on the 3 stage pipeline for just an N-bit x 8-bit multiplier (not including the control structures). And would need a huge amount of instructions to drive that machine. Also the routing required roughly 18x the switch registers (compared to current thinking). So tabled the idea for now.

ROUTER - An analogy for this "router" is to think of a railroad switching station, with a set of parallel tracks or "lanes" each connected by a set of switches which can exchange trains between two tracks, but with switches that require lanes to cross over each other (as in a bridge over tracks not involved in the exchange). The trains in this case are really small, ideally one word of information, but likely sub-words due to area (more on this later).

The general idea is to setup the switching station based on the data-driven needs of the communication, then allow the data payload to flow through the switch. For instance when sorting a packet per lane, for each of the log2(lanes) sorting sub-pass of a bitonic sort, the sorting key gets passed through the switching station, setting switches, then the data payload passes through at a cost proportional to size of the payload. The switching station takes out what would otherwise be log2(lanes/pow2(sub_pass)) passes per sub-pass for the payload transfer for the bitonic sort.

The series of connections are hyper-cube edges, which means if you take the binary value of the lane index, there is one connection per bit in that lane index. For example, a 512-lane SIMD machine has a 9-bit lane index, and thus 9 connections (9 edges in the 9 dimensional hyper-cube). In order to make a switching station which useful for basic parallel communication, the ordering in time of the connections matter greatly. Broadcast and other data replication/expansion algorithms need the switches ordered from LSB to MSB, while sorting and data merging algorithms need the switching ordered from MSB to LSB. Continuing the 512-lane example, each of the 9 pipeline stages of the switching station requires,

(1.) Output from the prior pipeline stage.
(2.) The associated LSB-first ordered edge from the prior pipeline stage.
(3.) The associated MSB-first ordered edge from the prior pipeline stage.
(4.) 2-switch control bits (select between 1,2,3 or some extra value).

For a 9 pipeline stage switching station, this would exceed the entire SLICEL budget. So likely going to, run the switch at half-word or lower granularity.

MOVING ON - Choosing VLIW ISA, and the exact structure for the cores is hard work. Validating the design has all the hardware features I'd like for solving problems is also challenging. Left off thinking about multi-precision math, and how to best manage the 3 cycle DSP latency.

20160623

FPGA Processor Rethinking

Have been very inspired by Jan Gray's GRVI Phalanx.


MIMD
Outside of work, I've been slowly attempting to work up a paper design for my own massively parallel FPGA based computer, looking to close on something to actually build. Mostly been working on SIMD based machines, without a serious focus on MIMD, until I read Jan's paper, which set me off in another direction, how would I build a MIMD machine in a Xilinx FPGA?

Basics
Looking at using a board with the fastest Artix-7. Collecting numbers,

33650 - CLB Slices
740 - DSP Slices
365 - 36 Kbit Block RAMs (each which can be split into two 18 Kbit BRAMs)

If could use all 740 DSP Slices and could maintain a 375 MHz clock (number borrowed from Jan's paper), that would be,

740 DSPs * 2 ops/clk * 375 MHz = 0.555 Tops/sec

Talking about effectively trying for a 740 core MIMD machine. Definitely won't be able to realize that peak, and in comparison to GPUs like FuryX at 8.6 Tops/sec (and 32-bit instead of 18-bit), this number seems small at first. Except if this little FPGA machine was driving an old TV like a console at around NES resolution (same width, less height),

PC driving 2560x1440 which at 8.6 Tops/sec is ___2.3 Mops/pixel/sec.
FPGA driving 256x192 which at 0.5 Tops/sec is __11.3 Mops/pixel/sec.

Which is complete insanity levels of performance per pixel for the FPGA machine for a vintage arcade box, even if only reaching 1/4 of that performance.

So Let the Fun Begin
DSPs are limited by having an 18-bit input for the "d=a*b+c" operation, Block RAMs are natively 18-bits wide, so naturally this is going to be an 18-bit computer. Working backwards to get rough design constraints, first breaking down how many CLB slices support distributed RAMs, and dividing everything by the 740 DSP slices.

11552 SLICEM / 740 = 15
22098 SLICEL / 740 = 29
__730 _BRAMS / 740 = ~1 (18 Kbit)

Block RAMs are dual ported, they don't have the ports required to keep the DSPs filled. Each SLICEM in contrast can be used as a Quad-Port 32 x 2-bit RAM, which looks like a good target for a register file (can sustain 3 read ports for the DSP op). Will need 9 SLICEMs to support a 32 entry x 18-bit register file. These SLICEM's only support 1 write port, and want 2 (for parallel ability to write into the register file while doing DSP ops). So register file will need to be at least 2 banks, for a total of 18 SLICEMs. High level register file design will limit the peak number of DSPs used.

Initial target of one 1024 entry x 18-bit Block RAM per core for data (roughly 16x the capacity of the register file). If clustering 8 CPUs together, that is 8K x 18-bit words for data, via 8-way banking. I'm thinking about doing something quite rash, and only supporting aligned 8-word (8 * 18-bits/word) block loads and stores from this data RAM, both for the CPU register file and the message router. This in comparison with the GRVI would replace the 2:1 concentrators and 4x4 crossbar. Instead the CPU would have 4 8-word regions in one of the banks which could be accessed for block load/store, in parallel with the other bank of 32-word register file used for DSP operations. Effectively the CPU would be modal with a binary switch, one bank used for block load/store to setup for next group of computation, while the other bank is used for computation. Then switch to do math on the loaded data, and load new data in the other bank. It is a very restrictive and simple design but something I think can work quite well in practice.

Also thinking for instruction RAM, sharing one BRAM between two CPUs. But switching to block loads, so all 8 CPUs in a cluster can share 4K x 18-bit words of BRAM. This involves an ISA design which can compute the branch target early enough in the pipeline.

More next time...

20160606

Knupath: Push Model

Another MIMD machine of tiny processors connected by an on-chip network,
Former NASA Exec Brings Stealth Machine Learning Chip to Light
{"We wanted to have the processing in immediate vicinity of the memory—a push model. You don’t need the cache, you don’t need to do fetch. We didn’t design this just for processing, we balanced communications and processing in memory to keep balance. It’s a communicator—there’s a router right in the middle of it," Goldin explains. Unfortunately, the reason they signed a contract for the first chip, which came out in 2015, is because of that eDRAM feature, which put each of the tDSPs right adjacent to memory for immediate contact. While the next variant of their chip won’t be able to use it, they have found a suitable workaround, although they were not able to provide details as of yet.}

If the first diagram is accurate,

Cluster: 8 DSPs sharing 2 MB of eMEM (guessing eDRAM?) + 256 KB sMEM (static RAMs for program binaries?).
Super Cluster: 8 clusters connected with a full 8x8 crossbar.
Chip: 4 super clusters connected with a full 4x4 crossbar.

And from the second diagram,

Chip 4x4 crossbar has 16 bidirectional ports at 10 Gbit/sec per port.

20160517

VK_AMD_rasterization_order Time Saver

(1.) Download and use the latest vulkan.h from here.

(2.) Add VK_AMD_RASTERIZATION_ORDER_EXTENSION_NAME (this string, "VK_AMD_rasterization_order", is defined in vulkan.h) to your VkDeviceCreateInfo.ppEnabledExtensionNames.

(3.) Then use the extension, for example as follows,
// using "static" here to have structure pre-zeroed, feel free to clear instead, etc
static VkPipelineRasterizationStateRasterizationOrderAMD orderAMD;
orderAMD.sType = VK_STRUCTURE_TYPE_PIPELINE_RASTERIZATION_STATE_RASTERIZATION_ORDER_AMD;
orderAMD.rasterizationOrder = VK_RASTERIZATION_ORDER_RELAXED_AMD;

// append it into the pNext of raster state
VkPipelineRasterizationStateCreateInfo createInfo;
...
createInfo.pNext = &orderAMD;

20160424

Parallel Project : Register File

Referencing,
7 Series FPGAs Overview
7 Series FPGAs Memory Resources

Register File
Starting with the first constraint of the design, how to layout the register file for the SIMD machine. Artix-7 XC7A200T is the target which has capacity for 365 36-Kbit block RAMs (32-Kbit data, 4-Kbit parity). Block RAMs are symmetrical dual port (read or write). Each port takes a 16-bit bit indexed address, and returns a 36-bit (32-bits data, 4-bits parity) bit addressed sliding window into the memory.

The ALU design requires 2 read ports and 1 write port. It would be possible to get 2 read ports by duplicating writes and splitting memory to 2 copies. However I'd like to be able to use all of the memory. Since the majority of operations are going to stream through consecutive addresses in the register file, can leverage the sliding window to get 2 consecutive bits per read, and pipeline such that even clocks pull 2-bits for the first read operand, and odd clocks pull 2-bits for the 2nd read operand. So ignoring parity, one possible configuration is to set the write port to only write 16-bits (16 lanes), and have the read port fetch 32-bits,

BIT ADDRESS SIMD LANE
--- ------- ---------
0   n       (lane & 15) = 0
1   n       (lane & 15) = 1
2   n       (lane & 15) = 2
...
15  n       (lane & 15) = 15
16  n+1     (lane & 15) = 0
17  n+1     (lane & 15) = 1
18  n+1     (lane & 15) = 2
...
31  n+1     (lane & 15) = 15

In this example each 36-Kbit block RAM supports 16 SIMD lanes each with 2-Kbit of register file. Throughput of the register file provides an upper bound on performance. For a given clock rate, and assuming 256 block RAMs (70%) are used for the register file, limiting throughput is:

1-bit ops/second = clockRate * 256 block rams * 16 lanes

A 128 Mhz machine would thus be limited to,

128 M * 256 blocks * 16 lanes = around 512 Gop/sec

Where each op provides 1-bit of a variable precision ALU operation. My target for video output is the NES resolution: 256x224 at around 59.94 Hz (non-interlaced) and only NTSC monochrome (works on any classic TV). Target is roughly 3.4 Mpix/second. Dividing out to ops/pixel = 150 Kop/pixel. Even if this rough estimate is way too optimistic (which it is), going to have a tremendous amount of perf/pixel for this neo-vintage arcade machine.

Continuing, 256 blocks * 16 lanes = 4096 lane SIMD machine, but the XC7A200T only has 33,650 slices. Using 70% again: 33650 slices * 0.7 / 4096 lanes = 5.7 slices/lane. Yes way too optimistic. Slices per lane for the 1-bit/clock return ALU will limit the peak number of lanes in the machine, which will reduce limiting throughput, but on the positive side, will result in more memory per lane. Can work backwards from possible block RAM configurations to get slice/ALU targets, and this time going to round down to only using 16k slices,

256 blocks * 16 lanes,  2-Kbit/lane,  4 slices/lane
256 blocks *  8 lanes,  4-Kbit/lane,  8 slices/lane
256 blocks *  4 lanes,  8-Kbit/lane, 16 slices/lane
256 blocks *  2 lanes, 16-Kbit/lane, 32 slices/lane
256 blocks *  1 lanes, 32-Kbit/lane, 64 slices/lane

More Constraints
Block RAMs only have byte, not bit, write enable, so if it is possible to reach the 2 lane or better configuration, this rules out building any kind of per-lane predication which depends on disabling register file writes per lane. Going to work through ALU design without needing lane write enable. Likewise original thoughts on NPU have been transformed quite a lot based on real constrains. Focusing on fused ALU+NPU: where ALU operands can come from other edges in the network graph.

More next time ...

20160423

Parallel Machine Thoughts

Been very slowly working up towards learning everything required to build my own highly parallel computer in an FPGA. This process has in some ways been on-going for years, getting comfortable building languages and working on little OS prototypes, etc. On the way taking inspriation from other people's projects like the JayStation and Linus Åkesson's Parallelogram, and realizing that not-too-expensive FPGAs have reached the performance level where is it possible to simulate vintage arcade machines like the NeoGeo. This post represents some of the higher level thinking on the current design direction which is highly motivated by the CM-1.

Fusing MEM and ALU
One traditional design construct which I'm intending to toss out is the macro-level separation of memory and ALU. The FPGA I have in mind currently is the Digilent Nexys Video Artix-7 FPGA Trainer Board which has the largest 7 series Xilinx chip which can be used with the free version of Vivado. The Artix-7 has 13-mega-bits of on-chip block ram. To place this into context, the 2nd Amiga, the A2000, started with 8-mega-bits, and a little over 7 MHz CPU. Effectively I'm looking at carving out the block rams across a large number of tiny processors connected by an on-chip network, treating the external DDR3 more like volatile "storage" instead of "memory". I'm intending to prototype (or just design if I fail) something for fun and learning (something I could build a little arcade game on), but if made into actual hardware could hopefully be scaled up to stacked chips like HBM but instead of being composed of just memory, would have arrays of paired eDRAM and ALU cells.

Contrasting with GPUs
The majority of on-chip memory in a GPU is effectively a parking lot for the state of parallel jobs waiting on off-chip memory access, or high latency hits into on-chip caches. As a direct result of this, one of the software programming model constructs a GPU effectively deprecates, is the usage of the large CPU stacks. Likewise deprecated, the model of writing large software libraries invoked with "call and return" traversing huge library dependency chains.

However the GPU is still built around the data analog to "call and return", the "fetch memory" construct (the VMEM load/sample opcodes in GCN for example). The important property of the "fetch memory" construct is that it is a blocking operation.

Designing Around All Non-Blocking Constructs
GPU is designed around vector gather (aka the texture fetch), with only limited ability to scatter, either locally into a tiny memory like LDS divided across workgroups, or globally into L2, but no ability to directly scatter into the largest on-chip memory: the register file (aka the GPU's parking lot). One of the best evolutionally changes between the CPU and GPU was moving the ALU for global atomics from the shader cores to the other side of the crossbar close to the L2. I'd like to continue this evolution to the next logical step.

An inverse of the gather machine is a machine built around scatter, a machine built around non-blocking "fire-and-forget". Conceptionally the tiny processors become "pinned jobs" which are persistently running instead of mostly parked, and the "scatter" is on-chip parallel message routing between the register files of these tiny pinned jobs.

Designing Around Cross-Lane Operations
Looking to repurpose the area of the traditional GPU memory sub-system, like caches and texture units, along with the ports used in the register file for load/store, instead for what is effectively a second functional unit which specializes in doing cross-lane operations in parallel with the ALU on a giant SIMD machine. These cross-lane operations become the "scatter" or parallel message routing network which I'm going to call the "NPU" (short for Network Processing Unit). This NPU would be able to do things like broadcast (on a local to global scale), sorting (when paired with ALU operations), scatter, etc. NPU's purpose is to re-organize data for the needs of computation in parallel with computation.

Transposed ALU Design
The ALU I have in mind is very similar to the CM-1: an ultra-wide SIMD machine with a bit addressed register file and a very simple ALU. Where conventional SIMD processor opcodes use an immediate register index to address and fetch something like 1 32-bit word per lane, the design I have in mind instead would fetch 1-bit for 32 words instead (using the same register index). The opcode register index is thus a bit index (shifts are amortized into the cost of the bit addressed register file), and the ALU, instead of retiring one 32-bit result per op, retires one bit for 32 words. This introduces the ability to do varible bit-width operations, but greatly increases the wall clock time it takes to compute anything (no longer optimizing for latency here because there are no non-register file memory reads).

In order to make it possible to compute multiplies faster, I'm looking at an ALU design which includes a pipelined tree reduction, with a mux at the end to select which element of the tree to return to the register file (implemented in block rams). Very rough diagram below for a 4-bit tree. Each [_] is an adder, and with carry and control logic not drawn. The "y" input is transfered across the tree with a one clock delay. Each of the [a] through [d] adders have an input which feeds back to itself for the next cycle, with a control input which can selectively override with the new "x" input to the module.
bit in x --->o-----o-----o-----o
             |     |     |     |
bit in y -->[a]-->[b]-->[c]-->[d]
             |\   /       \   /
             | [e]         [f]
             |  | \       / 
             |  |  \     /  
             |  |    [g]    
             |  |     |     
bit out z <--o--o-----o
The number of leaves of the tree acts as some multiplier on the throughput of multiply, while having side effects on the critical path timing, and increasing area, etc. I'm still exporing what other uses this tree construct could have to see what opcode control signals I should support to reconfigure.

Bit-Banked Register File
The FPGA's block rams are dual port, and I need 3 ports for the ALU, and more for the NPU. To work around this problem, I'm looking at banking the register file at the bit address level. Multi-precision ALU ops and multi-bit message passing (NPU) will stream through a series of addresses, making it easy to have streams start at different addresses modulo the banding multiplier, in theory at least. This adds and associated complexity to the compiler for code generation, or the human, if hand assembling code. Not sure yet if this idea is going to work out.

NPU
Finding this to be the most challenging part of the project, producing a design which is not too large in comparison with the ALU, etc. I'm working from the idea of using the same CM-1 style hypercube network, but not sure about edge node design (CM-1 has 16 ALUs at each edge). Currently doubt I want anything like a full 16-lane crossbar in an edge.

From a high level view, messages are stored in the register file, and the NPU can conditionally swap two SIMD lane's messages, with the ability to conditionally write the data payload of the swapped message instead to a second area of register file for a message which reached the destination (this way the NPU can continue to use the message scratch area for routing other messages). Lane connectivity is limited by whatever fixed network layout I end up going with. Decided on using the register file as the message buffering area to allow the "program" to control the full routing algorithm (instead of having say a fixed petit cycle as in the CM-1 and a separate memory, etc). This also effectively limits the design to message swaps (like a parallel sorting network), but enables the possibility of doing broadcast and other things requiring message duplication.

Messages are composed of {message enable bit, variable bit relative lane address, variable bit payload}. Some message passing won't need the relative lane address (fixed network flow instead of data driven). Also parts of the message could be read out of order. For instance to build up the NPU register which enables the message transfer to actually write to the lane's register file, for routing like in the CM-1, the bit representing the active edge of the petit cycle would be read first.

The NPU design I'm looking at has one read and one write port into the register file, if message throughput is a bit per clock per 1-bit lane in the SIMD machine. The alternative is to drive messages at half rate and use the one FPGA block ram port which can be read or write. This has some appeal because then only 4 ports are needed across the ALU and NPU combined. And lastly I'm thinking about just merging ALU and NPU, so that network inputs can directly be used as ALU input.

On Paper
Current work involves building towards a fixed ALU+NPU logic design which gets driven by SIMD control fan-out. Once I'm happy with the ALU+NPU design, I'm back to verilog to see how many of these I can place in the FPGA. Then I need to think more about building the "sequencer" or CPU-style scalar unit which drives the control lines for the massive SIMD machine.

Wrestled about actually blogging about this for a long time, because like anything not tied to the day job with a real deadline, I'm likely to scrap everything and do something else. And this is my first attempt at somewhat serious hardware design, which is certain to contain horrible fail moments, which hopefully others can enjoy at my expense. On the other hand, unlike dealing with software, this project has brought back the sense of wonder I had as a kid which otherwise since has transformed into sarcasm after dealing with how the industry as since evolved.

20160422

The End of HDMI to Analog HDfury

According to an email today to prior customers like myself, the makers of the great HDFury devices, have been forced to discontinue selling all HDMI to Analog devices within a week. Likely the last chance to purchase one before your only option is the used online market. As someone who depends on these devices to drive my analog CRTs from modern digital equipment for personal fair use, I'm sad to see the continued decline of my rights and ability to just use my computers for basic fundamental things like analog output. Hopefully the HDFury business continues with other products and hopefully at some point they release some kind of Display Port to Analog converter.

20160420

April 2016 LG OLED65G6P HDR OLED TV Review on HDTVtest

LG OLED65G6P HDR OLED TV Review on HDTVtest
Now April 2016, OLED HDR TVs reach 744 nits (still under one stop brighter peak than your typical PC panel). While April 2016 LCDs HDR TVs are upwards of 1000 nits with one TV reaching 1300 nits. OLED is more efficient than Plasma but cannot hit the power efficiency of LEDs backing LCDs. 2016 LED backed LCD TVs are still limited in contrast, reaching upwards of only 5000:1 ANSI contrast. While OLED's selling point is the dark blacks and huge contrast, this is still a serious pain point for the technology scaled up to TV sizes. I've been watching this space hoping to eventually replace my Plasma while I still can get an OLED at 1080p, but the technology just is not there yet. Quotes from the HDTVtest review from this April 2016,

"We calibrated several G6s in both the ISF Night and ISF Day modes, and found that with about 200-250 hours run-in time on each unit, the grayscale tracking was consistently red deficient and had a green tint as a result (relative to a totally accurate reference)"

"Case in point: we originally calibrated one of the units and added some targeted touch-ups at 70% stimulus, only to find that, 60 hours later, the same adjustments that had gained us totally flat grayscale at the time of calibration were no longer ideal."

"LG have placed the option to run the self-correction process in the user menu (previously, it was only accessible in the service mode.) ... It takes a little over an hour, and interrupting the TV before it’s finished will require you to start again."

"The LG 65G6’s [Brightness] control (which governs black level) is set to discard some dark-scene details by default. We found that we had to raise it by a decent number of clicks from the factory position of “50” in order to avoid this. ... After doing this, during our dark-scene, dark-room testing, we noticed the blacks “floating”. During cuts to black, we could see swathes of non-black areas lighted on the panel, which is probably why LG crushed blacks by default."

"The LG OLED65G6’s input lag measured as being 34ms using the Leo Bodnar testing device."

20160409

ELF

Random source from one of my prior languages which generates an ELF header for a x86-64 Linux binary with the dlsym() symbol.

\============================================================================
                              [ELF] LINUX
-----------------------------------------------------------------------------
http://www.sco.com/developers/gabi/2000-07-17/ch4.symtab.html
http://blog.markloiseau.com/2012/05/tiny-64-bit-elf-executables/
============================================================================\
{ ElfPh! \align memsz filesz paddr,vaddr,offset type\ 
 `word .elfPhXWR `word `dup `dup `long `long `long `long `long `long }
{ ElfDs! \val tag\ `long `long }
{ ElfSym! \size value shndx other type bind name\ 
 `word 10 `mul `add `byte `byte `half `long `long }
{ ElfStr( $ .ElfStr `neg `add }
{ ElfStr) `text 0 `byte }
\===========================================================================\
{ Elf
\---------------------------------------------------------------------------\
\ELF HEADER\
\e_ident\     00010102464c457f `long
\reserved\    0  `long
\e_type\      2  `half \ET_EXEC\
\e_machine\   3e `half \X86-64\
\e_version\   1  `word \EV_CURRENT\
\e_entry\     .ElfEntry `long
\e_phoff\     .ElfPh `long
\e_shoff\     0  `long
\e_flags\     0  `word
\e_ehsize\    40 `half
\e_phentsize\ 38 `half
\e_phnum\     .ElfPh# `half
\e_shentsize\ 40 `half
\e_shnum\     0  `half
\e_shstrndx\  0  `half
\---------------------------------------------------------------------------\
\PROGRAM HEADER\
$ :ElfPh
7 :elfPhXWR \PF_X+PF_W+PF_R\
3 :ElfPh#
1       .elfIs# .elfIs# .ElfIs  3 \PT_INTERP\  ElfPh!
.BUILD# .BUILD# .ElfEnd 0       1 \PT_LOAD\    ElfPh!
8       .elfDs# .elfDs# .ElfDs  2 \PT_DYNAMIC\ ElfPh!
\---------------------------------------------------------------------------\
\DYNAMIC SECTION\
$ :ElfDs
.elfLib% 1 \DT_NEEDED\  ElfDs!
.ElfHsh  4 \DT_HASH\    ElfDs!
.ElfStr  5 \DT_STRTAB\  ElfDs!
.ElfSym  6 \DT_SYMTAB\  ElfDs!
.ElfRel  7 \DT_RELA\    ElfDs!
.elfRel# 8 \DT_RELASZ\  ElfDs!
18       9 \DT_RELAENT\ ElfDs!
0        0              ElfDs!
.ElfDs $# :elfDs#
\---------------------------------------------------------------------------\
\SYMBOL TABLE\
10 $- \overlap\
$ :ElfSym 
0 0         0 0               0              0            0          ElfSym!
0 .ElfDlSym 0 0 \STV_DEFAULT\ 1 \STT_OBJECT\ 2 \STB_WEAK\ .elfDlSym% ElfSym!
\---------------------------------------------------------------------------\
\RELOCATION TABLE\
$ :ElfRel .ElfDlSym `long 7 \R_X86_64_JUMP_SLOT\ `word 1 `word 0 `long .ElfRel $# :elfRel#
\---------------------------------------------------------------------------\
\DLSYM\
$ :ElfDlSym 0 `long
\---------------------------------------------------------------------------\
\HASH TABLE\
$ :ElfHsh 1 `word 2 `word 1 `word 0 `word 0 `word
\---------------------------------------------------------------------------\
\INTERPRETER STRING\
$ :ElfIs "/lib/ld-linux-x86-64.so.2" `text 0 `byte .ElfIs $# :elfIs#
\---------------------------------------------------------------------------\
\DYNAMIC STRING TABLE\
1 $- \overlap\
$ :ElfStr
0 `byte
ElfStr( :elfLib%   "libdl.so.2" ElfStr)
ElfStr( :elfDlSym% "dlsym"      ElfStr) }
\===========================================================================\
{ Elf( 0 `asm! 0 $0! $_0 Elf }
{ Elf) `asm :ElfEnd }
{ Elf! $ :ElfEntry }
And the associated "readelf" on a binary generated with this code,
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1e7
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         3
  Size of section headers:           64 (bytes)
  Number of section headers:         0
  Section header string table index: 0

There are no sections in this file.

There are no sections to group in this file.

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  INTERP         0x00000000000001bc 0x00000000000001bc 0x00000000000001bc
                 0x000000000000001a 0x000000000000001a  RWE    1
      [Requesting program interpreter: /lib/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000001e48 0x0000000002000000  RWE    2000000
  DYNAMIC        0x00000000000000e8 0x00000000000000e8 0x00000000000000e8
                 0x0000000000000080 0x0000000000000080  RWE    8

Dynamic section at offset 0xe8 contains 8 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000004 (HASH)               0x1a8
 0x0000000000000005 (STRTAB)             0x1d5
 0x0000000000000006 (SYMTAB)             0x158
 0x0000000000000007 (RELA)               0x188
 0x0000000000000008 (RELASZ)             24 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x0000000000000000 (NULL)               0x0

There are no relocations in this file.

The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported.

Histogram for bucket list length (total of 1 buckets):
 Length  Number     % of total  Coverage
      0  0          (  0.0%)
      1  1          (100.0%)    100.0%

No version information found in this file.
And a hex dump,
00000000  7f 45 4c 46 02 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 3e 00 01 00 00 00  e7 01 00 00 00 00 00 00  |..>.............|
00000020  40 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |@...............|
00000030  00 00 00 00 40 00 38 00  03 00 40 00 00 00 00 00  |....@.8...@.....|
00000040  03 00 00 00 07 00 00 00  bc 01 00 00 00 00 00 00  |................|
00000050  bc 01 00 00 00 00 00 00  bc 01 00 00 00 00 00 00  |................|
00000060  1a 00 00 00 00 00 00 00  1a 00 00 00 00 00 00 00  |................|
00000070  01 00 00 00 00 00 00 00  01 00 00 00 07 00 00 00  |................|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  48 1e 00 00 00 00 00 00  |........H.......|
000000a0  00 00 00 02 00 00 00 00  00 00 00 02 00 00 00 00  |................|
000000b0  02 00 00 00 07 00 00 00  e8 00 00 00 00 00 00 00  |................|
000000c0  e8 00 00 00 00 00 00 00  e8 00 00 00 00 00 00 00  |................|
000000d0  80 00 00 00 00 00 00 00  80 00 00 00 00 00 00 00  |................|
000000e0  08 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
000000f0  01 00 00 00 00 00 00 00  04 00 00 00 00 00 00 00  |................|
00000100  a8 01 00 00 00 00 00 00  05 00 00 00 00 00 00 00  |................|
00000110  d5 01 00 00 00 00 00 00  06 00 00 00 00 00 00 00  |................|
00000120  58 01 00 00 00 00 00 00  07 00 00 00 00 00 00 00  |X...............|
00000130  88 01 00 00 00 00 00 00  08 00 00 00 00 00 00 00  |................|
00000140  18 00 00 00 00 00 00 00  09 00 00 00 00 00 00 00  |................|
00000150  18 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000160  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000170  0c 00 00 00 21 00 00 00  a0 01 00 00 00 00 00 00  |....!...........|
00000180  00 00 00 00 00 00 00 00  a0 01 00 00 00 00 00 00  |................|
00000190  07 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
000001a0  00 00 00 00 00 00 00 00  01 00 00 00 02 00 00 00  |................|
000001b0  01 00 00 00 00 00 00 00  00 00 00 00 2f 6c 69 62  |............/lib|
000001c0  2f 6c 64 2d 6c 69 6e 75  78 2d 78 38 36 2d 36 34  |/ld-linux-x86-64|
000001d0  2e 73 6f 2e 32 00 6c 69  62 64 6c 2e 73 6f 2e 32  |.so.2.libdl.so.2|
000001e0  00 64 6c 73 79 6d 00 33  ff be b8 14 00 00 ff 15  |.dlsym.3........|
No Dynamic Linking Example
And source for a ELF which only uses syscalls and no dynamic linking for comparison,
\===============================================================
            64-BIT ELF FOR KERNEL ONLY INTERFACE
----------------------------------------------------------------
http://blog.markloiseau.com/2012/05/tiny-64-bit-elf-executables/
===============================================================\
{ Elf
\e_ident\     00010102464c457f `long
\reserved\    0  `long
\e_type\      2  `half
\e_machine\   3e `half
\e_version\   1  `word
\e_entry\     .ElfEntry `long
\e_phoff\     .ElfPh `long
\e_shoff\     0  `long
\e_flags\     0  `word
\e_ehsize\    40 `half
\e_phentsize\ 38 `half
\e_phnum\     1  `half
\e_shentsize\ 0  `half
\e_shnum\     0  `half
\e_shstrndx\  0  `half
\______________________________________________________________\
`asm :ElfPh
\p_type\   1 `word
\p_flags\  7 `word
\p_offset\ 0 `long
\p_vaddr\  0 `long
\p_paddr\  0 `long
\p_filesz\ .ElfEnd `long
\p_memsz\  4000000 `long
\p_align\  2000000 `long }
\______________________________________________________________\
{ Elf( 0 `asm! 0 $0! Elf }
{ Elf) `asm :ElfEnd }
{ ElfEntry! $ :ElfEntry }
The "readelf" results,
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x5f6
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         1
  Size of section headers:           0 (bytes)
  Number of section headers:         0
  Section header string table index: 0

There are no sections in this file.

There are no sections to group in this file.

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000000006aa 0x0000000004000000  RWE    2000000

There is no dynamic section in this file.

There are no relocations in this file.

The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported.

No version information found in this file.
And the hex dump (I'm too lazy so this is going to just include the full binary, which happens to be the compiler I use for the language both these ELF headers are written in),
00000000  7f 45 4c 46 02 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  02 00 3e 00 01 00 00 00  f6 05 00 00 00 00 00 00  |..>.............|
00000020  40 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |@...............|
00000030  00 00 00 00 40 00 38 00  01 00 00 00 00 00 00 00  |....@.8.........|
00000040  01 00 00 00 07 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  aa 06 00 00 00 00 00 00  00 00 00 04 00 00 00 00  |................|
00000070  00 00 00 02 00 00 00 00  69 6e 2e 61 00 6f 75 74  |........in.a.out|
00000080  2e 61 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |.a..............|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000b0  00 00 00 00 01 02 03 04  05 06 07 08 09 00 00 00  |................|
000000c0  00 00 00 00 0a 0b 0c 0d  0e 0f 00 00 00 00 00 00  |................|
000000d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000e0  00 00 00 00 0a 0b 0c 0d  0e 0f 00 00 00 00 00 00  |................|
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000100  00 00 00 65 05 65 05 65  05 65 05 65 05 65 05 65  |...e.e.e.e.e.e.e|
00000110  05 65 05 65 05 65 05 65  05 65 05 65 05 65 05 65  |.e.e.e.e.e.e.e.e|
*
00000140  05 65 05 65 05 d7 03 4b  04 d7 03 d7 03 ab 03 d7  |.e.e...K........|
00000150  03 06 04 d7 03 d7 03 d7  03 d7 03 05 05 d7 03 71  |...............q|
00000160  03 d7 03 a0 04 a0 04 a0  04 a0 04 a0 04 a0 04 a0  |................|
00000170  04 a0 04 a0 04 a0 04 2b  03 d7 03 d7 03 d7 03 d7  |.......+........|
00000180  03 d7 03 d7 03 d7 03 d7  03 d7 03 d7 03 d7 03 d7  |................|
*
000001b0  03 d7 03 d7 03 d7 03 d7  03 d7 03 40 05 d7 03 d7  |...........@....|
000001c0  03 d7 03 9b 02 d7 03 d7  03 d7 03 d7 03 d7 03 d7  |................|
000001d0  03 d7 03 d7 03 d7 03 d7  03 d7 03 d7 03 d7 03 d7  |................|
*
000001f0  03 d7 03 d7 03 d7 03 d7  03 05 03 d7 03 d4 02 d7  |................|
00000200  03 82 05 48 89 91 08 00  50 00 48 89 99 10 00 50  |...H....P.H....P|
00000210  00 48 89 a9 00 00 50 00  8b c5 83 c5 08 c3 b9 c5  |.H....P.........|
00000220  9d 1c 81 33 d2 33 db 40  0f b6 06 83 c6 01 48 69  |...3.3.@......Hi|
00000230  c9 93 01 00 01 48 33 c8  48 c1 e2 03 48 03 d0 48  |.....H3.H...H..H|
00000240  83 e2 f0 48 c1 e3 04 83  e0 0f 48 03 d8 40 0f b6  |...H......H..@..|
00000250  06 83 c6 01 83 f8 20 0f  87 d1 ff ff ff c1 e1 06  |...... .........|
00000260  83 c1 e0 83 c1 20 81 e1  e0 ff 7f 00 4c 3b b9 00  |..... ......L;..|
00000270  00 50 00 0f 84 8a ff ff  ff 48 3b 91 08 00 50 00  |.P.......H;...P.|
00000280  0f 85 dd ff ff ff 48 3b  99 10 00 50 00 0f 85 d0  |......H;...P....|
00000290  ff ff ff 48 8b 81 00 00  50 00 c3 e8 7e ff ff ff  |...H....P...~...|
000002a0  8b 90 00 00 20 00 8b 00  48 8b 08 48 89 0f 48 8b  |.... ...H..H..H.|
000002b0  48 08 48 89 4f 08 48 8b  48 10 48 89 4f 10 03 fa  |H.H.O.H.H.H.O...|
000002c0  40 0f b6 06 83 c6 01 8b  c8 03 c9 0f b7 89 03 01  |@...............|
000002d0  00 00 ff e1 41 8b 40 fc  41 83 e8 04 8b cf 2b 08  |....A.@.A.....+.|
000002e0  89 88 00 00 20 00 b9 c3  00 00 00 40 88 0f 83 c7  |.... ......@....|
000002f0  01 40 0f b6 06 83 c6 01  8b c8 03 c9 0f b7 89 03  |.@..............|
00000300  01 00 00 ff e1 83 c6 01  e8 11 ff ff ff 41 89 00  |.............A..|
00000310  41 83 c0 04 48 89 38 40  0f b6 06 83 c6 01 8b c8  |A...H.8@........|
00000320  03 c9 0f b7 89 03 01 00  00 ff e1 e8 ee fe ff ff  |................|
00000330  2b c7 83 e8 07 89 47 03  b9 48 89 00 00 66 89 0f  |+.....G..H...f..|
00000340  b9 05 00 00 00 40 88 4f  02 b9 48 8b 03 83 89 4f  |.....@.O..H....O|
00000350  07 b9 eb 08 00 00 66 89  4f 0b 83 c7 0d 40 0f b6  |......f.O....@..|
00000360  06 83 c6 01 8b c8 03 c9  0f b7 89 03 01 00 00 ff  |................|
00000370  e1 e8 a8 fe ff ff 2b c7  83 e8 0e 89 47 0a 48 b9  |......+.....G.H.|
00000380  48 89 43 08 83 c3 08 48  48 89 0f b9 8b 05 00 00  |H.C....HH.......|
00000390  66 89 4f 08 83 c7 0e 40  0f b6 06 83 c6 01 8b c8  |f.O....@........|
000003a0  03 c9 0f b7 89 03 01 00  00 ff e1 e8 6e fe ff ff  |............n...|
000003b0  89 47 08 48 b9 48 89 43  08 83 c3 08 b8 48 89 0f  |.G.H.H.C.....H..|
000003c0  83 c7 0c 40 0f b6 06 83  c6 01 8b c8 03 c9 0f b7  |...@............|
000003d0  89 03 01 00 00 ff e1 83  c6 ff e8 3f fe ff ff 2b  |...........?...+|
000003e0  c7 83 e8 06 89 47 02 b9  ff 15 00 00 66 89 0f 83  |.....G......f...|
000003f0  c7 06 40 0f b6 06 83 c6  01 8b c8 03 c9 0f b7 89  |..@.............|
00000400  03 01 00 00 ff e1 e8 13  fe ff ff 8b 00 2b c7 83  |.............+..|
00000410  e8 0f 89 47 0b 48 b9 48  85 c0 48 8b 03 8d 5b 48  |...G.H.H..H...[H|
00000420  89 0f b9 f8 0f 00 00 66  89 4f 08 b9 85 00 00 00  |.......f.O......|
00000430  40 88 4f 0a 83 c7 0f 40  0f b6 06 83 c6 01 8b c8  |@.O....@........|
00000440  03 c9 0f b7 89 03 01 00  00 ff e1 8b ce 40 0f b6  |.............@..|
00000450  06 83 c6 01 83 f8 22 0f  85 f0 ff ff ff 8b c6 2b  |......"........+|
00000460  c1 83 c0 ff 89 4f 05 89  47 11 b9 48 89 43 08 89  |.....O..G..H.C..|
00000470  0f b9 b8 00 00 00 40 88  4f 04 48 b9 48 89 43 10  |......@.O.H.H.C.|
00000480  83 c3 10 b8 48 89 4f 09  83 c7 15 40 0f b6 46 01  |....H.O....@..F.|
00000490  83 c6 02 8b c8 03 c9 0f  b7 89 03 01 00 00 ff e1  |................|
000004a0  33 c9 40 0f b6 80 83 00  00 00 48 c1 e1 04 48 03  |3.@.......H...H.|
000004b0  c8 40 0f b6 06 83 c6 01  83 f8 30 0f 83 e1 ff ff  |.@........0.....|
000004c0  ff 48 8b d1 48 f7 da 8d  5e 01 83 f8 2d 48 0f 44  |.H..H...^...-H.D|
000004d0  ca 0f 44 f3 48 89 4f 09  48 b9 48 89 43 08 83 c3  |..D.H.O.H.H.C...|
000004e0  08 48 48 89 0f b9 b8 00  00 00 40 88 4f 08 83 c7  |.HH.......@.O...|
000004f0  11 40 0f b6 06 83 c6 01  8b c8 03 c9 0f b7 89 03  |.@..............|
00000500  01 00 00 ff e1 40 0f b6  06 40 0f b6 88 83 00 00  |.....@...@......|
00000510  00 40 0f b6 46 01 40 0f  b6 80 83 00 00 00 48 c1  |.@..F.@.......H.|
00000520  e1 04 48 03 c8 40 88 0f  83 c7 01 40 0f b6 46 03  |..H..@.....@..F.|
00000530  83 c6 04 8b c8 03 c9 0f  b7 89 03 01 00 00 ff e1  |................|
00000540  40 0f b6 06 83 c6 01 83  f8 5c 0f 85 f0 ff ff ff  |@........\......|
00000550  40 0f b6 46 01 83 c6 02  8b c8 03 c9 0f b7 89 03  |@..F............|
00000560  01 00 00 ff e1 40 0f b6  06 83 c6 01 83 f8 20 0f  |.....@........ .|
00000570  86 f0 ff ff ff 8b c8 03  c9 0f b7 89 03 01 00 00  |................|
00000580  ff e1 48 b9 d9 bf 6f 7a  22 42 bc 83 48 ba 40 96  |..H...oz"B..H.@.|
00000590  00 00 00 00 00 00 48 bb  71 e6 00 00 00 00 00 00  |......H.q.......|
000005a0  e8 b8 fc ff ff bb 00 00  20 01 ba 00 00 30 01 bf  |........ ....0..|
000005b0  00 00 30 00 8b 00 ff d0  8d 9f 00 00 d0 ff bf 7d  |..0............}|
000005c0  00 00 00 be 41 02 00 00  ba c0 01 00 00 b8 02 00  |....A...........|
000005d0  00 00 0f 05 8b e8 8b fd  be 00 00 30 00 8b d3 b8  |...........0....|
000005e0  01 00 00 00 0f 05 8b fd  b8 03 00 00 00 0f 05 b8  |................|
000005f0  e7 00 00 00 0f 05 48 83  e4 f0 bf 78 00 00 00 be  |......H....x....|
00000600  00 00 00 00 b8 02 00 00  00 0f 05 8b d8 48 b8 ff  |.............H..|
00000610  ff ff ff ff ff ff ff 48  89 05 e2 f9 01 00 bf 00  |.......H........|
00000620  00 00 00 be 00 00 02 00  ba 08 00 02 00 41 ba 08  |.............A..|
00000630  00 00 00 b8 0d 00 00 00  0f 05 8b fb be 10 00 02  |................|
00000640  00 b8 05 00 00 00 0f 05  8b fb be 00 00 10 00 48  |...............H|
00000650  8b 15 ea f9 01 00 b8 00  00 00 00 0f 05 8b fb b8  |................|
00000660  03 00 00 00 0f 05 48 b8  00 00 00 00 7f 7f 7f 7f  |......H.........|
00000670  48 8b 1d c9 f9 01 00 48  89 83 00 00 10 00 bd 00  |H......H........|
00000680  00 d0 00 be 00 00 10 00  bf 00 00 40 01 41 b8 00  |...........@.A..|
00000690  00 10 01 45 33 ff 40 0f  b6 06 83 c6 01 8b c8 03  |...E3.@.........|
000006a0  c9 0f b7 89 03 01 00 00  ff e1                    |..........|

20160403

Compiling on Linux Without Libc

For reference, just reposting some of the inline asm bits from one of my engines to jump start compiling without libc...

Shell script to compile forces C (-x c) since I often use the cpp extension which defaults to C++, and forces no libraries except libdl (-nostdlib -ldl).
gcc -x c e.cpp -o e.bin -std=gnu99 -nostdlib -ldl ...
Note output from "ldd" will show libc even with -nostdlib because libdl depends on libc, even when the binary only ever uses say 2 external symbols from libdl {dlopen() and dlsym()}. The linux-vdso is mapped for syscall bypass kernel fast path. Some "ldd" output,
linux-vdso.so.1 (0x00007fff763f9000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f3cb75ac000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f3cb7209000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3cb77b0000)

Rolling Your Own Main
Running without libc means jumping in from _start instead, and then doing a little assembly to setup the correct environment (note the manual stack alignment).
// Pulled from elsewhere in the engine...
#define ER_ __restrict
#define ES_ static
typedef unsigned char EU1;
typedef signed int ES4;
typedef EU1 *ER_ EU1R;

// Enter without libc,
ES_ void main(ES4 argc, EU1R *ER_ argv) { ERomMain(argc, argv); EDie(); }
__asm__(
  ".text\n"
  ".global _start\n"
  "_start:\n"
  "xor %rbp,%rbp\n"
  "pop %rdi\n"
  "mov %rsp,%rsi\n"
  "andq $-16,%rsp\n"
  "call main\n");

Syscalls
Sorry in advance this may wrap. Showing only the 64-bit x86-64 interface below. Syscalls have 0 to 6 arguments so you need just 7 inline asm functions to access any syscall. The return is often technically signed (as signed means error), but I use unsigned everywhere out of habit with a typecast when I need the signed result. I grab syscall numbers from the linux source, and make my own headers for what I need (which is not much).
// Copied from elsewhere in the engine...
#define EI_ static inline __attribute__((always_inline))
typedef unsigned long EU8;

// Linux syscall access.
EI_ EU8 ELnx0(EU8 num) { EU8 ret;
  asm volatile("syscall":"=a"(ret):"a"(num):
    "cc","memory","%rcx","%rdx","%rdi","%rsi","%r8","%r9","%r10","%r11");
  return ret; }
EI_ EU8 ELnx1(EU8 num, EU8 ar1) { EU8 ret;
  asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1):
    "cc","memory","%rcx","%rdx","%rsi","%r8","%r9","%r10","%r11");
  return ret; }
EI_ EU8 ELnx2(EU8 num, EU8 ar1, EU8 ar2) { EU8 ret;
  asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2):
    "cc","memory","%rcx","%rdx","%r8","%r9","%r10","%r11");
  return ret; }
EI_ EU8 ELnx3(EU8 num, EU8 ar1, EU8 ar2, EU8 ar3) { EU8 ret;
  asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2),"d"(ar3):
    "cc","memory","%rcx","%r8","%r9","%r10","%r11");
  return ret; }
EI_ EU8 ELnx4(EU8 num, EU8 ar1, EU8 ar2, EU8 ar3, EU8 ar4) { EU8 ret;
  register EU8 lar4 asm("r10") = ar4;
  asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2),"d"(ar3),"r"(lar4):
    "cc","memory","%rcx","%r8","%r9","%r11");
  return ret; }
EI_ EU8 ELnx5(EU8 num, EU8 ar1, EU8 ar2, EU8 ar3, EU8 ar4, EU8 ar5) { EU8 ret;
  register EU8 lar4 asm("r10") = ar4; register EU8 lar5 asm("r8") = ar5;
  asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2),"d"(ar3),"r"(lar4),"r"(lar5):
    "cc","memory","%rcx","%r9","%r11");
  return ret; }
EI_ EU8 ELnx6(EU8 num, EU8 ar1, EU8 ar2, EU8 ar3, EU8 ar4, EU8 ar5, EU8 ar6) { EU8 ret;
  register EU8 lar4 asm("r10") = ar4; register EU8 lar5 asm("r8") = ar5; register EU8 lar6 asm("r9") = ar6;
  asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2),"d"(ar3),"r"(lar4),"r"(lar5),"r"(lar6):
    "cc","memory","%rcx","%r11");
  return ret; }

20160331

Practical Example of Things Linux Supporters Should Fix

Based on a true story ... literally happening right now, it's March 31st and Hyper Light Drifter has just been released. I'd like to play the game, and wow it is supposed to actually have Linux support.

Step 1: I go to www.heart-machine.com and see if I can just pay the developer directly and download a Linux binary of the game. As a fellow developer I'd much rather pay a developer directly for a game that I'm interested in, than purchasing through a third party who takes a cut of the pie.

Fail 1: No ability to buy the game directly, must go through Steam.

Step 2: I go to store.steampowered.com and attempt to download and install Steam on my Linux box.

Fail 2: Clicking the "Download" button takes me here: repo.steampowered.com/steam/archive/precise/steam_latest.deb but with an error message from the server.

Forbidden

You don't have permission to access /steam/archive/precise/steam_latest.deb on this server.
Apache/2.2.22 (Ubuntu) Server at repo.steamstatic.com Port 80


Ok, so for whatever reason the Steam download link for Linux is/was broken. But even if it worked, the download would be useless to me, as it is only a Debian package, and I'm not running Ubuntu, so I have no way to install that package without manually installing something which can unpack DEB files, then attempting to manually install whatever my specific Linux box doesn't have which Steam might depend on.

No easy way to get the game, no easy way to install Steam. This is an example of the typical "Linux experience" today.

I'm still going to get the game, but I'm waiting for the PS4 version. I'd like to support Linux, but quite frankly that is impossible as long "It Just Doesn't Work".


A Better Time
Another story, back when I worked for Wolfram Research ages ago, working on the UNIX part of the Mathematica frontend and also building the UNIX audio support, at the time, Mathematica ran on something on the order of 9 UNIX/BSD platforms: Linux/SunOS/AIX/Solaris/etc. Mathematica was statically linked. It just worked. Period.

Seriously, it just worked.

You got a fantastic experience regardless of what platform you were on, or what Linux distro you had on your machine.


Dependency Nightmare
Between the time when I was writing a fork() based audio engine in Mathematica and now attempting to find a way to just give the developer money and get a working Linux binary of a game I'd like to play, the industry adopted this nightmare policy of dynamic linking to a hairball of rolling release libraries.

Now nothing ever works.

Seriously, a 1TB harddrive is $50. The idea that there is a need to dynamically link to save space is utter insanity.


Problem Practices
If you are "supporting" Linux and falling into one of these cases, you are doing serious damage to any effort to make Linux a viable platform.

(1.) Distributing a {insert a specific Linux distro package} file instead of a tgz file which works on any Linux machine.

(2.) Distributing a binary which dynamically links to a tree sized library dependency chain.

(3.) Requiring a user or developer to manually build your software.


Taking Responsibility
As a Linux developer I strive to link to just one thing, libdl.so. I only use direct syscalls except for {OpenGL/Vulkan, ALSA, Xlib} and those I dlopen() when possible. If syscall mapping changes for a Linux version, I'll just release different binaries. This is about as anti-dependency as is possible, and it best ensures things just work.

For me it was trivial to not use libc and go direct to native system calls. It saves me a tremendous amount of time and pain by NOT depending on other people's broken software libraries and instead writing things directly.

As for the "real world" there are examples of the "right way" to do "libraries". Like STB (stb_howto.txt is a great read). Single file headers with little or no dependencies which a developer includes into a project instead of linking to. Problem solved.

Instant Soft-Reboot to Prior Machine Snapshot

An idea for practical simple memory protection on an otherwised unprotected system. Works similar in concept to emulator instant save and restore but applied to a full system. Reserve 50% of memory for a full machine state snapshot. Deny access to this snapshot memory except during save and restore (ie pages just not in page table). Hook up hotkey combination to save or restore full machine state. On crash, simply restore from snapshot (instant soft-reboot). Design around one full machine barrier per frame for safe snapshot point.

20160330

PAGE_EXECUTE_READWRITE

Been very quite about my background language and OS prototypes, but they slowly continue as I explore various options. One common theme across everything I've tried is a dependence on simultaneous ability to execute, read, and write to a page of memory. Each language defers compilation until run-time. The binary is a self-compiling position-independent executable which leverages a global dictionary to specialize position-dependent and run-time-dependent code for a given situation at any point during execution. This means quite literally that the line between code and data has been removed. Data generated at run-time contains baked code specific to the data. And so on...

20160304

Tim Sweeney on UWP - And My Thoughts on the Topic

The Guardian Op-Ed by Tim Sweeney : Microsoft wants to monopolise games development on PC. We must fight it

If the Industry Wants an Alternative It's Free to Make That Alternative!
There has never been a better time to do so either.
Great quality Linux Vulkan drivers are soon to be here from all three of the major desktop GPU manufactures. An engine designed for Vulkan is going to run fantastic on Linux. Linux with some major distro re-shaping (below), can be made into an awesome alternative desktop OS, with traditional WinNT era ideals. As an open system, the Linux platform has the opportunity to be shaped into an OS which has a console-like Quality of Service guarantee for games and VR. For example, the base kernel already has support for real-time priority and pinned CPU memory. The opportunity for gaining benefits on Linux for VR is tremendous: think about OS guaranteed chunks of time on CPU and GPU timed exactly to HMD sync rate in combination of guaranteed resident memory with absolutely no hitching or paging.

Valve still has an amazing opportunity if they would take a reformed SteamOS to the desktop instead of just the living room. Valve has the PC OEM link required with Steam to still pull this off on a pre-installed system.

The effort required to build a reformed Linux starts with fully scrapping the traditional Linux distro, and starting from scratch. Reformed distro needs to break conventions core to the politics of the Linux scene: reformed distro runs binary self contained packages dependent on only the kernel, hardware libraries like Vulkan, and Steam libraries. App install needs to be as easy as dragging a folder into an applications folder on the drive. App delete needs to be as easy as deleting the app's folder. Linux dependency nightmare is removed from the equation.

Treat this like an embedded Linux device. Start with only the Linux kernel, light-weight replacement for glibc (like musl), light-weight shell tools (like busybox), add back only what is required to get the core consumer and developer needs supplied. Things like Wine for some Windows binary compatibility, web browser and associated plugins, movie viewer, mixer control, audio player, browser, Steam, working with usb sticks, decompressing/compressing files, etc, all work out of the box. Scrap the "you need to be a system's programmer" standard Linux configuration system. Something like one file, with a easy GUI control panel for standard users, would be a better option. Run from RAM after boot with the exception of large apps.

Everything on reformed Linux distro either works out of the box, or isn't included. User experience is paramount. The go/no-go gauge for this system, is if a non-computer person can use without instruction and without any kind of frustration.

Seriously it is a Less Than a Year Effort
I'd offer some of my at-home personal-time to contribute to an organized public effort to build this reformed distro. Years ago I developed and maintained a minimized personal-use from-scratch Linux distro which fit on a 100 MB ZIP disk. All this effort needs is a group of people who all agree enough to make forward progress, who are willing to see the thing to completion. Employers who might benefit from this, could start by sponsoring some official at-work time to work on the effort. Etc.

An alternative doesn't exist because of lack of coordination and action from people who care enough to do something about it. Single individuals don't stand a chance on their own. I've re-pitched this idea for years with no buy-in, would love to hear from others who might be interested in making it happen.

For this to work, GL or Vulkan "ports" of DX11 based Windows games isn't going to cut it. Game devs who want an alternative, need to take responsibility to solve the chicken and egg problem. To provide a sustained incentive for a consumer to buy into the platform alternative you are asking for even thought it starts without enough market share to justify the effort in the short term. Consoles get this with critical mass of exclusive 1st party games. A new PC platform can only get this with a sustained migration, until a critical mass point is reached such that it is possible to be the lead platform.

First step is giving consumers a choice without compromise. Release with performance and quality parity on Linux and Windows. The start is taking Vulkan seriously, retooling for the new API to leverage what it is capable of. Using Vulkan to get simultaneous Win7/8/10 support and Linux support from one effort. This is a win-win situation, even if the Linux effort never takes, the investment in Vulkan on Win7/8 will enable investment in better tech which isn't constrained by having a DX11 fallback rendering path. Major game efforts on Vulkan will force IHVs to dedicate effort to tuning and bug fixing drivers. As developers, you choose what you want to be good by launching a game using the API. Real titles are the best form of QA testing, as they have real-world coverage :) Have Linux working during development: a WIN32 and Linux Kernel interface portability layer is trivial to make.

Closing
The industry continuing on its current trajectory won't bring back the conventions of the past. The opportunity to possibly change that exists, if people are willing to band together to build it.