## 20160626

### FPGA Processor Rethinking : Part 2

Continued from prior post...

Why Not MIMD
Jumping to the conclusion first, I still wouldn't adopt a MIMD architecture for a highly parallel machine. This post is going to attempt to describe why, and outline the rough design that I've been pursuing for a SIMD machine.

INSTRUCTION ISSUES - With MIMD the amount of instruction RAM per core would be tiny. GRVI in the 8 CPU cluster configuration from the GRVI Phalanx Paper has just 4KB per core. Ideally instructions need to be small, with an ISA with good code density (like Forth, like the J1), but ultimately this will drop ALU IPC. The J1 Paper provides an example: 35% instructions are ALU. Rather be able to issue loads and stores and DSP ALU (d=a*b+c) all in the same cycle, which requires VLIW. SIMD amortizes the large instructions, with a broadcast of the decoded control signals, or a hierarchical instruction decode across the fan out (based on whatever is optimal). I'd like to also have parallel ability to pack and unpack sub-words (to maintain good information density in the core's local scratch RAM), as well as broadcast uniform constants.

ROUTING ISSUES - Or rather timing issues. The Hoplite Paper outlines one of the of core problems. Routing the communication of random MIMD jobs requires throttling message injection to avoid saturating the network, which would otherwise result in some drastically horrible message latencies. Seems like Hoplite requires something like roughly 0.5/max(gridWidth,gridHeight) message injection rate max (seems like the router needs to remain roughly half empty to avoid latency cliff). Optimizing for machines with highly volatile data-dependent run-time performance can be a nightmare. SIMD ensures the work synchronization required for highly efficient communication, as well as makes it easy to implement a majority of the common forms of parallel communication.

Current Design Thoughts
An 18-bit/word massively wide SIMD machine driven by VLIW with the following per lane,

One DSP
One 1024 entry x 18-bit Block RAM (2 ports both either Read or Write)
One 32 entry x 18-bit Register File (32x2Q SLICEM x 9)
One hyper-cube router

Rough machine limits/lane if targeting a 512 lane machine (using same FPGA from prior post),

11552 SLICEM / 512 = 22
22098 SLICEL / 512 = 43

To place this into context, each SLICEL has the ability to MUX four 4-bit values into one 4-bit value, given a 2-bit control signal. Each 18-bit word requires 4.5 SLICEs to do implement a 4:1 choice. That gives 43/4.5 or roughly 9 4:1 choices for the SLICELs for driving connections between {reg-file, BRAM, DSP, and router}. Not including using left over SLICEMs, and not including all the other misc things required. Likely impossible to hit a 512 lane target, won't know until I move on to implementing the cores.

BIT-PIPELINED? - Prior was thinking about doing a bit-pipelined machine (1-bit per lane, 16 lanes grouped). Each SLICE can be configured into 4 pairs of 5:1 LUTs each which can make a full adder. The bit-pipelined multiply would require 4 SLICEs per 16 1-bit lanes for each adder. Easy to burn the entire SLICEL budget on the 3 stage pipeline for just an N-bit x 8-bit multiplier (not including the control structures). And would need a huge amount of instructions to drive that machine. Also the routing required roughly 18x the switch registers (compared to current thinking). So tabled the idea for now.

ROUTER - An analogy for this "router" is to think of a railroad switching station, with a set of parallel tracks or "lanes" each connected by a set of switches which can exchange trains between two tracks, but with switches that require lanes to cross over each other (as in a bridge over tracks not involved in the exchange). The trains in this case are really small, ideally one word of information, but likely sub-words due to area (more on this later).

The general idea is to setup the switching station based on the data-driven needs of the communication, then allow the data payload to flow through the switch. For instance when sorting a packet per lane, for each of the log2(lanes) sorting sub-pass of a bitonic sort, the sorting key gets passed through the switching station, setting switches, then the data payload passes through at a cost proportional to size of the payload. The switching station takes out what would otherwise be log2(lanes/pow2(sub_pass)) passes per sub-pass for the payload transfer for the bitonic sort.

The series of connections are hyper-cube edges, which means if you take the binary value of the lane index, there is one connection per bit in that lane index. For example, a 512-lane SIMD machine has a 9-bit lane index, and thus 9 connections (9 edges in the 9 dimensional hyper-cube). In order to make a switching station which useful for basic parallel communication, the ordering in time of the connections matter greatly. Broadcast and other data replication/expansion algorithms need the switches ordered from LSB to MSB, while sorting and data merging algorithms need the switching ordered from MSB to LSB. Continuing the 512-lane example, each of the 9 pipeline stages of the switching station requires,

(1.) Output from the prior pipeline stage.
(2.) The associated LSB-first ordered edge from the prior pipeline stage.
(3.) The associated MSB-first ordered edge from the prior pipeline stage.
(4.) 2-switch control bits (select between 1,2,3 or some extra value).

For a 9 pipeline stage switching station, this would exceed the entire SLICEL budget. So likely going to, run the switch at half-word or lower granularity.

MOVING ON - Choosing VLIW ISA, and the exact structure for the cores is hard work. Validating the design has all the hardware features I'd like for solving problems is also challenging. Left off thinking about multi-precision math, and how to best manage the 3 cycle DSP latency.

## 20160623

### FPGA Processor Rethinking

Have been very inspired by Jan Gray's GRVI Phalanx.

MIMD
Outside of work, I've been slowly attempting to work up a paper design for my own massively parallel FPGA based computer, looking to close on something to actually build. Mostly been working on SIMD based machines, without a serious focus on MIMD, until I read Jan's paper, which set me off in another direction, how would I build a MIMD machine in a Xilinx FPGA?

Basics
Looking at using a board with the fastest Artix-7. Collecting numbers,

33650 - CLB Slices
740 - DSP Slices
365 - 36 Kbit Block RAMs (each which can be split into two 18 Kbit BRAMs)

If could use all 740 DSP Slices and could maintain a 375 MHz clock (number borrowed from Jan's paper), that would be,

740 DSPs * 2 ops/clk * 375 MHz = 0.555 Tops/sec

Talking about effectively trying for a 740 core MIMD machine. Definitely won't be able to realize that peak, and in comparison to GPUs like FuryX at 8.6 Tops/sec (and 32-bit instead of 18-bit), this number seems small at first. Except if this little FPGA machine was driving an old TV like a console at around NES resolution (same width, less height),

PC driving 2560x1440 which at 8.6 Tops/sec is ___2.3 Mops/pixel/sec.
FPGA driving 256x192 which at 0.5 Tops/sec is __11.3 Mops/pixel/sec.

Which is complete insanity levels of performance per pixel for the FPGA machine for a vintage arcade box, even if only reaching 1/4 of that performance.

So Let the Fun Begin
DSPs are limited by having an 18-bit input for the "d=a*b+c" operation, Block RAMs are natively 18-bits wide, so naturally this is going to be an 18-bit computer. Working backwards to get rough design constraints, first breaking down how many CLB slices support distributed RAMs, and dividing everything by the 740 DSP slices.

11552 SLICEM / 740 = 15
22098 SLICEL / 740 = 29
__730 _BRAMS / 740 = ~1 (18 Kbit)

Block RAMs are dual ported, they don't have the ports required to keep the DSPs filled. Each SLICEM in contrast can be used as a Quad-Port 32 x 2-bit RAM, which looks like a good target for a register file (can sustain 3 read ports for the DSP op). Will need 9 SLICEMs to support a 32 entry x 18-bit register file. These SLICEM's only support 1 write port, and want 2 (for parallel ability to write into the register file while doing DSP ops). So register file will need to be at least 2 banks, for a total of 18 SLICEMs. High level register file design will limit the peak number of DSPs used.

Initial target of one 1024 entry x 18-bit Block RAM per core for data (roughly 16x the capacity of the register file). If clustering 8 CPUs together, that is 8K x 18-bit words for data, via 8-way banking. I'm thinking about doing something quite rash, and only supporting aligned 8-word (8 * 18-bits/word) block loads and stores from this data RAM, both for the CPU register file and the message router. This in comparison with the GRVI would replace the 2:1 concentrators and 4x4 crossbar. Instead the CPU would have 4 8-word regions in one of the banks which could be accessed for block load/store, in parallel with the other bank of 32-word register file used for DSP operations. Effectively the CPU would be modal with a binary switch, one bank used for block load/store to setup for next group of computation, while the other bank is used for computation. Then switch to do math on the loaded data, and load new data in the other bank. It is a very restrictive and simple design but something I think can work quite well in practice.

Also thinking for instruction RAM, sharing one BRAM between two CPUs. But switching to block loads, so all 8 CPUs in a cluster can share 4K x 18-bit words of BRAM. This involves an ISA design which can compute the branch target early enough in the pipeline.

More next time...

## 20160606

### Knupath: Push Model

Another MIMD machine of tiny processors connected by an on-chip network,
Former NASA Exec Brings Stealth Machine Learning Chip to Light
{"We wanted to have the processing in immediate vicinity of the memory—a push model. You don’t need the cache, you don’t need to do fetch. We didn’t design this just for processing, we balanced communications and processing in memory to keep balance. It’s a communicator—there’s a router right in the middle of it," Goldin explains. Unfortunately, the reason they signed a contract for the first chip, which came out in 2015, is because of that eDRAM feature, which put each of the tDSPs right adjacent to memory for immediate contact. While the next variant of their chip won’t be able to use it, they have found a suitable workaround, although they were not able to provide details as of yet.}

If the first diagram is accurate,

Cluster: 8 DSPs sharing 2 MB of eMEM (guessing eDRAM?) + 256 KB sMEM (static RAMs for program binaries?).
Super Cluster: 8 clusters connected with a full 8x8 crossbar.
Chip: 4 super clusters connected with a full 4x4 crossbar.

And from the second diagram,

Chip 4x4 crossbar has 16 bidirectional ports at 10 Gbit/sec per port.

## 20160517

### VK_AMD_rasterization_order Time Saver

(2.) Add VK_AMD_RASTERIZATION_ORDER_EXTENSION_NAME (this string, "VK_AMD_rasterization_order", is defined in vulkan.h) to your VkDeviceCreateInfo.ppEnabledExtensionNames.

(3.) Then use the extension, for example as follows,
// using "static" here to have structure pre-zeroed, feel free to clear instead, etc
static VkPipelineRasterizationStateRasterizationOrderAMD orderAMD;
orderAMD.sType = VK_STRUCTURE_TYPE_PIPELINE_RASTERIZATION_STATE_RASTERIZATION_ORDER_AMD;
orderAMD.rasterizationOrder = VK_RASTERIZATION_ORDER_RELAXED_AMD;

// append it into the pNext of raster state
VkPipelineRasterizationStateCreateInfo createInfo;
...
createInfo.pNext = &orderAMD;

## 20160424

### Parallel Project : Register File

Referencing,
7 Series FPGAs Overview
7 Series FPGAs Memory Resources

Register File
Starting with the first constraint of the design, how to layout the register file for the SIMD machine. Artix-7 XC7A200T is the target which has capacity for 365 36-Kbit block RAMs (32-Kbit data, 4-Kbit parity). Block RAMs are symmetrical dual port (read or write). Each port takes a 16-bit bit indexed address, and returns a 36-bit (32-bits data, 4-bits parity) bit addressed sliding window into the memory.

The ALU design requires 2 read ports and 1 write port. It would be possible to get 2 read ports by duplicating writes and splitting memory to 2 copies. However I'd like to be able to use all of the memory. Since the majority of operations are going to stream through consecutive addresses in the register file, can leverage the sliding window to get 2 consecutive bits per read, and pipeline such that even clocks pull 2-bits for the first read operand, and odd clocks pull 2-bits for the 2nd read operand. So ignoring parity, one possible configuration is to set the write port to only write 16-bits (16 lanes), and have the read port fetch 32-bits,

BIT ADDRESS SIMD LANE
--- ------- ---------
0   n       (lane & 15) = 0
1   n       (lane & 15) = 1
2   n       (lane & 15) = 2
...
15  n       (lane & 15) = 15
16  n+1     (lane & 15) = 0
17  n+1     (lane & 15) = 1
18  n+1     (lane & 15) = 2
...
31  n+1     (lane & 15) = 15

In this example each 36-Kbit block RAM supports 16 SIMD lanes each with 2-Kbit of register file. Throughput of the register file provides an upper bound on performance. For a given clock rate, and assuming 256 block RAMs (70%) are used for the register file, limiting throughput is:

1-bit ops/second = clockRate * 256 block rams * 16 lanes

A 128 Mhz machine would thus be limited to,

128 M * 256 blocks * 16 lanes = around 512 Gop/sec

Where each op provides 1-bit of a variable precision ALU operation. My target for video output is the NES resolution: 256x224 at around 59.94 Hz (non-interlaced) and only NTSC monochrome (works on any classic TV). Target is roughly 3.4 Mpix/second. Dividing out to ops/pixel = 150 Kop/pixel. Even if this rough estimate is way too optimistic (which it is), going to have a tremendous amount of perf/pixel for this neo-vintage arcade machine.

Continuing, 256 blocks * 16 lanes = 4096 lane SIMD machine, but the XC7A200T only has 33,650 slices. Using 70% again: 33650 slices * 0.7 / 4096 lanes = 5.7 slices/lane. Yes way too optimistic. Slices per lane for the 1-bit/clock return ALU will limit the peak number of lanes in the machine, which will reduce limiting throughput, but on the positive side, will result in more memory per lane. Can work backwards from possible block RAM configurations to get slice/ALU targets, and this time going to round down to only using 16k slices,

256 blocks * 16 lanes,  2-Kbit/lane,  4 slices/lane
256 blocks *  8 lanes,  4-Kbit/lane,  8 slices/lane
256 blocks *  4 lanes,  8-Kbit/lane, 16 slices/lane
256 blocks *  2 lanes, 16-Kbit/lane, 32 slices/lane
256 blocks *  1 lanes, 32-Kbit/lane, 64 slices/lane

More Constraints
Block RAMs only have byte, not bit, write enable, so if it is possible to reach the 2 lane or better configuration, this rules out building any kind of per-lane predication which depends on disabling register file writes per lane. Going to work through ALU design without needing lane write enable. Likewise original thoughts on NPU have been transformed quite a lot based on real constrains. Focusing on fused ALU+NPU: where ALU operands can come from other edges in the network graph.

More next time ...

## 20160423

### Parallel Machine Thoughts

Been very slowly working up towards learning everything required to build my own highly parallel computer in an FPGA. This process has in some ways been on-going for years, getting comfortable building languages and working on little OS prototypes, etc. On the way taking inspriation from other people's projects like the JayStation and Linus Åkesson's Parallelogram, and realizing that not-too-expensive FPGAs have reached the performance level where is it possible to simulate vintage arcade machines like the NeoGeo. This post represents some of the higher level thinking on the current design direction which is highly motivated by the CM-1.

Fusing MEM and ALU
One traditional design construct which I'm intending to toss out is the macro-level separation of memory and ALU. The FPGA I have in mind currently is the Digilent Nexys Video Artix-7 FPGA Trainer Board which has the largest 7 series Xilinx chip which can be used with the free version of Vivado. The Artix-7 has 13-mega-bits of on-chip block ram. To place this into context, the 2nd Amiga, the A2000, started with 8-mega-bits, and a little over 7 MHz CPU. Effectively I'm looking at carving out the block rams across a large number of tiny processors connected by an on-chip network, treating the external DDR3 more like volatile "storage" instead of "memory". I'm intending to prototype (or just design if I fail) something for fun and learning (something I could build a little arcade game on), but if made into actual hardware could hopefully be scaled up to stacked chips like HBM but instead of being composed of just memory, would have arrays of paired eDRAM and ALU cells.

Contrasting with GPUs
The majority of on-chip memory in a GPU is effectively a parking lot for the state of parallel jobs waiting on off-chip memory access, or high latency hits into on-chip caches. As a direct result of this, one of the software programming model constructs a GPU effectively deprecates, is the usage of the large CPU stacks. Likewise deprecated, the model of writing large software libraries invoked with "call and return" traversing huge library dependency chains.

However the GPU is still built around the data analog to "call and return", the "fetch memory" construct (the VMEM load/sample opcodes in GCN for example). The important property of the "fetch memory" construct is that it is a blocking operation.

Designing Around All Non-Blocking Constructs
GPU is designed around vector gather (aka the texture fetch), with only limited ability to scatter, either locally into a tiny memory like LDS divided across workgroups, or globally into L2, but no ability to directly scatter into the largest on-chip memory: the register file (aka the GPU's parking lot). One of the best evolutionally changes between the CPU and GPU was moving the ALU for global atomics from the shader cores to the other side of the crossbar close to the L2. I'd like to continue this evolution to the next logical step.

An inverse of the gather machine is a machine built around scatter, a machine built around non-blocking "fire-and-forget". Conceptionally the tiny processors become "pinned jobs" which are persistently running instead of mostly parked, and the "scatter" is on-chip parallel message routing between the register files of these tiny pinned jobs.

Designing Around Cross-Lane Operations
Looking to repurpose the area of the traditional GPU memory sub-system, like caches and texture units, along with the ports used in the register file for load/store, instead for what is effectively a second functional unit which specializes in doing cross-lane operations in parallel with the ALU on a giant SIMD machine. These cross-lane operations become the "scatter" or parallel message routing network which I'm going to call the "NPU" (short for Network Processing Unit). This NPU would be able to do things like broadcast (on a local to global scale), sorting (when paired with ALU operations), scatter, etc. NPU's purpose is to re-organize data for the needs of computation in parallel with computation.

Transposed ALU Design
The ALU I have in mind is very similar to the CM-1: an ultra-wide SIMD machine with a bit addressed register file and a very simple ALU. Where conventional SIMD processor opcodes use an immediate register index to address and fetch something like 1 32-bit word per lane, the design I have in mind instead would fetch 1-bit for 32 words instead (using the same register index). The opcode register index is thus a bit index (shifts are amortized into the cost of the bit addressed register file), and the ALU, instead of retiring one 32-bit result per op, retires one bit for 32 words. This introduces the ability to do varible bit-width operations, but greatly increases the wall clock time it takes to compute anything (no longer optimizing for latency here because there are no non-register file memory reads).

In order to make it possible to compute multiplies faster, I'm looking at an ALU design which includes a pipelined tree reduction, with a mux at the end to select which element of the tree to return to the register file (implemented in block rams). Very rough diagram below for a 4-bit tree. Each [_] is an adder, and with carry and control logic not drawn. The "y" input is transfered across the tree with a one clock delay. Each of the [a] through [d] adders have an input which feeds back to itself for the next cycle, with a control input which can selectively override with the new "x" input to the module.
bit in x --->o-----o-----o-----o
|     |     |     |
bit in y -->[a]-->[b]-->[c]-->[d]
|\   /       \   /
| [e]         [f]
|  | \       /
|  |  \     /
|  |    [g]
|  |     |
bit out z <--o--o-----o

The number of leaves of the tree acts as some multiplier on the throughput of multiply, while having side effects on the critical path timing, and increasing area, etc. I'm still exporing what other uses this tree construct could have to see what opcode control signals I should support to reconfigure.

Bit-Banked Register File
The FPGA's block rams are dual port, and I need 3 ports for the ALU, and more for the NPU. To work around this problem, I'm looking at banking the register file at the bit address level. Multi-precision ALU ops and multi-bit message passing (NPU) will stream through a series of addresses, making it easy to have streams start at different addresses modulo the banding multiplier, in theory at least. This adds and associated complexity to the compiler for code generation, or the human, if hand assembling code. Not sure yet if this idea is going to work out.

NPU
Finding this to be the most challenging part of the project, producing a design which is not too large in comparison with the ALU, etc. I'm working from the idea of using the same CM-1 style hypercube network, but not sure about edge node design (CM-1 has 16 ALUs at each edge). Currently doubt I want anything like a full 16-lane crossbar in an edge.

From a high level view, messages are stored in the register file, and the NPU can conditionally swap two SIMD lane's messages, with the ability to conditionally write the data payload of the swapped message instead to a second area of register file for a message which reached the destination (this way the NPU can continue to use the message scratch area for routing other messages). Lane connectivity is limited by whatever fixed network layout I end up going with. Decided on using the register file as the message buffering area to allow the "program" to control the full routing algorithm (instead of having say a fixed petit cycle as in the CM-1 and a separate memory, etc). This also effectively limits the design to message swaps (like a parallel sorting network), but enables the possibility of doing broadcast and other things requiring message duplication.

Messages are composed of {message enable bit, variable bit relative lane address, variable bit payload}. Some message passing won't need the relative lane address (fixed network flow instead of data driven). Also parts of the message could be read out of order. For instance to build up the NPU register which enables the message transfer to actually write to the lane's register file, for routing like in the CM-1, the bit representing the active edge of the petit cycle would be read first.

The NPU design I'm looking at has one read and one write port into the register file, if message throughput is a bit per clock per 1-bit lane in the SIMD machine. The alternative is to drive messages at half rate and use the one FPGA block ram port which can be read or write. This has some appeal because then only 4 ports are needed across the ALU and NPU combined. And lastly I'm thinking about just merging ALU and NPU, so that network inputs can directly be used as ALU input.

On Paper
Current work involves building towards a fixed ALU+NPU logic design which gets driven by SIMD control fan-out. Once I'm happy with the ALU+NPU design, I'm back to verilog to see how many of these I can place in the FPGA. Then I need to think more about building the "sequencer" or CPU-style scalar unit which drives the control lines for the massive SIMD machine.

Wrestled about actually blogging about this for a long time, because like anything not tied to the day job with a real deadline, I'm likely to scrap everything and do something else. And this is my first attempt at somewhat serious hardware design, which is certain to contain horrible fail moments, which hopefully others can enjoy at my expense. On the other hand, unlike dealing with software, this project has brought back the sense of wonder I had as a kid which otherwise since has transformed into sarcasm after dealing with how the industry as since evolved.

## 20160422

### The End of HDMI to Analog HDfury

According to an email today to prior customers like myself, the makers of the great HDFury devices, have been forced to discontinue selling all HDMI to Analog devices within a week. Likely the last chance to purchase one before your only option is the used online market. As someone who depends on these devices to drive my analog CRTs from modern digital equipment for personal fair use, I'm sad to see the continued decline of my rights and ability to just use my computers for basic fundamental things like analog output. Hopefully the HDFury business continues with other products and hopefully at some point they release some kind of Display Port to Analog converter.

## 20160420

### April 2016 LG OLED65G6P HDR OLED TV Review on HDTVtest

LG OLED65G6P HDR OLED TV Review on HDTVtest
Now April 2016, OLED HDR TVs reach 744 nits (still under one stop brighter peak than your typical PC panel). While April 2016 LCDs HDR TVs are upwards of 1000 nits with one TV reaching 1300 nits. OLED is more efficient than Plasma but cannot hit the power efficiency of LEDs backing LCDs. 2016 LED backed LCD TVs are still limited in contrast, reaching upwards of only 5000:1 ANSI contrast. While OLED's selling point is the dark blacks and huge contrast, this is still a serious pain point for the technology scaled up to TV sizes. I've been watching this space hoping to eventually replace my Plasma while I still can get an OLED at 1080p, but the technology just is not there yet. Quotes from the HDTVtest review from this April 2016,

"We calibrated several G6s in both the ISF Night and ISF Day modes, and found that with about 200-250 hours run-in time on each unit, the grayscale tracking was consistently red deficient and had a green tint as a result (relative to a totally accurate reference)"

"Case in point: we originally calibrated one of the units and added some targeted touch-ups at 70% stimulus, only to find that, 60 hours later, the same adjustments that had gained us totally flat grayscale at the time of calibration were no longer ideal."

"LG have placed the option to run the self-correction process in the user menu (previously, it was only accessible in the service mode.) ... It takes a little over an hour, and interrupting the TV before it’s finished will require you to start again."

"The LG 65G6’s [Brightness] control (which governs black level) is set to discard some dark-scene details by default. We found that we had to raise it by a decent number of clicks from the factory position of “50” in order to avoid this. ... After doing this, during our dark-scene, dark-room testing, we noticed the blacks “floating”. During cuts to black, we could see swathes of non-black areas lighted on the panel, which is probably why LG crushed blacks by default."

"The LG OLED65G6’s input lag measured as being 34ms using the Leo Bodnar testing device."

## 20160409

### ELF

Random source from one of my prior languages which generates an ELF header for a x86-64 Linux binary with the dlsym() symbol.

\============================================================================
[ELF] LINUX
-----------------------------------------------------------------------------
http://www.sco.com/developers/gabi/2000-07-17/ch4.symtab.html
http://blog.markloiseau.com/2012/05/tiny-64-bit-elf-executables/
============================================================================\
word .elfPhXWR word dup dup long long long long long long }
{ ElfDs! \val tag\ long long }
{ ElfSym! \size value shndx other type bind name\
word 10 mul add byte byte half long long }
{ ElfStr( $.ElfStr neg add } { ElfStr) text 0 byte } \===========================================================================\ { Elf \---------------------------------------------------------------------------\ \ELF HEADER\ \e_ident\ 00010102464c457f long \reserved\ 0 long \e_type\ 2 half \ET_EXEC\ \e_machine\ 3e half \X86-64\ \e_version\ 1 word \EV_CURRENT\ \e_entry\ .ElfEntry long \e_phoff\ .ElfPh long \e_shoff\ 0 long \e_flags\ 0 word \e_ehsize\ 40 half \e_phentsize\ 38 half \e_phnum\ .ElfPh# half \e_shentsize\ 40 half \e_shnum\ 0 half \e_shstrndx\ 0 half \---------------------------------------------------------------------------\ \PROGRAM HEADER\$ :ElfPh
7 :elfPhXWR \PF_X+PF_W+PF_R\
3 :ElfPh#
1       .elfIs# .elfIs# .ElfIs  3 \PT_INTERP\  ElfPh!
.BUILD# .BUILD# .ElfEnd 0       1 \PT_LOAD\    ElfPh!
8       .elfDs# .elfDs# .ElfDs  2 \PT_DYNAMIC\ ElfPh!
\---------------------------------------------------------------------------\
\DYNAMIC SECTION\
$:ElfDs .elfLib% 1 \DT_NEEDED\ ElfDs! .ElfHsh 4 \DT_HASH\ ElfDs! .ElfStr 5 \DT_STRTAB\ ElfDs! .ElfSym 6 \DT_SYMTAB\ ElfDs! .ElfRel 7 \DT_RELA\ ElfDs! .elfRel# 8 \DT_RELASZ\ ElfDs! 18 9 \DT_RELAENT\ ElfDs! 0 0 ElfDs! .ElfDs$# :elfDs#
\---------------------------------------------------------------------------\
\SYMBOL TABLE\
10 $- \overlap\$ :ElfSym
0 0         0 0               0              0            0          ElfSym!
0 .ElfDlSym 0 0 \STV_DEFAULT\ 1 \STT_OBJECT\ 2 \STB_WEAK\ .elfDlSym% ElfSym!
\---------------------------------------------------------------------------\
\RELOCATION TABLE\
$:ElfRel .ElfDlSym long 7 \R_X86_64_JUMP_SLOT\ word 1 word 0 long .ElfRel$# :elfRel#
\---------------------------------------------------------------------------\
\DLSYM\
$:ElfDlSym 0 long \---------------------------------------------------------------------------\ \HASH TABLE\$ :ElfHsh 1 word 2 word 1 word 0 word 0 word
\---------------------------------------------------------------------------\
\INTERPRETER STRING\
$:ElfIs "/lib/ld-linux-x86-64.so.2" text 0 byte .ElfIs$# :elfIs#
\---------------------------------------------------------------------------\
\DYNAMIC STRING TABLE\
1 $- \overlap\$ :ElfStr
0 byte
ElfStr( :elfLib%   "libdl.so.2" ElfStr)
ElfStr( :elfDlSym% "dlsym"      ElfStr) }
\===========================================================================\
{ Elf( 0 asm! 0 $0!$_0 Elf }
{ Elf) asm :ElfEnd }
{ Elf! :ElfEntry }  And the associated "readelf" on a binary generated with this code, ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x1e7 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 3 Size of section headers: 64 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections to group in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align INTERP 0x00000000000001bc 0x00000000000001bc 0x00000000000001bc 0x000000000000001a 0x000000000000001a RWE 1 [Requesting program interpreter: /lib/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000001e48 0x0000000002000000 RWE 2000000 DYNAMIC 0x00000000000000e8 0x00000000000000e8 0x00000000000000e8 0x0000000000000080 0x0000000000000080 RWE 8 Dynamic section at offset 0xe8 contains 8 entries: Tag Type Name/Value 0x0000000000000001 (NEEDED) Shared library: [libdl.so.2] 0x0000000000000004 (HASH) 0x1a8 0x0000000000000005 (STRTAB) 0x1d5 0x0000000000000006 (SYMTAB) 0x158 0x0000000000000007 (RELA) 0x188 0x0000000000000008 (RELASZ) 24 (bytes) 0x0000000000000009 (RELAENT) 24 (bytes) 0x0000000000000000 (NULL) 0x0 There are no relocations in this file. The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported. Histogram for bucket list length (total of 1 buckets): Length Number % of total Coverage 0 0 ( 0.0%) 1 1 (100.0%) 100.0% No version information found in this file.  And a hex dump, 00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............| 00000010 02 00 3e 00 01 00 00 00 e7 01 00 00 00 00 00 00 |..>.............| 00000020 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |@...............| 00000030 00 00 00 00 40 00 38 00 03 00 40 00 00 00 00 00 |....@.8...@.....| 00000040 03 00 00 00 07 00 00 00 bc 01 00 00 00 00 00 00 |................| 00000050 bc 01 00 00 00 00 00 00 bc 01 00 00 00 00 00 00 |................| 00000060 1a 00 00 00 00 00 00 00 1a 00 00 00 00 00 00 00 |................| 00000070 01 00 00 00 00 00 00 00 01 00 00 00 07 00 00 00 |................| 00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000090 00 00 00 00 00 00 00 00 48 1e 00 00 00 00 00 00 |........H.......| 000000a0 00 00 00 02 00 00 00 00 00 00 00 02 00 00 00 00 |................| 000000b0 02 00 00 00 07 00 00 00 e8 00 00 00 00 00 00 00 |................| 000000c0 e8 00 00 00 00 00 00 00 e8 00 00 00 00 00 00 00 |................| 000000d0 80 00 00 00 00 00 00 00 80 00 00 00 00 00 00 00 |................| 000000e0 08 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................| 000000f0 01 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 |................| 00000100 a8 01 00 00 00 00 00 00 05 00 00 00 00 00 00 00 |................| 00000110 d5 01 00 00 00 00 00 00 06 00 00 00 00 00 00 00 |................| 00000120 58 01 00 00 00 00 00 00 07 00 00 00 00 00 00 00 |X...............| 00000130 88 01 00 00 00 00 00 00 08 00 00 00 00 00 00 00 |................| 00000140 18 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00 |................| 00000150 18 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000170 0c 00 00 00 21 00 00 00 a0 01 00 00 00 00 00 00 |....!...........| 00000180 00 00 00 00 00 00 00 00 a0 01 00 00 00 00 00 00 |................| 00000190 07 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 |................| 000001a0 00 00 00 00 00 00 00 00 01 00 00 00 02 00 00 00 |................| 000001b0 01 00 00 00 00 00 00 00 00 00 00 00 2f 6c 69 62 |............/lib| 000001c0 2f 6c 64 2d 6c 69 6e 75 78 2d 78 38 36 2d 36 34 |/ld-linux-x86-64| 000001d0 2e 73 6f 2e 32 00 6c 69 62 64 6c 2e 73 6f 2e 32 |.so.2.libdl.so.2| 000001e0 00 64 6c 73 79 6d 00 33 ff be b8 14 00 00 ff 15 |.dlsym.3........|  No Dynamic Linking Example And source for a ELF which only uses syscalls and no dynamic linking for comparison, \=============================================================== 64-BIT ELF FOR KERNEL ONLY INTERFACE ---------------------------------------------------------------- http://blog.markloiseau.com/2012/05/tiny-64-bit-elf-executables/ ===============================================================\ { Elf \e_ident\ 00010102464c457f long \reserved\ 0 long \e_type\ 2 half \e_machine\ 3e half \e_version\ 1 word \e_entry\ .ElfEntry long \e_phoff\ .ElfPh long \e_shoff\ 0 long \e_flags\ 0 word \e_ehsize\ 40 half \e_phentsize\ 38 half \e_phnum\ 1 half \e_shentsize\ 0 half \e_shnum\ 0 half \e_shstrndx\ 0 half \______________________________________________________________\ asm :ElfPh \p_type\ 1 word \p_flags\ 7 word \p_offset\ 0 long \p_vaddr\ 0 long \p_paddr\ 0 long \p_filesz\ .ElfEnd long \p_memsz\ 4000000 long \p_align\ 2000000 long } \______________________________________________________________\ { Elf( 0 asm! 00! Elf }
{ Elf) asm :ElfEnd }
{ ElfEntry! :ElfEntry }  The "readelf" results, ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x5f6 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 1 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections to group in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x00000000000006aa 0x0000000004000000 RWE 2000000 There is no dynamic section in this file. There are no relocations in this file. The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported. No version information found in this file.  And the hex dump (I'm too lazy so this is going to just include the full binary, which happens to be the compiler I use for the language both these ELF headers are written in), 00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............| 00000010 02 00 3e 00 01 00 00 00 f6 05 00 00 00 00 00 00 |..>.............| 00000020 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |@...............| 00000030 00 00 00 00 40 00 38 00 01 00 00 00 00 00 00 00 |....@.8.........| 00000040 01 00 00 00 07 00 00 00 00 00 00 00 00 00 00 00 |................| 00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000060 aa 06 00 00 00 00 00 00 00 00 00 04 00 00 00 00 |................| 00000070 00 00 00 02 00 00 00 00 69 6e 2e 61 00 6f 75 74 |........in.a.out| 00000080 2e 61 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.a..............| 00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 000000b0 00 00 00 00 01 02 03 04 05 06 07 08 09 00 00 00 |................| 000000c0 00 00 00 00 0a 0b 0c 0d 0e 0f 00 00 00 00 00 00 |................| 000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000000e0 00 00 00 00 0a 0b 0c 0d 0e 0f 00 00 00 00 00 00 |................| 000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000100 00 00 00 65 05 65 05 65 05 65 05 65 05 65 05 65 |...e.e.e.e.e.e.e| 00000110 05 65 05 65 05 65 05 65 05 65 05 65 05 65 05 65 |.e.e.e.e.e.e.e.e| * 00000140 05 65 05 65 05 d7 03 4b 04 d7 03 d7 03 ab 03 d7 |.e.e...K........| 00000150 03 06 04 d7 03 d7 03 d7 03 d7 03 05 05 d7 03 71 |...............q| 00000160 03 d7 03 a0 04 a0 04 a0 04 a0 04 a0 04 a0 04 a0 |................| 00000170 04 a0 04 a0 04 a0 04 2b 03 d7 03 d7 03 d7 03 d7 |.......+........| 00000180 03 d7 03 d7 03 d7 03 d7 03 d7 03 d7 03 d7 03 d7 |................| * 000001b0 03 d7 03 d7 03 d7 03 d7 03 d7 03 40 05 d7 03 d7 |...........@....| 000001c0 03 d7 03 9b 02 d7 03 d7 03 d7 03 d7 03 d7 03 d7 |................| 000001d0 03 d7 03 d7 03 d7 03 d7 03 d7 03 d7 03 d7 03 d7 |................| * 000001f0 03 d7 03 d7 03 d7 03 d7 03 05 03 d7 03 d4 02 d7 |................| 00000200 03 82 05 48 89 91 08 00 50 00 48 89 99 10 00 50 |...H....P.H....P| 00000210 00 48 89 a9 00 00 50 00 8b c5 83 c5 08 c3 b9 c5 |.H....P.........| 00000220 9d 1c 81 33 d2 33 db 40 0f b6 06 83 c6 01 48 69 |...3.3.@......Hi| 00000230 c9 93 01 00 01 48 33 c8 48 c1 e2 03 48 03 d0 48 |.....H3.H...H..H| 00000240 83 e2 f0 48 c1 e3 04 83 e0 0f 48 03 d8 40 0f b6 |...H......H..@..| 00000250 06 83 c6 01 83 f8 20 0f 87 d1 ff ff ff c1 e1 06 |...... .........| 00000260 83 c1 e0 83 c1 20 81 e1 e0 ff 7f 00 4c 3b b9 00 |..... ......L;..| 00000270 00 50 00 0f 84 8a ff ff ff 48 3b 91 08 00 50 00 |.P.......H;...P.| 00000280 0f 85 dd ff ff ff 48 3b 99 10 00 50 00 0f 85 d0 |......H;...P....| 00000290 ff ff ff 48 8b 81 00 00 50 00 c3 e8 7e ff ff ff |...H....P...~...| 000002a0 8b 90 00 00 20 00 8b 00 48 8b 08 48 89 0f 48 8b |.... ...H..H..H.| 000002b0 48 08 48 89 4f 08 48 8b 48 10 48 89 4f 10 03 fa |H.H.O.H.H.H.O...| 000002c0 40 0f b6 06 83 c6 01 8b c8 03 c9 0f b7 89 03 01 |@...............| 000002d0 00 00 ff e1 41 8b 40 fc 41 83 e8 04 8b cf 2b 08 |....A.@.A.....+.| 000002e0 89 88 00 00 20 00 b9 c3 00 00 00 40 88 0f 83 c7 |.... ......@....| 000002f0 01 40 0f b6 06 83 c6 01 8b c8 03 c9 0f b7 89 03 |.@..............| 00000300 01 00 00 ff e1 83 c6 01 e8 11 ff ff ff 41 89 00 |.............A..| 00000310 41 83 c0 04 48 89 38 40 0f b6 06 83 c6 01 8b c8 |A...H.8@........| 00000320 03 c9 0f b7 89 03 01 00 00 ff e1 e8 ee fe ff ff |................| 00000330 2b c7 83 e8 07 89 47 03 b9 48 89 00 00 66 89 0f |+.....G..H...f..| 00000340 b9 05 00 00 00 40 88 4f 02 b9 48 8b 03 83 89 4f |.....@.O..H....O| 00000350 07 b9 eb 08 00 00 66 89 4f 0b 83 c7 0d 40 0f b6 |......f.O....@..| 00000360 06 83 c6 01 8b c8 03 c9 0f b7 89 03 01 00 00 ff |................| 00000370 e1 e8 a8 fe ff ff 2b c7 83 e8 0e 89 47 0a 48 b9 |......+.....G.H.| 00000380 48 89 43 08 83 c3 08 48 48 89 0f b9 8b 05 00 00 |H.C....HH.......| 00000390 66 89 4f 08 83 c7 0e 40 0f b6 06 83 c6 01 8b c8 |f.O....@........| 000003a0 03 c9 0f b7 89 03 01 00 00 ff e1 e8 6e fe ff ff |............n...| 000003b0 89 47 08 48 b9 48 89 43 08 83 c3 08 b8 48 89 0f |.G.H.H.C.....H..| 000003c0 83 c7 0c 40 0f b6 06 83 c6 01 8b c8 03 c9 0f b7 |...@............| 000003d0 89 03 01 00 00 ff e1 83 c6 ff e8 3f fe ff ff 2b |...........?...+| 000003e0 c7 83 e8 06 89 47 02 b9 ff 15 00 00 66 89 0f 83 |.....G......f...| 000003f0 c7 06 40 0f b6 06 83 c6 01 8b c8 03 c9 0f b7 89 |..@.............| 00000400 03 01 00 00 ff e1 e8 13 fe ff ff 8b 00 2b c7 83 |.............+..| 00000410 e8 0f 89 47 0b 48 b9 48 85 c0 48 8b 03 8d 5b 48 |...G.H.H..H...[H| 00000420 89 0f b9 f8 0f 00 00 66 89 4f 08 b9 85 00 00 00 |.......f.O......| 00000430 40 88 4f 0a 83 c7 0f 40 0f b6 06 83 c6 01 8b c8 |@.O....@........| 00000440 03 c9 0f b7 89 03 01 00 00 ff e1 8b ce 40 0f b6 |.............@..| 00000450 06 83 c6 01 83 f8 22 0f 85 f0 ff ff ff 8b c6 2b |......"........+| 00000460 c1 83 c0 ff 89 4f 05 89 47 11 b9 48 89 43 08 89 |.....O..G..H.C..| 00000470 0f b9 b8 00 00 00 40 88 4f 04 48 b9 48 89 43 10 |......@.O.H.H.C.| 00000480 83 c3 10 b8 48 89 4f 09 83 c7 15 40 0f b6 46 01 |....H.O....@..F.| 00000490 83 c6 02 8b c8 03 c9 0f b7 89 03 01 00 00 ff e1 |................| 000004a0 33 c9 40 0f b6 80 83 00 00 00 48 c1 e1 04 48 03 |3.@.......H...H.| 000004b0 c8 40 0f b6 06 83 c6 01 83 f8 30 0f 83 e1 ff ff |.@........0.....| 000004c0 ff 48 8b d1 48 f7 da 8d 5e 01 83 f8 2d 48 0f 44 |.H..H...^...-H.D| 000004d0 ca 0f 44 f3 48 89 4f 09 48 b9 48 89 43 08 83 c3 |..D.H.O.H.H.C...| 000004e0 08 48 48 89 0f b9 b8 00 00 00 40 88 4f 08 83 c7 |.HH.......@.O...| 000004f0 11 40 0f b6 06 83 c6 01 8b c8 03 c9 0f b7 89 03 |.@..............| 00000500 01 00 00 ff e1 40 0f b6 06 40 0f b6 88 83 00 00 |.....@...@......| 00000510 00 40 0f b6 46 01 40 0f b6 80 83 00 00 00 48 c1 |.@..F.@.......H.| 00000520 e1 04 48 03 c8 40 88 0f 83 c7 01 40 0f b6 46 03 |..H..@.....@..F.| 00000530 83 c6 04 8b c8 03 c9 0f b7 89 03 01 00 00 ff e1 |................| 00000540 40 0f b6 06 83 c6 01 83 f8 5c 0f 85 f0 ff ff ff |@........\......| 00000550 40 0f b6 46 01 83 c6 02 8b c8 03 c9 0f b7 89 03 |@..F............| 00000560 01 00 00 ff e1 40 0f b6 06 83 c6 01 83 f8 20 0f |.....@........ .| 00000570 86 f0 ff ff ff 8b c8 03 c9 0f b7 89 03 01 00 00 |................| 00000580 ff e1 48 b9 d9 bf 6f 7a 22 42 bc 83 48 ba 40 96 |..H...oz"B..H.@.| 00000590 00 00 00 00 00 00 48 bb 71 e6 00 00 00 00 00 00 |......H.q.......| 000005a0 e8 b8 fc ff ff bb 00 00 20 01 ba 00 00 30 01 bf |........ ....0..| 000005b0 00 00 30 00 8b 00 ff d0 8d 9f 00 00 d0 ff bf 7d |..0............}| 000005c0 00 00 00 be 41 02 00 00 ba c0 01 00 00 b8 02 00 |....A...........| 000005d0 00 00 0f 05 8b e8 8b fd be 00 00 30 00 8b d3 b8 |...........0....| 000005e0 01 00 00 00 0f 05 8b fd b8 03 00 00 00 0f 05 b8 |................| 000005f0 e7 00 00 00 0f 05 48 83 e4 f0 bf 78 00 00 00 be |......H....x....| 00000600 00 00 00 00 b8 02 00 00 00 0f 05 8b d8 48 b8 ff |.............H..| 00000610 ff ff ff ff ff ff ff 48 89 05 e2 f9 01 00 bf 00 |.......H........| 00000620 00 00 00 be 00 00 02 00 ba 08 00 02 00 41 ba 08 |.............A..| 00000630 00 00 00 b8 0d 00 00 00 0f 05 8b fb be 10 00 02 |................| 00000640 00 b8 05 00 00 00 0f 05 8b fb be 00 00 10 00 48 |...............H| 00000650 8b 15 ea f9 01 00 b8 00 00 00 00 0f 05 8b fb b8 |................| 00000660 03 00 00 00 0f 05 48 b8 00 00 00 00 7f 7f 7f 7f |......H.........| 00000670 48 8b 1d c9 f9 01 00 48 89 83 00 00 10 00 bd 00 |H......H........| 00000680 00 d0 00 be 00 00 10 00 bf 00 00 40 01 41 b8 00 |...........@.A..| 00000690 00 10 01 45 33 ff 40 0f b6 06 83 c6 01 8b c8 03 |...E3.@.........| 000006a0 c9 0f b7 89 03 01 00 00 ff e1 |..........|  ## 20160403 ### Compiling on Linux Without Libc For reference, just reposting some of the inline asm bits from one of my engines to jump start compiling without libc... Shell script to compile forces C (-x c) since I often use the cpp extension which defaults to C++, and forces no libraries except libdl (-nostdlib -ldl). gcc -x c e.cpp -o e.bin -std=gnu99 -nostdlib -ldl ...  Note output from "ldd" will show libc even with -nostdlib because libdl depends on libc, even when the binary only ever uses say 2 external symbols from libdl {dlopen() and dlsym()}. The linux-vdso is mapped for syscall bypass kernel fast path. Some "ldd" output, linux-vdso.so.1 (0x00007fff763f9000) libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f3cb75ac000) libc.so.6 => /usr/lib/libc.so.6 (0x00007f3cb7209000) /lib64/ld-linux-x86-64.so.2 (0x00007f3cb77b0000)  Rolling Your Own Main Running without libc means jumping in from _start instead, and then doing a little assembly to setup the correct environment (note the manual stack alignment). // Pulled from elsewhere in the engine... #define ER_ __restrict #define ES_ static typedef unsigned char EU1; typedef signed int ES4; typedef EU1 *ER_ EU1R; // Enter without libc, ES_ void main(ES4 argc, EU1R *ER_ argv) { ERomMain(argc, argv); EDie(); } __asm__( ".text\n" ".global _start\n" "_start:\n" "xor %rbp,%rbp\n" "pop %rdi\n" "mov %rsp,%rsi\n" "andq-16,%rsp\n"
"call main\n");


Syscalls
Sorry in advance this may wrap. Showing only the 64-bit x86-64 interface below. Syscalls have 0 to 6 arguments so you need just 7 inline asm functions to access any syscall. The return is often technically signed (as signed means error), but I use unsigned everywhere out of habit with a typecast when I need the signed result. I grab syscall numbers from the linux source, and make my own headers for what I need (which is not much).
// Copied from elsewhere in the engine...
#define EI_ static inline __attribute__((always_inline))
typedef unsigned long EU8;

// Linux syscall access.
EI_ EU8 ELnx0(EU8 num) { EU8 ret;
asm volatile("syscall":"=a"(ret):"a"(num):
"cc","memory","%rcx","%rdx","%rdi","%rsi","%r8","%r9","%r10","%r11");
return ret; }
EI_ EU8 ELnx1(EU8 num, EU8 ar1) { EU8 ret;
asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1):
"cc","memory","%rcx","%rdx","%rsi","%r8","%r9","%r10","%r11");
return ret; }
EI_ EU8 ELnx2(EU8 num, EU8 ar1, EU8 ar2) { EU8 ret;
asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2):
"cc","memory","%rcx","%rdx","%r8","%r9","%r10","%r11");
return ret; }
EI_ EU8 ELnx3(EU8 num, EU8 ar1, EU8 ar2, EU8 ar3) { EU8 ret;
asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2),"d"(ar3):
"cc","memory","%rcx","%r8","%r9","%r10","%r11");
return ret; }
EI_ EU8 ELnx4(EU8 num, EU8 ar1, EU8 ar2, EU8 ar3, EU8 ar4) { EU8 ret;
register EU8 lar4 asm("r10") = ar4;
asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2),"d"(ar3),"r"(lar4):
"cc","memory","%rcx","%r8","%r9","%r11");
return ret; }
EI_ EU8 ELnx5(EU8 num, EU8 ar1, EU8 ar2, EU8 ar3, EU8 ar4, EU8 ar5) { EU8 ret;
register EU8 lar4 asm("r10") = ar4; register EU8 lar5 asm("r8") = ar5;
asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2),"d"(ar3),"r"(lar4),"r"(lar5):
"cc","memory","%rcx","%r9","%r11");
return ret; }
EI_ EU8 ELnx6(EU8 num, EU8 ar1, EU8 ar2, EU8 ar3, EU8 ar4, EU8 ar5, EU8 ar6) { EU8 ret;
register EU8 lar4 asm("r10") = ar4; register EU8 lar5 asm("r8") = ar5; register EU8 lar6 asm("r9") = ar6;
asm volatile("syscall":"=a"(ret):"a"(num),"D"(ar1),"S"(ar2),"d"(ar3),"r"(lar4),"r"(lar5),"r"(lar6):
"cc","memory","%rcx","%r11");
return ret; }


## 20160331

### Practical Example of Things Linux Supporters Should Fix

Based on a true story ... literally happening right now, it's March 31st and Hyper Light Drifter has just been released. I'd like to play the game, and wow it is supposed to actually have Linux support.

Step 1: I go to www.heart-machine.com and see if I can just pay the developer directly and download a Linux binary of the game. As a fellow developer I'd much rather pay a developer directly for a game that I'm interested in, than purchasing through a third party who takes a cut of the pie.

Fail 1: No ability to buy the game directly, must go through Steam.

Step 2: I go to store.steampowered.com and attempt to download and install Steam on my Linux box.

Fail 2: Clicking the "Download" button takes me here: repo.steampowered.com/steam/archive/precise/steam_latest.deb but with an error message from the server.

Forbidden

You don't have permission to access /steam/archive/precise/steam_latest.deb on this server.
Apache/2.2.22 (Ubuntu) Server at repo.steamstatic.com Port 80

Ok, so for whatever reason the Steam download link for Linux is/was broken. But even if it worked, the download would be useless to me, as it is only a Debian package, and I'm not running Ubuntu, so I have no way to install that package without manually installing something which can unpack DEB files, then attempting to manually install whatever my specific Linux box doesn't have which Steam might depend on.

No easy way to get the game, no easy way to install Steam. This is an example of the typical "Linux experience" today.

I'm still going to get the game, but I'm waiting for the PS4 version. I'd like to support Linux, but quite frankly that is impossible as long "It Just Doesn't Work".

A Better Time
Another story, back when I worked for Wolfram Research ages ago, working on the UNIX part of the Mathematica frontend and also building the UNIX audio support, at the time, Mathematica ran on something on the order of 9 UNIX/BSD platforms: Linux/SunOS/AIX/Solaris/etc. Mathematica was statically linked. It just worked. Period.

Seriously, it just worked.

You got a fantastic experience regardless of what platform you were on, or what Linux distro you had on your machine.

Dependency Nightmare
Between the time when I was writing a fork() based audio engine in Mathematica and now attempting to find a way to just give the developer money and get a working Linux binary of a game I'd like to play, the industry adopted this nightmare policy of dynamic linking to a hairball of rolling release libraries.

Now nothing ever works.

Seriously, a 1TB harddrive is \$50. The idea that there is a need to dynamically link to save space is utter insanity.

Problem Practices
If you are "supporting" Linux and falling into one of these cases, you are doing serious damage to any effort to make Linux a viable platform.

(1.) Distributing a {insert a specific Linux distro package} file instead of a tgz file which works on any Linux machine.

(2.) Distributing a binary which dynamically links to a tree sized library dependency chain.

(3.) Requiring a user or developer to manually build your software.

Taking Responsibility
As a Linux developer I strive to link to just one thing, libdl.so. I only use direct syscalls except for {OpenGL/Vulkan, ALSA, Xlib} and those I dlopen() when possible. If syscall mapping changes for a Linux version, I'll just release different binaries. This is about as anti-dependency as is possible, and it best ensures things just work.

For me it was trivial to not use libc and go direct to native system calls. It saves me a tremendous amount of time and pain by NOT depending on other people's broken software libraries and instead writing things directly.

As for the "real world" there are examples of the "right way" to do "libraries". Like STB (stb_howto.txt is a great read). Single file headers with little or no dependencies which a developer includes into a project instead of linking to. Problem solved.

### Instant Soft-Reboot to Prior Machine Snapshot

An idea for practical simple memory protection on an otherwised unprotected system. Works similar in concept to emulator instant save and restore but applied to a full system. Reserve 50% of memory for a full machine state snapshot. Deny access to this snapshot memory except during save and restore (ie pages just not in page table). Hook up hotkey combination to save or restore full machine state. On crash, simply restore from snapshot (instant soft-reboot). Design around one full machine barrier per frame for safe snapshot point.

## 20160330

Been very quite about my background language and OS prototypes, but they slowly continue as I explore various options. One common theme across everything I've tried is a dependence on simultaneous ability to execute, read, and write to a page of memory. Each language defers compilation until run-time. The binary is a self-compiling position-independent executable which leverages a global dictionary to specialize position-dependent and run-time-dependent code for a given situation at any point during execution. This means quite literally that the line between code and data has been removed. Data generated at run-time contains baked code specific to the data. And so on...

## 20160304

### Tim Sweeney on UWP - And My Thoughts on the Topic

The Guardian Op-Ed by Tim Sweeney : Microsoft wants to monopolise games development on PC. We must fight it

If the Industry Wants an Alternative It's Free to Make That Alternative!
There has never been a better time to do so either.
Great quality Linux Vulkan drivers are soon to be here from all three of the major desktop GPU manufactures. An engine designed for Vulkan is going to run fantastic on Linux. Linux with some major distro re-shaping (below), can be made into an awesome alternative desktop OS, with traditional WinNT era ideals. As an open system, the Linux platform has the opportunity to be shaped into an OS which has a console-like Quality of Service guarantee for games and VR. For example, the base kernel already has support for real-time priority and pinned CPU memory. The opportunity for gaining benefits on Linux for VR is tremendous: think about OS guaranteed chunks of time on CPU and GPU timed exactly to HMD sync rate in combination of guaranteed resident memory with absolutely no hitching or paging.

Valve still has an amazing opportunity if they would take a reformed SteamOS to the desktop instead of just the living room. Valve has the PC OEM link required with Steam to still pull this off on a pre-installed system.

The effort required to build a reformed Linux starts with fully scrapping the traditional Linux distro, and starting from scratch. Reformed distro needs to break conventions core to the politics of the Linux scene: reformed distro runs binary self contained packages dependent on only the kernel, hardware libraries like Vulkan, and Steam libraries. App install needs to be as easy as dragging a folder into an applications folder on the drive. App delete needs to be as easy as deleting the app's folder. Linux dependency nightmare is removed from the equation.

Treat this like an embedded Linux device. Start with only the Linux kernel, light-weight replacement for glibc (like musl), light-weight shell tools (like busybox), add back only what is required to get the core consumer and developer needs supplied. Things like Wine for some Windows binary compatibility, web browser and associated plugins, movie viewer, mixer control, audio player, browser, Steam, working with usb sticks, decompressing/compressing files, etc, all work out of the box. Scrap the "you need to be a system's programmer" standard Linux configuration system. Something like one file, with a easy GUI control panel for standard users, would be a better option. Run from RAM after boot with the exception of large apps.

Everything on reformed Linux distro either works out of the box, or isn't included. User experience is paramount. The go/no-go gauge for this system, is if a non-computer person can use without instruction and without any kind of frustration.

Seriously it is a Less Than a Year Effort
I'd offer some of my at-home personal-time to contribute to an organized public effort to build this reformed distro. Years ago I developed and maintained a minimized personal-use from-scratch Linux distro which fit on a 100 MB ZIP disk. All this effort needs is a group of people who all agree enough to make forward progress, who are willing to see the thing to completion. Employers who might benefit from this, could start by sponsoring some official at-work time to work on the effort. Etc.

An alternative doesn't exist because of lack of coordination and action from people who care enough to do something about it. Single individuals don't stand a chance on their own. I've re-pitched this idea for years with no buy-in, would love to hear from others who might be interested in making it happen.

For this to work, GL or Vulkan "ports" of DX11 based Windows games isn't going to cut it. Game devs who want an alternative, need to take responsibility to solve the chicken and egg problem. To provide a sustained incentive for a consumer to buy into the platform alternative you are asking for even thought it starts without enough market share to justify the effort in the short term. Consoles get this with critical mass of exclusive 1st party games. A new PC platform can only get this with a sustained migration, until a critical mass point is reached such that it is possible to be the lead platform.

First step is giving consumers a choice without compromise. Release with performance and quality parity on Linux and Windows. The start is taking Vulkan seriously, retooling for the new API to leverage what it is capable of. Using Vulkan to get simultaneous Win7/8/10 support and Linux support from one effort. This is a win-win situation, even if the Linux effort never takes, the investment in Vulkan on Win7/8 will enable investment in better tech which isn't constrained by having a DX11 fallback rendering path. Major game efforts on Vulkan will force IHVs to dedicate effort to tuning and bug fixing drivers. As developers, you choose what you want to be good by launching a game using the API. Real titles are the best form of QA testing, as they have real-world coverage :) Have Linux working during development: a WIN32 and Linux Kernel interface portability layer is trivial to make.

Closing
The industry continuing on its current trajectory won't bring back the conventions of the past. The opportunity to possibly change that exists, if people are willing to band together to build it.

4KB