OS Project : 7 - PS/2 and Misc

Just started back up again after a long break. Finished up the PS/2 driver. Simplified keyboard interface to one 64-bit bit array in memory, and a simple polling function.

   __0 __1 __2 __3 __4 __5 __6 __7 __8 __9 __A __B __C __D __E __F
0| 0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
1| G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V
2| W   X   Y   Z   `   -   =   [   ]   \   ;   '   ,   .   /   SPC

Key down sets a bit. Key release clears bits for {SHF, CTL, ALT} only. It is up to the application to clear bits otherwise. Ended up being a good design choice as keys like '\' immediately generate release commands even when held down on my keyboard. Key bitmap organized to make editor usage (like key-to-hex conversions) simple. Keys limited to what I expect to use, and supporting everything on typical PC arcade controllers.

Humans and modern computers share a common thread: neither is any good at multitasking. The context of multitasking in this post is not async IO, but rather running multiple applications at the same time on the same CPU core. Suspect that the needs of multitasking change quite a bit at the point where nearly everything on the machine is perceptually instantaneous: as it would be once the 30 years of 40% yearly compounding complexity dept is removed. SSDs have throughput similar to last-level cache on early P4 era PCs. Logical end point: CPU cores owned by the applications themselves, no single-core multitasking, no preemption. App using power control to drop CPU core to lowest power state required to sustain a hard real-time user interface.



Kickstarter Link: Voxelnauts

"Voxelnauts: one universe, infinite worlds, with possibilities limited only by your imagination. Your adventure awaits - come build it"

As a young kid I grew up building things out of legos. Had one box which collected all the legos from a few sets. Would set out to build things and imagine what they could be in a virtual world. Voxelnauts looks to be the modern realization of that, but done in a way which was impossible 30+ years ago: ability to feel the presence of what you build in a massive real-life scale, to share that with others, and interact in a persistent world...


OS Project : 6 - Hashing

Been a while since I had any time to work on this project. Renamed as I've done some major design changes...

Dropping source-less programming part. Source-less worked great for x86 programming, however there are a few things I didn't like about it. Little issues like having 2 words for each common forth construct like "CALL WORD" instead of just "WORD". Larger issues being the requirement to have another editor to generate complex data or non-x86 machine code.

From the Ashes
Pulled the old work into a new bootloader. This time rapid prototyping using just NASM. Eventually I'll have to rewrite the bootloader in the source/language I end up creating. Switched back to 64-bit long mode (see why below). Running with 1GB pages since long mode forces page tables.

Re-Thinking The Forth Dictionary
Time off from this project brought forth some new ideas addressing the original motivation for going source-less: (a.) don't like linear searches at interpreter-time through large dictionaries (simple forth implementation), (b.) don't like how hashing (complex forth implementation) takes a bunch of ALU-time, (c.) really don't like how dynamic hashing can leave 75% of the cache unused, (d.) and still want to have full 32-bit values in source without multiple source tokens.

Solving (d.) is easy, just switch to 64-bit source tokens. Source is now 2x the size, but 32-bit immediates are trivial, and words can now use 10 characters instead of 5. Source is a prefetchable linear read, so the "more memory for more simple" trade is worth it to me.

Now Can Hashing be Made Better?
Solving (b.): how about factoring everything in the hashing algorithm except the "AND" operation into the string encoding itself? The tokens encode strings as 60-bit integers. Just need to switch from a generic 6-bits per character, to something where the characters are encoded in a way that they effect all bits.

One thing I tried is to use a reversible cellular automata (CA) like Wolfram Rule 30R. So strings in tokens are kept in memory in the form after being processed through N steps of applying the CA. Editor to display the string simply has to reverse the CA process.

To test the theory I first grabbed an english dictionary, then hashed to {256, 512, 1K, 2K, ... 512K, 1 M} entry tables, and tracked the {min,max,average} table bin counts. Comparing the reversible CA to just the 64-bit finalizer in MurmurHash3 (not reversible but provides a good baseline): both have similar results.

Also tested against a dictionary with all possible {1,2,3} character strings (including symbols and numbers). Results are again similar.

Second I tried a recursive interval encoding: each character is encoded in the interval of the parent character. First character starts with the full 0 to 2^60-1 interval. Character 63 is a special terminal character (signals no more characters are encoded). Character 63 gets a smaller interval, allowing the other characters to get an extended interval designed to spread out the value when ANDed to a power of two. This works similar in theory to arithmetic encoding, except this has constant probabilities. Multiplers used to encode each character (starting with the 1st character and going to the last),

64*64*64*64*64*64*64*64*64 +63*63*63*63*63*63*63*63,
64*64*64*64*64*64*64*64 +63*63*63*63*63*63*63,
64*64*64*64*64*64*64 +63*63*63*63*63*63,
64*64*64*64*64*64 +63*63*63*63*63,
64*64*64*64*64 +63*63*63*63,
64*64*64*64 +63*63*63,
64*64*64 +63*63,
64*64 +63,
64 +1,

This evenly distributes the effect of a character across all bits. The synthetic all {1,2,3} length string case performs better with this encoding. The english dictionary does not see much of an improvement.

A forth option would be to just use arithmetic encoding, or predictive arithmetic encoding on the string. I suspect this would have similar results on the english dictionary as the above fixed interval encoder had on the all possible strings case. However I didn't try it, because I don't have any great way to get symbol probabilities for source that does not exist yet, and current results are probably good enough.

Hash Cache Utilization
Solving (c.) hash cache utilization: how about a software hash cache. Source tends to have a small working set of words, even if the dictionary gets quite large. Now that hashing to different size tables is free via AND, I can split the hash table into two parts: a 256 entry table (or larger) which will effectively always be filled and in hardware cache, and a much larger table for software misses which still wastes 75% of the hardware cacheline. The software hash cache contains the following per entry,

{4-byte data, 4-byte larger table address, 8-byte string}

The larger table is inclusive. Eviction of the small hash uses the larger table address, to avoid searching (or probing) for the right entry.

Moving On
This should have the right combination of being simple enough and fast enough to avoid triggering the "something is wrong" re-design impulse again.

The high-level view I have of the OS is effectively an ultra-high-power micro-controller that boots into a forth editor like a C64 boots into basic...


Source-Less Programming : 5

Boot Loader Bring-up
Managed to get the boot loader done, which includes the following steps,

(1.) Move the stack seg:pointer (since next step overwrites it).
(2.) Use BIOS to read the other 62 512-byte sectors for the first track.
(3.) Use BIOS to switch to 80x50 text mode and load custom character glyphs.
(4.) Use BIOS to set EGA text palette to 0-15 with 0 for overscan.
(5.) Program VGA palette registers for those 16 colors.
(6.) Use BIOS to enable A20.
(7.) Turn off interrupts, and relocate the image's 63 sectors to zero.
(8.) Load zero entry IDT, minimal 3 entry GDT.
(9.) Enable protected mode and jump to the 3rd sector.

The 2nd 512-byte sector contains the 8x8 character bitmaps for the first 64 characters. The majority of the time was spent making a nice font, getting colors the way I wanted, and prototyping editor look and feel (without building it).

Didn't feel like fully hand assembling 16-bit x86 machine code for the boot loader, so I used NASM and hexdump to accellerate the process (to provide machine code I could pad out to 32-bit alignment). Also wrote a quick C based tool to bootstrap the process of building the loader. Something which would enable me to easily build out an annotated image, and show a print out in the console of what I'd be seeing in the editor. Here is a shot of a bit of the scratch C code I used to make the font,

Here is a shot in QEMU of the loader displaying the font,

And another shot from QEMU showing the pallet,

What the Current Annotated Image Looks Like
Below is a shot captured from the terminal window output of the C tool. I'm using 3 cache lines for the loader code.

Grey lines separate the 512-byte sectors. Memory address on the left in grey. Each pair of lines shows half a x86 cacheline. The blue to white shows the 5 character/word annotation strings (now using the extra 2 bits of the label for color). The red hex show the image data. Not using {GET,ABS,REL} tagged words in this part, so everything in the bootloader is just hand assembled 16-bit machine code, and this is not representative of what the rest of the system will look like. The rest of the system will have {GET opcode} followed by {HEX} or {ABS} for opcode immediates (easy to write). The 16-bit code is {HEX} mixed opcode and immediates, quite a bit different (hard to write).

Some hints on the annotations,

Everything is in base 16. AX is TOP so I don't bother with "A=9000" (which wouldn't fit anyway), instead I just write "9000" (the A= is implied). The "!" means store so "SSSP!" is storing TOP (or AX) into both SS and SP. The "B=200" means BX=200h. In this 16-bit x86 case I use 3E to pad out opcodes to 32-bit. The "X" = SI, "Y" = DI, "F" = BP.

Next Step
Ground work is done, next step is to bring up the opcode dictionary for {GET} words, then write a little IDE driver to get access to load the rest of the image, and to be able to save in the editor. After that, write the drawing code for the editor, then a mini PS/2 driver for the input, then write editor input handling. Then I have a full OS ready to start on a real machine.


Source-Less Programming : 4

Still attempting to fully vet the design before the bootstrap reboot...

DAT words in the edit image need to maintain their source address in the live image This way on reload the live data can be copied over, and persistent data gets saved to disk. DAT annotations no longer have 30 bits of free space, instead they have a live address. When live address is zero. then DAT words won't maintain live data. This way read-only data can be self-repairing (as long as the annotations don't get modified). Going to use a different color for read-only DAT words. New persistent data DAT words will reference their edit-image hex value before reload (then get updated to the live address).

REL words always get changed on reload (self repairing). No need to keep the live address. REL is only used for relative branching x86 opcodes. Don't expect to have any run-time (non-edit-time) self-modifying of relative branch addresses. Given that branching to a relative branch opcode immedate is not useful, the LABEL of a REL word is only useful as a comment.

GET words also get changed on reload (self repairing). GET is only designed for opcodes and labeled constants. GET words will often be LABELed as a named branch/call target. Been thinking about removing GET, and instead making a new self-annotating word (display searches for a LABELed DAT word with the same image value, then displays the LABEL instead of HEX). This opens up the implicit possibility of mis-annotations. Would be rare for opcodes given they are large 32-bit values. But for annotating things like data structure immediate offsets, this will be a problem (4 is the second word offset in any structure).

ABS words always get changed on reload (self repairing). ABS words are targets for self-modifying code/data, so they also need LABELs. Reset on reload presents a problem in that ABS cannot be used to setup persistent data unless that persistent data is constant or only built/changed in the editor. But this limitation makes sense in the context that ABS addresses in live data structures can get invalidated by moving stuff around in memory. The purpose of ABS is edit-time relinking.

Source-Less Programming : 3

Annotation Encoding
Refined from last post, two 32-bit annotation words per binary image word,


..............................00 - DAT : hex data

Going to switch to just 2 lines per word displayed in the editor, Only DAT annotations show hex value, other types show LABEL of referenced address in the place of the hex value. So no need for an extra note. In practice will be using some amount of binary image memory to build up a dictionary of DAT words representing all the common somewhat forth like opcodes, then GET words in the editor to build up source.

Need to redo the bootloader from floppy to harddrive usage, and switch even the bootloader's 16-bit x86 code to 32-bit aligned LABEL'ed stuff so the final editor can edit the bootloader. Prior was avoiding manually assembling the 16-bit x86 code in the boot loader, but might as well ditch NASM and use something else to bootstrap everything.


Source-Less Programming : 2

Continuing with what will either be an insanely great or amazingly stupid project...

Making slow progress with bits of free-time after work, far enough thinking through the full editor design to continue building. Decided to ditch 64-bit long mode for 32-bit protected mode. Not planning on using the CPU for much other than driving more parallel friendly hardware... so this is mostly a question of limiting complexity. Don't need 16 registers and the REX prefix is too ugly for me to waste time on any more. The 32-bit mode uses much more friendly mov reg,[imm32] absolute addressing, also with ability to use [EBP+imm32] without an SIB byte (another thing I mostly avoid). Unfortunately still need relative addresses for branching. 32-bit protected mode thankfully doesn't require page tables unlike 64-bit long mode. Can still pad out instructions to 32-bits via reduntant segment selectors.

Source-Less Analog to Compile-Time Math?
Compile-time math is mostly for the purpose of self-documenting code: "const uint32_t willForgetHowICameUpWithThisNumber = startHere + 16 * sizeof(lemons);". The source-less analog is to write out the instructions to compute the value, execute that code at edit time, then have anotations for 32-bit data words which automatically pull from the result when building 32-bit words for opcode immediates for the new binary image.

Reduced Register Usage Via Self Modifying Code
Sure, kills the trace cache in two ways, what do I care. Sometimes the easist way to do something complex is to just modify the opcode immediates before calling into the function...

What Will Annotations Look Like?
The plan so far is for the editor to display a grid of 8x8 32-bit words. Each word is colored according to a tag annotation {data, absolute address, relative address, pull value}. Each word has two extra associated annotations {LABEL, NOTE}. Both are 5 6-bit character strings. Words in grid get drawn showing {LABEL, HEX VALUE, NOTE} as follows,


The LABEL provides a name for an address in memory (data or branch address). Words tagged with absolute or relative addresses or pull value show in the NOTE field the LABEL of the memory address they reference. Words tagged with data use NOTE to describe the opcode, or the immediate value. Editor when inserting a NOTE can grab the data value from other words with the same NOTE (so only need to manually assemble an opcode once). Edit-time insert new words, delete words, and move blocks of words, all just relink the entire edit copy of the binary image. ESC key updates a version number in the edit copy, which the excuting copy sees triggering it to replace itself with the edit copy.

Boot Strapping
I'm bootstrapping the editor in NASM in a way that I'll be able to see and edit later at run-time. This is a time consuming process to get started because instead of using NASM to assemble code, I need to manually write the machine code to get the 32-bit padded opcodes. Once enough of the editor is ready, I need a very tiny IDE/PATA driver to be able to store to the disk image. Then I can finish the rest of the editor in the editor. Then I'll also be self hosted outside the emulator and running directly on an old PC with a non-USB keyboard, but with a proper PCIe slot...

Look No Triangles : Scatter vs Gather

There are a bunch of people working-on and succeeding in non-triangle rendering. With GPU perf still climbing IMO it is possible to return to the golden age of a different kind of software rendering: the kind done in a pipeline built out of compute shaders.

In my sphere tracing of math based SDF fields I was purely ALU bound, tracing to the limit of floating point precision. The largest performance win was found by doing a many-level hierarchical trace (starting with very coarse grain empty space skipping). But the limit of all this is just a log reduction of the number of steps in the search, still requires many search steps per pixel. And when doing a memory based trace (instead of a math based trace) the search is just a very long latency chain with divergent access patterns. Tracing via searching on the GPU hits a wall. To make matters worse when tracing, the ALU units are loaded up with work involved in tracing, instead of something useful.

The alternative to this is to switch to a mostly scatter based design. A large amount of the tree structure traversed each frame in a gather based approach is similar across frames. Why not just have the tree stored mostly expanded in memory based on the needs of the view. Then expand or collapse the tree based on the new visibility needs of the next frame. Rendering is then a mostly scatter process which reads leaves in the tree once. Reads of memory can now be coherent, and ALU can be used for things more interesting than search. Scatter will be somewhat divergent, but that cost can be managed by loading up enough useful ALU work in parallel. There are a lot of ways to skin this. Nodes of the tree can be bricks. Bricks can be converted into little view based depth sprites, then binned into tiles and composited. Seems as if bricks converted into triangle meshes and rasterized is the popular path now, but still using the CPU to feed everything. This could get much more interesting when the GPU is generating the cached geometry bricks: artistically controlled procedual volume generation...


From Scratch Bug 2 : Source-Less Programming

This is a disruptive idea which comes back periodically: source-less programming. Is it possible to efficiently program at a level even lower than an assembler?

The general idea is that the editor is now something similar to an enhanced hex editor which edits a binary image directly. Lowest memory is split into three parts {{running image, annotations for edit image}, edit image}. The {running image, annotations for edit image} is the boot image. The {edit image} is working space which gets snapshot replacement for {running image} on a safe transition zone. The "annotation" section is what enables human understanding of the binary image.

One way to keep the system very simple is to always work in 32-bit words. A 32-bit word in memory is one of four things {data, opcode, absolute address, rip relative address}. Data is easily handled by the hex editor part. The annotation could provide a name for the data or a comment. Opcodes in x86 can be complex but it is easy to simplify. Start with something similar to forth zero-operand and one-operand operations (calls, etc). Make all operations padded to 32-bit alignment (x86-64 can use the 2e ignored prefix to pad). A call for instance becomes {32-bit opcode for call, 32-bit rip relative branch address}. Or a global load becomes {32-bit opcode for load, 32-bit rip relative branch address}. Annotation memory can provide a name for the opcode. Annotation can provide a tag for each word in memory which marks if the memory is a relative or absolute address (word gets a different color based on tag similar to color forth). Addresses can be auto annotated by displaying the annotation associated with the destination. Editor works on the {edit image}, with insert and delete of words automatically adjusting all the words tagged as address (effectively relinking at edit time). The {edit image} can also keep a mapping of original {running image} address so that it is possible to view the live values of any data. Editor provides something to easily be able to write an annotation and have the associated opcode or address automatically get updated. For example type the opcode name and the 32-bit value is automatically updated. Very simple and yet powerful minimal tool.


Pixel Art and Slot Mask Pitch

This and the prior post are all shots from the same late model Arcade CRT, a 29" SVGA Makvision which can scan 30-40KHz and 47-90Hz. I'm cheating somewhat in taking a Metal Slug screen shot and displaying it on a non-15KHz monitor. Metal Slug was roughly 304x224 if I'm remembering right, so ultra low resolution to enable a 60Hz scan-out on CGA CRTs.

Arcade titles over the years with CRTs had a range of monitors and resolutions. Displays would provide a different look depending on the Slot Mask Pitch (effectively the number of dots for a given scanline). In this next shot I'm driving the monitor near it's lowest resolution (at roughly 312 lines), then using H-size and V-size control to enlarge the screen shot as much as possible (showing maybe 250 lines on a 600 line display, so higher slots/line count than the Metal Slug titles). The 29" Makvision is a flat screen and thus suffers from moire patterns more than a curved display. In order to get the classic scan-line look (which is caused by scanning only half the display's lines to get double the frame rate), this shot has the moire reduction turned off (which keeps the beam from having vertical line jitter, which would otherwise cause lines to blend together).

Alternatively I can drive this monitor at 800x600 and then set the moire reduction to blend scan lines. This is to simulate various Arcade games which displayed a relatively higher resolution compared the display slot mask pitch (lower slots/line count, the other extreme). The prior post's image was somewhere in-between these two examples.

Indie vs Real Slug Fest

If you see squares you are doing it wrong. The classic pixel art masters never intended for it to look as ugly as exact square pixels.

Shot from Metal Slug. The shot on the right is from a photo of an arcade CRT monitor.


From Scratch Bug

Inspired by Jaymin's JayStation2 effort and remembering a past life building custom OSs for early x86 machines, haven't been able to avoid the custom OS bug any longer. It starts easy with a harmless QEMU install, followed by a 512-byte bootloader switching to 80x50 text mode and installing a custom 48 character Forth font, then bring up of a Forth assembler/editor, then on to the pain of modern PCI and USB driver bring-up... with the eventual goal of a tiny bootable USB thumb system.

Amazingly refreshing to not have the OS telling you NO. Or the API telling you NO. Modern systems are all about the NO. Systems I grew up on were all about the YES.

Reworking my language from scratch, trying something new, replacing the Forth data stack with a new concept, but maintaining zero operand opcodes. Not sure if the idea will pan out. Dropping everything but 32-bit word support from the language, no need to interop with other software. No more 8/16/64-bit loads or stores (can still just inline machine code if required). Still running in x86-64 64-bit mode, so return stack PUSH/POP/CALL/RET is still a 64-bit stack operation, just don't need that 64-bit address space or 64-bit pointers anywhere else. Trying padding out all x86 opcodes to 32-bit alignment. This makes the 32-bit immediate 32-bit aligned. Wastes space, gives up some perf? Why would I care when most of the CPU side of the system fits in the L1 cache. Dropping paging, dropping interrupts, dropping everything, none of that stuff is needed.

Reworking an editor and binary source encoding. Switching to 32-bit tokens with 5 character max strings. 48 character character-set. Doing something horrible with font design: 1=I, 0=O, 2=Z, 5=S, etc. All caps font with no non-vertical or non-horizonal lines. Actually looks awesome. When you don't have to interop with the NO machine, long symbol names are not required. Color Forth like editors have almost no state. It is magical how they function simultaneously as an editor/assembler/console/debugger/UI/etc. Take the idea of "editor-time-words", words embedded in the source code which are evaluated when the block of source is drawn to the screen. Becomes possible to build out UI tools in the source. Can have an editor-time word read system data and draw in real-time updates in the source code itself. Editor-time words are just like any other word in the system, just color tagged to only be evaluated at draw time.

Minimal systems are a blessing, more so when you have only minimal free time to work on them.