{"id":28234924,"url":"https://github.com/rasky/small64","last_synced_at":"2025-06-12T22:30:51.136Z","repository":{"id":288971194,"uuid":"946265156","full_name":"rasky/small64","owner":"rasky","description":"Small64 - The first Nintendo 64 4K intro","archived":false,"fork":false,"pushed_at":"2025-05-11T17:38:42.000Z","size":533,"stargazers_count":19,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-06T22:59:56.430Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rasky.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-10T21:48:24.000Z","updated_at":"2025-05-23T15:41:50.000Z","dependencies_parsed_at":"2025-04-20T19:37:30.893Z","dependency_job_id":"ba4e34b1-3c40-4fae-9797-862ad662d01b","html_url":"https://github.com/rasky/small64","commit_stats":null,"previous_names":["rasky/small64"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/rasky/small64","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasky%2Fsmall64","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasky%2Fsmall64/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasky%2Fsmall64/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasky%2Fsmall64/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rasky","download_url":"https://codeload.github.com/rasky/small64/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasky%2Fsmall64/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259541558,"owners_count":22873714,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-18T22:14:38.761Z","updated_at":"2025-06-12T22:30:51.121Z","avatar_url":"https://github.com/rasky.png","language":"C++","readme":"# Small64 - The first 4k on Nintendo 64\n\nSubmitted at Revision 2025.\n\nThis document gives technical insights on how Small64 works and how we\nachieved it.\n\n## Quick recap of N64 hardware\n\nThis is a short description of N64 hardware just to frame your mind about\nthe challenge:\n\n * MIPS R4300 64-bit CPU running at 93.750 Mhz, with FPU.\n * 4 MiB RDRAM, plus 4 MiB available through the expansion pak.\n * RSP 32-bit coprocessor, running at 62.500 Mhz, based on a custom 32-bit\n   MIPS core plus custom SIMD extensions (128-bit registers with 8 16-bit\n   fixed point slots). This is basically a DSP with internal 4K IMEM + 4K DMEM\n   static memory for code and data, plus DMA access to RDRAM. This is where 3D\n   transform and lighting normally runs.\n * RDP: Rasterizer doing screen-space triangle drawing with texture mapping,\n   perspective correction, Z-buffering, etc. Not programmable.\n * Audio: just a plain DAC playing back 16-bit stereo samples from RDRAM.\n\n## How a standard N64 ROM works\n\nNormally, a N64 ROM has the following layout:\n\n           ----------------------\n              Header (64 bytes)\n           ----------------------\n              IPL3 (4032 bytes)\n           ----------------------\n                 Actual game\n\n\n\n           ----------------------\n\nWhen powering on the console, the CPU runs some bootstrap come burnt into\nthe silicon that is called IPL1/2. That code accesses the cartridge,\nloads the IPL3 into some static memory, verifies that its checksum matches\na hardcoded value (that is stored in a security chip in the cartridge itself),\nand then jumps to it.\n\nIPL3 is stored in the game cartridge, but is actually a piece of bootcode that\nwas provided by Nintendo that is the last stage of the secure boot of the\nconsole. IPL3 initializes the main RDRAM of the system, that involves also\na complex current calibration process, and then loads the actual game\n(it assumes a 1 MiB flat binary) and jumps to it. This is where game developers\nactually started providing their own code.\n\nThe 64-byte header of the ROM contains several metadata about the ROM, starting\nwith the title, the region, etc. Most of it is just conventional data though,\nnot really needed at runtime.\n\n## How small64 works\n\nTo actually make a 4 KiB intro, we only have one option:\n\n           ----------------------\n            Small64 (4096 bytes)\n           ----------------------\n\nSo our intro *has to be the header and the IPL3*. This means that the intro\nmust also take care of initializing RDRAM. And moreover, **the intro itself has to\nmatch the hardcoded checksum that IPL1/2 is going to calculate**, otherwise\nit will not boot.\n\nInserting some code in the header is a common technique on PC too, so we just\nadapted the same logic to Nintendo 64. In our case, there is only one part\nthat we need to preserve: byte 2, 3, and 4 that are used by IPL1/2 to configure\nthe ROM access speed. Everything else can be used. As a nice touch, we also\nstore the ASCII name at offset 0x20 as expected in ROMs, so that it is\ndisplayed correctly by flashcart menus and ROM managers.\n\nTo boot the console, we had to write our own *compact* RDRAM initialization\nroutine. This was a bit challenging if you consider that the full initialization\nprocess has been reverse engineered at the end of 2023 (!). Up until 2023, \neverybody still used a IPL3 payload ripped from commercial games to perform\nthe initialization. In 2023, Libdragon published the first unencumbered, open-source\nimplementation of IPL3, that's been in use in homebrew productions since.\n\nLibdragon's code performs initialization as documented by Rambus datasheets,\nso the process is a bit cumbersome and involves various steps including\noutput current calibration. For small64, we want for a totally barebone\napproach, where we basically bang hardcoded values to various registers in\nsequence. Current calibration is not performed: we only use a fixed value\nthat appears to work perfectly fine on most console at least when they are\nsemi-warm. In the end, our RAM init code is just 0x2ec bytes, before compression.\n\nIt seems that because of this, the intro fails to boot on cold console; if\nthat's the case for you, just run another ROM for a few seconds and then try\nagain. It should then work. We're hopefully going to fix this in a followup\nrelease.\n\n## Compression\n\nFor ROM compression, we needed something that could run on the MIPS CPU,\neven *before* the RAM is initialized, as we wanted to also compress RAM init code.\nLibdragon ships various algorithms including Shrinkler, which is the \"grandfather\"\nof the famous Crinkler used by most intros on PC.\n\nFor small64, we want instead with upkr tool (https://github.com/exoticorn/upkr)\nbecause it showed a slightly harder compression ratio, especially because it\nallows for 4-byte parity for literal contexts, which gives a bit of advantage\non MIPS whose opcodes are 32-bit.\n\nTo improve upkr ratio, we also wrote our own custom section ordering tool called\n\"swizzle\". This tool basically runs a simulated annealing optimizer to \ntry different permutation of code and data sections (functions and data),\nto find an order that maximizes the ratio. This technique is also used by\nCrinkler but we didn't have time to write a full linker, so it basically\nworks with ELF .o files, and outputs a linker script (order.ld) containing\nthe correct order for the GCC linker to produce the final binary. \n\nSwizzle is able to save about 50 compressed bytes, which is quite a bit when\nyou fight for the byte!\n\nThe final payload of the intro is 164 KiB (or 37 KiB if you exclude the BSS\nsegment), which compresses down to 3786 bytes. Not bad! We include the BSS\nsegment in the compression because zeros compress very well, and so that the\ndecompression will clear that memory.\n\n## Small64 boot process\n\nLet's now see how the ROM is actually laid out:\n\n           ---------------------------------\n              Stage 0 ROM (0x0000 - 0x003F)\n           ---------------------------------\n               Stage 0 (0x0040 - 0x0168)\n           ---------------------------------\n           Compressed RDRAM init (0x168 - ~0x232)\n           ---------------------------------\n            Compressed Intro (~0x232 - ~0xFFC)\n           ---------------------------------\n           IPL2 hash matching cookie (0xFFC-0xFFF)\n           ---------------------------------\n\nSo how do you decompress an intro if RAM is not available? That's what we do:\n\n * IPL2 loads what it believes to be the \"IPL3\" (offsets 0x40-0x1000) into \n   DMEM (RSP static RAM for data).\n * Stage 0 is where execution starts. It's offset 0x40 in the ROM, which is\n   where the IPL3 entrypoint is.\n * Stage 0 contains the upkr decompression code and the payload for the next\n   stages. It decompresses Stage 1 (RDRAM init) into IMEM. Notice that part of\n   Stage 0 code is put in what is normally the header space (0x0-0x3F), and\n   since that part isn't loaded into DMEM by IPL2, it is run directly from ROM.\n * After decompression, it jumps to IMEM to run Stage 1.\n * Now Stage 1 runs. This is the RDRAM init code. It initializes RDRAM so that\n   we finally have our 4 MiB of RAM available for the intro. Then, it jumps\n   back to Stage 0.\n * Stage 0 now runs decompression a second time. This time, it decompresses\n   Stage 2 (compressed intro) to RDRAM. To be precise, it also decompresses\n   again Stage 1 to RDRAM because Stage 1 and 2 are solidly compressed as \n   a single payload to improve ratio, so Stage 1 must be decompressed before\n   Stage 2.\n * Then, Stage 0 jumps to the Stage 2 entrypoint in RDRAM. And now the\n   actual intro begins!\n\nWow, quite a journey! All in all, we managed to have to first compressed\nbyte of the intro at offset 0x232, meaning that the intro itself has to fit\ninto 3534 (compressed) bytes.\n\n## GPU hash cracking for an intro?\n\nWe reserved the last 4 bytes of the ROM for the bruteforcing cookie. What is it?\n\nAs explained above, IPL1/2 will verify the IPL3 checksum using a bespoke algorithm,\nto make sure it matches a hardcoded value. If this check fails, the ROM won't boot,\nso we need to make sure our final ROM matches this checksum. How can it be possible?\n\nWe perform GPU hash cracking (technically, a \"pre-image attack\") by tweaking the\nlast 8 bytes of the ROM testing millions and millions of values\nuntil we find one that matches the requested checksum. This is a process\nthat can take multiple hours on modern GPUs (eg: ~18/24 hours on a Apple M1 Pro).\nThis technique is also used by Libdragon to release their own open-source\nIPL3s, so [tooling for this](https://github.com/Polprzewodnikowy/ipl3hasher-new) was already available.\n\nTo perform the cracking we use two sets of free bits (called respectively \nX bits and Y bits by the tool). The X bits must be the last 32 bits of the ROM,\nso there's not much to do (as explained, we reserve them). The Y bits instead\ncan be everywhere in the ROM; the tool supports specifying up to 32 bits for Y\nbits, though most signing can succeed with only 20 of them. Since our ROM is\npretty full, we use another tool we wrote ([mips_free_bits.py](https://github.com/rasky/small64/blob/main/tools/mips_free_bits.py))\nto search for empty bits in stage0. In fact, many MIPS opcodes don't really use\nall of the 32 bits that made the opcode, but leave a few them undefined. The\nVR4300 CPU luckily just ignores those, so we can use them as our Y bits for\nthe tool.\n\n\n## Music\n\nThe initial experiments started with [dollchan bytebeat\ntool](https://dollchan.net/bytebeat/), but it is essentially JavaScript,\nwhich uses double floats internally. The hunch was that floats would be too slow\non N64 for rendering multichannel music in real time and it would be better to\nuse integers (but 64-bit ones!). However, doubles only have 52 bits in the\nfraction, so converting the dollchan bytebeat to 64-bit integers on the N64\nwould likely be a headache. It would be better to compose the music using 64-bit\nintegers in the first place. Thus, new tools were needed.\n\nThe new tool was a simple VST2 instrument, written in Go, using the [vst2\nlibrary](https://github.com/pipelined/vst2). In the first iteration, it had:\n  * All math was based on 64-bit integers. \n  * Two oscillators (sinusoidal, saw, triangle, square, and noise)\n  * ADSR envelope, with linear slopes\n  * Delay effect\n  * One second-order filter per instrument (low/band/high/notch), with adjustable frequency and resonance\n  * Pitch drop, with exponentially dropping frequency\n  *  One global reverb (shared by all instruments), ported from [4klang/Sointu](https://github.com/vsariola/sointu)\n\nAnyway, this was like WAAAY over the *speed* budget when tested on the emulator:\nN64 wasn't able to even render a single instrument with the reverb in real time!\nSo, things had to be simplified a lot. After removing the reverb, delay, and\njust keeping one oscillator per channel, the machine was still only fast enough\nto render 4 channels. The song had 8 instruments, so we had to make every two\ninstruments share the same channel and ensure that in the composition, these\nchannels had no overlapping notes.\n\nNext problem was the *size* budget. The first versions of the song were\nconsuming around 1.5k (after compressed) and given the inefficiencies of\ncompressing the MIPS instruction encodings, the visuals were going to need all\nthe bytes they could get. Thus, several further features had to be removed from\nthe synth.\n\nIn the end, the synth had:\n  * 4 channels / 8 instruments, with every two instruments sharing a channel\n  * Adjustable sustain-release envelope, with fixed length sustain per instrument\n  * 1 oscillator (sinusoidal, triangle, noise)\n  * Filter per instrument (high or band)\n\nThe song was composed in [MuLab](https://www.mutools.com/) with the note data\nexported into a .mid file, and a quick converter to convert this to linear\narrays. There was no need to make patterns/order list kind of data storage,\nbecause the LZ type decompression of UPKR already handled the repetitive\npatterns very well. There were 4 tracks, with note numbers 1-127 representing\nthe triggering of instrument #1 of that channel (track), while note numbers\n129-255 represented triggering of instrument #2 of that channel.\n\nFinally, to get the instrument settings from the MuLab project file, the vst2\ninstrument was programmed to store its settings as JSON in the DAW project\n(using the GetChunk/SetChunk mechanism). We were happy to see that MuLab did not\napply any compression to its project files, so it was relatively easy to write a\nscript to scrape the JSONs from the MuLab project file.\n\n\n3D Graphics\n=====\n\n### Overview\n\nDrawing 3D meshes on the N64 requires requires a good amount of code.\u003cbr\u003e\nWhile there is dedicated hardware in form of a rasterizer (the RDP), it will only process 2D screen-space triangles.\u003cbr\u003e\nOn top of that, the format expects them to be pre-processed into starting points and slopes, instead of just a pair of 3 vertices (See [RDP Triangle](https://n64brew.dev/wiki/Reality_Display_Processor/Commands#0x08_through_0x0F_-_Fill_Triangle) for more information).\u003cbr\u003e\nThis means that the entire 3D pipeline from transformation, lighting, clipping and slope calculations needs to be done in software.\u003cbr\u003e\n\nWhile it is possible to do this on the CPU, games will offload this onto the RSP with special code called \"microcode\" or \"ucode\".\u003cbr\u003e\n\nThe main advantage here are the 32 vector registers with 8 lanes each (all 16bit integers).\u003cbr\u003e\nOn top it can also execute a scalar and vector instruction at the same time under certain conditions.\u003cbr\u003e\n\nUsually the hardest part of writing ucode is understanding all the nuances of the rather special instruction set, as well as keeping it fast.\u003cbr\u003e\nInstructions can stall each other with complicated rules, requiring manual  re-ordering (See [RSP Pipeline](https://n64brew.dev/wiki/Reality_Signal_Processor/CPU_Pipeline)).\u003cbr\u003e\n\nNormally all of this is written directly in assembly, due to a lack of compiler support.\u003cbr\u003e\nThere is however a high-level language called [RSPL](https://github.com/HailToDodongo/rspl) which was developed together with one of the homebrew 3D ucodes [Tiny3D](https://github.com/HailToDodongo/tiny3d), which this demo loosely used for reference.\n\n### Demo Ucode\n\nFor this demo the ucode had to be written way differently than you usually would.\u003cbr\u003e\nInstead of running fully in parallel, it is instead synced with the CPU, which avoids having to implement a command queue system.\u003cbr\u003e\nThe idea being that the RSP is halted by default, while the CPU can setup the data for the next task.\u003cbr\u003e\n\nMesh data in form of unindexed triangles are generated once in RDRAM via the CPU.\u003cbr\u003e\nThe exact location and size is hardcoded in the ucode.\u003cbr\u003e\n\nThe other input parameters that change over time (rotation, scaling) are directly set to DMEM.\u003cbr\u003e\nNote that this can only be done here since the RSP is halted, otherwise it may cause bus conflicts.\u003cbr\u003e\n\nEach frame the CPU will calculate a fixed-point transformation matrix only containing rotation.\u003cbr\u003e\nWhich intern can be used to rotate both the vertices and normals at the same time.\u003cbr\u003e\nScaling is provided as a single scalar which gets applied after that.\u003cbr\u003e\n\nOnce the RSP is started it will then load the parameters, and starts processing the triangles by streaming them in one by one.\u003cbr\u003e\nEach getting transformed, lighting and effects applied, and finally converted and send to the RDP for rasterization.\u003cbr\u003e\nLastly it will stop itself halting the processor.\u003cbr\u003e\n\nIn order to reduce instructions a lot of things where removed including: perspective, input data for UVs and color, clipping and rejection as well as some precision for the final RDP slopes.\u003cbr\u003e\n\nOne of the challenges with that was to work with the compression in mind.\u003cbr\u003e\nFor example the automatic reordering of RSPL was disabled for this demo.\u003cbr\u003e\nWhile it may not change the size, it created less \"uniform\" code.\u003cbr\u003e\nAs a tradeoff this made the code run way slower than it could however.\u003cbr\u003e\n\nThe process of how the RDP will fetch data was also changed.\u003cbr\u003e\nNormally you can point it to a buffer in RDRAM via a register from which to fetch commands.\u003cbr\u003e\nWhile this gives you a lot of memory to work with, it requires a DMA from the RSP to get it there.\u003cbr\u003e\nInstead we point it directly to DMEM to avoid that, with the tradeoff of dealing with the 4kb DMEM size and reduced performance once more due to more syncing.\u003cbr\u003e\n\nHowever only reducing code doesn't make for a great demo, so a few effects where squeezed in.\u003cbr\u003e\nAll of which work by using the existing data with little extra code:\n\n#### UV-Gen\nUVs are generated based on the screen-space normals X and Y position.\u003cbr\u003e\nWhich is a simplified form of spherical texture coordinates.\u003cbr\u003e\nThe RDP will later draw a texture with that, where the texture data itself is simply random data in RDRAM.\u003cbr\u003e\nThis gives the torus a metallic appearance.\u003cbr\u003e\n\n#### Fresnel\nBy taking the Z-component of the normals and scaling it, we can get a cheap fresnel factor.\u003cbr\u003e\nThat factor is later used to blend towards another random texture for the colored outline effect.\u003cbr\u003e\n\n#### Specular\nThe specular highlights are faked by taking the previous fresnel factor and inverting it.\u003cbr\u003e\nEffectively simulating a light pointing directly at the screen.\u003cbr\u003e\nBy multiplying it with itself a few times, which also compresses nicely, we can get a sharp highlight.\u003cbr\u003e\nThis value is passed as vertex color to the RDP, and is then added on top of the texture color.\u003cbr\u003e\n\n#### Explosion Effect\n\nA scalar can be passed in which which will displace a triangle.\u003cbr\u003e\nThis will take the first vertex of a triangle together with parts of the matrix to generate an offset in screen-space.\u003cbr\u003e\nThe exact calculation have no deeper meaning and where chosen randomly.\u003cbr\u003e\nBy adding this scaled displacement onto all vertices of a triangle they will start flying across the screen.\u003cbr\u003e\n","funding_links":[],"categories":["Programming"],"sub_categories":["C"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frasky%2Fsmall64","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frasky%2Fsmall64","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frasky%2Fsmall64/lists"}