{"id":17360644,"url":"https://github.com/certik/bcompiler","last_synced_at":"2025-04-14T23:32:05.044Z","repository":{"id":54096347,"uuid":"42843204","full_name":"certik/bcompiler","owner":"certik","description":"Mirror of http://www.rano.org/bcompiler.tar.gz, with a bootstrap script","archived":false,"fork":false,"pushed_at":"2021-03-09T12:52:13.000Z","size":43,"stargazers_count":87,"open_issues_count":1,"forks_count":17,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-28T12:21:11.794Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://rano.org/bcompiler.html","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/certik.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-09-21T04:07:53.000Z","updated_at":"2025-03-16T16:48:01.000Z","dependencies_parsed_at":"2022-08-13T06:40:58.020Z","dependency_job_id":null,"html_url":"https://github.com/certik/bcompiler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certik%2Fbcompiler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certik%2Fbcompiler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certik%2Fbcompiler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certik%2Fbcompiler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/certik","download_url":"https://codeload.github.com/certik/bcompiler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248978970,"owners_count":21192883,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T19:26:59.684Z","updated_at":"2025-04-14T23:32:04.804Z","avatar_url":"https://github.com/certik.png","language":"Shell","funding_links":[],"categories":["Assemblers"],"sub_categories":["Self-hosted hex assemblers"],"readme":"\tBootstrapping a simple compiler from nothing\n\t============================================\n\nThis document describes how I implemented a tiny compiler for a toy\nprogramming language somewhat reminiscent of C and Forth. The funny\nbit is that I implemented the compiler in the language itself without\ndirectly using any previously existing software. So I started by\nwriting raw machine code in hexadecimal and then, through a series of\nbootstrapping steps, gradually made programming easier for myself\nwhile implementing better and better \"languages\".\n\nThe complete source code for all the stages is in a tar archive:\n\u003chttp://www.rano.org/bcompiler.tar.gz\u003e. This text is the README file\nfrom that archive. So, if you are reading this on-line, you can fetch\nthe tar archive and continue off-line, if you prefer.\n\nThe code only runs on i386-linux, though it would be easy to port it\nto another operating system on i386, and probably not at all hard to\nport it to a different architecture.\n\n\nHEX1: the boot loader\n---------------------\n\nYou could input a short program into the memory of an early computer\nby using switches on its front panel. This short program might then\nread in a longer program from punched cards. To write a program on\npunched cards you did not need an editor program, as you could write\nnew cards using an electro-mechanical card punch and manually insert\nand remove cards from the deck. So, if we were using an early\ncomputer, we could really implement a compiler without using any\nexisting software. Unfortunately, a modern PC has neither front panel\nswitches nor a punched card reader, so you need some software running\non the machine just to read in a new program. In fact, you probably\nneed some rather complex software running on the machine: just take a\nlook at /usr/src/linux/drivers/block/floppy.c, for example.\n\nSince we are doing this on a PC running Linux, we have to define some\nother starting point. Rather than use the raw hardware, we start with\nthese facilities:\n\n - an operating system;\n\n - a simple text editor (or we could use Emacs and pretend it's a\n   simple text editor);\n\n - a shell that lets us run a program with file descriptors connected\n   to particular files (this way the programs we write only need to\n   read from and write to file descriptors and do not have to know\n   about opening files);\n\n - an initial program to convert hexadecimal to binary so that we can\n   compose our first programs in hexadecimal, using the text editor,\n   and then \"compile\" them to binary in order to run them (this\n   corresponds roughly to the program that you might enter into an\n   early computer using front panel switches).\n\nOur initial program is hex1.he (the source in hexadecimal) or hex1\n(the binary). If you want to check that hex1 really is the binary\ncorresponding to hex1.he, you can do a hex dump of it:\n\n\tod -Ax -tx1 hex1\n\nIf you use hex1 to process hex1.he the result it hex1 again:\n\n\t./hex1 \u003c hex1.he | diff - hex1\n\nSo we can think of hex1 as a trivial bootstrapping compiler for a\nlanguage called HEX1.\n\nApart from comments and white space, the syntax of HEX1 is\n/([0-9a-f]{2})*/. Comments start with '#' and continue to the end of\nthe line. The semantics of HEX1 is the semantics of machine code,\nwhich is rather complex. Fortunately we can restrict ourselves to a\ntiny subset of the full instruction set.\n\nIn hex1.he I have put the corresponding assembler code in comments\nnext to the machine code. The file starts with two ELF headers: a\n52-byte file header and a 32-byte program header. It is not necessary\nto understand all the fields in the ELF header. The most interesting\nfields are:\n\n* e_entry, which specifies where execution should begin. Here it is\n0x08048054, which is directly after the ELF headers (labelled _start).\n\n* p_vaddr and p_paddr, which specify the target address in memory.\nHere it is 0x08048000, which is standard for Linux binaries.\n\n* p_filesz and p_memsz, which should be set to the length of the file.\nIt seems not to matter if you put a larger number here, and I will\nmake use of that later, though here I have put the correct value.\n\n(For more information about ELF do a web search. SCO and Intel have\nsome useful on-line documents.)\n\nThe code at _start is a loop that reads pairs of hex digits by calling\ngethex and outputs bytes by calling putchar. Next comes putchar, which\nuses the \"write\" system call. Then gethex, which calls getchar and\ncontains a loop for skipping over comments. The ASCII characters\n[0-9a-f] are converted correctly to the values 0 to 15; everything\nbelow '0' (48) is treated as a space and ignored; other characters are\nmisconverted, as there is no error detection. The function getchar\nuses the \"read\" system call, and calls \"exit\" at the end of the file.\n\n\nHEX2: one-character labels\n--------------------------\n\nWriting machine code in hex is not much fun. The worst part is\ncalculating the addresses for branch, jump and call instructions. Here\nI am using relative addresses, so I have to recalculate the address\nevery time I change the length of the code between an instruction and\nits target. It would be no better if I were using absolute addresses:\nthen I would have to change all references to locations after the\nchange.\n\nSo the first feature I add for my convenience is a function for\ncomputing relative addresses. Instead of writing\n\n\t# function:\n\t\t...\n\t\te8 cc ff ff ff\t\t# call function\n\nI will be able to write:\n\n\t.F\t\t\t# function:\n\t\t...\n\t\te8 F\t\t\t# call function\n\nHEX2 automatically fills in the correct 4-byte relative address.\n\nUnfortunately, I still have to use HEX1 to implement the first version\nof HEX2, so, to keep the implementation simple, I only allow\none-character labels and backwards references to them. And there is no\nerror detection for an undefined label.\n\nThe syntax of HEX2 is ([0-9a-f]{2}|\\.L|L)*, where L is any character\nabove 32 apart from [0-9a-f].\n\nThe first implementation of HEX2 is hex2a.he. If you compare the ELF\nheaders in hex1.he and hex2a.he you will notice that I have changed\np_flags. This is to make the program writable as well as executable.\nNormal programs consist of several sections, in particular a text\nsection, which contains the program itself, and a data section. The\ntext section is executable, but not writable, and the data section is\nwritable, but not executable. In hex1.he I did not need to write any\ndata to memory, so I only had a text section. In hex2a.he I need to\nwrite data to memory, but I can not be bothered with separate\nsections, so I use a single section which is both executable and\nwritable.\n\nThere are only two pieces of data: \"pos\" is a 32-bit counter to keep\ntrack of our location as we output the binary, and \"label\" is a\n259-byte table to record the values of the labels. Why 259 bytes? This\nis because I forgot to multiply by 4. I should have used a table of\n256 4-byte values, one for each possible one-character label, and\ncalculated the address as (table + char * 4). Since I forgot to\nmultiply by 4, I only need 259 bytes for my table, and I have to avoid\nusing labels that are close to one another: if I use 'm', then I\ncannot use 'j', 'k', 'l', 'n', 'o' or 'p'. It would be easy to fix\nthis bug immediately, but it is even easier to work around it for now\nand fix it a bit later.\n\nWe can \"compile\" hex2a.he using hex1:\n\n\t./hex1 \u003c hex2a.he \u003e hex2a \u0026\u0026 chmod +x hex2a\n\nSince HEX2 is a superset of HEX1, hex2a.he can also compile itself:\n\n\t./hex2a \u003c hex2a.he | diff - hex2a\n\nTo test the new facility, I made hex2b.he from hex2a.he by replacing\nnumerical addresses by symbolic ones wherever possible. Compiling\nhex2b.he gives the same binary as hex2a.he:\n\n\t./hex2a \u003c hex2b.he | diff - hex2a\n\nIn hex2c.he I fix the \"multiply by 4\" bug. It is easier to fix the bug\nnow that I can use labels and do not have to manually modify relative\naddresses. In hex2c.he I also replace some 1-byte relative addresses\nby 4-byte relative addresses, so that I can use labels, and I have\ninserted blocks of NOPs at the end of file to make the precise value\nof e_entry less critical.\n\nWe can compile hex2c.he using hex2a/hex2b or using itself:\n\n\t./hex2a \u003c hex2c.he \u003e hex2c \u0026\u0026 chmod +x hex2c\n\t./hex2c \u003c hex2c.he | diff - hex2c\n\n\nHEX3: four-character labels and a lot of calls\n----------------------------------------------\n\nOne-character labels are a bit restrictive, so let us implement\nfour-character labels. If labels have exactly four characters we can\nstore them neatly in 32-bit words!\n\nThe syntax of HEX3 is /([0-9a-f]{2}|:....|\\.....)*/, and now we will\nintroduce some very basic error detection. The compiler can report\nthree different kins of error, which is will do using its exit code:\n\n exit code 1: syntax error\n exit code 2: redefined label\n exit code 3: undefined label\n\nSince it is a single-pass compiler, only backwards references to\nlabels are permitted.\n\nThe first implementation of HEX3 was hex3a.he, written in HEX2:\n\n\t./hex2c \u003c hex3a.he \u003e hex3a \u0026\u0026 chmod +x hex3a\n\nIt is not possible to compile hex3a.he with hex3a itself, as HEX3 is\nnot compatible with HEX2.\n\nI created hex3a.he by making successive small changes to hex2c.he. The\nsystem call brk() is used to get memory for an arbitrarily large\nsymbol table. Absolute references to data are avoided by putting a\nfunction (.z / get_p) in front of the static data area that returns\nthe address of the following data.\n\nHaving created hex3a.he, I started work on hex3b.he, an implementation\nof HEX3 written in HEX3. Initially hex3b.he was just hex3a.he\ntranslated to the new syntax, but I then gradually rewrote it to make\nmuch greater use of labels and functions. In the final version, after\na certain point in the file, everything is done using only these\ninstruction groups:\n\n - push a constant onto the stack:  68 XX XX XX XX\n - call a named function:           e8 .LABEL\n - unconditional jump:              e9 .LABEL\n - conditional branch:              58 85 c0 0f 85 .LABEL\n - push an address onto the stack:  68 .LABEL e8 .reab\n\nThe last instruction group consists of a push instruction followed by\na call instruction, but the two may not be separated: the function\n\"reab\" converts the relative address on the stack to an absolute\naddress by adding its return address and subtracting 5.\n\nWe can compile hex3b.he using hex3a or itself:\n\n\t./hex3a \u003c hex3b.he \u003e hex3b \u0026\u0026 chmod +x hex3b\n\t./hex3b \u003c hex3b.he | diff - hex3b\n\n\nHEX4: any-length labels and implicit calls\n------------------------------------------\n\nWhen implementing hex3b.he we found that it is possible to define all\ncomplex functions in terms of simpler functions by using a tiny subset\nof all the possible machine instructions: branch, call, jump and a few\nothers.\n\nIn HEX4 we use an even smaller set of instructions and generate those\ninstructions implicitly.\n\nIn HEX4 there are four types of token:\n\n - in-line code or data ('58, '59)\n - define label (:data, :loop, :func)\n - instruction: push constant (10, 42)\n - instruction: push label address (\u0026func, \u0026loop)\n - instruction: call label address (+, -, jump, branch, func)\n\nTokens must be separated by white space and the type of token is\nrecognised from the first character. Labels can have any length - but\nwe implement them with a simple hash function, so there is a risk of\nspurious redefined label errors.\n\nThe jump and branch instruction groups from HEX3 are implemented by\nfunctions. A \"push label address\" instruction must always be followed\nimmediately by a call to one of the functions that can understand a\nrelative address: address, branch, jump. The \"address\" function\n(formally \"reab\") converts the relative address to an absolute\naddress, which can be stored and used later.\n\nThe predefined functions are:\n\nStack manipulation: drop dup rot pick swap\nArithmetic: + - * / % \u003c\u003c \u003e\u003e log\nComparisons: \u003c \u003c= == != \u003e= \u003e\nBitwise logic: \u0026 | ^ ~\nMemory access: @ = c@ c=\nFlow of control, using immediate relative address: branch call\nFlow of control, using stored absolute address: call\nAddress conversion: address\nArray support: [] []\u0026 []= c[] c[]\u0026 c[]=\nAccess of arguments and variables: arg arg\u0026 arg= var var\u0026 var=\nFunction support: enter vars xreturnx xreturn0 xreturn1\nDynamic memory: wsize sbrk / malloc free realloc\nSystem calls: exit in out\n\n- All operations take arguments and return results to the stack.\n\n- Comparisons return 0 or 1.\n\n- All data are words, except for c@, c=, c[], c[]\u0026, c[]=, which\noperate on bytes.\n\n- Any user-defined function must start with \"enter\"; \"vars\" can be\nused straight after \"enter\" to reserve space for N local variables.\n\n- To return from a function, use one of the \"return\" functions. \"X Y\nxreturnx\" means return Y values from a function that took X arguments.\nThe most common cases are Y=0 and Y=1, so \"X xreturn0\" and \"X\nxreturn1\" are provided.\n\n- Like in C, addresses are byte addresses, so we have to multiply by\nwsize when allocating memory with sbrk or malloc.\n\n- \"x y []\" is equivalent to \"x y wsize * + @\"\n\n- As always, no forward references to labels are allowed.\n\nAs with HEX3 there are two implementations of HEX4. The first one,\nhex4a.he, is written in HEX3. The second one, hex4b.he, is written in\nHEX4.\n\n\t./hex3b \u003c hex4a.he \u003e hex4a \u0026\u0026 chmod +x hex4a\n\t./hex4a \u003c hex4b.he \u003e hex4b \u0026\u0026 chmod +x hex4b\n\t./hex4b \u003c hex4b.he | diff - hex4b\n\n\nHEX5: structured programming, at last\n-------------------------------------\n\nHEX5 is more like a real structured programming language. There are no\nlonger any labels; instead there are loops and if...(else)...fi\nstructures. The syntax of HEX5 can no longer be described with a\nregular expression; instead we need a context-free grammar:\n\n\tprogram = (hexitem | global | procedure)*\n\thexitem = hexbyte |  \"_def\" symbol\n\thexbyte = /'[0-9a-f][0-9a-f]/\n\tglobal = \"var\" symbol | \"string\" symbol string_literal\n\tstring_literal = /\"([^\"]|\\\\.)*\"/\n\tprocedure = \"def\" args name \"{\" vars body \"}\"\n\targs = symbol*\n\tname = symbol\n\tvars = \"var\" symbol\n\tbody = (number | word | loop | jump | if)*\n\tnumber = /[0-9]+/\n\tword = symbol\n\tloop = \"{\" body \"}\"\n\tjump = \"break\" | \"continue\" | \"until\" | \"while\"\n\tif = \"if\" body \"fi\" | \"if\" body \"else\" body \"fi\"\n\tsymbol = /.+/ except ...\n\nLexical rules:\n\n\tcomment = /#[^\\n]*\\n?/\n\tspace = /\\s/\n\tstring_literal = /\"([^\"]|\\\\.)*\"/\n\ttoken = /\\S+/\n\nThe first implementation of HEX5, written in HEX4, is hex5a.he. This\nis only a very partial implementation, as it would be quite tedious to\nimplement all of HEX5 in HEX4. In particular, there are not yet any\nnamed variables or arguments; access to a function's arguments and\nlocal variables is done using the functions from HEX4. Global\nvariables are implemented with a cunning hack:\n\n\t./hex4b \u003c hex5a.he \u003e hex5a \u0026\u0026 chmod +x hex5a\n\nNext came hex5b.he, which can compile itself, as it is written in a\nsubset of HEX5. In hex5b.he I implemented named arguments and\nvariables:\n\n\t./hex5a \u003c hex5b.he \u003e hex5b \u0026\u0026 chmod +x hex5b\n\t./hex5b \u003c hex5b.he | diff - hex5b\n\nThen I wanted to start using those features for implementing further\nfeatures, so I switched to developing hex5c.he, in which I implemented\nstring constants, \"while\", \"until\", \"return0\" and \"return1\":\n\n\t./hex5b \u003c hex5c.he \u003e hex5c \u0026\u0026 chmod +x hex5c\n\n\nBCC: a real language\n--------------------\n\nAll that is needed to turn HEX5 into a tiny structured programming\nlanguage is to separate off the first part of the source, where there\nis in-line machine code and the \"predefined\" and library functions are\nimplemented, into a separate header file. At this point I removed\nreferences to \"hex\" and called the two files \"header.bc\" and \"bcc.bc\".\nThese two files are concatenated for compilation:\n\n\tcat header.bc bcc.bc | ./hex5c \u003e bcc \u0026\u0026 chmod +x bcc\n\nNow bcc can compile itself, of course:\n\n\tcat header.bc bcc.bc | ./bcc \u003e bcc2 \u0026\u0026 chmod +x bcc2\n\tmv bcc2 bcc\n\tcat header.bc bcc.bc | ./bcc | diff - bcc\n\nNote that the bcc produced by hex5 might not be identical to the bcc\nproduced by bcc itself, as I might make some minor improvements to the\ncode generated by bcc. But the main improvements to be introduced in\nbcc are:\n\n - proper error messages to stderr instead of just exit codes\n - report undefined symbols\n - a dynamic buffer for tokens so there is no limit to their length\n\n\nWhat next?\n----------\n\nHere are some things that one might want to do with BCC for one's\neducation and entertainment:\n\n - port it to a different operating system or architecture\n   (you could compile to Java byte code, for example)\n\n - think of a neater way of handling return values from functions\n\n - implement a compile-time check for stack underflow\n\n - include a non-bogus implementation of malloc, realloc, free\n\n - use an RB-tree for the symbol table so that the compiler does not\n   take time quadratic in the number of symbols\n\n - think up a way of using BCC to bootstrap GCC ...\n\n\nEdmund GRIMLEY EVANS \u003cedmundo@rano.org\u003e, March 2001\nRevised: March 2002\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcertik%2Fbcompiler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcertik%2Fbcompiler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcertik%2Fbcompiler/lists"}