{"id":17101207,"url":"https://github.com/hackerb9/tokenize","last_synced_at":"2026-01-04T18:14:07.606Z","repository":{"id":49751080,"uuid":"517916659","full_name":"hackerb9/tokenize","owner":"hackerb9","description":"Tokenizer for Model 100 BASIC","archived":false,"fork":false,"pushed_at":"2024-08-04T08:56:44.000Z","size":18900,"stargazers_count":8,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-29T01:56:23.244Z","etag":null,"topics":["basic-programming","compression","cruncher","flex","kyocera-85","model100","tandy200","tokenizer","trs-80model100"],"latest_commit_sha":null,"homepage":"","language":"Stata","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hackerb9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-26T04:49:41.000Z","updated_at":"2024-11-28T20:26:12.000Z","dependencies_parsed_at":"2024-08-04T09:46:30.503Z","dependency_job_id":null,"html_url":"https://github.com/hackerb9/tokenize","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Ftokenize","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Ftokenize/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Ftokenize/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackerb9%2Ftokenize/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hackerb9","download_url":"https://codeload.github.com/hackerb9/tokenize/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245153896,"owners_count":20569408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["basic-programming","compression","cruncher","flex","kyocera-85","model100","tandy200","tokenizer","trs-80model100"],"created_at":"2024-10-14T15:24:17.461Z","updated_at":"2026-01-04T18:14:07.558Z","avatar_url":"https://github.com/hackerb9.png","language":"Stata","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Model 100 Tokenizer in C\n\nAn external \"tokenizer\" for the TRS-80 Model 100 BASIC language.\nConverts BASIC programs in ASCII text (`.DO`) to executable BASIC\n(`.BA`) on a host machine. Useful for large programs which the Model\n100 cannot tokenize due to memory limitations.\n\n\t$ cat FOO.DO\n\t10 ?\"Hello!\"\n\n\t$ tokenize FOO.DO \n\tTokenizing 'FOO.DO' into 'FOO.BA'\n\n\t$ hd FOO.BA\n\t00000000  0f 80 0a 00 a3 22 48 65  6c 6c 6f 21 22 00        |.....\"Hello!\".|\n\t0000000e\n\n![A screenshot of running the FOO.BA program on a Tandy 102 (emulated\nvia Virtual T)](README.md.d/HELLO.gif \"Running a .BA file on a Tandy 102 (Virtual T)\")\n\nThis program creates an executable BASIC file that works on the Model\n100, the Tandy 102, Tandy 200, Kyocera Kyotronic-85, and Olivetti M10.\nThose five machine have [identical\ntokenization](http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file).\nIt does not (yet) work for the NEC PC-8201/8201A/8300 whose N82 BASIC\nhas a different tokenization.\n\nAdditionally, this project provides a decommenter and cruncher\n(whitespace remover) to save bytes in the tokenized output. This\nallows one to have both well-commented, easy to read source code and a\nsmall executable size.\n\n## Introduction\n\nThe Tandy/Radio-Shack Model 100 portable computer can save its BASIC\nfiles in ASCII (plain text) or in a _“tokenized”_ format where the\nkeywords — such as `FOR`, `IF`, `PRINT`, `REM` — are converted to a\nsingle byte. The Model 100 automatically tokenizes an ASCII program\nwhen it is `LOAD`ed so that it can be `RUN` or `SAVE`d. The tokenized\nformat saves space and loads faster.\n\n### The problem\n\nPrograms for the Model 100 are generally distributed in ASCII format\nwhich is good for portability and easy transfer. However, ASCII files\nhave downsides: \n\n1. `RUN`ning an ASCII program is quite slow because the Model 100 must\n   tokenize it first.\n\n1. Large programs can run out of memory (`?OM Error`) when tokenizing\n   on the Model 100 because both the ASCII and the tokenized versions\n   must be in memory simultaneously.\n\n![A screenshot of an OM Error after attempting to tokenize on\na Tandy 102](README.md.d/OMerror.png \"Out of Memory error when\nattempting to tokenize on a Tandy 102 (Virtual T)\")\n\n\n### The solution\n\nThis program tokenizes on a host computer before downloading to the Model 100.\n\n### File extension terminology\n\nTokenized BASIC files use the extension `.BA`. ASCII formatted BASIC\nfiles should be given the extension `.DO` so that the Model 100 will\nsee them as text documents.\n\n\u003cdetails\u003e\u003csummary\u003eClick for details on filename extensions\u003c/summary\u003e\u003cul\u003e\n\n  | Machine Memory Map   | ASCII BASIC | Tokenized BASIC   |\n  |----------------------|-------------|-------------------|\n  | Generic              | `.DO`       | `.BA`             |\n  | Model 100 / 102      | `.100`      | `.BA1`            |\n  | Tandy 200            | `.200`      | `.BA2`            |\n  | NEC PC-8201          | `.NEC.DO`   | `.BA0`, `.NEC.BA` |\n  | Olivetti M10         | `.M10.DO`   | `.B10`, `.M10.BA` |\n  | Kyocera Kyotronic-85 | `.KYO.DO`   | `.B85`, `.KYO.BA` |\n\nThe primary two filename extensions are:\n\n  * **`.DO`** This is the extension the Model 100 uses for plain text\n    BASIC files, but in general can mean any ASCII text document with\n    CRLF[^2] line endings.\n  * **`.BA`** On the Model 100, the .BA extension always means a\n    tokenized BASIC program. However, many \".BA\" files found on the\n    Internet are actually in ASCII format. Before the existence of\n    tokenizers like this, one was expected to know that ASCII BASIC\n    files had to be renamed to .DO before downloading to a Model 100.\n\nBASIC programs that use `PEEK`, `POKE`, or `CALL` typically only work\non one specific model of portable computer, even when the BASIC\nlanguage and tokenization is identical. (The only two models which\nhave the same memory map are the Model 100 and and Tandy 102.) Using\nPOKE or CALL on the wrong model can crash badly, possibly losing\nfiles. To avoid this, when sharing BASIC programs over the network,\nfilename extensions are sometimes used to indicate which model a\nprogram will work on:\n\n  * `.100` An ASCII BASIC file that includes POKEs or CALLs specific\n\t  to the Model 100 or Tandy 102.\n  * `.200` An ASCII BASIC file specific to the Tandy 200.\n  * `.NEC` (or `.NEC.DO`) An ASCII BASIC file specific to the NEC\n    PC-8201 portables.\n\n  * `.BA1` A tokenized BASIC file specific to the Model 100/102.\n  * `.BA2` A tokenized BASIC file specific to the Tandy 200.\n  * `.BA0` (or `.NEC.BA`) A tokenized BASIC file specific to the NEC\n    portables.\n\nIt is not yet clear what extensions are used to denote the\nKyotronic-85 and M10. Hackerb9 suggests:\n\n  * `.KYO` or `.KYO.DO` An ASCII BASIC file that includes POKEs or\n    CALLs specific to the Kyocera Kyotronic-85.\n  * `.M10` or `.M10.DO` An ASCII BASIC file specific to the Olivetti M10.\n  * `.B85` or `.KYO.BA` A tokenized BASIC file specific to the Kyotronic-85.\n  * `.B10` or `.M10.BA` A tokenized BASIC file specific to the M10.\n\n\u003c/ul\u003e\u003c/details\u003e \u003c!-- Filename extensions --\u003e\n\n## Programs in this project\n\n* **tokenize**: A shell script which ties together all the following\n  tools. Most people will only run this program directly.\n\n* **m100-tokenize**: Convert M100 BASIC program from ASCII (.DO)\n  to executable .BA file.\n  \n* **m100-sanity**: Clean up the BASIC source code (sort lines and\n  remove duplicates).\n  \n* **m100-jumps**: Analyze the BASIC source code and output a list of\n  lines that begin with a comment and that are referred to by other\n  lines in the program. Input should have been run through\n  m100-sanity. \n\n* **m100-decomment**: Modify the BASIC source code to remove `REM`\n  and `'` comments except it keeps lines mentioned on the command line\n  (output from m100-jumps). This can save a lot of space if the\n  program was well commented.\n\n* **m100-crunch**: Modify the BASIC source code to remove all\n  whitespace in an attempt to save even more space at the expense of\n  readability. \n\n## Compilation \u0026 Installation\n\nPresuming you have `flex` installed, just run `make` to compile.\n\n``` BASH\n$ git clone https://github.com/hackerb9/tokenize\n$ make\n$ make install\n```\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eOptionally, you can compile by hand\u003c/b\u003e\u003c/summary\u003e\n\n```bash\n flex m100-tokenize.lex  \u0026\u0026  gcc lex.tokenize.c\n```\n\nFlex creates the file lex.tokenize.c from m100-tokenize.lex. The\n`main()` routine is defined in m100-tokenize-main.c, which is\n#included by m100-tokenize.lex.\n\n\u003c/details\u003e\n\n## Usage\n\nOne can either use the `tokenize` wrapper or run the executables manually.\n\n### The tokenize wrapper\n\nThe\n\"[tokenize](https://github.com/hackerb9/tokenize/blob/main/tokenize)\"\nscript is easiest. By default, the output will be exactly the\nsame, byte for byte, as a .BA file created on actual hardware.\n\n#### Synopsis\n\n\u003cul\u003e\n\n**tokenize** _INPUT.DO_ [ _OUTPUT.BA_ ]\u003cbr/\u003e\n**tokenize** [ **-d** | **--decomment** ] _INPUT.DO_ [ _OUTPUT.BA_ ]\u003cbr/\u003e\n**tokenize** [ **-c** | **--crunch** ] _INPUT.DO_ [ _OUTPUT.BA_ ]\n\nThe **-d** option decomments before tokenizing.\n\nThe **-c** option decomments _and_ removes all optional\nwhitespace before tokenizing.\n\n\u003c/ul\u003e\n\n#### Example 1: Simplest usage: tokenize filename\n\n\n``` bash\n$ tokenize PROG.DO\nTokenizing 'PROG.DO' into 'PROG.BA'\n```\n\n\u003cdetails\u003e\u003csummary\u003eClick to see more examples\u003c/summary\u003e\u003cul\u003e\n\n#### Example 2: Overwrite or rename\n\n``` bash\n$ tokenize PROG.DO\nOutput file 'PROG.BA' is newer than 'PROG.DO'. \nOverwrite [yes/No/rename]? R\nOld file renamed to 'PROG.BA~'\n```\n\n#### Example 3: Crunching to save space: tokenize -c\n\n``` bash\n$ wc -c M100LE.DO \n17630 M100LE.DO\n\n$ tokenize M100LE.DO \nTokenizing 'M100LE.DO' into 'M100LE.BA'\n\n$ wc -c M100LE.BA\n15667 M100LE.BA\n\n$ tokenize -c M100LE.DO M100LE-crunched.BA\nDecommenting, crunching, and tokenizing 'M100LE.DO' into 'M100LE-crunched.BA'\n\n$ wc -c M100LE-crunched.BA\n6199 M100LE-crunched.BA\n```\n\nIn this case, using `tokenize -c` reduced the BASIC executable from 16\nto 6 kibibytes, which is quite significant on a machine that might\nhave only 24K of RAM. However, this is an extreme example from a well\ncommented program. Many Model 100 programs have already been\n\"decommented\" and \"crunched\" by hand to save space.\n\n\u003cul\u003e\n\n_Tip: When distributing a crunched program, it may be a good idea to also\ninclude the original source code to make it easy for people to learn\nfrom, debug, and improve it._\n\n\u003c/ul\u003e\n\n\u003c/ul\u003e\u003c/details\u003e\n\n### Running m100-tokenize and friends manually\n\nIf one decides to not use the `tokenize` script, certain programs\nshould _usually_ be run to process the input before the final\ntokenization step, depending upon what is wanted. m100-sanity is\nstrongly recommended. (See [Abnormal code](#abnormal-code) below.)\n\n``` mermaid\nflowchart LR;\nm100-sanity ==\u003e m100-tokenize\nm100-sanity ==\u003e m100-jumps ==\u003e m100-decomment\nm100-decomment --\u003e m100-crunch --\u003e m100-tokenize\nm100-decomment --\u003e m100-tokenize\n```\n\n| Programs used                                                                   | Effect                                         | Same as     |\n|---------------------------------------------------------------------------------|------------------------------------------------|-------------|\n| m100-sanity\u003cbr/\u003em100-tokenize                                                   | Identical byte-for-byte to a genuine Model 100 | tokenize    |\n| m100-sanity\u003cbr/\u003em100-jumps\u003cbr/\u003em100-decomment\u003cbr/\u003em100-tokenize                 | Saves RAM by removing unnecessary comments     | tokenize -d |\n| m100-sanity\u003cbr/\u003em100-jumps\u003cbr/\u003em100-decomment\u003cbr/\u003em100-crunch\u003cbr/\u003em100-tokenize | Saves even more RAM, removing whitespace       | tokenize -c |\n| m100-tokenize                                                                   | Abnormal code is kept as is                    |             |\n\n\u003cdetails\u003e\u003csummary\u003eClick to see more details about running these programs manually\u003c/summary\u003e\u003cp\u003e\u003cul\u003e\n\n### m100-tokenize synopsis\n\n**m100-tokenize** [ _INPUT.DO_ [ _OUTPUT.BA_ ] ]\n\nUnlike `tokenize`, m100-tokenize never guesses the output filename.\nWith no files specified, the default is to use stdin and stdout so it\ncan be used as a filter in a pipeline. The other programs --\nm100-sanity, m100-jumps, m100-decomment, and m100-crunch -- all have the\nsame syntax taking two optional filenames. \n\n### Example usage of m100-tokenize\n\nWhen running m100-tokenize by hand, process the input through the\n`m100-sanity` script first to correct possibly ill-formed BASIC source\ncode. \n\n``` bash\nm100-sanity INPUT.DO | m100-tokenize \u003e OUTPUT.BA\n```\n\nThe above example is equivalent to running `tokenize INPUT.DO\nOUTPUT.BA`.\n\n### Example usage with decommenting\n\nThe m100-decomment program needs help from the m100-jumps program to\nknow when it shouldn't completely remove a commented out line, for\nexample,\n\n``` BASIC\n10 REM This line would normally be removed\n20 GOTO 10   ' ... but now line 10 should be kept.\n```\n\nSo, first, we get the list of line numbers that must be kept in the\nvariable `$jumps` and then we call m100-decomment passing in that list\non the command line.\n\n    jumps=$(m100-sanity INPUT.DO | m100-jumps)\n    m100-sanity INPUT.DO | \n\t    m100-decomment - - $jumps | \n\t\tm100-tokenize \u003e OUTPUT.BA\n\nThe above example is equivalent to running `tokenize -d INPUT.DO\nOUTPUT.BA`.\n\nNote that m100-decomment keeps the entire text of comments which are\nlisted by m100-jumps with the presumption that, as targets of GOTO or\nGOSUB, they are the most valuable remarks in the program. (This\nbehaviour may change in the future.)\n\nExample output after decommenting but before tokenizing:\n``` BASIC\n10 REM This line would normally be removed\n20 GOTO 10\n```\n\n### Example usage with crunching\n\nThe m100-crunch program removes all optional space and some other\noptional characters, such as a double-quote at the end of a line or a\ncolon before an apostrophe. It also completely removes the text of\ncomments which may have been preserved by m100-decomment from the\nm100-jumps list. In short, it makes the program extremely hard to\nread, but does save a few more bytes in RAM.\n\n    jumps=$(m100-sanity INPUT.DO | m100-jumps)\n    m100-sanity INPUT.DO | \n\t    m100-decomment - - $jumps | \n\t\tm100-crunch | \n\t\tm100-tokenize \u003e OUTPUT.BA\n\nThe above example is equivalent to running `tokenize -c INPUT.DO\nOUTPUT.BA`.\n\nExample output after crunching but before tokenizing:\n\n``` BASIC\n10REM\n20GOTO10\n```\n\n### An obscure note about stdout stream rewinding \n\nAfter finishing tokenizing, m100-tokenize rewinds the output\nfile in order to correct the **PL PH** line pointers. Rewinding\nfails if the standard output is piped to another program. For\nexample:\n\n  1. m100-tokenize  FOO.DO  FOO.BA\n  2. m100-tokenize  \u003cFOO.DO  \u003eFOO.BA\n  3. m100-tokenize  FOO.DO | cat \u003e FOO.BA\n\nNote that (1) and (2) are identical, but (3) is slightly different.\n\nIn example 3, the output stream cannot be rewound and the line\npointers will all contain \"\\*\\*\" (0x2A2A). This does not matter\nfor a genuine Model T computer which ignores **PL PH** in a file,\nbut some emulators are known to be persnickety and balk.\n\nIf you find this to be a problem, please file an issue as it is\npotentially correctable using `open_memstream()`, but hackerb9 does\nnot see the need.\n\n\u003c/ul\u003e\u003c/details\u003e \u003c!-- Running manually --\u003e\n\n\u003cdetails\u003e\u003csummary\u003eClick for details on creating abnormal .BA files.\u003c/summary\u003e\u003cul\u003e\n\n## Abnormal code\n\nThe `tokenize` script always uses the m100-sanity program to clean up\nthe source code, but one can run m100-tokenize directly to\npurposefully create abnormal, yet valid, `.BA` files. These programs\ncannot be created on genuine hardware, but **will** run.\n\nHere is an extreme example.\n\n\u003cb\u003eSource code for \n\u003ca href=\"https://github.com/hackerb9/tokenize/blob/main/degenerate/GOTO10.DO\"\u003e\n\"GOTO 10\"\u003c/a\u003e by hackerb9\u003c/b\u003e\n\n```BASIC\n1 I=-1\n10 I=I+1: IF I MOD 1000 THEN 10 ELSE I=0\n10 PRINT: PRINT\n10 PRINT \"This is line ten.\"\n10 PRINT \"This is also line ten.\"\n5  PRINT \"Line five runs after line ten.\"\n10 PRINT \"Where would GOTO 10 go?\"\n15 PRINT \"  (The following line is 7 GOTO 10)\"\n7 GOTO 10\n8 ERROR \"Line 8 is skipped by GOTO 10.\"\n10 PRINT: PRINT \"It goes to the *next* line ten!\"\n10 FOR T=0 TO 1000: NEXT T\n10 PRINT \"Exceptions: Goes to *first* line ten\"\n10 PRINT \"  if the current line is ten, or\"\n10 PRINT \"  if a line number \u003e 10 is seen.\"\n10\n10 'Shouldn't 10 GOTO 10 go to itself?\n10 'It shouldn't have to search at all.\n10 'Maybe it's a bug in Tandy BASIC?\n10 'Perhaps it checks for (line \u003e= 10)\n10 'before it checks for (line == 10)?\n10\n10 'BTW, \u003e= is one less op than \u003e.\n10 'On 8085, \u003e= is CMP then check CY==0.\n10 'But \u003e also requires checking Z==0.\n10\n10 PRINT\n10 PRINT \"The next line is 9 GOTO 10. This time\"\n10 PRINT \"it goes to the FIRST line ten, because\"\n10 PRINT \"line 20 comes before the next line ten.\"\n9 GOTO 10\n20 ERROR \"Line 20 is never reached, but it has an effect because 20\u003e10.\"\n10 ERROR \"This is the final line ten. The previous GOTO 10 won't find it because line 20 comes first.\"\n10\n15\n20 \n0 PRINT \"This program examines how\"\n1 PRINT \"Model T computers run\"\n2 PRINT \"degenerate tokenized BASIC.\"\n3 PRINT \"Trying to load it as a .DO\"\n4 PRINT \"file will not work as\"\n5 PRINT \"Tandy BASIC corrects issues\"\n6 PRINT \"such as duplicate line numbers\"\n7 PRINT \"and out of order lines.\"\n8 PRINT \"Please use hackerb9's\"\n9 PRINT \"pre-tokenized GOTO10.BA.\"\n```\n\nTo run this on a Model 100, one must download the tokenized BASIC file,\n[GOTO10.BA](https://github.com/hackerb9/tokenize/raw/main/degenerate/GOTO10.BA),\nwhich is created from \n[GOTO10.DO](https://github.com/hackerb9/tokenize/raw/main/degenerate/GOTO10.DO)\nlike so:\n\n``` bash\nm100-tokenize GOTO10.DO\n```\n\n\u003c/ul\u003e\u003c/details\u003e \u003c!-- Abnormal code --\u003e\n\n\n## Machine compatibility\n\nAcross the eight Kyotronic-85 sisters, there are actually only two\ndifferent tokenized formats: \"M100 BASIC\" and \"N82 BASIC\". This\nprogram (currently) works only for the former, not the latter.\n\nThe three Radio-Shack portables (Models 100, 102 and 200), the Kyocera\nKyotronic-85, and the Olivetti M10 all share the same tokenized BASIC.\nThat means a single tokenized BASIC file _might_ work for any of\nthose, presuming the program does not use CALL, PEEK, or POKE.\nHowever, the NEC family of portables -- the PC-8201, PC-8201A, and\nPC-8300 -- run N82 BASIC. A tokenized N82 BASIC file cannot run on an\nM100 computer and vice versa, even for programs which share the same\nASCII BASIC source code.\n\n### Checksum differences are not a compatibility problem\n\nThe .BA files generated by `tokenize` aim to be exactly the same, byte\nfor byte, as the output from tokenizing on a Model 100 using `LOAD`\nand `SAVE`. There are some bytes, however, which can change and should\nbe ignored when testing if two tokenized programs are identical. \n\n\u003cdetails\u003e\u003csummary\u003eClick to read details on line number pointers...\u003c/summary\u003e\u003cul\u003e\n\nA peculiar artifact of the [`.BA` file format][fileformat] is that it\ncontains pointer locations offset by where the program happened to be\nin memory when it was saved. The pointers in the file are _never_ used\nas they are recalculated when the program is loaded into RAM.\n\nTo account for this variance when testing, the output of this program\nis intended to be byte-for-byte identical to:\n\n1. A Model 100\n2. that has been freshly reset\n3. with no other BASIC programs on it\n4. running `LOAD \"COM:88N1\"` and `SAVE \"FOO\"` while a host computer sends the ASCII BASIC program over the serial port.\n\nWhile the Tandy 102, Kyotronic-85, and M10 also appear to output files\nidentical to the Model 100, the Tandy 200 does not. The 200 has more\nROM than the other Model T computers, so it stores the first BASIC\nprogram at a slightly different RAM location (0xA000 instead of\n0x8000). This has no effect on compatibility between machines, but it\ndoes change the pointer offset.\n\nSince two `.BA` files can be the identical program despite having\ndifferent checksums, this project includes the `bacmp` program,\ndescribed below.\n\n[fileformat]: http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file \"Reverse engineered file format documentation\"\n\n\u003c/ul\u003e\u003c/details\u003e \u003c!-- Line number pointers --\u003e\n\n## Why Lex?\n\nThis program is written in\n[Flex](https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/handouts/050%20Flex%20In%20A%20Nutshell.pdf),\na lexical analyzer, because it made implementation trivial. The\ntokenizer itself, m100-tokenize, is mostly just a table of keywords\nand the corresponding byte they should emit. Flex handles special\ncases, like quoted strings and REMarks, easily.\n\nThe downside is that one must have flex installed to _modify_ the\ntokenizer. Flex is _not_ necessary to compile on a machine as flex\ngenerates portable C code. See the tokenize-cfiles.tar.gz in the\ngithub release or run `make cfiles`.\n\n\n## Miscellaneous notes\n\n* Tokenized BASIC files are binary files, not ASCII and as such cannot\n  be transferred easily using the builtin TELCOM program or the `LOAD\n  \"COM:\"` or `SAVE \"COM:\"` commands. Instead, one must use a program\n  such as [TEENY](https://youtu.be/H0xx9cOe97s) which can transfer\n  8-bit data.\n\n* To save in ASCII format on the Model 100, append `, A`:\n  ```BASIC\n  save \"FOO\", A\n  ```\n\n* This program accepts line endings as either `CRLF` (standard for a\n  Model 100 text document) or simply `LF` (UNIX style).\n\n\n## Testing\n\nRun `make check` to test this tokenizer on some [sample Model 100\nprograms](https://github.com/hackerb9/tokenize/tree/main/samples)\nincluding some strange ones designed specifically to exercise peculiar\nsyntax. The program `bacmp` is used to compare the generated .BA file\nwith one created on hackerb9's Tandy 200.\n\nFor example, \n[SCRAMB.DO](https://github.com/hackerb9/tokenize/blob/main/samples/SCRAMB.DO) \nhas input that is scrambled and redundant to check if m100-sanity is\nworking correctly. \n\n``` BASIC\n20 GOTO 10\n10 GOTO 10\n10 PRINT \"!dlroW ,olleH\"\n```\n\n## bacmp: BASIC comparator\n\nThe included\n[bacmp.c](https://github.com/hackerb9/tokenize/blob/main/bacmp.c)\nprogram may be useful for others who wish to discover if two tokenized\nBASIC files are identical. Because of the way the Model 100 works, a\nnormal [`cmp`](https://manpages.debian.org/cmp) test will fail. There\nare pointers from each BASIC line to the next which change based upon\nwhere the program happens to be in memory. The `bacmp` program handles\nthat by allowing line pointers to differ as long as they are offset by\na constant amount.\n\nNote that generating correct line pointers is actually unnecessary in\na tokenizer; they could be any arbitrary values. The Model 100 always\nregenerates the pointers when it loads a .BA file. However, some\nemulators insist on having them precisely correct, so this tokenizer\nhas followed suit.\n\n## More information\n\n* Hackerb9 has documented the file format of tokenized BASIC at\n  http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file\n\n## Known Bugs\n\n* None known. [Reports](https://github.com/hackerb9/tokenize/issues/new)\n  are gratefully accepted.\n\n\n## Alternatives\n\nHere are some other ways that you can tokenize Model 100 BASIC when\nthe source code and the tokenized version won't both fit in the M100's\nRAM. \n\n* Perhaps the simplest is to load the ASCII BASIC file to the M100 over\n  a serial port from a PC host. On the Model 100, type:\n  \n          LOAD \"COM:88N1\"\n\n  or for a Tandy 200, type\n\n          LOAD \"COM:88N1ENN\"\n  \n  On the host side, send the file followed by the `^Z` ('\\x1A')\n  character to signal End-of-File. See hackerb9's\n  [sendtomodelt](https://github.com/hackerb9/m100le/blob/main/adjunct/sendtomodelt)\n  for a script to make this easier.\n\n* Robert Pigford wrote a Model 100 tokenizer that runs in Microsoft Windows.\n  Includes PowerBASIC source code. \n  http://www.club100.org/memfiles/index.php?\u0026direction=0\u0026order=nom\u0026directory=Robert%20Pigford/TOKENIZE\n\n* Mike Stein's entoke.exe is a tokenizer for Microsoft DOS.\n  http://www.club100.org/memfiles/index.php?\u0026direction=0\u0026order=nom\u0026directory=Mike%20Stein\n  \n* The Model 100's ROM disassembly shows that the tokenization code is quite short.\n  http://www.club100.org/memfiles/index.php?action=downloadfile\u0026filename=m100_dis.txt\u0026directory=Ken%20Pettit/M100%20ROM%20Disassembly\n  \n* The VirtualT emulator can be scripted to tokenize a program through telnet.\n  https://sourceforge.net/projects/virtualt/\n\n[^2]: CRLF means Carriage Return, Line Feed. That is, `0x0D 0x0A`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Ftokenize","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhackerb9%2Ftokenize","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackerb9%2Ftokenize/lists"}