{"id":13726214,"url":"https://github.com/ocaml-community/sedlex","last_synced_at":"2025-06-23T06:42:36.465Z","repository":{"id":1742409,"uuid":"12734005","full_name":"ocaml-community/sedlex","owner":"ocaml-community","description":"An OCaml lexer generator for Unicode","archived":false,"fork":false,"pushed_at":"2024-11-01T21:41:39.000Z","size":3670,"stargazers_count":240,"open_issues_count":20,"forks_count":43,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-11-01T22:19:59.130Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"OCaml","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ocaml-community.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-09-10T15:55:17.000Z","updated_at":"2024-11-01T21:41:43.000Z","dependencies_parsed_at":"2024-10-22T14:15:37.187Z","dependency_job_id":null,"html_url":"https://github.com/ocaml-community/sedlex","commit_stats":{"total_commits":205,"total_committers":30,"mean_commits":6.833333333333333,"dds":0.551219512195122,"last_synced_commit":"cea846d90578e2b04a9fdd5decc11bb88f9fb35e"},"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-community%2Fsedlex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-community%2Fsedlex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-community%2Fsedlex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-community%2Fsedlex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ocaml-community","download_url":"https://codeload.github.com/ocaml-community/sedlex/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224654176,"owners_count":17347683,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T01:02:55.963Z","updated_at":"2024-11-14T16:33:31.943Z","avatar_url":"https://github.com/ocaml-community.png","language":"OCaml","funding_links":[],"categories":["OCaml","Tools","Compilers and Compiler Tools"],"sub_categories":[],"readme":"# sedlex\n\n[![build](https://github.com/ocaml-community/sedlex/actions/workflows/build.yml/badge.svg)](https://github.com/ocaml-community/sedlex/actions/workflows/build.yml)\n\nUnicode-friendly lexer generator for OCaml.\n\nThis package is licensed by LexiFi under the terms of the MIT license.\n\nsedlex was originally written by Alain Frisch\n\u003calain.frisch@lexifi.com\u003e and is now maintained as part of the\nocaml-community repositories on github.\n\n## API\nThe API is documented [here](https://ocaml-community.github.io/sedlex).\n\n## Overview\n\nsedlex is a lexer generator for OCaml, similar to ocamllex, but\nsupporting Unicode.  Contrary to ocamllex, lexer specifications for\nsedlex are embedded in regular OCaml source files.\n\nThe lexers work with a new kind of \"lexbuf\", similar to ocamllex\nLexing lexbufs, but designed to support Unicode, and abstracting from\na specific encoding.  A single lexer can work with arbitrary encodings\nof the input stream.\n\nsedlex is the successor of the ulex project. Contrary to ulex which\nwas implemented as a Camlp4 syntax extension, sedlex is based on the\nnew \"-ppx\" technology of OCaml, which allow rewriting OCaml parse\ntrees through external rewriters. (And what a better name than \"sed\"\nfor a rewriter?)\n\nAs any -ppx rewriter, sedlex does not touch the concrete syntax of the\nlanguage: lexer specifications are written in source file which comply\nwith the standard grammar of OCaml programs. sedlex reuse the syntax\nfor pattern matching in order to describe lexers (regular expressions\nare encoded within OCaml patterns). A nice consequence is that your\neditor (vi, emacs, ...) won't get confused (indentation, coloring) and\nyou don't need to learn new priority rules. Moreover, sedlex is\ncompatible with any front-end parsing technology: it works fine even\nif you use camlp4 or camlp5, with the standard or revised syntax.\n\n\n## Lexer specifications\n\n\nsedlex adds a new kind of expression to OCaml: lexer definitions.\nThe syntax for the new construction is:\n\n```ocaml\n  match%sedlex lexbuf with\n  | R1 -\u003e e1\n  ...\n  | Rn -\u003e en\n  | _  -\u003e def\n```\n\nor:\n\n```ocaml\n  [%sedlex match lexbuf with \n  | R1 -\u003e e1\n  ...\n  | Rn -\u003e en\n  | _  -\u003e def\n  ]\n```\n\n(The first vertical bar is optional as in any OCaml pattern matching.\nGuard expressions are not allowed.)\n\nwhere:\n- lexbuf is an arbitrary lowercase identifier, which must refer to\n  an existing value of type `Sedlexing.lexbuf`.\n- the Ri are regular expressions (see below);\n- the ei and def are OCaml expressions (called actions) of the same type\n  (the type for the whole lexer definition).\n\nUnlike ocamllex, lexers work on stream of Unicode codepoints, not\nbytes.\n\nThe actions can call functions from the Sedlexing module to extract\n(parts of) the matched lexeme, in the desired encoding.\n\nRegular expressions are syntactically OCaml patterns:\n\n- `\"....\"` (string constant): recognize the specified string\n- `'....'` (character constant) : recognize the specified character\n- `i` (integer constant) : recognize the specified codepoint\n- `'...' .. '...'`: character range\n- `i1 .. i2`: range between two codepoints\n- `R1 | R2` : alternation\n- `R, R2, ..., Rn` : concatenation\n- `Star R` : Kleene star (0 or more repetition)\n- `Plus R` : equivalent to `R, R*`\n- `Opt R` : equivalent to `(\"\" | R)`\n- `Rep (R, n)` : equivalent to `R{n}`\n- `Rep (R, n .. m)` : equivalent to `R{n, m}`\n- `Chars \"...\"` : recognize any character in the string\n- `Compl R` : assume that R is a single-character length regexp (see below)\n  and recognize the complement set\n- `Sub (R1,R2)` : assume that R is a single-character length regexp (see below)\n  and recognize the set of items in `R1` but not in `R2` (\"subtract\")\n- `Intersect (R1,R2)` : assume that `R` is a single-character length regexp (see\n  below) and recognize the set of items which are in both `R1` and `R2`\n- `lid` (lowercase identifier) : reference a named regexp (see below)\n\nA single-character length regexp is a regexp which does not contain (after\nexpansion of references) concatenation, Star, Plus, Opt or string constants\nwith a length different from one.\n\n\n\nNote:\n - The OCaml source is assumed to be encoded in Latin1 (for string\n   and character literals).\n\n\nIt is possible to define named regular expressions with the following\nconstruction, that can appear in place of a structure item:\n\n```ocaml\n  let lid = [%sedlex.regexp? R]\n```\n\nwhere lid is the regexp name to be defined and R its definition.  The\nscope of the \"lid\" regular expression is the rest of the structure,\nafter the definition.\n\nThe same syntax can be used for local binding:\n\n```ocaml\n  let lid = [%sedlex.regexp? R] in\n  body\n```\n\nThe scope of \"lid\" is the body expression.\n\n\n## Predefined regexps\n\nsedlex provides a set of predefined regexps:\n- any: any character\n- eof: the virtual end-of-file character\n- xml_letter, xml_digit, xml_extender, xml_base_char, xml_ideographic,\n  xml_combining_char, xml_blank: as defined by the XML recommandation\n- tr8876_ident_char: characters names in identifiers from ISO TR8876\n- cc, cf, cn, co, cs, ll, lm, lo, lt, lu, mc, me, mn, nd, nl, no, pc, pd,\n  pe, pf, pi, po, ps, sc, sk, sm, so, zl, zp, zs: as defined by the\n  Unicode standard (categories)\n- alphabetic, ascii_hex_digit, hex_digit, id_continue, id_start,\n  lowercase, math, other_alphabetic, other_lowercase, other_math,\n  other_uppercase, uppercase, white_space, xid_continue, xid_start: as\n  defined by the Unicode standard (properties)\n\n\n## Running a lexer\n\nSee the interface of the Sedlexing module for a description of how to\ncreate lexbuf values (from strings, stream or channels encoded in\nLatin1, utf8 or utf16, or from integer arrays or streams representing\nUnicode code points).\n\nIt is possible to work with a custom implementation for lex buffers.\nTo do this, you just have to ensure that a module called Sedlexing is\nin scope of your lexer specifications, and that it defines at least\nthe following functions: start, next, mark, backtrack.  See the interface\nof the Sedlexing module for more information.\n\n\n\n## Using sedlex\n\nThe quick way:\n\n```\n   opam install sedlex\n```\n\n\nOtherwise, the first thing to do is to compile and install sedlex.\nYou need a recent version of OCaml and [dune](https://dune.build/).\n\n```\n  make\n```\n\n### With findlib\n\nIf you have findlib, you can use it to install and use sedlex.\nThe name of the findlib package is \"sedlex\".\n\nInstallation (after \"make\"):\n\n```\n  make install\n```\n\nCompilation of OCaml files with lexer specifications:\n\n```\n  ocamlfind ocamlc -c -package sedlex.ppx my_file.ml\n```\n\nWhen linking, you must also include the sedlex package:\n\n```\n  ocamlfind ocamlc -o my_prog -linkpkg -package sedlex.ppx my_file.cmo\n```\n\n\nThere is also a sedlex.ppx subpackage containing the code of the ppx\nfilter.  This can be used to build custom drivers (combining several ppx\ntransformations in a single process).\n\n\n### Without findlib\n\nYou can use sedlex without findlib. To compile, you need to run the\nsource file through -ppx rewriter ppx_sedlex. Moreover, you need to\nlink the application with the runtime support library for sedlex\n(sedlexing.cma / sedlexing.cmxa).\n\n### With utop\n\nOnce sedlex is installed as per above, simply type\n\n```\n#require \"sedlex.ppx\";;\n```\n\n## Examples\n\nThe `examples/` subdirectory contains several samples of sedlex in use.\n\n## Contributors\n\n- Benus Becker: implementation of Utf16\n- sghost: for Unicode 6.3 categories and properties\n- Peter Zotov:\n  - improvements to the build system\n  - switched parts of ppx_sedlex to using concrete syntax (with ppx_metaquot)\n- Steffen Smolka: port to dune\n- Romain Beauxis:\n  - Implementation of the unicode table extractors\n  - General maintenance\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focaml-community%2Fsedlex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Focaml-community%2Fsedlex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focaml-community%2Fsedlex/lists"}