Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/chenglou/elastic

Something something OCaml
https://github.com/chenglou/elastic
Last synced: about 2 months ago
JSON representation
Something something OCaml
Host: GitHub
URL: https://github.com/chenglou/elastic
Owner: chenglou
License: mit
Created: 2015-12-11T05:12:54.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2015-12-11T06:51:35.000Z (about 9 years ago)
Last Synced: 2024-10-13T11:15:51.034Z (3 months ago)
Language: OCaml
Size: 61.5 KB
Stars: 2
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project

README

        Multilingualization for OCaml source code

=========================================

The _m17n_ package allows using Unicode identifiers in the OCaml source code.

``` ocaml

type 色 =

| @赤

| @黄色

| @緑色

[@@deriving show]

let () = print_endline (show_色 @赤)

```

You're encouraged to use _m17n_.

Installation

------------

_m17n_ currently requires a trunk (unreleased) version of the OCaml

compiler. Thankfully, installing it (and some supplementary packages)

with [OPAM 1.2][opam] is as simple as:

``` sh

opam switch 4.03+trunk

opam pin add -y ppx_tools https://github.com/alainfrisch/ppx_tools

opam pin add -y m17n https://github.com/whitequark/ocaml-m17n

```

Note that _m17n_ is not compatible with [camlp4][].

[opam]: https://opam.ocaml.org

[camlp4]: https://github.com/ocaml/camlp4/

Usage

-----

_m17n_ can be activated using [ocamlfind][]:

``` sh

ocamlfind ocamlc -package m17n -syntax utf8 ...

```

If you are using [ocamlbuild][], add the following to your `_tags` file:

```

<**/*.{ml,mli}>: package(m17n), syntax(utf8)

```

_m17n_ also works in toplevel. It can be activated using:

```

#require "m17n";;

```

[ocamlfind]: http://projects.camlcity.org/projects/findlib.html

[ocamlbuild]: http://nicolaspouillard.fr/ocamlbuild/ocamlbuild-user-guide.html

Features

--------

_m17n_ expects the source code to be valid UTF-8. It extends the identifiers

normally recognized by OCaml to include all Unicode letters and digits.

The case distinction is also preserved; however, the files corresponding to

modules with non-English names must have the first character to be uppercase.

Since few of the world's scripts distinguish between upper and lower case,

a sigil is provided to disambiguate constructor and module names and all

other identifiers. When an identifier is prepended with `@`, it is treated

as if its first letter was uppercase.

See [technical details](#technical-details) for specifics.

_m17n_ is compatible with ppx syntax extensions such as [ppx_deriving][].

_m17n_ includes integration both with the standard OCaml toplevel and

[utop][].

_m17n_ does not add Unicode literals or any runtime support for manipulating

Unicode strings and characters. (See [OCaml pull #80][pr-uchar], [Uutf][], [Uunf][]

and [Uucd][] projects.)

[ppx_deriving]: https://github.com/whitequark/ppx_deriving

[pr-uchar]: https://github.com/ocaml/ocaml/pull/80

[uutf]: http://erratique.ch/software/uutf

[uunf]: http://erratique.ch/software/uunf

[uucd]: http://erratique.ch/software/uucd

[utop]: https://github.com/diml/utop

### Can't look-alike characters like a and а be confusing?

They can.

However, _m17n_ issues a warning if more than one script

is used in an identifier, hopefully handling most of the confusing

cases. You can still use several scripts if you separate them

by underscores, e.g. `show_色`.

Additionally, _m17n_ issues a warning if any two identifiers

look alike enough to be visually confusable.

### Localized error messages?

This will be possible when [PR6696] is fixed.

[PR6696]: http://caml.inria.fr/mantis/view.php?id=6696

### Localized keywords?

It is trivial to add support for these. What's hard is coming up with

a good localization. I tried to make one, but committing it felt

worse than kicking a bucket of kittens. If you have one, please

[open an issue][issue].

### Are RTL scripts supported?

In theory, yes. However I lack ability to verify whether the RTL support

works correctly. [Open an issue][issue] if it is not.

[issue]: https://github.com/whitequark/ocaml-m17n/issues

Technical details

-----------------

### Unicode handling

_m17n_ includes only five changes to the OCaml lexer:

  * U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR),

    U+000C FORM FEED (FF), U+0020 SPACE and U+3000 IDEOGRAPHIC SPACE are

    recognized as whitespace,

  * Characters of [General Category][gc] [Lu][gcv] are recognized as uppercase letters

    at the start of an identifier,

  * Characters of [General Categories][gc] [Ll][gcv], [Lm][gcv], [Lo][gcv], [Lt][gcv] and

    U+005F LOW LINE are recognized as lowercase letters at the start of an identifier,

  * Characters with property [ID_Continue][d1] are recognized as continuation of

    an identifier,

  * U+0040 COMMERCIAL AT makes the following lowercase or unicase letter recognized

    as an uppercase letter. If it is possible to map the letter to upper case,

    this is done.

To summarize, an identifier may start with [ID_Start][d1], and continue

with [ID_Continue][d1].

All identifiers are normalized to [NFC][nf]. However, strings are not normalized.

These rules closely follow the recommendations of [Unicode TR31][tr31].

Formally, _m17n_ conforms to Unicode 6.3 [UAX31 Level 1][C2], observing [R1][]

and [R3][] with the profile specified above and [R4][] unconditionally

with normalization to [NFC][nf].

[gc]: http://www.unicode.org/reports/tr44/#General_Category

[gcv]: http://www.unicode.org/reports/tr44/#General_Category_Values

[d1]: http://unicode.org/reports/tr31/#Default_Identifier_Syntax

[nf]: http://www.unicode.org/reports/tr15/#Norm_Forms

[c2]: http://unicode.org/reports/tr31/#C2

[r1]: http://unicode.org/reports/tr31/#R1

[r3]: http://unicode.org/reports/tr31/#R3

[r4]: http://unicode.org/reports/tr31/#R4

[tr31]: http://unicode.org/reports/tr31/

### Detecting confusable characters

_m17n_ follows [Unicode TR39][tr39] at Unicode 7.0.0 in its handling

of confusable characters.

Within a single identifier chunk (a part of an identifier separated by

U+005F LOW LINE), all characters should satisfy the [Highly Restrictive][highrst]

mixed script detection.

Within a single source chunk, all identifiers should not be [confusable][confusable]

according to the [Mixed-Script Any-Case][confusMA] table.

If any of this is not true, _m17n_ issues a warning.

[tr39]: http://www.unicode.org/reports/tr39

[highrst]: http://www.unicode.org/reports/tr39/#highly_restrictive

[confusable]: http://www.unicode.org/reports/tr39/#Confusable_Detection

[confusMA]: http://www.unicode.org/reports/tr39/#ma

### Interaction with filesystem

OCaml uses module names to search the include path for referenced modules.

As the module names are normalized to [NFC][nf], the queries to the filesystem

use the same form. Different operating systems handle them in different ways:

  * Mac OS X on HFS+ stores the filenames in [NFD][nf] and normalizes all input

    to [NFD][nf]. No edge cases possible.

  * Other *nix systems such as Linux treat filenames as opaque `/`-delimited,

    `NUL`-finalized streams of bytes, however essentially all existing input

    methods normalize to [NFC][nf].

  * Windows treats filenames as opaque streams of UTF-16 characters with

    somewhat [more complex][winfn] interpretation. It performs its own

    case folding (in most cases; case-sensitive Windows filesystems

    [exist][wincs]), but no normalization. Its input methods normalize to

    [NFC][nf] as well.

_m17n_ aims to reduce possible confusion by looking into the include

directories and looking for any OCaml build products whose basenames are

identical to the names of any referenced modules under toNFKC_Casefold

(definition R5 in [Unicode 6.3 section 3.13][u63]), but not as-is.

This measure should be enough to not only catch all instances of

mis-normalized filenames, but also incorrect capitalization and some

look-alike characters.

[winfn]: http://msdn.microsoft.com/en-us/library/aa365247(v=VS.85).aspx

[wincs]: https://support.microsoft.com/KB/100625

[u63]: http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

### Interaction with the OCaml compiler

The OCaml compiler has an `-pp` option which, among other things, allows

to provide a binary that accepts source code and emits an OCaml abstract

syntax tree, thus allowing to implement a custom frontend. (This is

what [camlp4][] uses.)

Additionally, the OCaml compiler exports its internals, including

the parser, in a package `compiler-libs`, thus allowing to avoid

reimplementing the parser in a custom frontend.

The compiler treats the identifiers as opaque tokens almost everywhere.

It does not even concatenate them, which is important, as [NFC][nf]

[is not generally closed under concatenation][nfnotclosed]. The only place

where the compiler actually dissects the strings is the module name → filename

mapping. However, it ignores bytes with values over 127, passing UTF-8

strings through.

Findlib provides an interface that allows registering a preprocessor.

Additionally, it will pass all package include paths to such a preprocessor.

_m17n_ uses all these features and implementation details to provide

a seamless Unicode-aware frontend.

[nfnotclosed]: http://www.unicode.org/reports/tr15/#Concatenation

License

-------

[MIT license](LICENSE.txt)