Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/lindig/ocaml-hyphenate

OCaml module for the hyphenation of words based on TeX hyphenation patterns
https://github.com/lindig/ocaml-hyphenate
Last synced: 4 days ago
JSON representation
OCaml module for the hyphenation of words based on TeX hyphenation patterns
Host: GitHub
URL: https://github.com/lindig/ocaml-hyphenate
Owner: lindig
License: bsd-2-clause
Created: 2012-09-09T06:23:39.000Z (about 12 years ago)
Default Branch: master
Last Pushed: 2015-08-22T13:25:55.000Z (about 9 years ago)
Last Synced: 2023-03-25T00:23:29.059Z (over 1 year ago)
Language: OCaml
Size: 219 KB
Stars: 8
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

README

        
# Hyphenation

This software implements hyphenation of words in Objective Caml as a module

based on the algorithm implemented in TeX and using the hyphenation

patterns provided for TeX.

## Requirements and Compilation

To compile the software from source code you need:

* A Unix system like Linux or MacOS X. Objective Caml also is available

  for other platforms and the code is portable but the build process

  makes no attempt to support building the code on non-Unix platforms.

* Objective Caml

* Make

* Lipsum (http://github.com/lindig/lipsum). This is a tool to support

  literate programming and is needed to extract the source code from this

  literate program. Lipsum is implemented in Objective Caml as well and you

  could add it as as Git submodule.

To build the software it should suffice to run Make. Please take a look at

the `Makefile`. It supports downloading and building Lipsum.

    $ make

## Demo

Running Make builds a small demo application that can be used to hyphenate

words from a text file or the command line.

    $ ./demo.native Compilation Requirements

    com-pi-la-tion

    re-quire-ments

    $ ./demo.native -h

    demo usage:

    demo -f file.txt          hypenate words in file.txt

    demo word ..              hyphenate arguments

    demo -h                   emit help

    demo -d                   emit hyphenation patterns

    demo reads words from a file or the command line and

    emits them hyphenated to stdout. Before hyphenation, words are

    turned to lower case. demo uses built-in patterns for

    US English.

    (c) 2012 Christian Lindig 

    https://github.com/lindig/ocaml-hyphenate

## Performance

Hyphenating a word takes about 15 microseconds on a 2.4 GHz MacBook Pro:

    time ./demo.native -f /usr/share/dict/words | wc

      234936  234936 2932821

    real	0m3.476s

    user	0m3.120s

    sys	    0m1.102s

Surely the implementation leaves a lot of room for optimization but I

can't justify the effort now. The implementation creates many substrings

from the word being hyphenated and these are represented as copies in

Objective Caml. A string representation that only uses indices into an

existing string could avoid creating copies. Another potential for

optimization is the representation of hyphenation patterns. The

implementation stores them in a hash table. I haven't looked into this but

I believe a better representation would take advantage of the fact that

lookup operations are correlated: lookup operations result from sliding a

window over the word to be hyphenated. Hence, two sequential lookup

operations are looking for words that are quite similar. It should be

possible to exploit this. The implementation of TeX would be the first

place to look for ideas.

## Ideas for Future Work

* The interface could provide an additional function that returns the

  result  not as list of strings but as a list or array of points that

  indicate where to split a word.

* Maybe we need a function that suggests a particular hyphenation based

  space requirements. But maybe this is better left for client code.

* Implement Unicode handling (and hence give up simple strings for the

  representation of words.)

## References

TeX's hyphenation algorithm is detailed in _The TeXbook_ by Donald Knuth in

Appendix H.

## Copyright

Copyright (c) 2012, 2014 Christian Lindig 

All rights reserved.

Redistribution and use in source and binary forms, with or

without modification, are permitted provided that the following

conditions are met:

1.  Redistributions of source code must retain the above copyright

    notice, this list of conditions and the following disclaimer.

2.  Redistributions in binary form must reproduce the above copyright

    notice, this list of conditions and the following disclaimer in

    the documentation and/or other materials provided with the

    distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND

CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,

INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF

MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR

CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,

SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT

LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF

USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED

AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT

LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN

ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE

POSSIBILITY OF SUCH DAMAGE.

## Encodings

This code works only with encodings where a character corresponds to a

byte.

## Hyphenate -- The Interface

The core algorithm is implemented in module Hyphenate with an accompanying

interface.

Hyphenation is language specific and requires to load hyphenation patterns

from a file using `load` before words can be hyphenated using `hyphenate`.

A language value is a mutable abstraction.  `hyphenate` splits a given

word into disjoint substrings:

    <>=

    (** Module [Hyphenate]: split a word according to TeX hyphenation patterns.

        This module is not Unicode aware.

    

        @author Christian Lindig 

    *)

    

    exception Error of string

    (** signaling errors *)

    

    type t

    (** mutable hyphenation pattern dictionary for a given language *)

    

    type path = string

    (** file path *)

    

    

    <>=

    val make: minpre:int  ->  minpost:int ->  minlen:int  ->  t

    (** Create empty pattern dictionary.

        @param minpre min number of characters before the first hyphen

        @param minpost min characters after the last hyphen

        @param minlen min length of word that gets hyphenated *)

    

    val add:  t -> string -> unit

    (** [add t pat] add pattern [pat] to dictionary [t]; mutates it *)

    

    val load: path -> minpre:int -> minpost:int -> minlen:int -> t

    (** [load path minpre minpost minlen] loads patterns from [path]

        to create a pattern dictionary from it.

        @param path named file with TeX patterns

        @param minpre min number of characters before the first hyphen

        @param minpost min characters after the last hyphen

        @param minlen min length of word that gets hyphenated

        @return a new pattern dictionary

        @raise Error when [path] can't be loaded

        *)

    

    val dump: t -> unit

    (** [dump t] dumps pattern dictionary to stdout for debugging *)

    

    val hyphenate: t -> string -> string list

    (** [hyphenate t word] hyphenates [word] according to patterns [t] *)

    

## Hyphenation Patterns

TeX encodes language-specific hyphenation patterns in files with one pattern

per line:

    .ba5na

    .bas4e

    .ber4

    .be5ra

A pattern is a sequence of letters that are interspersed with digits. A

digit greater zero indicates (roughly) a good hyphenation point.  In the

above format, digit `0` may be omitted for brevity. Hence, the patterns

above could be expanded to:

    0.0b0a5n0a0

    0.0b0a0s4e0

    0.0b0e0r40

    0.0b0e5r0a0

A dot at the beginning or end of a pattern designates the beginning or end

of a word. Before a word to be hyphenated is matched against a pattern, it

is prefixed and affixed with a dot such that pattern with a dot matches

only the beginning or end of a word.

A pattern file may contain comments. A comment (just like in TeX) starts

with a `%` and reaches until the end of the line.

The file `hyphen.tex` with hyphenation patterns for US English is from the

TeX distribution at

[www.tex.ac.uk](http://www.tex.ac.uk/tex-archive/macros/plain/base/hyphen.tex).

More language-specific patterns can be found at

[tug.org/tex-hyphen](http://tug.org/tex-hyphen)

## Reading Patterns -- The Interface

To read a pattern file we use a lexical scanner. The scanner implements a

function `read` that returns the next pattern.

    <>=

    exception Error of string       (** reports syntax errors in patterns*)

    

    (** hyphenation pattern *)

    type entry =

        | EOF                       (** end of file *)

        | Pattern of string         (** pattern *)

    

    val read:   Lexing.lexbuf -> entry

    (** read next pattern

        @raise Error for syntax errors *)

    

    val words:  Lexing.lexbuf -> string list

    (** read all patterns into a list for testing *)

    

## Reading Patterns -- The Implementation

Below is the general organization of a scanner specification for the

OCamlLex scanner generator.

    <>=

    {

        <>

    }

    <>

    {

        <>

    }

    

    

The prologue contains generally useful definitions that can be used

in the rest of the file.

    <>=

    exception Error of string

    let error fmt   = Printf.kprintf (fun msg -> raise (Error msg)) fmt

    

    type entry = EOF | Pattern of string

    

    let get         = Lexing.lexeme

    let (@@) f x    = f x       (* function application *)

    let (@.) f g x  = f (g x)   (* function composition *)

    

    

    <>=

    let digit       = ['0'-'9']

    let lowercase   = ['a'-'z']

    let uppercase   = ['A'-'Z']

    let alpha       = lowercase | uppercase

    let dot         = '.'

    let pat         = dot? digit? (lowercase digit?)+ dot?

    let comment     = '%' [^'\n']* '\n'

    let ws          = [' ' '\t' '\r' '\n']

    

    

Rule `token`  recognizes a pattern as it is in a file.

    <>=

    rule token = parse

        eof         { EOF }

      | ws+         { token lexbuf }

      | comment     { token lexbuf }

      | pat         { Pattern (get lexbuf) }

      | _           { error "illegal pattern: %s" @@ get lexbuf }

    

    

Rule `words`  splits the input into words by capturing sequences of

letters and returns them lowercased in a list. Such words are collected

into a list.  This list must be reversed before it is returned.  Be aware

that only `a` to `z` and `A` to `Z` are considered letters that make up

words. The `words` scanner is used only by the demo application to split

a text file into words but is not used by module Hyphenate.

    <>=

    and words ws = parse

         eof        { List.rev ws }

      |  alpha+     { words ((String.lowercase @@ get lexbuf) :: ws) lexbuf }

      |  _          { words ws lexbuf }  (* skip *)

    

    

    

Function `read` reads the next pattern from the file and splits into

two components. Function `words` returns the words in a file (in lower

case).

    <>=

    let read:   Lexing.lexbuf -> entry          = token

    let words:  Lexing.lexbuf -> string list    = words []

    

## Hyphenate -- The Implementation

This module implements the hyphenation algorithm using hyphenation

patterns. Hyphenation patterns for a language are simply stored in a

hash table mapping strings of length _n_ to _n+1_ possible hyphenation

points. We also remember the maximum and minimum pattern size such that we

can avoid searching for any pattern that falls outside these bounds.

    <>=

    type path       = string      (* a file path *)

    type t   =

        { patterns:             (string, int array) Hashtbl.t (* key, value *)

        ; mutable maxpatlen:    int (* longest key in patterns *)

        ; mutable minpatlen:    int (* shortest key in patterns *)

        ; minpre:               int

        ; minpost:              int

        ; minlen:               int (* shorter words are not hyphenated *)

        }

    

    let make ~minpre ~minpost ~minlen =                 (* create empty value *)

        { patterns          =   Hashtbl.create 4999 (* a prime number *)

        ; maxpatlen         =   0

        ; minpatlen         =   max_int

        ; minpre            =   minpre (* at least 1 char before 1st hyphen *)

        ; minpost           =   minpost

        ; minlen            =   minlen

        }

    

    

    <>=

    exception Error of string

    let error msg = raise (Error msg)

    

    

Some small utilities. Function `debug` is basically a printf function

for _stdout_. This is not very clever as we would like to avoid evaluating

its arguments when we are not debugging.

    <>=

    let debug fmt   = Printf.kprintf (fun msg -> ()) fmt

    let debug fmt   = Printf.kprintf (fun msg -> prerr_string msg) fmt

    

    

    <>=

    let (@.) f g x  = f (g x)   (* function composition *)

    let (@@) f x    = f x

    

    

`finally f x cleanup` function provides resource cleanup in the presence

of exceptions: `f x` is computed as a result and `cleanup x` is guaranteed

to run afterwards. (In many cases `cleanup` will not use its argument `x`

but it can be convenient to have access to it.)

    <>=

    type 'a result = Success of 'a | Failed of exn

    let finally f x cleanup =

        let result =

            try Success (f x) with exn -> Failed exn

        in

            cleanup x;

            match result with

            | Success y  -> y

            | Failed exn -> raise exn

    

    

Below are functions that split a pattern as it is read from a file

into a pattern value.

First some predicates to classify characters as letters and digits.

`int_of` computes the integer value of a digit.

    <>=

    let is_digit = function

        | '0'..'9' -> true

        | _        -> false

    

    let is_letter = not @. is_digit

    

    let int_of (c:char): int =

        assert ('0' <= c && c <= '9');

        Char.code c - Char.code '0'

    

    

Function `foldstr f zero str` iterates over string `str` from left to

right and applies `f` to each character and an intermediate value. The

initial intermediate value is `zero` and the next one is the value returned

by `f` in the previous iteration.

    <>=

    let foldstr f zero str =

        let limit = String.length str in

        let rec loop i acc =

            if   i = limit

            then acc

            else loop (i+1) (f acc str.[i])

        in

            loop 0 zero

    

    

`Letters` counts the number of letters in a string.

    <>=

    let letters (word:string): int =

        foldstr (fun n c -> if is_letter c then n+1 else n) 0 word

    

    

Function `normalize` takes a pattern as it is read from a pattern file and

splits it into a pattern -- a string of length _n_ and an array of size

_n+1_. For a letter at position _i_ in the string, the array indices _i_

and _i+1_ assign a value to points before and after the letter that

indicates its suitability as a breakpoint. The function first creates a

string of spaces and an array initialized with zeroes and fills both as it

scans the pattern that was read from the file. We rely on the scanner that

the `texword` argument has a suitable format.

    <>=

    let normalize (texword:string): string * int array =

        let n      = letters texword in

        let word   = Bytes.make n ' ' in

        let breaks = Array.make (n+1) 0 in

        let scan i c =

            if is_letter c

            then (Bytes.set word i c    ; i+1)

            else (breaks.(i) <- int_of c; i  )

        in

            ( ignore (* int *) (foldstr scan 0 texword)

            ; Bytes.to_string word, breaks

            )

    

For debugging, we join a word and its break points into a string again

that can be easily printed. In a sense, `join` is a dual to `normalize`.

    <>=

    let join (word:string) (breaks:int array): string =

        assert (Array.length breaks = String.length word + 1);

        let i2c i = Char.chr (i + Char.code '0') in

        let str = Bytes.make (Array.length breaks + String.length word) ' ' in

        for i = 0 to String.length word - 1 do

            ( Bytes.set str (i*2)   (i2c breaks.(i))

            ; Bytes.set str (i*2+1) word.[i]

            )

        done;

        Bytes.set str (String.length word * 2) @@ i2c breaks.(String.length word);

        Bytes.to_string str

    

    

`add` adds a new pattern to dictionary `t` and maintains the `minpatlen`

and `maxpatlen` fields.

    <>=

    let add (t:t) pattern: unit =

        let word, breaks = normalize pattern in

            ( Hashtbl.add t.patterns word breaks

            ; t.maxpatlen <- max t.maxpatlen (String.length word)

            ; t.minpatlen <- min t.minpatlen (String.length word)

            )

    

    

The `slide` function takes a string and a function `f` and applies it to

all substrings of length `n`, starting on the left of the string. Hence, a

window of size `n` is slid over the string and each content is passed to

`f`. In addition, `f` is passed the index of the first character of the

substring.

    <>=

    let slide (n:int) (str:string) (f:int -> string -> unit): unit =

        assert (n > 0);

        assert (n <= String.length str);

        for i = 0 to String.length str - n do

            f i (String.sub str i n)

        done

    

    

The basic idea to compute hyphenation points using patterns is as follows:

given a string, we slide windows of increasing size 1, 2, 3, ... over this

string. Every window content is taken as a key to look up an associated

array of integers that assigns numbers to every point before, after, and

within the string, that is, all possible hyphenation points. For a window

of size _m_ the integer array has size _m+1_, which is the number of

hyphenation points for a string of size _m_.

When we plan to hyphenate a word of size _n_, the various sliding windows

retrieve associated hyphenation points which are combined into one array of

size _n+1_ such that each hyphenation point has an integer assigned.

A short array of size _m+1_ is combined with a larger array of size _n+1_

point by point: the large array starts with all zeros. As the smaller array

is slid over the large array, the value in the large array at position _i_

is the maximum of the existing value and the value in the smaller array.

Combining a small integer array with a large array is implemented by

function `combine`.  It takes a `small` and a `large` array as arguments

and the index `i` of the element in the large array that is aligned with

the first element (0) of the smaller array. The result is an updated large

array.

    <>=

    let combine ~(first:int) ~(small:int array) ~(large:int array): unit =

        assert (Array.length small + first <= Array.length large);

        for i = first to first + Array.length small - 1 do

            large.(i) <- max small.(i-first) large.(i)

        done

    

    

Function `load` reads a pattern file for a language and returns a

`language` value.  We make sure that the file gets closed even if some

exception is raised (most likely due to syntax errors detected in the

scanner).

    <>=

    let load' minpre minpost minlen io: t =

        let lexbuf      = Lexing.from_channel io      in

        let t           = make minpre minpost minlen  in

        let rec loop lb =

            match Hyphenate_reader.read lb with

                | Hyphenate_reader.EOF -> t

                | Hyphenate_reader.Pattern(pattern) ->

                    ( add t pattern

                    ; loop lb

                    )

        in

            loop lexbuf

    

    let load (path:path) ~minpre ~minpost ~minlen =

        let io = try open_in path with Sys_error(msg) -> error msg

        in

            finally (load' minpre minpost minlen) io close_in

    

    

Hyphenation by splitting a word into substrings can finally happen when

we know about the hyphenation points. For a word of size _n_ there are

_n+1_ potential hyphenation points (including before and after the word).

Array `breaks` of size _n+1_ assigns a value to each hyphenation point. A

possible hyphenation is found, if the assigned value is odd. For example,

here are the hyphenation points for _hyphenation_:

    0h0y3p0h0e2n5a4t2i0o2n0

     h y-p h e n-a t i o n

The first hyphen correspond to the hyphenation point with value 3, the

second to the point with value 5.

Using odd values to indicate hyphenation points is just a convention that

is used in TeX's hyphenation patterns. Likewise, even numbers are used to

discourage hyphenation. Since always the maximum (in function `combine`) is

used, it is possible to override decisions but numbers can't cancel each

other out. The higher the number, the stronger the suitability for

hyphenation or not.

Function `split` implements splitting a word into parts at hyphenation

points. It takes all clues from the array `breaks` that assigns a value to

each potential hyphenation point as explained above.

Break point _i_ belongs to the gap between characters _i-1_ and _i_:

    word    . h y p h e n a t i o n .                word

    word    0 1 2 3 4 5 6 7 8 9 1 2 3     index  for word

    breaks 0 1 2 3 4 5 6 7 8 9 1 2 3 4    index  for breaks

    breaks 0 0 0 3 0 0 2 5 4 2 0 2 0 0    value  for breaks

            . h y-p h e n-a t i o n .

    minpre         = 1

    minpost         = 2

    length breaks   = 15

    first           = 1 + 1  = 2

    last            = 15 - 3 = 12

    <>=

    let take string first last  = String.sub string first (last - first + 1)

    let is_odd (n:int): bool    = n mod 2 = 1

    

    let next_hp breaks first last i =

        let rec loop i =

            if i > last || i >= Array.length breaks then None

            else if i < first                       then loop (i+1)

            else if is_odd breaks.(i)               then Some i

            else loop (i+1)

        in

            loop i

    

    <>=

    let split (t:t) (word:string) (breaks:int array): string list =

        (* debug "split %s %s\n" word (join word breaks); *)

        assert (Array.length breaks = String.length word + 1);

        let word_len   = String.length word in

        let first_char = 1 in (* word *)

        let last_char  = word_len  - 2 in (* word *)

        let first_bp   = t.minpre + 1 in (* breaks *)

        let last_bp    = word_len - t.minpost - 1 in (* breaks *)

        (* debug "first:%d last:%d first_bp:%d last_bp:%d\n"

            first_char last_char first_bp last_bp; *)

        let rec loop i (* word *) acc =

            match next_hp breaks first_bp last_bp (i+1) with

            | None   -> take word i last_char :: acc

            | Some j -> loop j (take word i (j-1) :: acc)

        in

            List.rev @@ loop first_char []

    

    

To hyphenate a word, we put a dot `.` at the beginning and end and use a

sliding window to find all matching patterns in the pattern dictionary.

The result is an array that tells us about good hyphenation points.  When a

matching pattern is found, the corresponding hyphenation points are

combined with the ones already found. We try to find all patterns in the

dictionary up to the maximum pattern length or the length of the word,

whatever is shorter.

    <>=

    let hyphenate t (word:string): string list =

        if String.length word < t.minlen then [word] else

        let word   = "." ^ word ^ "." in

        let len    = String.length word in

        let breaks = Array.make (len + 1) 0 in

        let lookup pos substr:unit =

            ( () (* debug "%s%s\n" (String.make pos ' ') substr *)

            ; combine pos (Hashtbl.find t.patterns substr) breaks

            ; () (* debug "%s\n" (join word breaks) *)

            ) in

        let lookup' pos substr = try lookup pos substr with Not_found -> ()

        in

            for i = t.minpatlen to min t.maxpatlen len do

                slide i word lookup'

            done;

            (* debug "%s\n" (join word breaks); *)

            split t word breaks

    

    

    <>=

    let dump t =

        let print key value =

            Printf.printf "%s %s\n" key (join key value)

        in

            Hashtbl.iter print t.patterns

    

## Demo Client

We provide a small demo client.

    <>=

    exception Error of string

    

    let error msg = raise (Error msg)

    let (@@) f x  = f x

    

    

`finally f x cleanup` function provides resource cleanup in the presence

of exceptions: `f x` is computed as a result and `cleanup x` is guaranteed

to run afterwards. (In many cases `cleanup` will not use its argument `x`

but it can be convenient to have access to it.)

    <>=

    type 'a result = Success of 'a | Failed of exn

    let finally f x cleanup =

        let result =

            try Success (f x) with exn -> Failed exn

        in

            cleanup x;

            match result with

            | Success y  -> y

            | Failed exn -> raise exn

    

    

    

    <>=

    let usage this =

        List.iter prerr_endline

        [ this ^ " usage:"

        ; ""

        ; this ^ " -f file.txt          hypenate words in file.txt"

        ; this ^ " word ..              hyphenate arguments"

        ; this ^ " -h                   emit help"

        ; this ^ " -d                   emit hyphenation patterns"

        ; ""

        ; this^" reads words from a file or the command line and"

        ; "emits them hyphenated to stdout. Before hyphenation, words are"

        ; "turned to lower case. "^this^" uses built-in patterns for"

        ; "US English."

        ; ""

        ; "(c) 2012, 2014 Christian Lindig "

        ; "https://github.com/lindig/ocaml-hyphenate"

        ]

    

    

`process` hyphenates a word, joins the parts together using a hyphen and

emits it.

    <>=

    let process lang word =

        print_endline @@ String.concat "-" @@ Hyphenate.hyphenate lang word

    

    let words' io       = Hyphenate_reader.words @@ Lexing.from_channel io

    let words_in path   = finally words' (open_in path) close_in

    

    let main () =

        let argv        = Array.to_list Sys.argv in

        let this        = Filename.basename

                            @@ List.hd argv in

        let args        = List.tl argv in

        let language    = Hyphenate_us.t in

            match args with

            | ["-f"; path]  -> List.iter (process language) (words_in path)

            | "-h" :: _     -> usage this

            | ["-d"]        -> Hyphenate.dump Hyphenate_us.t

            | word :: _     -> List.iter (process language)

                                (List.map String.lowercase args)

            | _             -> usage this

    

    let _ = main (); exit 0