https://github.com/unhammer/lfgalign

Masteroppgåve: Syntaktisk fraselenking
https://github.com/unhammer/lfgalign
alignment lfg paper syntax treebank
Last synced: 2 months ago
JSON representation
Masteroppgåve: Syntaktisk fraselenking
Host: GitHub
URL: https://github.com/unhammer/lfgalign
Owner: unhammer
License: gpl-3.0
Created: 2010-01-29T21:10:34.000Z (about 15 years ago)
Default Branch: master
Last Pushed: 2011-01-16T21:29:00.000Z (over 14 years ago)
Last Synced: 2024-12-27T00:41:50.859Z (4 months ago)
Topics: alignment, lfg, paper, syntax, treebank
Language: Prolog
Homepage: http://bora.uib.no/handle/1956/5003
Size: 23.3 MB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.markdown
- License: COPYING
Awesome Lists containing this project

README

        
About

=====

`lfgalign` is a program for aligning corresponding f-structures and

c-structures of LFG analysed parallel sentences. The analyses should

be in the

[XLE format](http://www2.parc.com/isl/groups/nltt/xle/doc/xle.html#Prolog_Output),

and preferably manually disambiguated from grammars that have been

written using common analysis principles (see the

[Xpar project description](http://xpar.b.uib.no/project-description/)). One

may optionally supply word-translations (e.g. from word alignments or

translational dictionaries) in order to improve the predicate

alignment.

There is an

[article about lfgalign](https://github.com/unhammer/lfgalign/raw/master/article/lfgalign-art.pdf)

that describes the method; see also the

[master's thesis](https://github.com/unhammer/lfgalign/raw/master/thesis/lfgalign.pdf)

(in Norwegian).

Usage

=====

Prerequisites:

- [asdf](http://common-lisp.net/project/asdf/), this is bundled with

  [SBCL](http://www.sbcl.org/) as well as the less common Common

  Lisps.

- [lisp-unit](http://github.com/OdonataResearchLLC/lisp-unit)

  (optional, for regression tests)

Make a symlink from your `systems` directory to `lfgalign.asd` in this

directory (you can do the same for `lisp-unit`); since I installed

SBCL using [clbuild](http://common-lisp.net/project/clbuild/) this

directory was at `/path/to/clbuild/systems`, but you can find the path

by evaluating `asdf:*central-registry*` in your interpreter after

requiring `'asdf`.

Running in the interpreter

--------------------------

Load the package in your interpreter with

    (asdf:operate 'asdf:load-op 'lfgalign)

Switch to the `lfgalign` package:

    (in-package :lfgalign)

(If you're using through Emacs with Slime, you can load from the REPL

with `, load RET lfgalign RET` and switch with `, in RET lfgalign

RET`.)

You can then run the regression tests with

    (lisp-unit:run-tests)

    

The function `evaluate` in the file `eval.lisp` shows how you load two

Prolog files into analysis tables, create an empty LPT table, run

f-alignment, ranking and c-alignment, finally give some

not-very-formatted output.

Running from the command-line

-----------------------------

A very preliminary command-line interface using SBCL is available. You

should be able to align two Prolog files by simply saying

    ./align.sh source.pl target.pl

    

although it assumes SBCL is installed in `usr`; you can set the

correct paths to SBCL and your asdf systems directory (where you

symlinked to lfgalign.asd) first by doing e.g.:

    

    export LISP=/l/c/clbuild/target/bin/sbcl

    export LISPCORE=/l/c/clbuild/target/lib/sbcl/sbcl.core

    export ASDFSYSTEMS=/l/c/clbuild/systems/

Common Lisp command-line interfaces are unfortunately not very

standardised.

Alternative c-structure alignment

---------------------------------

The global variable `*pro-affects-c-linking*` controls whether

unlinked pro-elements may hinder linking c-structure nodes of two

predicates. Setting this to `t` or `nil` toggles two alternative ways

of linking c-structures in the cases where one language has

pro-elements and the other does not, and the pro-element is linked on

the f-structure level.

Central functions

=================

lfgalign

----------

lfgalign.lisp currently does the following:

- collect c-structure trees: `maketree`

- find the topmost c-node in an f-domain: `topnode`

- find a c-node referenced by f-structure variable: `treefind`

- find f-structure predicate from variable, traversing equivalent

  f-vars: `get-pred` and `unravel`

  

- find arguments, adjuncts, lemma and lexical expression of a

  predicate/f-var: `get-args`, `get-adjs`, `lemma`, `L`

  

- keep tables of LPT correspondences (lookup with `LPT?` ensures a

  "pro" is an LPT of a noun as defined by `noun?`)

  

- find all set-unique combinations of links of source arguments with

  target args/adjuncts, and target arguments with source args/adjuncts

  (excluding adj-adj links): `argalign` (if given LPT tables, this

  removes combinations where at least one link is non-LPT)

  

- `outer-pred` creates a fake "sentence pred" with id -1, that has 0

  as an argument and, as adjuncts: any unreferenced preds in the

  f-structure (preds that are not arguments/adjuncts reachable through

  0)

 

- `f-align` combines the above and recursively tries to align all

  arguments in all permutations of argument-argument/adjunct pairs,

  creating a decision tree of sorts; `flatten` spreads this out into

  several simple lists. 

  

- `rank` uses `rank-helper` and `rank-branch` to turn the output from

  `f-align` into a single flat, ranked list of links for input into

  `c-align`.

  

- `add-links` takes a flat f-alignment and a tree, and creates a table

  of type `LL-splits`. Each node `n` is added to a list in the table,

  where the index of the list is the set of alignments of pre-terminal

  nodes dominated by node `n` (so several nodes may have the same

  index).

- `c-align` takes a flat f-alignment and finds the `LL-splits`

  of source and target trees, intersecting that on the keys to find

  which nodes can be aligned.

  

prolog-import

----------

prolog-import.lisp parses an XLE Prolog file and puts everything into

a hash table. Keys are f-structure variable numbers for the

f-structure, while the c-structure parts are referenced on the names

of the parts (subtree, terminal, phi, cproj, fspan, semform_data,

surfaceform), the values being alists with unique id keys. If we turn

it all back into an assoc-list, we get e.g.:

    ((0 ("VFORM" . "fin") ("CLAUSE-TYPE" . "decl") ("TNS-ASP" . 10)

      ("POLARITY" . 5) ("CHECK" . 1) ("SUBJ" . 3) ("PRED" "qePa" 2 (3) NIL))

     (3 ("PERS" . "3") ("NUM" . "sg") ("CASE" . "erg") ("ANIM" . "+") ("NTYPE" . 6)

      ("CHECK" . 4) ("PRED" "Abrams" 0 NIL NIL))

     (4 ("_AGR-POS" . "left") ("_POLARITY" . 5)) (6 ("NSYN" . "proper"))

     (1 ("_TENSEGROUP" . "aor") ("_TENSE" . "aor") ("_PERIOD" . "+")

      ("_MAIN-CL" . "+") ("_AGR" . "both") ("_MORPH-SYNT" . 7) ("_IN-SITU" . 2))

     (|in_set| ("NO-PV" . 22) (3 . 2))

     (7 ("_SYNTAX" . "unerg") ("_PERF-PV" . "-") ("_LEXID" . "V2746-3")

      ("_CLASS" . "MV") ("_AGR" . 8))

     (8 ("_OBJ" . 9)) (9 ("PERS" . "3") ("NUM" . "sg"))

     (10 ("TENSE" . "past") ("MOOD" . "indicative") ("ASPECT" . "perf"))

     (21 ("o::" . 22))

     (|subtree| (2 "PROP" NIL 1) (4 "V_SUFF_BASE" NIL 5) (6 "V_SUFF_BASE" NIL 7)

      (8 "V_SUFF_BASE" NIL 9) (10 "V_SUFF_BASE" NIL 11) (12 "V_SUFF_BASE" NIL 13)

      (14 "V_SUFF_BASE" NIL 15) (18 "V_BASE" NIL 17) (28 "PERIOD" NIL 22)

      (118 "PROPP" NIL 2) (141 "IPfoc[main,-]" NIL 118) (281 "V" NIL 18)

      (283 "V" 281 14) (284 "V" 283 12) (285 "V" 284 10) (286 "V" 285 8)

      (287 "V" 286 6) (288 "V" 287 4) (293 "I[main,-]" NIL 288)

      (398 "Ibar[main,-]" NIL 293) (401 "IPfoc[main,-]" 141 398)

      (454 "ROOT" NIL 401) (457 "ROOT" 454 28))

     (|phi| (1 . 3) (2 . 3) (4 . 0) (5 . 0) (6 . 0) (7 . 0) (8 . 0) (9 . 0)

      (10 . 0) (11 . 0) (12 . 0) (13 . 0) (14 . 0) (15 . 0) (17 . 23) (18 . 0)

      (22 . 0) (28 . 0) (118 . 3) (141 . 0) (281 . 0) (283 . 0) (284 . 0) (285 . 0)

      (286 . 0) (287 . 0) (288 . 0) (293 . 0) (398 . 0) (401 . 0) (454 . 0)

      (457 . 0))

     (|terminal| (1 "abramsma" (1)) (5 "+Obj3" (3)) (7 "+Subj3Sg" (3))

      (9 "+Aor" (3)) (11 "+Base" (3)) (13 "+Unerg" (3)) (15 "+V" (3))

      (17 "qePa-2746-3" (3)) (22 "." (22)))

     (|cproj| (17 . 21)) (|semform_data| (2 18 10 14) (0 2 1 9))

     (|fspan| (3 1 9) (0 1 16))

     (|surfaceform| (22 "." 15 16) (3 "iqePa" 10 15) (1 "abramsma" 1 9)))

We collect the eq-vars (equivalent variables) into a doubly-linked

circular list (so we can easily look up a member and get all

equivalents). 

We signal an error if the file is not disambiguated (as indicated by

the `select` and `choice` fields in the Prolog file). Otherwise, we

filter out non-selected parses from the file, keeping only the ones

equivalent to the selected parse (see `filter-equiv`, `in-disjunction`

and `disambiguated?`). 

TODO

====

- Use LPT-check as a k-best ranking criterion rather than a binary

  cut-off.

  

- SPEC and POSS features may lead to PRED's that are not arguments or

  adjuncts of anything else (e.g. determiners, possessors) -- need

  some principled method of aligning these.

- The program just uses dset3 of the dsets, rename it (make a class?)

  and deprecate the others.

- Could perhaps make argument calls a bit more concise by making a

  class `alignment`, containing constants `tab_s`, `tab_t`, creating

  constants `tree_s` and `tree_t` on init, and storing `LPT` and

  `f-alignments`.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/unhammer/lfgalign

Awesome Lists containing this project

README