https://github.com/arademaker/sick

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/arademaker/sick
Owner: arademaker
Created: 2020-07-14T15:50:49.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-06-29T18:39:11.000Z (almost 4 years ago)
Last Synced: 2025-12-27T20:58:26.650Z (6 months ago)
Language: Python
Size: 17.1 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.org

Awesome Lists containing this project

README

#+title: treebanking SICK dataset with ERG

** data

- https://www.aclweb.org/anthology/L14-1314/
- http://marcobaroni.org/composes/sick.html
- https://zenodo.org/record/2787612#.X0E1SS2z31A

also mentioned at http://nlpprogress.com/english/semantic_textual_similarity.html

SICK contains 9,841 pairs of sentences, the text file contains one line for each pair:

#+BEGIN_EXAMPLE
pair_ID: 1
sentence_A: A group of kids is playing in a yard and an old man is standing in the background
sentence_B: A group of boys in a yard is playing and a man is standing in the background
entailment_label: NEUTRAL
relatedness_score: 4.5
entailment_AB: A_neutral_B
entailment_BA: B_neutral_A
sentence_A_original: A group of children playing in a yard, a man in the background.
sentence_B_original: A group of children playing in a yard, a man in the background.
sentence_A_dataset: FLICKR
sentence_B_dataset: FLICKR
SemEval_set: TRAIN
#+END_EXAMPLE

Considering the sentences only for treebanking, we have many
repetitions. For instance, the sentence 'A man is playing a guitar'
occurs in 63 pairs.

#+BEGIN_EXAMPLE
% awk -F "\t" -v OFS="\n" 'NR > 1 {print $2,$3}' SICK.txt | wc -l
19680
% awk -F "\t" -v OFS="\n" 'NR > 1 {print $2,$3}' SICK.txt | sort | uniq | wc -l
6076
% awk -F "\t" -v OFS="\n" 'NR > 1 {print $2,$3,$8,$9}' SICK.txt | sort | uniq | wc -l
7985
#+END_EXAMPLE

** data preparation

1. obtain the SICK.txt file (note that I made few manual editions
to FIX errors in the original SICK.txt)

2. create the profiles running data/compact.sh

3. process the profiles with ACE/ERG (see data/proc-profile.sh)

4. create the data/sample.txt from the sentences.txt

** grammar compilation

grammar compilation (trunk version):

#+BEGIN_SRC bash
ace -g ~/hpsg/terg/ace/config.tdl -G erg.dat
#+END_SRC

** profile and fftb treebanking

profile construction:

#+BEGIN_SRC bash
mkprof -r ~/logon/lingo/erg/tsdb/gold/mrs/relations -i data/sample.txt data/golden
art -a "ace -g erg.dat -O --disable-generalization" -f data/golden
#+END_SRC

with ACE/PyDelphin:

#+BEGIN_SRC bash
delphin mkprof --input sample.txt --relations ~/hpsg/logon/lingo/lkb/src/tsdb/skeletons/english/Relations --skeleton data/golden
delphin process data/golden -g erg.dat --full-forest --options='--disable-generalization'
#+END_SRC

treebanking:

#+BEGIN_SRC bash
fftb -g erg.dat --webdir /usr/local/fftb/assets/ data/sample
#+END_SRC

The annotation was done in aprox. 6 hours.

** profile processing with ACE

#+BEGIN_SRC bash
delphin process -g erg.dat -o "-n 1" -s data/golden data/parsed
#+END_SRC

** comparing the profiles

#+BEGIN_SRC bash
% delphin edm golden parsed
Precision: 0.9637710992177851
Recall: 0.9683557394002068
F-score: 0.9660579799855565
#+END_SRC

** solver for underspecified scopes

We used https://github.com/coli-saar/utool to solve the
underspecified scopes of quantifiers. This process actually test
the consistency of the MRS structures.

Download utool and start the server with:

: java -Xmx8g -server -jar utool/Utool-3.4.jar server --logging --warmup

then execute:

: python solver.py > solver.txt

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arademaker/sick

Awesome Lists containing this project

README