https://github.com/arademaker/sick
https://github.com/arademaker/sick
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/arademaker/sick
- Owner: arademaker
- Created: 2020-07-14T15:50:49.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-06-29T18:39:11.000Z (almost 4 years ago)
- Last Synced: 2025-12-27T20:58:26.650Z (6 months ago)
- Language: Python
- Size: 17.1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.org
Awesome Lists containing this project
README
#+title: treebanking SICK dataset with ERG
** data
- https://www.aclweb.org/anthology/L14-1314/
- http://marcobaroni.org/composes/sick.html
- https://zenodo.org/record/2787612#.X0E1SS2z31A
also mentioned at http://nlpprogress.com/english/semantic_textual_similarity.html
SICK contains 9,841 pairs of sentences, the text file contains one line for each pair:
#+BEGIN_EXAMPLE
pair_ID: 1
sentence_A: A group of kids is playing in a yard and an old man is standing in the background
sentence_B: A group of boys in a yard is playing and a man is standing in the background
entailment_label: NEUTRAL
relatedness_score: 4.5
entailment_AB: A_neutral_B
entailment_BA: B_neutral_A
sentence_A_original: A group of children playing in a yard, a man in the background.
sentence_B_original: A group of children playing in a yard, a man in the background.
sentence_A_dataset: FLICKR
sentence_B_dataset: FLICKR
SemEval_set: TRAIN
#+END_EXAMPLE
Considering the sentences only for treebanking, we have many
repetitions. For instance, the sentence 'A man is playing a guitar'
occurs in 63 pairs.
#+BEGIN_EXAMPLE
% awk -F "\t" -v OFS="\n" 'NR > 1 {print $2,$3}' SICK.txt | wc -l
19680
% awk -F "\t" -v OFS="\n" 'NR > 1 {print $2,$3}' SICK.txt | sort | uniq | wc -l
6076
% awk -F "\t" -v OFS="\n" 'NR > 1 {print $2,$3,$8,$9}' SICK.txt | sort | uniq | wc -l
7985
#+END_EXAMPLE
** data preparation
1. obtain the SICK.txt file (note that I made few manual editions
to FIX errors in the original SICK.txt)
2. create the profiles running data/compact.sh
3. process the profiles with ACE/ERG (see data/proc-profile.sh)
4. create the data/sample.txt from the sentences.txt
** grammar compilation
grammar compilation (trunk version):
#+BEGIN_SRC bash
ace -g ~/hpsg/terg/ace/config.tdl -G erg.dat
#+END_SRC
** profile and fftb treebanking
profile construction:
#+BEGIN_SRC bash
mkprof -r ~/logon/lingo/erg/tsdb/gold/mrs/relations -i data/sample.txt data/golden
art -a "ace -g erg.dat -O --disable-generalization" -f data/golden
#+END_SRC
with ACE/PyDelphin:
#+BEGIN_SRC bash
delphin mkprof --input sample.txt --relations ~/hpsg/logon/lingo/lkb/src/tsdb/skeletons/english/Relations --skeleton data/golden
delphin process data/golden -g erg.dat --full-forest --options='--disable-generalization'
#+END_SRC
treebanking:
#+BEGIN_SRC bash
fftb -g erg.dat --webdir /usr/local/fftb/assets/ data/sample
#+END_SRC
The annotation was done in aprox. 6 hours.
** profile processing with ACE
#+BEGIN_SRC bash
delphin process -g erg.dat -o "-n 1" -s data/golden data/parsed
#+END_SRC
** comparing the profiles
#+BEGIN_SRC bash
% delphin edm golden parsed
Precision: 0.9637710992177851
Recall: 0.9683557394002068
F-score: 0.9660579799855565
#+END_SRC
** solver for underspecified scopes
We used https://github.com/coli-saar/utool to solve the
underspecified scopes of quantifiers. This process actually test
the consistency of the MRS structures.
Download utool and start the server with:
: java -Xmx8g -server -jar utool/Utool-3.4.jar server --logging --warmup
then execute:
: python solver.py > solver.txt