Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/linas/learn-experiments-en
Language learning experiments config files
https://github.com/linas/learn-experiments-en
nlp-machine-learning opencog
Last synced: about 1 month ago
JSON representation
Language learning experiments config files
- Host: GitHub
- URL: https://github.com/linas/learn-experiments-en
- Owner: linas
- Created: 2022-05-20T01:27:07.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-03-26T01:20:36.000Z (9 months ago)
- Last Synced: 2024-10-20T10:31:16.956Z (2 months ago)
- Topics: nlp-machine-learning, opencog
- Language: Shell
- Homepage: https://wiki.opencog.org/w/Language_learning
- Size: 199 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Restart of English language-learning experiments
================================================
Restarted July 2021 ... ongoing, through Oct 2022Directories:
* run-1 -- attempt to have a perfect run-through of everything. (July 2021)
* run-2 -- work with a flawed copy of tranche-1 only
* run-3 -- Similarity Smackdown - compare MI to overlap, jaccard, etc. (Aug 2021)
* run-4 -- Deep trimming of datasets (Sept 2021)
* run-5 -- Trimming of word-pair files.
* run-6 -- All similarities between top-ranked words.
* run-7 -- Exploring one merge at a time.
* run-8 -- Exploring one merge at a time, w/shapes, and more merges.
* run-9 -- Bringup of production ranked merge.
* run-10 -- Four different merge experiments, crashed around 15 classes
* run-11 -- Precise & imprecise merge, crashed, flawed
* run-12 -- Attempted compute of similarities. Broken.
* run-13 -- Staging area; no experiments.
* run-14 -- Clustering attempts; get about 100-200 merges deep, then blah.
(later; revive for some frame debugging work.)
* run-15 -- Deep compute of MI similarities and GOE similarities. (Sept 2022)
* run-16 -- Cluster attempts with GOE similaries.
* run-17 -- Link-Grammar API development & test. (Oct 2022)
* run-18 -- Integrated pair+MST+clustering bring-up. (Nov 2022)Spindled databases
==================
Archived databases are in
```
/home2/linas/src/novamente/data/rocks-archive
```Databases in the `~/data` directory
===================================
The assorted `run-1-*.rdb` databases are "master copies" of the best
runs with the properly, correctly applied processing. These took
a long time to generate, and need to be archived. They are imperfect:
right from the get-go, there's some bug with escaping quotes that
needlessly pollutes these files. However ... that bug has not been
found yet, has not been fixed yet, and we haven't re-run anything,
so the below will do.* `run-1-en_pairs-tranche-1.rdb` -- run-1 guten-tranche-1 only.
* `run-1-en_pairs-tranche-12.rdb` -- run-1 guten-tranche-1 and 2.
* `run-1-en_pairs-tranche-12*.rdb` -- etc.The above are large.
* `run-1-en_mpg-tranche-1234.rdb` -- ?? I guess mpg-parsed ??
Huge. See Diary Part Two* `run-1-marg-tranche-123.rdb` -- Described in Diary Part Three
page 6,8 ... contains word pairs and also disjuncts, and MMT
marginals for disjuncts (and maybe word-pair marginals?)This takes 50GB to load word pairs, and 60 GB to do anything with
them.* `run-1-t1*-trim-1-1-1.rdb` -- MPG-parsed and trimmed to remove words,
disjuncts, and sections with a count of 1. Includes MM^T marginals
(but for word-disjunct pairs only) and redone pair marginals.
This amount of trimming was not enough! See below.* `run-1-t1*-tsup-1-1-1.rdb` -- As above, but also removed all words, disjuncts
with a support of only 1. See see run-5, run-6 and diary part three.
This is the correct amount to trim! Includes MM^T marginals.
... but the MM^T marginals are only on the (w,d) pairs, and NOT
on the shapes. The merge work needs shapes!* `run-1-t12-tsup-1-1-1.rdb` has ... 7.5K x 7.8K words, total of 7.4M
word-pairs; takes 6.0 GB to load word-pairs.
has 7.1K x 66K (w,d) matrix, 270K word-disjunct pairs.
Needs only additional 0.4 GB to also load (w,d) for total of 6.4 GBXXX -- there is an issue: the matrix-summary says 7.4M word-word
pairs, but there are only 3.4M atoms. So the summary is pre-trim.* `run-1-t123-tsup-1-1-1.rdb` has ... 44K x 44K words, total of 18.5M
word-pairs; takes 13.6 GB to load word-pairs. 20 minutes to load.
has 11.3K x 136K (w,d) matrix, 560K word-disjunct pairs.
Needs only additional 0.9 GB to also load (w,d) for total of 14.5 GB* `run-1-t1*-shape.rdb` -- Copy of above, with MM^T marginals on shapes.
This is on the fat side, as it still retains the original
word-pairs. It also contains the (un-needed) support and MM^T
on the shapeless (w,d) pairs.* `r9-sim-200.rdb` -- See Diary
* `r9-sim-200+entropy.rdb`
* `r9-sim-200+mi.rdb`* `r14-sim200.rdb` -- qualiity connector set with top 200 words with MI
in them.Junk Databases
==============
The following were generated in various experiments, but do not
need to be archived, they can be deleted.* `r2-en_pairs.rdb` -- just pair counts for guten-tranche-1 only.
Above is missing 400+ files. There were some crashes.
Part of run-2 -- includes MI* `r2-mpg_parse.rdb` -- MPG disjunct counts for above aka run-2.
This includes the MM^T marginals.* `r3-*rdb` -- Assorted Similarilty Smackdown databases.
* `r6-similarity-tsup.rdb` -- copy of run-1-t1234-tsup-1-1-1.rdb with MI for
word-pairs between the top 1200 wordvecs computed.* `r7-merge.rdb` -- individual merge experiments.
* `r8-merge.rdb` -- individual merge experiments.
* `r9-merge.rdb` -- Bringup of production merge.**...THE END...**