{"id":13936281,"url":"https://github.com/openai/deeptype","last_synced_at":"2025-04-04T09:09:59.016Z","repository":{"id":50605181,"uuid":"120123975","full_name":"openai/deeptype","owner":"openai","description":" Code for the paper \"DeepType: Multilingual Entity Linking by Neural Type System Evolution\"","archived":false,"fork":false,"pushed_at":"2023-04-02T12:00:18.000Z","size":361,"stargazers_count":650,"open_issues_count":21,"forks_count":145,"subscribers_count":24,"default_branch":"master","last_synced_at":"2025-03-28T08:09:08.525Z","etag":null,"topics":["paper"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1802.01021","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-02-03T19:48:28.000Z","updated_at":"2025-02-20T07:20:20.000Z","dependencies_parsed_at":"2022-09-24T15:13:22.432Z","dependency_job_id":"efad8b8b-6106-4849-8923-c21a6367e130","html_url":"https://github.com/openai/deeptype","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fdeeptype","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fdeeptype/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fdeeptype/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fdeeptype/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openai","download_url":"https://codeload.github.com/openai/deeptype/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247149502,"owners_count":20891954,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["paper"],"created_at":"2024-08-07T23:02:32.449Z","updated_at":"2025-04-04T09:09:59.000Z","avatar_url":"https://github.com/openai.png","language":"Python","readme":"**Status:** Archive (code is provided as-is, no updates expected)\n\nDeepType: Multilingual Entity Linking through Neural Type System Evolution\n--------------------------------------------------------------------------\n\nThis repository contains code necessary for designing, evolving type systems, and training neural type systems. To read more about this technique and our results [see this blog post](https://blog.openai.com/discovering-types-for-entity-disambiguation/) or [read the paper](https://arxiv.org/abs/1802.01021).\n\nAuthors: Jonathan Raiman \u0026 Olivier Raiman\n\nOur latest approach to learning symbolic structures from data allows us to discover a set of task specific constraints on a neural network in the form of a type system, to guide its understanding of documents, and obtain state of the art accuracy at [recognizing entities in natural language](https://en.wikipedia.org/wiki/Entity_linking). Recognizing entities in documents can be quite challenging since there are often millions of possible answers. However, when using a type system to constrain the options to only those that semantically \"type check,\" we shrink the answer set and make the problem dramatically easier to solve. Our new results suggest that learning types is a very strong signal for understanding natural language: if types were given to us by an oracle, we find that it is possible to obtain accuracies of 98.6-99% on two benchmark tasks [CoNLL (YAGO)](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/) and the [TAC KBP 2010 challenge](https://pdfs.semanticscholar.org/b7fb/11ef06b0dcdc89ef0a5507c6c9ccea4206d8.pdf).\n\n### Data collection\n\nGet wikiarticle -\u003e wikidata mapping (all languages) + Get anchor tags, redirections, category links, statistics (per language). To store all wikidata ids, their key properties (`instance of`, `part of`, etc..), and\na mapping from all wikipedia article names to a wikidata id do as follows,\nalong with wikipedia anchor tags and links, in three languages: English (en), French (fr), and Spanish (es):\n\n```\nexport DATA_DIR=data/\n./extraction/full_preprocess.sh ${DATA_DIR} en fr es\n```\n\n### Create a type system manually and check oracle accuracy:\n\nTo build a graph projection using a set of rules inside `type_classifier.py`\n(or any Python file containing a `classify` method), and a set of nodes\nthat should not be traversed in `blacklist.json`:\n\n```\nexport LANGUAGE=fr\nexport DATA_DIR=data/\npython3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py\n```\n\nTo save a graph projection as a numpy array along with a list of classes to a\ndirectory stored in `CLASSIFICATION_DIR`:\n\n```\nexport LANGUAGE=fr\nexport DATA_DIR=data/\nexport CLASSIFICATION_DIR=data/type_classification\npython3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py  --export_classification ${CLASSIFICATION_DIR}\n```\n\nTo use the saved graph projection on wikipedia data to test out how discriminative this\nclassification is (Oracle performance) (edit the config file to make changes to the classification used):\n\n```\nexport DATA_DIR=data/\npython3 extraction/evaluate_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}\n```\n\n### Obtain learnability scores for types\n\n```bash\nexport DATA_DIR=data/\npython3 extraction/produce_wikidata_tsv.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR} sample_data.tsv\npython3 learning/evaluate_learnability.py sample_data.tsv --out report.json --wikidata ${DATA_DIR}wikidata/\n```\n\nSee `learning/LearnabilityStudy.ipynb` for a visual analysis of the AUC scores.\n\n### Evolve a type system\n\n```bash\npython3 extraction/evolve_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}  --method cem  --penalty 0.00007\n```\nMethod can be `cem`, `greedy`, `beam`, or `ga`, and penalty is the soft constraint on the size of the type system (lambda in the paper).\n\n#### Convert a type system solution into a trainable type classifier\n\nThe output of `evolve_type_system.py` is a set of types (root + relation) that can be used to build a type system. To create a config file that can be used to train an LSTM use the jupyter notebook `extraction/TypeSystemToNeuralTypeSystem.ipynb`.\n\n### Train a type classifier using a type system\n\nFor each language create a training file:\n\n```\nexport LANGUAGE=en\npython3 extraction/produce_wikidata_tsv.py extraction/configs/${LANGUAGE}_disambiguator_config_export.json /Volumes/Samsung_T3/tahiti/2017-12/${LANGUAGE}_train.tsv  --relative_to /Volumes/Samsung_T3/tahiti/2017-12/\n```\n\nThen create an H5 file from each language containing the mapping from tokens to their entity ids in Wikidata:\n\n```\nexport LANGUAGE=en\npython3 extraction/produce_windowed_h5_tsv.py  /Volumes/Samsung_T3/tahiti/2017-12/${LANGUAGE}_train.tsv /Volumes/Samsung_T3/tahiti/2017-12/${LANGUAGE}_train.h5 /Volumes/Samsung_T3/tahiti/2017-12/${LANGUAGE}_dev.h5 --window_size 10  --validation_start 1000000 --total_size 200500000\n```\n\nCreate a training config with all languages, `my_config.json`. Paths to the datasets is relative to config file (e.g. you can place it in the same directory as the dataset h5 files):\n[Note: set `wikidata_path` to where you extracted wikidata information, and `classification_path` to where you exported the classifications with `project_graph.py`]. See learning/configs for a pre written config covering English, French, Spanish, German, and Portuguese.\n\n```\n{\n    \"datasets\": [\n        {\n            \"type\": \"train\",\n            \"path\": \"en_train.h5\",\n            \"x\": 0,\n            \"ignore\": \"other\",\n            \"y\": [\n                {\n                    \"column\": 1,\n                    \"objective\": \"type\",\n                    \"classification\": \"type_classification\"\n                },...\n            ],\n            \"ignore\": \"other\",\n            \"comment\": \"#//#\"\n        },\n        {\n            \"type\": \"dev\",\n            \"path\": \"en_dev.h5\",\n            \"x\": 0,\n            \"ignore\": \"other\",\n            \"y\": [\n                {\n                    \"column\": 1,\n                    \"objective\": \"type\",\n                    \"classification\": \"type_classification\"\n                },...\n            ],\n            \"ignore\": \"other\",\n            \"comment\": \"#//#\"\n        }, ...\n    ],\n    \"features\": [\n        {\n            \"type\": \"word\",\n            \"dimension\": 200,\n            \"max_vocab\": 1000000\n        },...\n    ],\n    \"objectives\": [\n        {\n            \"name\": \"type\",\n            \"type\": \"softmax\",\n            \"vocab\": \"type_classes.txt\"\n        }, ...\n    ],\n    \"wikidata_path\": \"wikidata\",\n    \"classification_path\": \"classifications\"\n}\n```\n\nLaunch training on a single gpu:\n\n```\nCUDA_VISIBLE_DEVICES=0 python3 learning/train_type.py my_config.json --cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000  --name TypeClassifier --weight_noise 1e-6  --save_dir my_great_model  --anneal_rate 0.9999\n```\n\nSeveral key parameters:\n\n- `name`: main scope for model variables, avoids name clashing when multiple classifiers are loaded\n- `batch_size`: how many examples are used for training simultaneously, can cause out of memory issues\n- `max_epochs`: length of training before auto-stopping. In practice this number should be larger than needed.\n- `fused`: glue all output layers into one, and do a single matrix multiply (recommended).\n- `hidden_sizes`: how many stacks of LSTMs are used, and their sizes (here 2, each with 200 dimensions).\n- `cudnn`: use faster CuDNN kernels for training\n- `anneal_rate`: shrink the learning rate by this amount every 33000 training steps\n- `weight_noise`: sprinkle Gaussian noise with this standard deviation on the weights of the LSTM (regularizer, recommended).\n\n\n#### To test that training works:\n\nYou can test that training works as expected using the dummy training set containing a Part of Speech CRF objective and cat vs dogs log likelihood objective is contained under learning/test:\n\n```bash\npython3 learning/train_type.py learning/test/config.json\n```\n\n### Installation\n\n#### Mac OSX\n\n```\npip3 install -r requirements.txt\npip3 install wikidata_linker_utils_src/\n```\n\n#### Fedora 25\n\n```\nsudo dnf install redhat-rpm-config\nsudo dnf install gcc-c++\nsudo pip3 install marisa-trie==0.7.2\nsudo pip3 install -r requirements.txt\npip3 install wikidata_linker_utils_src/\n```\n","funding_links":[],"categories":["Deep Learning goodies","📦 Legacy \u0026 Inactive Projects","Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fdeeptype","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenai%2Fdeeptype","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fdeeptype/lists"}