https://github.com/dataoneorg/onto-dataonepython
Clone of Bitbucket ndigiuseppe/dataonepython ontology coverage project
https://github.com/dataoneorg/onto-dataonepython
Last synced: over 1 year ago
JSON representation
Clone of Bitbucket ndigiuseppe/dataonepython ontology coverage project
- Host: GitHub
- URL: https://github.com/dataoneorg/onto-dataonepython
- Owner: DataONEorg
- Created: 2014-01-24T07:09:34.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2014-01-24T07:18:43.000Z (over 12 years ago)
- Last Synced: 2025-01-30T21:17:15.977Z (over 1 year ago)
- Language: Python
- Size: 31.4 MB
- Stars: 0
- Watchers: 6
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
This document will describe the various Python packages for the DataONE ontology coverage project.
package:
corpusFetcher:
This package is used to get a corpus and normalize it. It also normalizes a thesaurus and performs part of speach. However, to USE the part of speach package, one must download and install the natural language toolkit (NLTK). the link is http://nltk.org/install.html
classes:
fetchCorpus: This is the main file to call that will get the corpus from a remove location, and normalize it completely and in the right order
get_metadata: This file will call cn.dataone.org to get various corpus files and store them
lovins: This is a stemmer that follows the lovins stemmer pattern
paicehusk: This is a stemmer that follows the paice pattern
porter: This is a stemmer that follows the porter pattern
porter2: This is a stemmer that follows the snowball pattern
removeNumbers: This is part of the normalization process that removes words with not enough english characters
removePunct: This is part of the normalization process that removes all punctuation
removeStop: This is part of the normalization process that removes stop words
removeUpper: This is part of the normalization process that turns everything to lower case
stemWords: This is part of the normalization process that stems all words
These files normalize the same file and print out a new file in a pipe fashion. It does this by putting files in the "data" directory from the parent directory. the final product is a file called finishedCorpus_6.txt
OntologyWorker:
classes:
fetchOntology: This file takes a list of URLs and downloads the ontologies from the wweb. Is very specific to gather the SWEET ontologies, not generalized.
ontologyStemmer: This file stems the class names from an ontology and then overwrites its. Pass in as a parameter a directory containing OWL ontologies (only stems those)
partOfSpeachTagger:
classes:
PoSTagger: This file takes as an arguement a string, and returns a list of tuples with a word the PoS. it filters out all except nouns, adverbs, adjects, and verbs
thesaurusFixer:
classes:
MergeKeyValuePairs: Because stemming a thesauri file can cause some keys to be the same...This merges the keys and corresponding values within a file. you need to pass in 2 arguements (the input and output path) else it hardcodes to (likely) non-existent path
ThesaurusStemmer: This file takes a thesauri file and stems all the words within it. you need to pass in 2 arguements (the input and output path) else it hardcodes to (likely) non-existent path
wordNet:
classes:
wordNetHandler: This file uses wordnet to generate synonyms for specific "words" (ie a string containing a single word). However, because wordNet's synonym generator is ...bad, its not used.
Directories:
data:
This directory contains a variety of folders including the various levels of normalized corpus.