Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rmraya/terms
Term extraction from XLIFF 2.0
https://github.com/rmraya/terms
java terminology-extraction yake
Last synced: about 8 hours ago
JSON representation
Term extraction from XLIFF 2.0
- Host: GitHub
- URL: https://github.com/rmraya/terms
- Owner: rmraya
- License: epl-1.0
- Created: 2021-01-14T11:59:45.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-11-02T10:57:13.000Z (17 days ago)
- Last Synced: 2024-11-02T11:27:41.368Z (17 days ago)
- Topics: java, terminology-extraction, yake
- Language: Java
- Homepage:
- Size: 1.74 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Terms Extractor
Java tools for extractiong terms from XLIFF 2.0 files.
This project is based on the paper *YAKE! Keyword extraction from single documents using multiple local features* by Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Jorge, Célia Nunes and Adam Jatowt.
## Requirements for building
- Java 21 (get it from [https://adoptium.net/](https://adoptium.net/))
- Apache Ant 1.10.14 or newer (get it from [https://ant.apache.org/bindownload.cgi](https://ant.apache.org/bindownload.cgi))### Building
Follow these steps to build the project:
```bash
git clone https://github.com/rmraya/Terms.git
cd Terms
ant
```A binary distribution will be created in `/dist` folder.
## Usage
Execute `dist/extractTerms.sh` or `dist\extractTerms.bat` and the program will display the following usage information:
``` bash
INFO: Usage:termExtractor [-version] [-help] -xliff xliffFile [-output outputFile] [-minFreq frequency] [-maxLenght length] [-maxScore score] [-generic]
Where:
-version: (optional) Display version information and exit
-help: (optional) Display this usage information and exit
-xliff: The XLIFF file to process
-output: (optional) The output file where the terms will be written
-maxLenght: (optional) The maximum number of words in a term. Default: 3
-minFreq: (optional) The minimum frequency for a term to be considered. Default: 3
-maxScore: (optional) The maximum score for a term to be considered. Default: 0.001
-generic: (optional) Include terms with relevance < 1.0. Default: false
```By default, the program extracts terms with a minimum frequency of 3, a maximum length of 3 words, and a maximum score of 0.001. Terms with a relevance less than 1.0 are excluded by default.
The program writes a CSV (comma separated values) file with the same name as the supplied XLIFF file with the `.csv` extension, containing the following columns:
|Column| Description|
|:--:|--|
|#| The candidate term number|
|Term| The term candidate|
|Score| The term score, calculated using the values from the remaining columns.|
|Casing| Insidence of the term case when not used at the start of a sentence. The underlying rationale is that uppercase terms tend to be more relevant than lowercase ones.|
|Position| Insidence of the term position in the XLIFF file. The rationale is that relevant keywords tend to appear at the very beginning of a document, whereas words occurring in the middle or at the end of a document tend to be less important.|
|Frequency| The number of occurrences of the term in the XLIFF file.|
|Relevance| Inverse of the normalized term frequency. The rationale is that common words are less relevant than rare ones.|
|Relatedness| A value which aims to determine the dispersion of a candidate term with regards to its specific context, calculated considering the words that appear before and after the term in the same sentence.|
|Different| A measurement of how often a candidate term appears within different sentences. It reflects the assumption that candidates which appear in many different sentences have a higher probability of being important.|## Credits
Stop words lists extracted from [https://github.com/Alir3z4/stop-words](https://github.com/Alir3z4/stop-words). Supported languages are:
- Arabic
- Bulgarian
- Catalan
- Czech
- Danish
- Dutch
- English
- Finnish
- French
- German
- Gujarati
- Hindi
- Hebrew
- Hungarian
- Indonesian
- Malaysian
- Italian
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Spanish
- Swedish
- Turkish
- Ukrainian
- Vietnamese
- Persian/Farsi