{"id":21019340,"url":"https://github.com/rmraya/terms","last_synced_at":"2025-12-30T06:58:20.378Z","repository":{"id":259000759,"uuid":"329601906","full_name":"rmraya/Terms","owner":"rmraya","description":"Term extraction from XLIFF 2.0","archived":false,"fork":false,"pushed_at":"2024-11-21T20:07:15.000Z","size":1845,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-20T12:45:06.390Z","etag":null,"topics":["java","terminology-extraction","yake"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"epl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rmraya.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-14T11:59:45.000Z","updated_at":"2024-11-21T20:07:23.000Z","dependencies_parsed_at":"2024-11-02T11:23:27.052Z","dependency_job_id":null,"html_url":"https://github.com/rmraya/Terms","commit_stats":null,"previous_names":["rmraya/terms"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmraya%2FTerms","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmraya%2FTerms/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmraya%2FTerms/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmraya%2FTerms/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rmraya","download_url":"https://codeload.github.com/rmraya/Terms/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243447641,"owners_count":20292455,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","terminology-extraction","yake"],"created_at":"2024-11-19T10:31:18.954Z","updated_at":"2025-12-30T06:58:20.373Z","avatar_url":"https://github.com/rmraya.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Terms Extractor\n\nJava tools for extracting terms from XLIFF 2.0 files.\n\nThis project is based on the paper *YAKE! Keyword extraction from single documents using multiple local features* by Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Jorge, Célia Nunes and Adam Jatowt.\n\n## Features\n\n- **Monolingual Term Extraction**: Extract terms from source text in XLIFF files\n- **Bilingual Term Extraction**: Extract translation pair candidates from XLIFF files with confirmed translations\n- **Automatic Deduplication**: Intelligent merging of similar terms\n- **Multiple Quality Filters**: Co-occurrence, mutual best match, and relevance-based filtering\n\n## Requirements for building\n\n- Java 21 (get it from [https://adoptium.net/](https://adoptium.net/))\n- Gradle 9.2 or newer (get it from [https://gradle.org/install/](https://gradle.org/install/))\n\n### Building\n\nFollow these steps to build the project:\n\n```bash\ngit clone https://github.com/rmraya/Terms.git\ncd Terms\ngradle\n```\n\nA binary distribution will be created in `/dist` folder.\n\n## Usage\n\n### Monolingual Term Extraction\n\nExecute `dist/extractTerms.sh` or `dist\\extractTerms.cmd` and the program will display the following usage information:\n\n``` bash\nINFO: Usage:\n\n    termExtractor [-version] [-help] -xliff xliffFile [-output outputFile] [-minFreq frequency] [-maxLength length] [-maxScore score] [-generic] [-debug]\n\nWhere:\n\n        -version:   (optional) Display version information and exit\n        -help:      (optional) Display this usage information and exit\n        -xliff:     The XLIFF file to process\n        -output:    (optional) The output file where the terms will be written\n        -maxLength: (optional) The maximum number of words in a term. Default: 3\n        -minFreq:   (optional) The minimum frequency for a term to be considered. Default: 3\n        -maxScore:  (optional) The maximum score for a term to be considered. Default: 0.001\n        -generic:   (optional) Include terms with relevance \u003c 1.0. Default: false\n        -debug:     (optional) Enable debug mode with detailed logging. Default: false\n```\n\nBy default, the program extracts terms with a minimum frequency of 3, a maximum length of 3 words, and a maximum score of 10.0. All terms (both single-word and multi-word) are included by default.\n\nUse the `-relevant` flag to exclude single-word terms and focus only on multi-word terms and proper nouns (words with unusual capitalization patterns).\n\n**Output Format:**\n\nThe program writes a CSV (comma separated values) file with the same name as the supplied XLIFF file with the `.csv` extension, containing the following columns:\n\n|Column| Description|\n|:--:|--|\n|#| The candidate term number|\n|Term| The term candidate|\n|Score| The term score, calculated using the values from the remaining columns.|\n|Casing| Insidence of the term case when not used at the start of a sentence. The underlying rationale is that uppercase terms tend to be more relevant than lowercase ones.|\n|Position| Insidence of the term position in the XLIFF file. The rationale is that relevant keywords tend to appear at the very beginning of a document, whereas words occurring in the middle or at the end of a document tend to be less important.|\n|Frequency| The number of occurrences of the term in the XLIFF file.|\n|Relevance| Inverse of the normalized term frequency. The rationale is that common words are less relevant than rare ones.|\n|Relatedness| A value which aims to determine the dispersion of a candidate term with regards to its specific context, calculated considering the words that appear before and after the term in the same sentence.|\n|Different| A measurement of how often a candidate term appears within different sentences. It reflects the assumption that candidates which appear in many different sentences have a higher probability of being important.|\n\n### Bilingual Term Extraction\n\nExecute `dist/bilingualExtractor.sh` or `dist\\bilingualExtractor.cmd` to extract translation pair candidates from bilingual XLIFF files:\n\n``` bash\nbilingualExtractor [-version] [-help] -xliff xliffFile [-output outputFile] \n                   [-minFreq frequency] [-maxLength length] [-maxScore score]\n                   [-minCoOccurrence count] [-maxPairs limit] [-minCoOccurrenceRatio ratio]\n                   [-debug]\n\nWhere:\n\n        -version:              (optional) Display version information and exit\n        -help:                 (optional) Display this usage information and exit\n        -xliff:                The XLIFF file to process (must contain translations with state=\"final\")\n        -output:               (optional) The output CSV file. Default: xliffFile_bilingual.csv\n        -maxLength:            (optional) Maximum number of words in a term. Default: 5\n        -minFreq:              (optional) Minimum frequency for a term. Default: 3\n        -maxScore:             (optional) Maximum YAKE score for a term. Default: 10.0\n        -minCoOccurrence:      (optional) Minimum times terms must co-occur. Default: 1\n        -maxPairs:             (optional) Maximum number of pairs to output (0 = unlimited). Default: 0\n        -minCoOccurrenceRatio: (optional) Minimum ratio of co-occurrence to total occurrences. Default: 0.7\n        -debug:                (optional) Enable debug mode with detailed logging. Default: false\n```\n\n**How It Works:**\n\n1. Processes only segments with `state=\"final\"` (confirmed translations)\n2. Extracts terms separately from source and target text using YAKE algorithm\n3. Identifies term pairs that co-occur in the same segments\n4. Applies mutual best match filtering: keeps only pairs where each term's best match is the other\n5. Filters by co-occurrence count and ratio\n6. Deduplicates pairs keeping the best scoring variants\n\n**Quality Filters:**\n\n- **Mutual Best Match**: Ensures each source term's highest co-occurrence target is the paired target term, and vice versa. This eliminates false pairs from terms that merely appear in the same segment.\n- **Co-occurrence Ratio**: Default 0.7 means terms must co-occur in at least 70% of segments where either term appears\n- **Minimum Length**: Terms must be at least 2 characters (eliminates single letters)\n\n**Output Format:**\n\nCSV file with the following columns:\n\n|Column|Description|\n|:--:|--|\n|Source Term|The source language term|\n|Source Score|YAKE score for the source term (lower is better)|\n|Source Frequency|Number of occurrences of source term|\n|Target Term|The target language term|\n|Target Score|YAKE score for the target term (lower is better)|\n|Target Frequency|Number of occurrences of target term|\n|Shared Segments|Segment numbers where both terms co-occur|\n|Co-occurrence Count|Number of segments where both terms appear together|\n\n## Term Deduplication\n\nThe program automatically deduplicates extracted terms using two strategies:\n\n1. **Case-insensitive matching**: Merges terms that differ only in capitalization (e.g., \"Machine Learning\" and \"machine learning\")\n2. **Similarity matching**: Merges terms that are similar based on Levenshtein distance with 85% similarity threshold, including:\n   - Substring relationships (e.g., \"learning\" vs \"machine learning\")\n   - Minor spelling variations\n\nWhen duplicates are found, the program keeps the variant with the lowest score (best in YAKE), or if scores are equal, the one with highest frequency.\n\n## Credits\n\nStop words lists extracted from [https://github.com/Alir3z4/stop-words](https://github.com/Alir3z4/stop-words). Supported languages are:\n\n- Arabic\n- Bulgarian\n- Catalan\n- Czech\n- Danish\n- Dutch\n- English\n- Finnish\n- French\n- German\n- Gujarati\n- Hindi\n- Hebrew\n- Hungarian\n- Indonesian\n- Malaysian\n- Italian\n- Norwegian\n- Polish\n- Portuguese\n- Romanian\n- Russian\n- Slovak\n- Spanish\n- Swedish\n- Turkish\n- Ukrainian\n- Vietnamese\n- Persian/Farsi\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frmraya%2Fterms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frmraya%2Fterms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frmraya%2Fterms/lists"}