{"id":13741786,"url":"https://github.com/braunefe/Gargantua","last_synced_at":"2025-05-08T22:32:29.906Z","repository":{"id":146609980,"uuid":"47689799","full_name":"braunefe/Gargantua","owner":"braunefe","description":null,"archived":false,"fork":false,"pushed_at":"2015-12-09T12:32:58.000Z","size":141,"stargazers_count":12,"open_issues_count":1,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-08-03T04:08:57.487Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-2.1","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/braunefe.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-12-09T12:25:01.000Z","updated_at":"2024-02-27T15:54:32.000Z","dependencies_parsed_at":"2023-04-15T12:23:02.017Z","dependency_job_id":null,"html_url":"https://github.com/braunefe/Gargantua","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braunefe%2FGargantua","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braunefe%2FGargantua/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braunefe%2FGargantua/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braunefe%2FGargantua/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/braunefe","download_url":"https://codeload.github.com/braunefe/Gargantua/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224782047,"owners_count":17369078,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:01:02.626Z","updated_at":"2024-11-15T12:31:17.978Z","avatar_url":"https://github.com/braunefe.png","language":"C++","funding_links":[],"categories":["FieldDB Webservices/Components/Plugins","Academic Research Paper-Specific Repositories"],"sub_categories":["Utilities","FieldDB Webservices/Components/Plugins"],"readme":"Sentence Aligner presented in :\n\nFabienne Braune, Alexander Fraser (2010). Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING) - Posters, Beijing, China, August. \n\nIf you have any problems using this software please contact : \nbraune.fabienne@gmail.com\n\nThis sentence aligner is optimized for the Europarl tagging format. \nIf your corpus is split into chapters, speakers or paragraphs, please convert your tags to the Europarl format.\n\n-----------------------------\n0. Clean up if already used\n-----------------------------\nchmod u+x clean.sh\n./clean.sh\n\n----------------------\n1. Prepare data\n-----------------------\na) The sentences to be aligned need to be into pairs of files with the the same name and the extension .txt.\nb) Each file in one langauge has to have a corresponding file (with the same name) in the other langauge. In order to remove documents that are in one langauge only you can use the perl script remove-non-parallel-files.perl which comes with the aligner.\nc) Each file has to contain one sentence per line and spaces between words.\nd) In order to sentence split and tokenize files, the split-sentences and tokenize script provided with the Europarl corpus can be used:\n\nhttp://www.statmt.org/europarl/\n\nImportant note 1: it is recommended to first sentence split the corpus in order to obtain the untokenized data. In a second step tokenize the sentence split files in order to obtain the tokenized data.\n\n-----------------------\n2. Prepare filesystem\n-----------------------\nIn the directory SentenceAligner make the following directories:\n\nchmod u+x prepare-filesystem.sh\n./prepare-filesystem.sh\n\nMove (or link) your data in the corresponding directories:\n\nmv (ln -s) your_untokenized_source_language_files/* corpus_to_align/source_language_corpus_untokenized\nmv (ln -s) your_untokenized_target_language_files/* corpus_to_align/target_language_corpus_untokenized\nmv (ln -s) your_tokenized_source_language_files/* corpus_to_align/source_language_corpus_tokenized\nmv (ln -s) your_tokenized_target_language_files/* corpus_to_align/target_language_corpus_tokenized\n\nImportant note 2: If the data remains untokenized (i.e the *_tokenized and *_untokenized files are the same) put the data in ALL directories.\n\n---------------------\n3. Prepare Data\n---------------------\nchmod u+x prepare-data.sh (includes lowercasing the data)\n./prepare-data.sh\n\n---------------------------------------------------------\n4. Compile source code\n---------------------------------------------------------\ncd src\nmake clean\nmake\n\n\n------------------------\n5. Run Aligner\n------------------------\n./sentence-aligner\n\t\t\t\t\t\t\t\t\t\t\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbraunefe%2FGargantua","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbraunefe%2FGargantua","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbraunefe%2FGargantua/lists"}