{"id":13741190,"url":"https://github.com/moses-smt/salm","last_synced_at":"2025-04-09T03:32:26.889Z","repository":{"id":12093631,"uuid":"14681980","full_name":"moses-smt/salm","owner":"moses-smt","description":"SALM: Suffix Array and its Applications in Empirical Language Processing by Joy","archived":false,"fork":false,"pushed_at":"2017-12-22T16:02:03.000Z","size":380,"stargazers_count":11,"open_issues_count":1,"forks_count":5,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-23T23:12:12.759Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moses-smt.png","metadata":{"files":{"readme":"Readme","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-11-25T09:55:10.000Z","updated_at":"2020-07-19T06:15:16.000Z","dependencies_parsed_at":"2022-09-14T00:20:54.435Z","dependency_job_id":null,"html_url":"https://github.com/moses-smt/salm","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moses-smt%2Fsalm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moses-smt%2Fsalm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moses-smt%2Fsalm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moses-smt%2Fsalm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moses-smt","download_url":"https://codeload.github.com/moses-smt/salm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247973951,"owners_count":21026738,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:00:56.498Z","updated_at":"2025-04-09T03:32:26.492Z","avatar_url":"https://github.com/moses-smt.png","language":"C++","funding_links":[],"categories":["Software"],"sub_categories":["Utilities"],"readme":"SALM: Suffix Array tool kit for empirical Language Manipulations.\r\nBy Joy, joy@cs.cmu.edu\r\n\r\n1) Download the source code from: http://projectile.is.cs.cmu.edu/research/public/tools/salm/salm.htm or http://www.sourceforge.net/projects/salm\r\n2) Build binaries:\r\n\ta) For Linux platform:\r\n\t\tcd Distribution/Linux\r\n\t\tmake allO32 (for 32-bit platform)\r\n\t\tor\r\n\t\tmake allO64 (for 64-bit platform)\r\n\t\t\r\n\t\tBinaries are created under Bin/Linux\r\n\t\t\r\n\tb) For Win32 platform\r\n\t\topen project files under Distribution/Win32 and use Visual C++ to build executables.\r\n\t\tExecutables are placed under Bin/Win32\r\n\t\t\r\n3) Index a corpus.\r\n\tThe first step is to index a corpus using IndexSA program.\r\n\tThere is no limitation to the size of the corpus as long as there is enough RAM.\r\n\tA corpus of N words requires 9N bytes memory during indexing.\r\n\t\r\n\tAnother constraint is that no sentence can have more than 254 words.\r\n\t\r\n\tSynposis of IndexSA:\r\n\t\tIndexSA corpusFileName [existingIDVocabularyFile]\r\n\t\r\n\tOptional existingIDVocabularyFile can be used to specify an existing vocabulary.\r\n\tIt will be updated if the words in the corpus are new to the exising vocabulary.\r\n\tThis is useful if several corpora want to share a common vocabulary.\r\n\t\r\n\r\n4) Applications\r\n\tThe key functions to suffix array applications are provided in class C_SuffixArraySearchApplicationBase and C_SuffixArrayScanningBase\r\n\tPlease check the documentation and API for more details.\r\n\t\r\n\tSample programs such as:\r\n\t\r\n\t\tFrequencyOfNgrams: \r\n\t\t\tOutput the frequency of an n-gram in the training corpus\r\n\t\t\t\r\n\t\tNGramMatchingStat4TestSet\t\t\r\n\t\t\tOutput the n-gram token matching statistics of a testing data\r\n\t\r\n\t\tNgramTypeInTestSetMatchedInCorpus\r\n\t\t\tOutput the n-gram type matching statistics of a testing data\r\n\t\t\t\r\n\t\tNgramMatchingFreq4Sent\r\n\t\t\tOutput the frequencies of all the embedded n-grams in a sentence\r\n\t\t\t\r\n\t\tNgramMatchingFreqAndNonCompositionality4Sent\r\n\t\t\tOutput the non-compositionalities of the embedded n-grams in a sentence\r\n\t\t\t\r\n\t\tFilterDuplicatedSentences\r\n\t\t\tFilter out duplicated sentences in the training corpus and output the unique ones\r\n\t\t\t\t\t\r\n\t\tCollectNgramFreqCount\r\n\t\t\tGiven a list of n-grams and a list of traing corpus indexed by their suffix array, collect counts of n-grams in these corpus. E.g. given a Chinese word list, one can collect the frequency of these words (as character n-grams) from several large corpora (segmented into characters).\r\n\r\n\t\tCalcCountOfCounts\r\n\t\t\tOutput the count-of-counts information of a corpus\r\n\t\t\t\r\n\t\tOutputHighFreqNgram\r\n\t\t\tSpecified by a configuration file, output the n-gram types that have frequencies higher than the threshold\r\n\t\r\n\t\tTypeTokenFreqInCorpus\r\n\t\t\tOutput the type/token statistics of the corpus\r\n\r\n5) Questions, comments and suggestions?\r\nPlease email joy+salm@cs.cmu.edu\r\n   ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoses-smt%2Fsalm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoses-smt%2Fsalm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoses-smt%2Fsalm/lists"}