{"id":48680840,"url":"https://github.com/sea-boat/TextAnalyzer","last_synced_at":"2026-06-15T12:01:46.493Z","repository":{"id":195794501,"uuid":"90715956","full_name":"sea-boat/TextAnalyzer","owner":"sea-boat","description":"A text analyzer which is based on machine learning,statistics and dictionaries that can analyze text.  So far, it supports hot word extracting, text classification, part of speech tagging, named entity recognition, chinese word segment, extracting address, synonym, text clustering, word2vec model, edit distance, chinese word segment, sentence similarity,word sentiment tendency, name recognition, idiom recognition, placename recognition, organization recognition, traditional chinese recognition, pinyin transform.","archived":false,"fork":false,"pushed_at":"2018-08-20T07:39:57.000Z","size":36200,"stargazers_count":210,"open_issues_count":6,"forks_count":75,"subscribers_count":21,"default_branch":"master","last_synced_at":"2026-01-25T01:54:16.311Z","etag":null,"topics":["segment","sentence-similarity","speech-tagging","synonyms","text-analyzer","text-classification","word2vec-model"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sea-boat.png","metadata":{"files":{"readme":"README.md","changelog":"change_log.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-05-09T07:30:08.000Z","updated_at":"2026-01-23T03:36:42.000Z","dependencies_parsed_at":"2023-09-19T16:54:16.438Z","dependency_job_id":null,"html_url":"https://github.com/sea-boat/TextAnalyzer","commit_stats":null,"previous_names":["sea-boat/textanalyzer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sea-boat/TextAnalyzer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sea-boat%2FTextAnalyzer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sea-boat%2FTextAnalyzer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sea-boat%2FTextAnalyzer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sea-boat%2FTextAnalyzer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sea-boat","download_url":"https://codeload.github.com/sea-boat/TextAnalyzer/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sea-boat%2FTextAnalyzer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34361403,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["segment","sentence-similarity","speech-tagging","synonyms","text-analyzer","text-classification","word2vec-model"],"created_at":"2026-04-11T01:00:36.071Z","updated_at":"2026-06-15T12:01:46.482Z","avatar_url":"https://github.com/sea-boat.png","language":"Java","funding_links":[],"categories":["人工智能"],"sub_categories":["自然语言处理"],"readme":"# TextAnalyzer\r\n\r\nA text analyzer which is based on machine learning, statistics and dictionaries that can analyze text.\r\n\r\nSo far, it supports hot word extracting, text classification, part of speech tagging, named entity recognition, chinese word segment, extracting address, synonym, text clustering, word2vec model, edit distance, chinese word segment, sentence similarity,word sentiment tendency, name recognition, idiom recognition, placename recognition, organization recognition, traditional chinese recognition, pinyin transform.\r\n\r\n# Features\r\n\r\n***extracting hot words from text.***\r\n1. to gather statistics via frequence.\r\n2. to gather statistics via by tf-idf algorithm\r\n3. to gather statistics via a score factor additionally.\r\n\r\n***extracting address from text.***\r\n\r\n***synonym can be recognized***\r\n\r\n***SVM Classificator***\r\n\r\nThis analyzer supports to classify text by svm. it involves vectoring the text. We can train the samples and then make a classification by the model.\r\n\r\nFor convenience,the model,tfidf and vector will be stored.\r\n\r\n***kmeans clustering \u0026\u0026 xmeans clustering***\r\n\r\nThis analyzer supports to clustering text by kmeans and xmeans.\r\n\r\n***vsm clustering***\r\n\r\nThis analyzer supports to clustering text by vsm.\r\n\r\n***part of speech tagging***\r\n\r\nIt's implemented by HMM model and decoder by viterbi algorithm.\r\n\r\n***google word2vec model***\r\n\r\nThis analyzer supports to use word2vec model.\r\n\r\n***chinese word segment***\r\n\r\nThis analyzer supports to do chinese word segment.\r\n\r\n***edit distance***\r\n\r\nThis analyzer supports calculating edit distance on char level or word level.\r\n\r\n***sentence similarity***\r\n\r\nThis analyzer supports calculating similarity between two sentences.\r\n\r\n\r\n# How To Use\r\n\r\n***just simple like this***\r\n\r\n## Extracting Hot Words\r\n\r\n1. indexing a document and get a docId.\r\n\r\n```\r\nlong docId = TextIndexer.index(text);\r\n```\r\n\r\n2. extracting by docId.\r\n\r\n```\r\n HotWordExtractor extractor = new HotWordExtractor();\r\n List\u003cResult\u003e list = extractor.extract(0, 20, false);\r\n if (list != null) for (Result s : list)\r\n    System.out.println(s.getTerm() + \" : \" + s.getFrequency() + \" : \" + s.getScore());\r\n```\r\n\r\na result contains term,frequency and score.\r\n\r\n```\r\n失业证 : 1 : 0.31436604\r\n户口 : 1 : 0.30099702\r\n单位 : 1 : 0.29152703\r\n提取 : 1 : 0.27927202\r\n领取 : 1 : 0.27581802\r\n职工 : 1 : 0.27381304\r\n劳动 : 1 : 0.27370203\r\n关系 : 1 : 0.27080503\r\n本市 : 1 : 0.27080503\r\n终止 : 1 : 0.27080503\r\n```\r\n\r\n## Extracting Address\r\n\r\n```\r\nString str =\"xxxx\";\r\nAddressExtractor extractor = new AddressExtractor();\r\nList\u003cString\u003e list = extractor.extract(str);\r\n```\r\n\r\n## SVM Classificator\r\n\r\n1. training the samples.\r\n\r\n```\r\nSVMTrainer trainer = new SVMTrainer();\r\ntrainer.train();\r\n```\r\n\r\n2. predicting text classification.\r\n\r\n```\r\ndouble[] data = trainer.getWordVector(text);\r\ntrainer.predict(data);\r\n```\r\n\r\n## Kmeans Clustering \u0026\u0026 Xmeans Clustering\r\n\r\n```\r\nList\u003cString\u003e list = DataReader.readContent(KMeansCluster.DATA_FILE);\r\nint[] labels = new KMeansCluster().learn(list);\r\n```\r\n\r\n## VSM Clustering\r\n\r\n```\r\nList\u003cString\u003e list = DataReader.readContent(VSMCluster.DATA_FILE);\r\nList\u003cString\u003e labels = new VSMCluster().learn(list);\r\n```\r\n\r\n## Part Of Speech Tagging\r\n```\r\nHMMModel model = new HMMModel();\r\nmodel.train();\r\nViterbiDecoder decoder = new ViterbiDecoder(model);\r\ndecoder.decode(words);\r\n```\r\n\r\n## Define Your Own Named Entity\r\n\r\nMITIE is an information extractor library comes up with MIT NLP term , which github is https://github.com/mit-nlp/MITIE .\r\n\r\n***train total\\_word\\_feature\\_extractor***\r\n\r\nPrepare your word set, you can put them into a txt file in the directory of 'data'.\r\n\r\nAnd then do things below:\r\n\r\n```\r\ngit clone https://github.com/mit-nlp/MITIE.git\r\ncd tools\r\ncd wordrep\r\nmkdir build\r\ncd build\r\ncmake ..\r\ncmake --build . --config Release\r\nwordrep -e data\r\n```\r\n\r\nFinally you get the total\\_word\\_feature\\_extractor model.\r\n\r\n\r\n***train ner\\_model***\r\n\r\nWe can use Java\\C++\\Python to train the ner model, anyway we must use the total\\_word\\_feature\\_extractor model to train it.\r\n\r\nif Java,\r\n\r\n```\r\nNerTrainer nerTrainer = new NerTrainer(\"model/mitie_model/total_word_feature_extractor.dat\");\r\n```\r\n\r\n\r\nif C++,\r\n\r\n```\r\nner_trainer trainer(\"model/mitie_model/total_word_feature_extractor.dat\");\r\n```\r\n\r\nif Python,\r\n\r\n```\r\ntrainer = ner_trainer(\"model/mitie_model/total_word_feature_extractor.dat\")\r\n```\r\n\r\n\r\n***build shared library***\r\n\r\nDo commands below:\r\n\r\n```\r\ncd mitielib\r\nD:\\MITIE\\mitielib\u003emkdir build\r\nD:\\MITIE\\mitielib\u003ecd build\r\nD:\\MITIE\\mitielib\\build\u003ecmake ..\r\nD:\\MITIE\\mitielib\\build\u003ecmake --build . --config Release --target install\r\n```\r\n\r\nThen we get these below:\r\n\r\n```\r\n-- Install configuration: \"Release\"\r\n-- Installing: D:/MITIE/mitielib/java/../javamitie.dll\r\n-- Installing: D:/MITIE/mitielib/java/../javamitie.jar\r\n-- Up-to-date: D:/MITIE/mitielib/java/../msvcp140.dll\r\n-- Up-to-date: D:/MITIE/mitielib/java/../vcruntime140.dll\r\n-- Up-to-date: D:/MITIE/mitielib/java/../concrt140.dll\r\n```\r\n\r\n\r\n## Word2vec\r\nwe must set the word2vec's path system parameter when startup,just like this `-Dword2vec.path=D:\\Google_word2vec_zhwiki1710_300d.bin`.\r\n\r\nusing google model.\r\n\r\n```\r\nWord2Vec vec = Word2Vec.getInstance(true);\r\nSystem.out.println(\"狗|猫: \" + vec.wordSimilarity(\"狗\", \"猫\"));\r\n```\r\n\r\nusing java model\r\n\r\n```\r\nWord2Vec vec = Word2Vec.getInstance(false);\r\nSystem.out.println(\"狗|猫: \" + vec.wordSimilarity(\"狗\", \"猫\"));\r\n```\r\n\r\n\r\n## Segment\u0026Search\r\n```\r\nDictSegment segment = new DictSegment();\r\nSystem.out.println(segment.seg(\"我是中国人\"));\r\nSystem.out.println(segment.Search(\"我在广州市\"));\r\n```\r\n\r\n## Edit Distance\r\nchar level,\r\n\r\n```\r\nCharEditDistance cdd = new CharEditDistance();\r\ncdd.getEditDistance(\"what\", \"where\");\r\ncdd.getEditDistance(\"我们是中国人\", \"他们是日本人吖，四贵子\");\r\ncdd.getEditDistance(\"是我\", \"我是\");\r\n```\r\n\r\nword level,\r\n\r\n```\r\nList list1 = new ArrayList\u003cString\u003e();\r\nlist1.add(new EditBlock(\"计算机\",\"\"));\r\nlist1.add(new EditBlock(\"多少\",\"\"));\r\nlist1.add(new EditBlock(\"钱\",\"\"));\r\nList list2 = new ArrayList\u003cString\u003e();\r\nlist2.add(new EditBlock(\"电脑\",\"\"));\r\nlist2.add(new EditBlock(\"多少\",\"\"));\r\nlist2.add(new EditBlock(\"钱\",\"\"));\r\ned.getEditDistance(list1, list2);\r\n```\r\n\r\n## Sentence Similarity\r\n\r\n```\r\nString s1 = \"我们是中国人\";\r\nString s2 = \"他们是日本人，四贵子\";\r\nSentenceSimilarity ss = new SentenceSimilarity();\r\nSystem.out.println(ss.getSimilarity(s1, s2));\r\ns1 = \"我们是中国人\";\r\ns2 = \"我们是中国人\";\r\nSystem.out.println(ss.getSimilarity(s1, s2));\r\n```\r\n\r\n## Get Synonym via Cilin Dictionary\r\n\r\n```\r\nCilinDictionary dict = CilinDictionary.getInstance();\r\nSet\u003cString\u003e code = dict.getCilinCoding(\"人类\");\r\nSystem.out.println(dict.getCilinWords(code.iterator().next()));\r\n[全人类, 生人, 人类]\r\n```\r\n\r\n## Words' Similarity by Cilin\r\n```\r\nString s1 = \"中国人\";\r\nString s2 = \"炎黄子孙\";\r\nCilinSimilarity cs = new CilinSimilarity();\r\nSystem.out.println(cs.getSimilarity(s1, s2));\r\ns1 = \"汽车\";\r\ns2 = \"摩托\";\r\nSystem.out.println(cs.getSimilarity(s1, s2));\r\n```\r\n\r\n## Get Hownet Glossary\r\n```\r\nHownetGlossary glossary = HownetGlossary.getInstance();\r\nCollection\u003cTerm\u003e coll = glossary.getTerms(\"人类\");\r\nfor (Term t : coll)\r\n  System.out.println(t);\r\n```\r\n\r\n## Get Hownet Sememe\r\n```\r\nHownetSememe sememe = HownetSememe.getInstance();\r\nCollection\u003cString\u003e coll = sememe.getDefine(\"用具\");\r\nfor (String t : coll)\r\n  System.out.println(t);\r\n```\r\n\r\n## Hownet Words Similarity\r\n```\r\nHownetSimilarity hownetSimilarity = new HownetSimilarity();\r\nSystem.out.println(\"hownet similarity : \" + hownetSimilarity.getSimilarity(\"中国\", \"美国\"));\r\n```\r\n\r\n## Get Pinyin \r\n```\r\nSystem.out.println(PinyinUtil.getInstance().getPinyin(\"哈哈\"));\r\nSystem.out.println(PinyinUtil.getInstance().getPinyin(\"中\"));\r\nSystem.out.println(PinyinUtil.getInstance().getPinyin(\"中国\"));\r\n```\r\n\r\n## Pinyin Similarity\r\n```\r\nString s1 = \"今天\";\r\nString s2 = \"明天\";\r\nPinyinSimilarity cs = new PinyinSimilarity();\r\nSystem.out.println(cs.getSimilarity(s1, s2));\r\n```\r\n\r\n## Information Extractor\r\n### usage\r\nWe have provided Python and Java APIs for extractor,choose one of them.\r\n\r\n### python\r\ndo a predict by this below,\r\n```\r\npython crf_ner.py predict \"测试文本\" \"../model/crf.model\"\r\n```\r\n\r\n### java \r\n\r\n```\r\nList list = JCYExtractor.getIDs(text);\r\n\r\nlist = JCYExtractor.getNames(text);\r\n\r\nJCYExtractor.getAddrs(text);\r\n```\r\n\r\n### train a model \r\n1. To collect corpus.\r\n2. Tagging corpus,we support those labels below,\r\n\r\n```\r\n# IB : ID beginning\r\n# IE : ID ending\r\n# IM : ID middle\r\n# U : unlabeled\r\n# PB : person beginning\r\n# PE : person ending\r\n# PM : person middle\r\n# BB : birthday beginning\r\n# BM : birthday middle\r\n# BE : birthday ending\r\n# LB : location beginning\r\n# LM : location middle\r\n# LE : location endings\r\n```\r\n\r\nfor example,\r\n\r\n```\r\n被\tU\r\n不\tU\r\n起\tU\r\n诉\tU\r\n人\tU\r\n伍\tPB\r\n某\tPM\r\n某\tPE\r\n，\tU\r\n```\r\n\r\n3. Put all samples to the directory of `data/jcy_data/train`.\r\n4. Call `train` function in the `crf_ner.py` script，the model will produce in the directory of `model` which name is `crf.model`.\r\n\r\n\r\n## Word Tendency\r\n\r\n```\r\nWordSentimentTendency tendency = new WordSentimentTendency();\r\nSystem.out.println(tendency.getTendency(\"高兴\"));\r\nSystem.out.println(tendency.getTendency(\"伤心\"));\r\n```\r\n\r\n## Chinese\u0026English Name Recognition\r\n\r\n```\r\nNameDict.get().searchName(\"汪建是华大基因董事长\");\r\nNameDict.get().searchEnglishName(\"Tom and Jim are my friends\");\r\n```\r\n\r\n## Idiom Recognition\r\n\r\n```\r\nIdiomDict.get().searchIdiom(\"从前有个人阿谀奉承\");\r\n```\r\n\r\n## Placename Recognition\r\n\r\n```\r\nPlacenameDict.get().searchPlacename(\"我住在天河北路，不在广州大道中，在天河区\");\r\n```\r\n\r\n## Organization Recognition\r\n\r\n```\r\nOrganizationDict.get().searchOrganization(\"去阿里巴巴找朋友\");\r\n```\r\n\r\n## Traditional Chinese Recognition\r\n\r\n```\r\nList\u003cInteger\u003e list = TraditionalDict.get().prefixSearch(\"1隻大狗\");\r\nfor(int i:list)\r\n\tSystem.out.println(TraditionalDict.get().getStringByIndex(i));\r\n```\r\n\r\n\r\n## Pinyin Transform \r\n\r\n```\r\nPinyinDict.get().getStringByIndex(PinyinDict.get().exactlySearch(\"一心一意\"));\r\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsea-boat%2FTextAnalyzer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsea-boat%2FTextAnalyzer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsea-boat%2FTextAnalyzer/lists"}