{"id":13593581,"url":"https://github.com/soberqian/TopicModel4J","last_synced_at":"2025-04-09T05:31:46.026Z","repository":{"id":43652141,"uuid":"190860520","full_name":"soberqian/TopicModel4J","owner":"soberqian","description":"TopicModel4J: A Java Package for Topic Models (Contain LDA, Collapsed Variational Bayesian Inference for LDA, author-topic model, BTM, dirichlet multinomial mixture model, DPMM, Dual Sparse Topic Model, GaussianLDA, Hierarchical Dirichlet processes, Labeled LDA, Link LDA, Pseudo-document-based Topic Model,  Sentence LDA and so on)","archived":false,"fork":false,"pushed_at":"2023-02-04T08:50:25.000Z","size":11401,"stargazers_count":29,"open_issues_count":5,"forks_count":8,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-11-06T15:43:46.543Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soberqian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-06-08T07:48:31.000Z","updated_at":"2024-10-19T01:29:15.000Z","dependencies_parsed_at":"2022-08-22T19:10:54.969Z","dependency_job_id":"e251b1c1-f117-46c2-a4dc-f89f7b571cda","html_url":"https://github.com/soberqian/TopicModel4J","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soberqian%2FTopicModel4J","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soberqian%2FTopicModel4J/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soberqian%2FTopicModel4J/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soberqian%2FTopicModel4J/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soberqian","download_url":"https://codeload.github.com/soberqian/TopicModel4J/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247986768,"owners_count":21028886,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:01:21.781Z","updated_at":"2025-04-09T05:31:43.786Z","avatar_url":"https://github.com/soberqian.png","language":"Java","funding_links":[],"categories":["Java","Models"],"sub_categories":["Hierarchical Dirichlet Process (HDP) [:page_facing_up:](https://papers.nips.cc/paper/2004/file/fb4ab556bc42d6f0ee0f9e24ec4d1af0-Paper.pdf)"],"readme":"# Software Introduction\nThis Java package corresponds to my research paper as follows:\u003cbr /\u003e\nQian Y, Jiang Y, Chai Y, et al. TopicModel4J: A Java Package for Topic Models[J]. arXiv preprint arXiv:2010.14707, 2020.\u003cbr /\u003e\n\u003cbr /\u003e\nThis package is about Topic Models for Natural Language Processing (NLP). And **it provides an easy-to-use interface for researchers and data analysts**.\u003cbr /\u003e\n\nMotivations：I develop this Java package to promote related research about Topic Models for Natural Language Processing (NLP). \u003cbr /\u003e\n\nWhen submitting my research paper to a journal, I will publicly release all the source code.\u003cbr /\u003e\n\n# Jar Dependency\nIf you want to use this package, you need download some Java jars: commons-math3-3.5.jar, lingpipe-4.1.0.jar, stanford-corenlp-3.9.1-models.jar, stanford-corenlp-3.9.1-sources.jar, stanford-corenlp-3.9.1.jar. The stanford-corenlp 3.9.1 can be download from this website: http://central.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.9.1/.\n\n# Data Preprocessing for NLP\nThis software can do the following text preprocessing:\u003cbr /\u003e\n* (1) Split the sentence to words.\u003cbr /\u003e\n* (2) Lowercase the words and preform lemmatization.\u003cbr /\u003e\n* (3) Remove useless characters, URLs and stop words.\u003cbr /\u003e\n\nThe first example is as follows:\u003cbr /\u003e\n```java\nimport java.util.ArrayList;\nimport com.topic.utils.FileUtil;\npublic class RawDataProcess {\n\t/**\n\t * Functions:\n\t * \n\t * (1) Split the sentence to words\n\t * (2) Lowercase the words and preform lemmatization\n\t * (3) Remove special characters (e.g., #, % and \u0026), URLs and stop words\n\t * \n\t * @author: Yang Qian\n\t */\n\tpublic static void main(String[] args) {\n\t\t\tString line = \"http://t.cn/RAPgR4n Artificial intelligence is a known phenomenons \"\n\t\t\t\t\t+ \"in the world today. Its root started to build years \"\n\t\t\t\t\t+ \"ago but the tree started to grow long after. Months ago when our beloved google assistant made her first \"\n\t\t\t\t\t+ \"call to book a haircut appointment in the Google IO event,\";\n\t\t\t//get all word for a document\n\t\t\tArrayList\u003cString\u003e words = new ArrayList\u003cString\u003e();\n\t\t\t//lemmatization using StanfordCoreNLP\n\t\t\tFileUtil.getlema(line, words);\n\t\t\t//remove noise words\n\t\t\tString text = FileUtil.RemoveNoiseWord(words);\n\t\t\tSystem.out.println(text);\n\t}\n}\n```\nRunning this code, we can obtain the following results:\u003cbr /\u003e\n```java\nartificial intelligence phenomenon world today root start build year ago tree start grow long month ago beloved google assistant make call book haircut appointment Google IO event\n```\nIf we want deal a file which a line represent one document. For example,\n```java\nWe present a new algorithm for domain adaptation improving upon a discrepancy minimization algorithm, (DM), previously shown to outperform a number of algorithms for this problem. \nWe investigated the feature map inside deep neural networks (DNNs) by tracking the transport map. We are interested in the role of depth--why do DNNs perform better than shallow models?\n```\nWe denote this file as 'rawdata'. And we can use the next code to deal with:\n```java\nimport java.io.IOException;\nimport java.util.ArrayList;\nimport com.topic.utils.FileUtil;\n\npublic class RawDataProcessing {\n\t/**\n\t * Functions:\n\t * \n\t * (1) Split the sentence to words\n\t * (2) Lowercase the words and preform lemmatization\n\t * (3) Remove special characters (e.g., #, % and \u0026), URLs and stop words\n\t * \n\t * @author: Yang Qian\n\t */\n\tpublic static void main(String[] args) throws IOException {\n\t\t//read data\n\t\tArrayList\u003cString\u003e docLines = new ArrayList\u003cString\u003e();\n\t\tFileUtil.readLines(\"data/rawdata\", docLines, \"gbk\");\n\t\tArrayList\u003cString\u003e doclinesAfter = new ArrayList\u003cString\u003e();\n\t\tfor(String line : docLines){\n\t\t\t//get all word for a document\n\t\t\tArrayList\u003cString\u003e words = new ArrayList\u003cString\u003e();\n\t\t\t//lemmatization using StanfordCoreNLP\n\t\t\tFileUtil.getlema(line, words);\n\t\t\t//remove noise words\n\t\t\tString text = FileUtil.RemoveNoiseWord(words);\n\t\t\tdoclinesAfter.add(text);\n\t\t}\n\t\t// write data\n\t\tFileUtil.writeLines(\"data/rawdata_process\", doclinesAfter, \"gbk\");\n\t}\n}\n```\n\n# Algorithm for NLP\nThe algorithms in this package contain **Latent Dirichlet Allocation (LDA), Biterm Topic Model (BTM),  Author-topic Model (ATM), Dirichlet Multinomial Mixture Model (DMM), Dual-Sparse Topic Model (DSTM), Labeled LDA, Link LDA, Sentence-LDA, Pseudo-document-based Topic Model (PTM), Hierarchical Dirichlet processes, Collaborative topic Model (CTM), Gaussian Lda and so on **. Now, I will intorduce how to use my package for running some algorithms.\n\n## Latent Dirichlet Allocation (Collapsed Gibbs sampling)\nReference: (1) Griffiths T. Gibbs sampling in the generative model of latent dirichlet allocation[J]. 2002.\u003cbr /\u003e\n           (2) Heinrich G. Parameter estimation for text analysis[R]. Technical report, 2005.\u003cbr /\u003e\nThe following code is to call the LDA algorithm for processing text:\u003cbr /\u003e\n```java\nimport com.topic.model.GibbsSamplingLDA;\n\npublic class LDAGibbsSamplingTest {\n\n\tpublic static void main(String[] args) {\n\t\tGibbsSamplingLDA lda = new GibbsSamplingLDA(\"data/rawdata_process_lda\", \"gbk\", 50, 0.1,\n\t\t\t\t0.01, 500, 50, \"data/ldaoutput/\");\n\t\tlda.MCMCSampling();\n\n\t}\n}\n```\nWhere the constructor method GibbsSamplingLDA() is:\n```java\npublic GibbsSamplingLDA(String inputFile, String inputFileCode, int topicNumber,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\nThe input file ('rawdata_process_lda') contains many document, like: \u003cbr /\u003e\n\n![input file](https://img-blog.csdnimg.cn/2019060820040440.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9xaWFueWFuZy1oZnV0LmJsb2cuY3Nkbi5uZXQ=,size_16,color_FFFFFF,t_70#pic_center)\n\nRunning the LDAGibbsSamplingTest.java, we can obtain the result after some iterations. \u003cbr /\u003e\n![Running](https://img-blog.csdnimg.cn/20190608200759730.png#pic_center)\n\nThe outfile contains 'LDAGibbs_topic_word_50.txt' and 'LDAGibbs_doc_topic50.txt'. The content of 'LDAGibbs_topic_word_50.txt' likes: \u003cbr /\u003e\n```java\nTopic:1\nstudy :0.03364301916742469\nstudent :0.029233711281785802\nonline :0.01600578762486915\ngame :0.01502594142806051\nteacher :0.012739633635507014\nsocial :0.01192309513816648\nactivity :0.010453325842953519\nexamine :0.01029001814348541\ntechnology :0.00980009504508109\n...\n\nTopic:2\nfuzzy :0.07505158709641029\nmethod :0.031024330934552934\ndecision :0.02585387024650563\ncriterion :0.021780173946831995\npropose :0.021310132066100423\nbase :0.017706477647158363\nnumber :0.016609713258784693\nproblem :0.015982990751142595\nuncertainty :0.013632781347484729\nset :0.012692697586021583\nmake :0.012536016959111058\npaper :0.012379336332200534\nrisk :0.011752613824558436\n...\n```\n\n##  Latent Dirichlet Allocation (Collapsed Variational Bayesian Inference)\nWe also use Collapsed Variational Bayesian Inference (CVBI) for learning the parameters of LDA. \u003cbr /\u003e\nReference: (1)Teh Y W, Newman D, Welling M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation[C]//Advances in neural information processing systems. 2007: 1353-1360. \u003cbr /\u003e\n(2)Asuncion A, Welling M, Smyth P, et al. On smoothing and inference for topic models[C]//Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, 2009: 27-34. \u003cbr /\u003e\n\nThe following code is to call the algorithm for processing text:\u003cbr /\u003e\n```java\nimport com.topic.model.CVBLDA;\n\npublic class CVBLDATest {\n\n\tpublic static void main(String[] args) {\n\t\tCVBLDA cvblda = new CVBLDA(\"data/rawdata_process_lda\", \"gbk\", 30, 0.1,\n\t\t\t\t0.01, 200, 50, \"data/ldaoutput/\");\n\t\tcvblda.CVBInference();\n\t}\n}\n```\nWhere the constructor method GibbsSamplingLDA() is:\n```java\npublic CVBLDA(String inputFile, String inputFileCode, int topicNumber,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\nRunning the CVBLDATest.java, we can obtain the result liking LDAGibbsSamplingTest.java. \u003cbr /\u003e\n\n## Labeled LDA\nWe use gibbs sampling for implementing the Labeled LDA algorithm. \u003cbr /\u003e\nReference:Ramage D, Hall D, Nallapati R, et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009: 248-256.\u003cbr /\u003e\n\nThe following code is to call the Labeled LDA algorithm for processing text:\u003cbr /\u003e\n```java\nimport com.topic.model.LabeledLDA;\n\npublic class LabeledLDATest {\n\n\tpublic static void main(String[] args) {\n\t\tLabeledLDA llda = new LabeledLDA(\"data/rawdata_process_author\", \"gbk\", 0.1,\n\t\t\t\t0.01, 500, 50, \"data/ldaoutput/\");\n\t\tllda.MCMCSampling();\n\t}\n}\n```\n\nWhere the constructor method LabeledLDA() is:\u003cbr /\u003e\n```java\npublic LabeledLDA(String inputFile, String inputFileCode,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\nThe input file ('rawdata_process_author') contains many document with labels, like: \u003cbr /\u003e\n```java\n457720--578743--643697--840908--874627--975162--1058302--1275106--1368496--1769120--1769130--2135000\tpaper present indoor navigation range strategy monocular camera exploit architectural orthogonality indoor environment introduce method estimate range vehicle state monocular camera visionbased SLAM navigation strategy assume indoor indoorlike manmade environment layout previously unknown gpsdenied representable energy base feature point straight architectural line experimentally validate propose algorithm fully selfcontained microaerial vehicle mav sophisticated onboard image processing slam capability building enable small aerial vehicle fly tight corridor significant technological challenge absence gps signal limited sense option experimental result show systemis limit capability camera environmental entropy\n273266--1065537--1120593--1474359--1976664--2135000\tglobalisation education increasingly topic discussion university worldwide hand industry university leader emphasise increase awareness influence global marketplace skill graduate time emergence tertiary education export market prompt university develop international recruitment strategy offer international student place undergraduate graduate degree programme article examine phenomenon globalisation emergence global intercultural collaboration delivery education effort global intercultural collaboration offer institution student learn successful approach\n```\n\nWhere the label and the document are segmented by '\\t'. The label can be String character.\nRunning the LabeledLDATest.java, we can output two files (LabeledLDA_topic_word.txt and LabeledLDA_doc_topic.txt). \u003cbr /\u003e \nThe contents of 'LabeledLDA_topic_word.txt' like: \u003cbr /\u003e \n```java\nTopic:1\nsystem :0.008885972224685621\ncar :0.008885972224685621\nmf :0.007112325074049769\nstalk :0.0053386779234139165\nspeed :0.0053386779234139165\nyear :0.0053386779234139165\n...\n\nTopic:2\nresidual :0.017458207100978618\nlease :0.015278655652666681\ncash :0.015278655652666681\nplan :0.013099104204354743\ncar :0.010919552756042806\nprice :0.00874000130773087\ntexas :0.00874000130773087\nbuy :0.006560449859418931\n...\n```\n## Partially Labeled Dirichlet Allocation\nGibbs sampling for Partially Labeled Dirichlet Allocation\n\nReference:Ramage D, Manning C D, Dumais S. Partially labeled topic models for interpretable text mining[C]//Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011: 457-465.\n```java\npackage example;\n\nimport com.topic.model.PLDA;\n\npublic class PLDATest {\n\n\tpublic static void main(String[] args) {\n\t\tPLDA plda = new PLDA(\"data/rawdata_process_author\", \"gbk\", 3, 0.1,\n\t\t\t\t0.01, 500, 50, \"data/ldaoutput/\");\n\t\tplda.MCMCSampling();\n\n\t}\n\n}\n```\nWhere the constructor method PLDA() is:\u003cbr /\u003e\n```java\npublic PLDA(String inputFile, String inputFileCode,int label_topicNumber,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\n\n\n## Sentence LDA\nWe use Collapsed Gibbs sampling for implementing the  Sentence-LDA.\u003cbr /\u003e \nReference: (1)Jo Y, Oh A H. Aspect and sentiment unification model for online review analysis[C]//Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 2011: 815-824.\u003cbr /\u003e \n(2) Büschken J, Allenby G M. Sentence-based text analysis for customer reviews[J]. Marketing Science, 2016, 35(6): 953-975.\u003cbr /\u003e \n\nThe following code is to call the Sentence LDA algorithm for processing text:\u003cbr /\u003e\n```java\nimport com.topic.model.SentenceLDA;\n\npublic class SentenceLDATest {\n\n\tpublic static void main(String[] args) {\n\t\tSentenceLDA sentenceLda = new SentenceLDA(\"data/rawdata_sentenceLDA\", \"gbk\", 50, 0.1,\n\t\t\t\t0.01, 500, 50, \"data/ldaoutput/\");\n\t\tsentenceLda.MCMCSampling();\n\t}\n}\n```\n\nWhere the constructor method LabeledLDA() is:\u003cbr /\u003e\n```java\npublic SentenceLDA(String inputFile, String inputFileCode, int topicNumber,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\nThe input file ('rawdata_sentenceLDA') contains many document, like: \u003cbr /\u003e\n\n```java\nfundamental step software design process selection refinement implementation data abstraction--step traditionally involve investigate expect performance system refinement abstraction select single alternative minimize performance cost metric--paper reformulate design step allow refinement datum abstraction computation--reformulation reflect fact implementation data abstraction dependent behavior exhibit object abstraction--behavior vary object computation single refinement inappropriate--framework present understanding represent variation behavior object potential multiple implementation--framework base static partitioning object disjoint implementation class static partitioning class implementation region dynamic partitioning class implementation region--framework analytic tool useful investigate expect performance multiple implementation describe detail\npreface front matter full preface advance design production computer hardware bring people direct contact computer--similar advance design production computer software require order increase contact rewarding--smalltalk-80 system result decade research create computer software produce highly functional interactive contact personal computer system--book detailed account smalltalk-80 system--divide major part Part overview concept syntax programming language--Part annotated illustrated specification system functionality--Part design implementation moderate-size application--Part specification smalltalk-80 virtual machine\n```\nWhere the separator between sentences is '--'. \u003cbr /\u003e\nRunning the LabeledLDATest.java, we can output two files (SentenceLDA_doc_topic50.txt and SentenceLDA_topic_word_50.txt). \u003cbr /\u003e \n\n## BTM\nWe use Collapsed Gibbs sampling for implementing the biterm topic model.\u003cbr /\u003e \nReference:(1) Cheng X, Yan X, Lan Y, et al. Btm: Topic modeling over short texts[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2928-2941.\u003cbr /\u003e \n(2)Yan X, Guo J, Lan Y, et al. A biterm topic model for short texts[C]//Proceedings of the 22nd international conference on World Wide Web. ACM, 2013: 1445-1456.\u003cbr /\u003e \nThe following code is to call the BTM algorithm for processing text:\u003cbr /\u003e\n```java\nimport com.topic.model.BTM;\n\npublic class BTMTest {\n\n\tpublic static void main(String[] args) {\n\t\tBTM btm = new BTM(\"data/shortdoc.txt\", \"gbk\", 15, 0.1,\n\t\t\t\t0.01, 1000, 30, 50, \"data/ldaoutput/\");\n\t\tbtm.MCMCSampling();\n\t}\n}\n```\n\nWhere the constructor method BTM() is:\u003cbr /\u003e\n```java\npublic BTM(String inputFile, String inputFileCode, int topicNumber,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords, int windowS,\n\t\t\tString outputFileDir)\n```\nThe input file ('rawdata_sentenceLDA') contains many document (5 documents), like: \u003cbr /\u003e\n\n```java\niphone crack iphone \nadding support iphone announced \nyoutube video guy siri pretty love \nrim made easy switch iphone yeah \nrealized ios \n```\nRunning the BTMTest.java, we can output four files:\u003cbr /\u003e\n![在这里插入图片描述](https://img-blog.csdnimg.cn/20190608205429516.png#pic_center)\n\nThe contents of 'BTM_topic_word_15.txt' like: \u003cbr /\u003e \n```java\nTopic:1\nlove :0.06267534660746875\nmarket :0.04905619262931387\nnexus :0.04360853103805192\nshare :0.03271320785552802\nvideo :0.02998937705989704\nwow :0.02998937705989704\nbeautiful :0.02998937705989704\nshit :0.02998937705989704\n...\n\nTopic:2\nscream :0.05755999328746434\nandroid :0.05036799079423681\nshit :0.04557332246541846\ngame :0.03838131997219093\nhaven :0.03598398580778175\ntalk :0.03118931747896339\npeople :0.028791983314554216\nmango :0.026394649150145038\njob :0.02399731498573586\nnice :0.02399731498573586\n...\n```\n\n##  Pseudo-document-based Topic Model\nCollapsed Gibbs sampling in the generative model of Pseudo-document-based Topic Model\u003cbr /\u003e \nReference:Zuo Y, Wu J, Zhang H, et al. Topic modeling of short texts: A pseudo-document view[C]//Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2016: 2105-2114.\u003cbr /\u003e \n\nThe following code is to call the PTM algorithm for processing text:\u003cbr /\u003e\n```java\nimport com.topic.model.PseudoDTM;\n\npublic class PTMTest {\n\n\tpublic static void main(String[] args) {\n\t\tPseudoDTM ptm = new PseudoDTM(\"data/shortDocument.txt\", \"gbk\", 300, 50, 0.1, 0.1,\n\t\t\t\t0.01, 500, 50, \"data/ldaoutput/\");\n\t\tptm.MCMCSampling();\n\t}\n\n}\n```\n\nWhere the constructor method PseudoDTM() is:\u003cbr /\u003e\n\n```java\npublic PseudoDTM(String inputFile, String inputFileCode, int pDocumentNumber, int topicNumber,\n\t\t\tdouble inputAlpha, double inputBeta, double inputLambada, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\n\n\nThe input file ('shortDocument.txt') contains many document (5 documents), like: \u003cbr /\u003e\n```java\n470 657\n2139 3204 3677\n109 111 448 2778 2980 3397 3405 3876\n117 4147\n66 375\n```\nThe output contains three file \u003cbr /\u003e:\n\n![在这里插入图片描述](https://img-blog.csdnimg.cn/20190704084048257.png)\n\nThe contents of 'PseudoDTM_topic_word_50.txt' like: \u003cbr /\u003e \n```\nTopic:1\n837 :0.04213507251351584\n447 :0.032695443233502104\n3217 :0.029262850768042567\n579 :0.026688406418947912\n407 :0.024972110186218144\n2567 :0.024113962069853258\n2954 :0.024113962069853258\n...\n\nTopic:2\n159 :0.05377295861916353\n172 :0.04270856384155786\n59 :0.03701830367021781\n850 :0.03670217810514336\n65 :0.033224796889324434\n412 :0.0316441690639522\n69 :0.03132804349887775\n587 :0.029747415673505515\n703 :0.028166787848133274\n802 :0.02627003445768659\n153 :0.02468940663231435\n146 :0.022792653241867668\n3683 :0.022160402111718772\n...\n```\n\n##  Author-topic Model\nCollapsed Gibbs sampling for author-topic model\n\nReference:Rosen-Zvi M, Griffiths T, Steyvers M, et al. The author-topic model for authors and documents[C]//Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004: 487-494.\n```java\nimport com.topic.model.AuthorTM;\n\npublic class ATMTest {\n\n\tpublic static void main(String args[]) throws Exception{\n\t\tAuthorTM lda = new AuthorTM(\"/home/qianyang/dualsparse/rawdata_process_author\", \"gbk\", 25, 0.1,\n\t\t\t\t0.01, 500, 50, \"/home/qianyang/dualsparse/output/\");\n\t\tlda.MCMCSampling();\n\t}\n\n}\n```\nWhere the constructor method AuthorTM() is:\u003cbr /\u003e\n\n```java\npublic AuthorTM(String inputFile, String inputFileCode, int topicNumber,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\nThe output result:\u003cbr /\u003e\n```java\n// output the result\nSystem.out.println(\"write topic word ...\" );\nwriteTopWordsWithProbability();\nSystem.out.println(\"write author topic ...\" );\nwriteAuthorTopic();\nSystem.out.println(\"write ranked topic author ...\" );\nwriteTopicAuthor();\n```\nThe output file contains:\u003cbr /\u003e\n![在这里插入图片描述](https://img-blog.csdnimg.cn/20190704085409460.png)\n\nWe run the code in Linux Server.\n\nThe example (contains two documents) of the input file likes:\u003cbr /\u003e\n```java\nVolkswagen Golf--BMW\tjhend925 correct gti heat seat include trim\nKia Soul--Ford Escape--Toyota RAV4\tcar_man current lease number Soul Exclaim premium package market quote pathetic offer walk dealership contact EXACT vehicle\n```\n\nThe contents of 'authorTM_topic_author_25.txt' like: \u003cbr /\u003e \n\n```\nTopic:1\nLexus IS 200t :0.82801393728223\nGMC Acadia :0.7883767535070141\nSaturn L300 :0.775535590877678\nLexus NX 300h :0.7683754512635379\nPorsche Cayenne :0.7153568853640953\nAudi R8 :0.610393466963623\nOldsmobile Alero :0.5796109993293093\n...\n\nTopic:2\nBMW X6 :0.32407809110629066\nLincoln Continental :0.255003599712023\nAudi A5 :0.2263959390862944\nFord Edge :0.1896140350877193\nCadillac ATS-V :0.1740510697032436\nPontiac G5 :0.1566591422121896\nLexus NX 300 :0.13647570703408266\nVolkswagen Tiguan :0.13225579761068165\n...\n```\n\nThe contents of 'authorTM_topic_word25.txt' like: \u003cbr /\u003e \n\n```java\nTopic:1\ndrive :0.08731467480511887\nawd :0.06541822490968394\npost :0.054768834886097705\ntime :0.03145971080386051\nbase :0.030427371975043478\nrate :0.029014697788241225\nhigh :0.027765024469146922\ndoor :0.02695002013060716\nshow :0.024179005379571968\n...\n\nTopic:2\nperson :0.028071566434497267\nmiles/year :0.02691409539297589\nhood :0.01939053362308692\narticle :0.015918120498522776\nmax :0.014760649457001397\nconsole :0.013313810655099673\nmassachusetts :0.013313810655099673\nrubber :0.013024442894719327\nsection :0.010709500811676568\n...\n```\n\n## LinkLDA\n\nCollapsed Gibbs sampling in the generative model of Link LDA\n\nReference:(1)Erosheva E, Fienberg S, Lafferty J. Mixed-membership models of scientific publications[J]. Proceedings of the National Academy of Sciences, 2004, 101(suppl 1): 5220-5227.\n\n(2)Su S, Wang Y, Zhang Z, et al. Identifying and tracking topic-level influencers in the microblog streams[J]. Machine Learning, 2018, 107(3): 551-578.\n\n(3)(Probabilistic inference formula):Ling G, Lyu M R, King I. Ratings meet reviews, a combined approach to recommend[C]//Proceedings of the 8th ACM Conference on Recommender systems. ACM, 2014: 105-112.\n\n```java\nimport com.topic.model.LinkLDA;\n\npublic class LinkLDATest {\n\n\tpublic static void main(String args[]) throws Exception{\n\t\tLinkLDA linklda = new LinkLDA(\"data/rawdata_process_link\", \"gbk\", 50, 0.1,\n\t\t\t\t0.01,0.01, 200, 50, \"data/linkldaoutput/\");\n\t\tlinklda.MCMCSampling();\n\t}\n\n}\n```\nWhere the constructor method LinkLDA() is:\u003cbr /\u003e\n\n```java\npublic LinkLDA(String inputFile, String inputFileCode, int topicNumber,\n\t\t\tdouble inputAlpha, double inputBeta,double inputGamma, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\n\nThe example of the input file likes:\u003cbr /\u003e\n```java\n457720--578743--643697--840908--874627--975162--1058302--1275106\tpaper present indoor navigation range strategy monocular camera exploit architectural orthogonality indoor environment introduce method estimate range vehicle state monocular camera visionbased SLAM navigation strategy assume\n273266--1065537--1120593--1474359--1976664--2135000\tglobalisation education increasingly topic discussion university worldwide hand industry university leader emphasise increase awareness influence global marketplace skill graduate time emergence tertiary education export market prompt\n...\n```\nThe output file contains:\u003cbr /\u003e\n![在这里插入图片描述](https://img-blog.csdnimg.cn/20190704102827484.png)\n\nThe contents of 'topic_link_LinkLDA_50.txt' like: \u003cbr /\u003e \n```\nTopic:1\n2135000 :0.1134694432807372\n44875 :0.00726720143952733\n891558 :0.0041754331180947285\n129986 :0.002629548957378428\n798508 :0.002629548957378428\n891548 :0.002474960541306798\n1760887 :0.0023203721252351675\n739898 :0.0023203721252351675\n307246 :0.0023203721252351675\n34076 :0.0018566068770202774\n...\n\nTopic:2\n2135000 :0.06079089083918171\n369235 :0.017583984276502394\n1777208 :0.012677206244147646\n392342 :0.007770428211792896\n422114 :0.006271134924128945\n1777200 :0.0061348355343413125\n1777102 :0.004635542246677361\n329348 :0.004226644077314466\n857174 :0.003954045297739202\n207251 :0.003954045297739202\n124072 :0.0038177459079515703\n...\n```\nThe contents of 'topic_word_LinkLDA_50.txt' like: \u003cbr /\u003e \n```\nTopic:1\nmodel :0.04919313426495349\ndistribution :0.031032042220064535\nestimate :0.021740932811592357\nparameter :0.019003608793232284\nprobability :0.018661443290937274\nestimation :0.016003080542337587\nrandom :0.011923414938050936\nfunction :0.011581249435755926\nmethod :0.010002024040548192\nanalysis :0.009659858538253182\nvariable :0.009554576845239334\nestimator :0.00931769303595817\n...\n\nTopic:2\ngraph :0.08478070592614391\nset :0.02273205662600254\nvertex :0.021495890950958215\nnumber :0.01871451818210849\nedge :0.017066297282049395\ngive :0.01401022102985649\nresult :0.01085113097140989\nshow :0.01023304813388773\nprove :0.009374599748440285\nclass :0.00896254452342551\ndegree :0.007692040912963291\nn :0.007520351235873802\npath :0.007486013300455904\n...\n```\nThe contents of 'doc_topic_LinkLDA_50.txt' like: \u003cbr /\u003e \n```\n0.08243243243243242\t0.0013513513513513514\t0.02837837837837838 ...\n0.28383838383838383\t0.00101010101010101\t0.00101010101010101 ...\n```\n\n## DMM\nCollapsed Gibbs Sampling for DMM\nReference:(1)Yin J, Wang J. A dirichlet multinomial mixture model-based approach for short text clustering[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014: 233-242.\n\n(2)Nguyen D Q. jLDADMM: A Java package for the LDA and DMM topic models[J]. arXiv preprint arXiv:1808.03835, 2018.\n\n```java\nimport com.topic.model.DMM;\n\npublic class DMMTest {\n\n\tpublic static void main(String[] args) {\n\t\tDMM dmm = new DMM(\"data/shortdoc.txt\", \"gbk\", 15, 0.1,\n\t\t\t\t0.01, 500, 50, \"data/ldaoutput/\");\n\t\tdmm.MCMCSampling();\n\n\t}\n\n}\n```\nWhere the constructor method DMM() is:\u003cbr /\u003e\n\n```java\npublic DMM(String inputFile, String inputFileCode, int clusterNumber,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\n\nThe example of the input file likes:\u003cbr /\u003e\n```java\niphone crack iphone \nadding support iphone announced \nyoutube video guy siri pretty love \nrim made easy switch iphone yeah \nrealized ios \ncurrent blackberry user bit disappointed move android iphone \n...\n```\nThe output file contains:\u003cbr /\u003e\n![在这里插入图片描述](https://img-blog.csdnimg.cn/20190704110118259.png)\n\nThe contents of 'DMM_cluster_word_15.txt' like: \u003cbr /\u003e \n```\nTopic:1\nscience :0.050034954559073204\nwindows :0.04004793768101468\ncomputer :0.04004793768101468\nresearch :0.04004793768101468\nandroid :0.030060920802956158\nsearch :0.030060920802956158\nadd :0.030060920802956158\nstart :0.030060920802956158\nshows :0.030060920802956158\nimprovements :0.030060920802956158\n...\n\nTopic:2\niphone :0.06536929406386731\ngreat :0.0523084960491086\nios :0.04577809704172925\nloving :0.04577809704172925\ntime :0.039247698034349895\ngood :0.039247698034349895\nsearch :0.039247698034349895\nsleep :0.03271729902697055\nman :0.026186900019591196\nfacebook :0.026186900019591196\nnice :0.026186900019591196\nworld :0.026186900019591196\nhelps :0.026186900019591196\npaying :0.019656501012211846\n...\n```\nThe contents of 'DMM_doc_cluster15.txt' like: \u003cbr /\u003e \n```\n1\n11\n9\n6\n6\n6\n1\n6\n7\n8\n1\n0\n...\n```\nThe contents of 'DMM_theta_15.txt' like: \u003cbr /\u003e \n```\n0.052684144818976285\n0.06766541822721599\n0.07515605493133583\n0.07016229712858926\n0.10262172284644196\n0.06267166042446942\n0.10761548064918852\n0.04769038701622972\n0.06267166042446942\n0.050187265917603\n0.037702871410736576\n0.0651685393258427\n0.05518102372034957\n0.09762796504369538\n0.04769038701622972\n```\n\n## DPMM\nCollapsed Gibbs Sampling for DPMM\nReference:(1)Yin J, Wang J. A model-based approach for text clustering with outlier detection[C]//2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 2016: 625-636.\n\n(2)https://github.com/junyachen/GSDPMM\n\nThis algorithm is similar with DMM. When I implement this algorithm, I the use same data structure between these two algorithms (DMM and DPMM).\n\n```java\nimport com.topic.model.DPMM;\n\npublic class DPMMTest {\n\n\tpublic static void main(String[] args) {\n\t\tDPMM dmm = new DPMM(\"data/shortdoc.txt\", \"gbk\", 5, 0.1,\n\t\t\t\t0.01, 1500, 50, \"data/ldaoutput/\");\n\t\tdmm.MCMCSampling();\n\n\t}\n\n}\n```\nWhere the constructor method DPMM() is:\u003cbr /\u003e\n```java\npublic DPMM(String inputFile, String inputFileCode, int initClusterNumber,\n\t\t\tdouble inputAlpha, double inputBeta, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\n\n## HDP\nSampling based on the Chinese restaurant franchise\n\nReference:(1)Teh Y W, Jordan M I, Beal M J, et al. Sharing clusters among related groups: Hierarchical Dirichlet processes[C]//Advances in neural information processing systems. 2005: 1385-1392.\n\n(2)https://github.com/arnim/HDP\n\n```java\nimport com.topic.model.HDP;\n\npublic class HDPTest {\n\n\tpublic static void main(String[] args) {\n\t\tHDP hdp = new HDP(\"data/rawdata_process_lda\", \"gbk\", 10, 1, 0.01,\n\t\t\t\t0.1, 1000, 50, \"data/ldaoutput/\");\n\t\thdp.MCMCSampling();\n\n\t}\n\n}\n```\nWhere the constructor method HDP() is:\u003cbr /\u003e\n```java\npublic HDP(String inputFile, String inputFileCode, int initTopicNumber,\n\t\t\tdouble inputAlpha, double inputBeta, double inputGamma, int inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\nThe output file contains:\u003cbr /\u003e\n![在这里插入图片描述](https://img-blog.csdnimg.cn/201907041126015.png)\n\nThe contents of 'HDP_topic_word_36.txt' like: \u003cbr /\u003e \n```\nTopic:1\nmethod :0.030071773821525403\nmodel :0.01782642193628322\nfunction :0.012205604677483528\nequation :0.012125307288072103\ndistribution :0.011643522951603558\nparameter :0.011201887309840727\nnumerical :0.011121589920429302\nresult :0.01072010297337218\nproblem :0.010639805583960757\npresent :0.009957277773963652\npropose :0.00975653430043509\n...\n\nTopic:2\nalgorithm :0.06079583823439637\nproblem :0.050096888430722096\npropose :0.025803978876496988\noptimization :0.023915928911142702\nsolution :0.021209723960801567\nsolve :0.016111989054345\nsearch :0.014790354078597003\nresult :0.012461759121326719\npaper :0.012398824122481576\nmethod :0.011895344131720435\ntime :0.010573709155972437\nshow :0.010447839158282152\n...\n```\n\n##  Dual-Sparse Topic Model\nCollapsed Variational Bayesian Inference for Dual-Sparse Topic Model\n\nReference:Lin T, Tian W, Mei Q, et al. The dual-sparse topic model: mining focused topics and focused terms in short text[C]//Proceedings of the 23rd international conference on World wide web. ACM, 2014: 539-550.\n```java\nimport com.topic.model.DualSparseLDA;\n\npublic class DualSparseLDATest {\n\n\tpublic static void main(String[] args) {\n\t\tDualSparseLDA slda = new DualSparseLDA(\"data/shortdoc.txt\", \"gbk\", 10, 1.0, 1.0, 1.0, 1.0, 0.1, 1E-12, 0.1, 1E-12, 500, 60, \"data/dualsparse/\");\n\t\tslda.CVBInference();\n\t}\n\n}\n```\nWhere the constructor method DualSparseLDA() is:\u003cbr /\u003e\n```java\npublic DualSparseLDA(String inputFile, String inputFileCode, int topicNumber,\n\t\t\tdouble inputS, double inputR, double inputX, double inputY, \n\t\t\tdouble inputGamma, double inputGamma_bar,double inputPi, double inputPi_bar,\n\t\t\tint inputIterations, int inTopWords,\n\t\t\tString outputFileDir)\n```\n\nThe output file contains:\u003cbr /\u003e\n\n![在这里插入图片描述](https://img-blog.csdnimg.cn/20190704161409153.png)\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoberqian%2FTopicModel4J","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoberqian%2FTopicModel4J","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoberqian%2FTopicModel4J/lists"}