{"id":21067165,"url":"https://github.com/idf/documentminer","last_synced_at":"2025-08-06T11:13:25.129Z","repository":{"id":23346118,"uuid":"26706826","full_name":"idf/DocumentMiner","owner":"idf","description":"Discover What is Inside the Search Engine Index","archived":false,"fork":false,"pushed_at":"2015-08-13T12:49:53.000Z","size":8363,"stargazers_count":1,"open_issues_count":4,"forks_count":1,"subscribers_count":2,"default_branch":"develop","last_synced_at":"2025-06-04T05:33:35.253Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/idf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-11-16T07:10:12.000Z","updated_at":"2015-03-23T09:32:29.000Z","dependencies_parsed_at":"2022-08-21T23:10:36.566Z","dependency_job_id":null,"html_url":"https://github.com/idf/DocumentMiner","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/idf/DocumentMiner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idf%2FDocumentMiner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idf%2FDocumentMiner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idf%2FDocumentMiner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idf%2FDocumentMiner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/idf","download_url":"https://codeload.github.com/idf/DocumentMiner/tar.gz/refs/heads/develop","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idf%2FDocumentMiner/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269067528,"owners_count":24354296,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-06T02:00:09.910Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T18:04:46.762Z","updated_at":"2025-08-06T11:13:25.104Z","avatar_url":"https://github.com/idf.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"Document Miner\n=============\n\nFinal Year Project\n\n## Set Up\n### Source Code \nInitial setup\n```bash\ngit clone git@github.com:idf/DocumentMiner.git\ncd ./DocumentMiner\ngit submodule init\ngit submodule update --recursive \n```\nNormally, the `develop` branch is used.  \n\nUpdate \n```bash\ngit submodule foreach git pull origin master\n```\n\n### Java Dependencies\nThis is a multi-module project, mananged by maven. You should configure `commons-util`, `km_*`, `rake4j` as module. The dependencies should be automatically resolved by maven, as indicated in pom.xml. \n\n### Web Dependencies\nWeb dependencies are managed by bower  \n* To install [bower](cd ./km-web/src/main/webapp/)  \nChange into the  directory of `bower.json`, by `cd ./km-web/src/main/webapp/`\n```bash\nbower install\n```\n\nAdditional dependencies\n```bash\nwget https://bootswatch.com/yeti/bootstrap.css -O bower_components/bootstrap/dist/css/bootstrap-yeti.css\n```\n\n### Binary Dependencies\nDownload [CLUTO](http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download), and specify the path to CLUTO in configuration file.\n\n### Configurations\n[Configurations](https://github.com/idf/DocumentMiner/blob/develop/km-common/src/main/java/km/common/Config.java)  \n[Configuration XML](https://github.com/idf/DocumentMiner/blob/develop/km-common/src/main/resources/settings.xml)  \n* the logic of configuration is controlled by [Settings.java](https://github.com/idf/DocumentMiner/blob/develop/km-common/src/main/java/km/common/Settings.java)\n\n## Offline Components\n### Topic Modeling \n1. Manually add add mallet dependencies (km-mallet/lib/mallet_deps.jar) into the km-mallet module\n1. Download [stopwords](http://www.lextek.com/manuals/onix/stopwords2.html) to /mallet/stoplist/en.txt\n1. Run km.crawler.postprocess.ToCSV, which takes the posts.txt as input and output as a csv format.\n1. Run km.mallet.preprocess.DataImportUnigram, which takes the csv file generated previously, and output as mallet specific format.\n1. Run km.mallet.topic.TrainTopicUnigram, which takes the previous step generated file, output two files, keys and topics.\n\n### Indexing\n1. Run km.crawler.postprocess.SortPostPerThread, which will generate post_sorted.txt\n1. Run km.lucene.indexing.PostIndexer, which will generate post index.\n\n### Clustering\n1. Run km.lucene.applets.collocations.Driver to get RAKE index based on post clustering. \n* It may takes some time.\n\n## Online Components\n### Collocation Analysis\n1. Run km.lucene.applets.collocations.TermCollocationExtractor to see the collocation results in CLI.\n\n### Web\n1. Install Glassfish server [download Java EE](http://www.oracle.com/technetwork/java/javaee/downloads/java-ee-sdk-7-downloads-1956236.html)\n1. Then start the server with the km-web war deployed. \n\n## Indexes\n###Submodules\n[.gitmodules](https://github.com/idf/DocumentMiner/blob/develop/.gitmodules)  \n\n### Utils\n* [LuceneUtils](https://github.com/idf/DocumentMiner/blob/develop/km-lucene/src/main/java/util/LuceneUtils.java)\n\n## Features\n1. term co-occurrences for term query;\n1. phrase co-occurrences for term query;\n1. term co-occurrences for phrase query;\n1. phrase co-occurrences for phrase query.  \nAnd many more others.\n\n### Co-occurrence process\n* [README](https://github.com/idf/DocumentMiner/blob/develop/km-lucene/src/main/java/km/lucene/applets/collocations)\n\n## Search Engine Interface in AngularJS\n* Through Web Service: JavaXS\n* Web dependencies: [bower.json](https://github.com/idf/DocumentMiner/blob/develop/km-web/src/main/webapp/bower.json)\n\n## Component Diagram\n![](/img/DocumentMinerComponent.png) \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidf%2Fdocumentminer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fidf%2Fdocumentminer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidf%2Fdocumentminer/lists"}