{"id":13591893,"url":"https://github.com/baidu/Familia","last_synced_at":"2025-04-08T18:31:09.161Z","repository":{"id":45377725,"uuid":"94299010","full_name":"baidu/Familia","owner":"baidu","description":"A Toolkit for Industrial Topic Modeling","archived":false,"fork":false,"pushed_at":"2021-07-01T08:28:33.000Z","size":6255,"stargazers_count":2637,"open_issues_count":28,"forks_count":596,"subscribers_count":158,"default_branch":"master","last_synced_at":"2024-11-05T03:11:21.124Z","etag":null,"topics":["lda","nlp","sentence-lda","topic-modeling","topic-models","twe"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/baidu.png","metadata":{"files":{"readme":"README.EN.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-14T06:48:21.000Z","updated_at":"2024-10-13T14:51:13.000Z","dependencies_parsed_at":"2022-09-09T16:10:42.660Z","dependency_job_id":null,"html_url":"https://github.com/baidu/Familia","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu%2FFamilia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu%2FFamilia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu%2FFamilia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu%2FFamilia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/baidu","download_url":"https://codeload.github.com/baidu/Familia/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223339248,"owners_count":17129293,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lda","nlp","sentence-lda","topic-modeling","topic-models","twe"],"created_at":"2024-08-01T16:01:03.409Z","updated_at":"2024-11-06T12:32:16.486Z","avatar_url":"https://github.com/baidu.png","language":"C++","funding_links":[],"categories":["C++","Models","Chinese NLP Toolkits 中文NLP工具"],"sub_categories":["Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)","Information Extraction 信息提取"],"readme":"\u003ca href=\"http://github.com/baidu/Familia\"\u003e\n\t\u003cimg style=\"vertical-align: top;\" src=\"https://raw.githubusercontent.com/wiki/baidu/Familia/img/logo.png?raw=true\" alt=\"logo\" height=\"140px\"\u003e\n\u003c/a\u003e\n\n[![Build Status][image-1]][1]\n[![License][image-2]]()\n\n**Familia** is an open-source project, which implements three popular topic models based on the large-scale industrial data. They are Latent Dirichlet Allocation(LDA)、SentenceLDA and Topical Word Embedding(TWE). In addition, **Familia** offers several tools including lda-infer and lda-query-doc-sim. **Familia** could be easily applied to many tasks, such as document classification, document clustering and personalized recommendation. Due to the high cost of model training, we will continue to release well-trained topic models based on the various types of large-scale data.  \n\n## News!!!\nRecently, we launched the Familia's LDA model in [PaddleHub](https://github.com/PaddlePaddle/PaddleHub) 1.8 version. According to different datasets, we launched three LDA models: lda_news, lda_novel, lda_webpage.\n\nPaddleHub is very convenient to use. Let's use lda_news as an example.\n\n1. First, before using PaddleHub, you need to install the PaddlePaddle deep learning framework. For more installation instructions, please refer to [PaddlePaddle Quick Installation] (https://www.paddlepaddle.org.cn/install/quick).\n\n2. Install Paddlehub: `pip install paddlehub`\n\n3. lda_news model installation: `hub install lda_news`\n\n4. API usage\n``` python\nimport paddlehub as hub\n\nlda_news = hub.Module(name=\"lda_news\")\njsd, hd = lda_news.cal_doc_distance(doc_text1=\"今天的天气如何，适合出去游玩吗\", doc_text2=\"感觉今天的天气不错，可以出去玩一玩了\")\n# jsd = 0.003109, hd = 0.0573171\n\nlda_sim = lda_news.cal_query_doc_similarity(query='百度搜索引擎', document='百度是全球最大的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。')\n# LDA similarity = 0.06826\n\nresults = lda_news.cal_doc_keywords_similarity('百度是全球最大的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。')\n# [{'word': '百度', 'similarity': 0.12943492762349573}, \n#  {'word': '信息', 'similarity': 0.06139783578769882}, \n#  {'word': '找到', 'similarity': 0.055296603463188265}, \n#  {'word': '搜索', 'similarity': 0.04270794098349327}, \n#  {'word': '全球', 'similarity': 0.03773627056367886}, \n#  {'word': '超过', 'similarity': 0.03478658388202199}, \n#  {'word': '相关', 'similarity': 0.026295857219683725}, \n#  {'word': '获取', 'similarity': 0.021313585287833996}, \n#  {'word': '中文', 'similarity': 0.020187103312009513}, \n#  {'word': '搜索引擎', 'similarity': 0.007092890537169911}]\n```\nMore detailed introduction and usage can be found here: https://www.paddlepaddle.org.cn/hublist?filter=en_category\u0026value=SemanticModel\n\n\n## Introduction\nThe details of topic models implemented by **Familia** can be referred to [papers on topic models][3].\n\nGenerally, the applications adopting topic models are categorized into two parts: Semantic Representation and Semantic Matching.\n\n- **Semantic Representation**\n\n    Topic models are able to mine hidden dimensions (topics) from document collection and generate semantic representations of documents. These generated semantic representations can be used as features for document classification, document content analysis, and CTR     prediction.\n\n- **Semantic Matching**\n\n    We offer two methods to compute semantic similarity between documents:\n    -\tSemantic similarity between short-long documents, which can be applied to keyword extraction and computing query-document semantic  similarity.\n    -\tSemantic similarity between long-long documents, which can be applied to computing semantic similarity between user profile and news article.\n\nMore details can be referred to [**Familia Wiki**][4].\n\n## Compilation\nThe required third parties include `gflags-2.0`，`glogs-0.3.4`，`protobuf-2.5.0`. The complier should support `C++11`, `g++ \u003e= 4.8` and be compatible with linux and mac. The deps could be obtained and installed automatically by running the following script.\n\n\t$ sh build.sh\n\n## Download\n\t$ cd model\n\t$ sh download_model.sh\n\nMore details can be referred to [Models][5].\n\n## Demo\n**Familia** demo includes the following functions:\n-\t**Semantic Representation**\n   utilize topic models to infer the topic distribution of the input document.\n   \n-\t**Semantic Matching**\n\tcompute semantic similarity between short-long or long-long documents.\n\n-\t**Topic Show**\n\tdemonstrate top words under each topic for users’ better understanding.\n  \nMore details can be referred to [Demos][6].\n\n## Tips\n* If libglog.so, libgflags.so and other dynamic libraries could not be found, please add third\\_party to the environmental parameter `LD_LIBRARY_PATH`.\n\n\t`export LD_LIBRARY_PATH=./third_party/lib:$LD_LIBRARY_PATH`\n\n## Contact\n[Github Issues][7]\n\n{familia} at baidu.com\n\n## Citation\n\nThe following article describes the Familia project and industrial cases powered by topic modeling. It bundles and translates the Chinese documentation of the website. We recommend citing this article as default.\n\nDi Jiang, Yuanfeng Song, Rongzhong Lian, Siqi Bao, Jinhua Peng, Huang He and Hua Wu. 2018. [Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering][8]. arXiv preprint arXiv:1808.03733.\n\n\t@article{jiang2018familia,\n  \t  author = {Di Jiang and Yuanfeng Song and Rongzhong Lian and Siqi Bao and Jinhua Peng and Huang He and Hua Wu},\n  \t  title = {{Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering}},\n  \t  journal = {arXiv preprint arXiv:1808.03733},\n  \t  year = {2018}\n\t}\n\nFurther Reading: [Federated Topic Modeling][11]\n\n## Copyright and License\n\nFamilia is provided under the [BSD-3-Clause License][9].\n\n[1]:\thttp://travis-ci.org/baidu/Familia\n[3]:\thttps://github.com/baidu/Familia/wiki/%E5%8F%82%E8%80%83%E6%96%87%E7%8C%AE\n[4]:\thttps://github.com/baidu/Familia/wiki\n[5]:\thttps://github.com/baidu/Familia/blob/master/model/README.md\n[6]:\thttps://github.com/baidu/Familia/wiki/Demo%E4%BD%BF%E7%94%A8%E6%96%87%E6%A1%A3\n[7]:\thttps://github.com/baidu/Familia/issues\n[8]:\thttps://arxiv.org/abs/1808.03733v2\n[9]:\tLICENSE\n[11]:   https://github.com/baidu/Familia/blob/master/papers/FTM.pdf\n\n[image-1]:\thttps://travis-ci.org/baidu/Familia.svg?branch=master\n[image-2]:\thttps://img.shields.io/pypi/l/Django.svg\n \n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaidu%2FFamilia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaidu%2FFamilia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaidu%2FFamilia/lists"}