{"id":20051569,"url":"https://github.com/ryanchao2012/okbot","last_synced_at":"2026-04-24T21:31:38.258Z","repository":{"id":86737252,"uuid":"84339447","full_name":"ryanchao2012/okbot","owner":"ryanchao2012","description":"A conversation retrieval engine based on PTT corpus","archived":false,"fork":false,"pushed_at":"2017-08-23T05:36:18.000Z","size":1724,"stargazers_count":1,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2026-03-29T18:41:29.553Z","etag":null,"topics":["chatbot","crawler","django","ptt"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ryanchao2012.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-08T16:02:32.000Z","updated_at":"2019-04-27T11:40:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"cb877984-3ee3-4354-8ee6-397072099f7e","html_url":"https://github.com/ryanchao2012/okbot","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ryanchao2012/okbot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanchao2012%2Fokbot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanchao2012%2Fokbot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanchao2012%2Fokbot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanchao2012%2Fokbot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ryanchao2012","download_url":"https://codeload.github.com/ryanchao2012/okbot/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryanchao2012%2Fokbot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32241578,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T13:21:15.438Z","status":"ssl_error","status_checked_at":"2026-04-24T13:21:15.005Z","response_time":64,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","crawler","django","ptt"],"created_at":"2024-11-13T12:03:58.510Z","updated_at":"2026-04-24T21:31:38.252Z","avatar_url":"https://github.com/ryanchao2012.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"MarginalBear\n============\n\n**MarginalBear** is a chit-chatbot with a conversation retrieval engine based on PTT corpus.\nThe core modules in this repo are: ``crawl_app``, ``ingest_app`` and ``chat_app``, and we use ``Django`` to manage these apps.\n\n\u003cimg src=\"res/icon.png\" width=\"150\"\u003e\u003cimg src=\"res/qrcode.png\" width=\"130\"\u003e\n\n\u003cimg src=\"res/line.png\" width=\"140\"\u003e\u003cimg src=\"res/messenger.png\" width=\"150\"\u003e\n\n\nChat Demos\n----------\n\n\u003cimg src=\"res/demo7.png\" width=\"300\"\u003e\n\u003cimg src=\"res/demo6.png\" width=\"300\"\u003e\n\u003cimg src=\"res/demo1.png\" width=\"300\"\u003e\n\u003cimg src=\"res/demo2.png\" width=\"300\"\u003e\n\u003cimg src=\"res/demo3.png\" width=\"300\"\u003e\n\u003cimg src=\"res/demo4.png\" width=\"300\"\u003e\n\n\nSetting Demos in Django Admin \n-----------------------------\n\n#### Crawler setting:\n\n\u003cimg src=\"res/admin_spider_config.png\" width=\"400\"\u003e\n\n#### Blacklist setting:\n\n\u003cimg src=\"res/admin_spider_blacklist.png\" width=\"400\"\u003e\n\n#### Vocabulary:\n\n\u003cimg src=\"res/admin_vocab.png\" width=\"400\"\u003e\n\n\nPTT-Crawler\n-----------\nCrawlers are implemented with ``scrapy`` framework, the logic is defined under ``crawl_app/spider/`` directory, each article in crawled data is collected in jsonline files and formatted as follows:\n\n    \"url\": \u003curl\u003e,\n    \"data\": \u003carticle-publish-date\u003e,\n    \"title\": \u003ctitle\u003e,\n    \"author\": \u003cauthor\u003e,\n    \"content\": \u003carticle-body\u003e,\n    \"push\": \u003clist of comment-string\u003e,\n\n\nTo build conversation corpus, we paired the ``title`` and ``push`` fields to mimic the Q\u0026A behavior, here are some examples:\n\n    \u003ctitle\u003e as Q              \u003cpush\u003e as A\n    綜藝玩很大是不是走下坡了      很久沒看了  都是老梗\n    該怎麼挽回好友？             就算挽回 以後也會因為別的事離開你\n    妹妹想去補習，該怎麼辦        其實你沒有妹妹\n\t\n\nFurther data cleaning process is handled by ``ingest_app``.\n\nEach crawler only handles articles from one PTT forum, since the user habits in different forums(ex: gossiping, sex, mantalk, ... etc.) are usually quit different, we may apply specific rules on each crawler. \nIn order to manage these crawlers easily, the crawl engine are integrated with Django. In Django admin interface, we can easily create different rules to filter out the noisy articles. A rule is actually a blacklist set with ``phrases`` should be filtered and a ``type`` related to the field of crawled items, these types are:\n\n- ``title``: related to ``title`` field of crawled items.\n- ``push``: related to ``push`` field of crawled items.\n- ``author``: related to ``author `` field of crawled items.\n- ``audience``: related to commenter of ``push`` field.\n\nA blacklist can be defined in admin as:\n\n    \"type\": title,\n    \"phrase\": 公告, Re:, Fw:, 投稿, 水桶,\n\nWhich means crawler should drop the item as the article's title contains one of these phrases. With this configuration, each crawler can equip multiple rules to aim different kind of censored contents.\n\n\nA spider can be defined in admin as:\n\n    \"tag\": Gossiping,  # forum name\n    \"entry\": https://www.ptt.cc/bbs/Gossiping/index{index}.html,\n    \"page\": 250,   # pages to crawl in a crawl task\n    \"offset\": 50,  # the distance from the newest page\n    \"freq\": 1,     # crawl frequencey, used with crontab, ex: daily\n    \"blacklist\": [\u003crule1\u003e, \u003crule2\u003e, ...],\n    \"start\": -1,   # start page index\n    \"end\": -1,     # end page index\n    \"status\": debug, # pass or debug\n\nWhen a spider is created, run this command to check whether the config is valid:\n\n    ./manage.py okbot_update_spider \u003ctag\u003e\n\nThe ``start`` and ``end`` index will be updated according to ``page`` and ``offset`` settings, if everything goes fine, the ``status`` will change to ``pass``, meaning the spider is ready to fire:\n\n    ./manage.py okbot_crawl \u003ctag\u003e\n\nAfter issuing a crawl task, a job log is generated; when the task is finished, a statistic summary is recorded and can be viewed in admin, ex:\n\n    \"name\": \"Gossiping\",\n    \"item_num\": \"3227\",\n    \"drop_num\": \"10\",\n    \"title\": \"mean: 19.2, std: 4.3\",\n    \"url\": \"mean: 56.0, std: 0.0\",\n    \"author\": \"mean: 16.7, std: 4.2\",\n    \"date\": \"mean: 24.0, std: 0.0\",\n    \"push\": \"mean: 17.4, std: 9.4\",\n    \"content\": \"mean: 269.3, std: 350.1\"\n\nFinally, we use crontab to manage daily crawl jobs, you can find the handler script in ``crawl_ingest.py``.\n\n\nIngester\n--------\nThis module \"ingest\" crawled data into database, and does three things:\n\n1. Build vocabularies by tokenizing(with ``jieba``) articles' titles.\n2. Index every articles.\n3. Build the ``ManyToMany`` relation(inverted indexing) between vocaluaries and articles.   \n\nThe taskes are wrapped into a command:\n\n    ./manage.py okbot_ingest --jlpath \u003cjsonline-file\u003e --tokenizer \u003ctokenizer\u003e\n\nSince the script only support ``postgresql``， if you use postgresql backend with Django, provide these environment variables, then the command should work:\n\n- `OKBOT_DB_USER`\n- `OKBOT_DB_NAME`\n- `OKBOT_DB_PASSWORD`\n \nThe vocabulary will be listed in Django admin. \nSince retrieval mechanism works with inverted index, you should label the words with high document-frequecy as ``stopword`` or the retrieval process will be very slow.  \n\n\n\nChatbot\n-------\n\nThe bot is deployed on both messenger and line platforms, you can find the api implementation in ``chat_app/views.py``. Basically, when the bot recieves a query, the engine find the related articles by inverted index, then calculates the ``jaccard`` or ``bm25`` similarity with some other features between query and articles' titles, after ranking the articles, the bot finally picks an \"comment\" in the top ranking articles as an reponse. You can find the ranking algorithm and implementations in ``chat_app/bots.py``.\nA ``word2vec``(with ``gensim`` package) model is also applied on queries to generate similar phrases, in order to rich the search informations.\n\nOther features:\n\n- Chat rules table\n- Chat tree/caching\n- Jieba tag weighting table\n \n\n\nEvaluation\n----------\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryanchao2012%2Fokbot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fryanchao2012%2Fokbot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryanchao2012%2Fokbot/lists"}