{"id":13794378,"url":"https://github.com/ghosthamlet/CHN","last_synced_at":"2025-05-12T21:31:29.594Z","repository":{"id":140564010,"uuid":"196845650","full_name":"ghosthamlet/CHN","owner":"ghosthamlet","description":"Hacker news on Console with auto classifer and recommender in reactjs style code","archived":false,"fork":false,"pushed_at":"2019-08-29T11:10:52.000Z","size":26385,"stargazers_count":14,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-11-18T09:32:44.120Z","etag":null,"topics":["console-ui","deeplearning","hacker-news","hackernews","machinelearning","reactjs","sklearn","spacy","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ghosthamlet.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-07-14T14:22:34.000Z","updated_at":"2024-03-09T06:46:13.000Z","dependencies_parsed_at":"2024-01-07T06:41:03.729Z","dependency_job_id":null,"html_url":"https://github.com/ghosthamlet/CHN","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ghosthamlet%2FCHN","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ghosthamlet%2FCHN/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ghosthamlet%2FCHN/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ghosthamlet%2FCHN/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ghosthamlet","download_url":"https://codeload.github.com/ghosthamlet/CHN/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253825075,"owners_count":21970125,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["console-ui","deeplearning","hacker-news","hackernews","machinelearning","reactjs","sklearn","spacy","word2vec"],"created_at":"2024-08-03T23:00:39.928Z","updated_at":"2025-05-12T21:31:25.453Z","avatar_url":"https://github.com/ghosthamlet.png","language":"Jupyter Notebook","funding_links":[],"categories":["Clients"],"sub_categories":["CLI \u0026 TUI"],"readme":"\u003ch1 align=\"center\"\u003eHacker News on Console(CHN)\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\nA text-based interface (TUI) to view and interact with Hacker News from your Console.\u003cbr\u003e\nWith auto classifer and recommender with relate to your upvotes and favorites.\u003cbr\u003e\nUI code is in reactjs style, easy and familiar for many developer who like reactjs.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg alt=\"title image\" src=\"data/title-image.png\"/\u003e\n\u003c/p\u003e\n\n\n## Table of Contents\n\n* [NOTICE](#notice)  \n* [Features](#features)  \n* [Installation](#installation)  \n* [Usage](#usage)  \n* [Settings](#settings)\n* [TODO](#todo)  \n* [Train your own classifer](#train)  \n* [License](#license)  \n\n\n## Notice\nCHN tested in Ubuntu in its default terminal, ONLY work with python3.6.7, and maybe python3.6+, macOS/windows and other OS did not tested, BUT when install by Docker see [Installation](#installation), many OS and environments except windows should work.\n\nCHN is still in early stage, may have many bugs and performance problems, but it is already useful now.\n\nClassifer just have around 71% accuracy at present, as it is trained by classify just post titles for 34 categories,\nand the data is not so many, has only around 150000 samples, highly imbalanced.\nyou can train your own classifer, more details about the data/classifer and train method see [Train your own classifer](#train)\n\n\n## Features\n* login to HN and vote/favorite post\n* browser all HN list pages include your submitted/voted/favorite page\n* use classifer or search to filter different page posts\n* auto recommend posts for your interest\n\ncreate comment/post and view detail did not implement, there are shortcuts to open web browser, they will be implemented and in [TODO](#todo)\n\n\n## Installation\n\n### By docker\n\nNotice: can't open web browser to show comment/post detail page from docker now, will fix it.\n\ndocker run -it --rm --name CHN --volume /srv/CHN/data:/app/data ghosthamlet/chn:latest python ui.py\n\nor use proxy: docker run --net host -it --rm --name CHN --volume /srv/CHN/data:/app/data ghosthamlet/chn:latest python ui.py -p 127.0.0.1:19180\n\n(change 127.0.0.1:19180 to your proxy address)\n\n\n### By code\n\ngit clone https://github.com/ghosthamlet/CHN.git\n\ncd CHN\n\npip3 install -r requirements.txt\n\npython3 -m spacy download en_core_web_md\n\n\n## Usage\n\n### By docker\ndocker run -it --rm --name CHN --volume /srv/CHN/data:/app/data ghosthamlet/chn:latest python ui.py\n\nor use proxy: docker run --net host -it --rm --name CHN --volume /srv/CHN/data:/app/data ghosthamlet/chn:latest python ui.py -p 127.0.0.1:19180\n\n(change 127.0.0.1:19180 to your proxy address)\n\n\n### By code\npython3 ui.py\n\nor use proxy: python3 ui.py -p 127.0.0.1:19180\n\n(change 127.0.0.1:19180 to your proxy address)\n\n\n### Shortcuts\n\n    h: show/close help screen\n\n    s: goto search keyword, use space to seperate multi keywords\n\n    t: goto select page type, or go back to posts\n\n    v: upvote current post(NOTE: you have to view/load upvoted page first)\n\n    o: favorite current post(NOTE: you have to view/load favorite page first)\n\n    r: refresh posts\n\n    c: open comment page\n\n    enter: open link page\n\n    ctrl c: quit\n\n\n### Notice\n\n    * when ui frozen, hit t, or ctrl c to quit and restart\n\n    * login is safe, just cookies will save on your computer, \n       accounts will not save, not send to any servers\n\n    * login may FAILED! when you try many times wrong username/password, your ip maybe locked by HN, \n       and it will use google reCAPTCHA to verify your login, you have to wait HN to remove reCAPTCHA to login CHN again\n\n    * use arrows to navigate\n\n    * sometimes after loading new page, ui maybe frozen, hit t to activate it\n\n    * load submitted/upvoted/favorite pages maybe very slow first time if you have many data, \n       but after first load it will be fast\n \n\n## Train\n### About the dataset\nTraining data has around 150000 samples, were collected from reddit.com. \nIn the start, there are 110 cats, every cat is a manually selected subreddit, \nall front page and top page posts in the subreddit formed the whole dataset.\nIt currently just contain posts title, did not crawl the posts body.\nThe samples for some cats are quite imbalanced.\nFor more info see in Train.ipynb *Explore data section*.\n\n\n### About the classifer\nI taked the following experiments to select classifer:\n\n\u003cpre\u003e\nCATS                CLASSIFER                   ACCURACY           RECALL\n\n110                 LogisticRegression          around 0.55        around 0.55\n\n110                 ComplementNB                around 0.55        around 0.55\n\n110                 SGDClassifier               around 0.55        around 0.55\n\n110                 RandomForestClassifier      not fin            not fin\n\n110                 SVC                         not fin            not fin\n\n110                 SGDClassifier               around 0.55        around 0.55\n\n110                 LinearSVC                   around 0.55        around 0.55\n\n101                 LinearSVC                   0.59               around 0.59\n\n57                  LinearSVC                   0.62               around 0.62\n\n41                  LinearSVC                   0.65               around 0.65\n\n34over_sampling     LinearSVC                   0.73               0.73 (val/test accuracy is 0.63)\n\n34under_sampling    LinearSVC                   under 0.7          under 0.7\n\n34                  LinearSVC                   0.709              0.71\n\u003c/pre\u003e\n\n* i did not record the experiments history, the ACCURACY/RECALL with around is approximate value, \n i will redo these with hyperparameter-hunter when i have time\n* 'not fin' classifer is too slow on my laptop to fin\n* aslo too slow to run random search and cross validation on my laptop, so i did not do it\n* 101 and 57 cats are dataset with least and most samples cats removed\n* 41 cats are dataset with many sub cats combined to one cat (highly imbalanced)\n* 34 cats are dataset removed some too broad cats, and merged some related cats (most highly imbalanced)\n* 34over_sampling cats is the same dataset as 34 cats, but added imblearn.over_sampling.SMOTE resample in pipline (most highly imbalanced)\n* most classifer val/test accuracy/recall is near the train accuracy/recall, \n  except the classifer with imblearn.over_sampling.SMOTE resample\n\nSome experiments tried in kaggle GPU kernel:\n\n\u003cpre\u003e\nCATS                CLASSIFER                                    TRIAN ACCURACY     TEST ACCURACY\n\n34                  LinearSVC with word2vec                       around 0.55        around 0.55\n\n34                  XGBoost                                       not fin            not fin\n\n34                  AWD_LSTM (fine tuning with fastai)            around 0.68        around 0.68\n\n34                  Transformer (fine tuning with fastai)         around 0.60        around 0.60\n\n34                  GPT2+AWD_LSTM (fine tuning with fastai)       not fin            not fin\n\n34                  BERT (fine tuning with pytorch-transformers)  around 0.78        around 0.78\n\u003c/pre\u003e\n\n* GPT2+AWD_LSTM, GPT2 for generate the posts body from its title, the body generated is good, \n  but did not good enough to closely related to the title subject, so i did not use it,\n  in the future when the GPT2001 is so good, maybe it can be used to replace crawling post body for high accuracy\n* 34 cats are dataset removed some too broad cats, and merged some related cats (most highly imbalanced)\n\nOverall, i did very little hyperparameter tune for all classifer, from the results, deep transfer learning by fine tuning BERT is the most accurate classifer for this poor dataset, but is very slow, the simple LinearSVC is fast and performance just behind BERT, so i am using it in this project.\n\n\n### Train your own classifer\n1. change reddit crawl settings in config.py, crawl subreddits posts by run crawler.py, you can use exists data/reddit.csv and skip this step\n2. train in Train.ipynb, sorry the code in Train notebook is not good\n3. change hn_classifer_model in config.py with saved model of the previous step \n\n\n## Settings\nsee config.py\n\n\n## TODO\n* package app by pip or PyInstaller, so user can easy install,\n  tried PyInstaller, succ to build executable, but failed to run, \n  the executable started many CHN processes, killed the OS\n* use hyperparameter-hunter to manage machine learning experiments\n* optimize classifer accuracy by crawl and classify posts body not just\n  title, and use deep transfer learning(maybe fine tuning BERT) to classify\n* optimize recommender performance by compare posts body, aslo optimize its\n  speed, it is rather slow now, maybe remove spacy and use raw word2vec\n* optimize app ui performance, add more progress reminder\n* updating guest pages will update all data of that page now,\n  change it to incremental updates like the user only pages\n* optimize react speed, let render just update itself component, not all components\n* refactor react api/code to more conform to reactjs, and extract it to independent pip library\n* add vim like shortcuts\n* add comment/post detail page, search/sort post comments, and create comment/post functions\n* add chart/graph page to show cats/keywords of submitted/upvoted/favorite stats along time\n* make the latest/hot/recommend page real time\n\n## License\nThis project is distributed under the [MIT](LICENSE) license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fghosthamlet%2FCHN","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fghosthamlet%2FCHN","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fghosthamlet%2FCHN/lists"}