{"id":13857833,"url":"https://github.com/systats/textlearnR","last_synced_at":"2025-07-13T22:31:36.403Z","repository":{"id":108356994,"uuid":"172758480","full_name":"systats/textlearnR","owner":"systats","description":"A simple collection of well working NLP models (Keras, H2O, StarSpace) tuned and benchmarked on a variety of datasets.","archived":false,"fork":false,"pushed_at":"2019-03-08T00:09:08.000Z","size":77062,"stargazers_count":18,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-08-06T03:04:27.831Z","etag":null,"topics":["classification","hyperparameter-optimization","keras","nlp","r","text-mining"],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/systats.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-02-26T17:33:15.000Z","updated_at":"2024-06-09T06:30:50.000Z","dependencies_parsed_at":"2023-03-08T02:30:21.398Z","dependency_job_id":null,"html_url":"https://github.com/systats/textlearnR","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systats%2FtextlearnR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systats%2FtextlearnR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systats%2FtextlearnR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systats%2FtextlearnR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/systats","download_url":"https://codeload.github.com/systats/textlearnR/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225920503,"owners_count":17545505,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","hyperparameter-optimization","keras","nlp","r","text-mining"],"created_at":"2024-08-05T03:01:48.297Z","updated_at":"2025-07-13T22:31:36.394Z","avatar_url":"https://github.com/systats.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"textlearnR\n================\n\nA simple collection of well working NLP models (Keras) in R, tuned and benchmarked on a variety of datasets. This is a work in progress and the first version only supports classification tasks (at the moment).\n\n\u003cimg src=\"Readme_files/figure-markdown_github/unnamed-chunk-1-1.png\" style=\"display: block; margin: auto;\" /\u003e\n\nWhat can this package do for you? (in the future)\n-------------------------------------------------\n\nTraining neural networks can be bothering and time consuming due to the sheer amount of hyper-parameters. Hyperparameters are values that are defined prior and provided as additional model input. Tuning those requires either deeper knowledge about the model behavior itself or computational resources for random searches or optimization on the parameter space. `textlearnR` provides a light weight framework to train and compare ML models from Keras, H2O, starspace and text2vec (coming soon). Furthermore, it allows to define parameters for text processing (e.g. maximal number of words and text length), which are also considered to be priors.\n\nBeside language models, textlearnR also integrates third party packages for automatically tuning hyperparameters. The following strategies will be avaiable:\n\n#### Searching\n\n-   Grid search\n-   Random search\n-   Sobol sequence (quasi-random numbers designed to cover the space more evenly than uniform random numbers). Computationally expensive but parallelizeable.\n\n#### Optimization\n\n-   [`GA`](https://github.com/luca-scr/GA) Genetic algorithms for stochastic optimization (only real-values).\n-   [`mlrMBO`](https://github.com/mlr-org/mlrMBO) Bayesian and model-based optimization.\n-   Others:\n    -   Nelder–Mead simplex (gradient-free)\n    -   Particle swarm (gradient-free)\n\nFor constructing new parameter objects the tidy way, the package `dials` is used. Each model optimized is saved to a SQLite database in `data/model_dump.db`. Of course, committed to [tidy principals](https://cran.r-project.org/package=tidyverse/vignettes/manifesto.html). Contributions are highly welcomed!\n\nSupervised Models\n-----------------\n\n[model overview](https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463)\n\n\u003c!-- * simple generic wrapper/class for models --\u003e\n\u003c!-- * parameter validation using manual and automatic verifiaction dataset or k-Fold cross validation. --\u003e\n\u003c!-- * Introduce early stopping to keras during training of model --\u003e\n``` r\nkeras_model \u003c- list(\n  simple_mlp = textlearnR::keras_simple_mlp,\n  deep_mlp = textlearnR::keras_deep_mlp,\n  simple_lstm = textlearnR::keras_simple_lstm,\n  #deep_lstm = textlearnR::keras_deep_lstm,\n  pooled_gru = textlearnR::keras_pooled_gru,\n  cnn_lstm = textlearnR::keras_cnn_lstm,\n  cnn_gru = textlearnR::keras_cnn_gru,\n  gru_cnn = textlearnR::keras_gru_cnn,\n  multi_cnn = textlearnR::keras_multi_cnn\n)\n```\n\nDatasets\n--------\n\n-   [celebrity-faceoff](https://github.com/jlacko/celebrity-faceoff)\n-   [Google Jigsaw Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)\n-   [Hate speech detection](https://github.com/t-davidson/hate-speech-and-offensive-language)\n-   [nlp-datasets](https://github.com/niderhoff/nlp-datasets)\n-   Scopus Classification\n-   party affiliations\n\nUnderstand one model\n--------------------\n\n``` r\ntextlearnR::keras_simple_mlp(\n    input_dim = 10000, \n    embed_dim = 128, \n    seq_len = 50, \n    output_dim = 1\n  ) %\u003e% \n  summary\n```\n\n    ## ___________________________________________________________________________\n    ## Layer (type)                     Output Shape                  Param #     \n    ## ===========================================================================\n    ## embedding_1 (Embedding)          (None, 50, 128)               1280000     \n    ## ___________________________________________________________________________\n    ## flatten_1 (Flatten)              (None, 6400)                  0           \n    ## ___________________________________________________________________________\n    ## dense_1 (Dense)                  (None, 128)                   819328      \n    ## ___________________________________________________________________________\n    ## dropout_1 (Dropout)              (None, 128)                   0           \n    ## ___________________________________________________________________________\n    ## dense_2 (Dense)                  (None, 1)                     129         \n    ## ===========================================================================\n    ## Total params: 2,099,457\n    ## Trainable params: 2,099,457\n    ## Non-trainable params: 0\n    ## ___________________________________________________________________________\n\n-   rather flowchart or ggalluvial\n\n\u003cimg src=\"Readme_files/figure-markdown_github/unnamed-chunk-4-1.png\" style=\"display: block; margin: auto;\" /\u003e\n\n\u003cimg src=\"Readme_files/figure-markdown_github/unnamed-chunk-5-1.png\" style=\"display: block; margin: auto;\" /\u003e\n\n\u003c!---\n\n### Other NLP Data\n\n* https://www.kaggle.com/mrisdal/fake-news/home\n* [rpanama](https://github.com/dgrtwo/rpanama)\n    + https://www.kaggle.com/zusmani/paradise-papers/home\n* https://www.kaggle.com/shujian/arxiv-nlp-papers-with-github-link\n* [`fulltext`](https://github.com/ropensci/fulltext)\n* [rorcid](https://github.com/ropensci/rorcid)\n* [roadoi](https://github.com/ropensci/roadoi)\n* [manifestoR](https://github.com/ManifestoProject/manifestoR)\n\n\n## Other NLP Resources\n\n* https://www.kaggle.com/rtatman/stopword-lists-for-19-languages\n* https://www.r-craft.org/r-news/regex-tutorial-with-examples/\n* http://ruder.io/optimizing-gradient-descent/\n* [good for explanations](https://beta.rstudioconnect.com/ml-with-tensorflow-and-r/#22)\n* https://github.com/OmaymaS/stringr_explorer\n* [Building a neural network from scratch in R](https://selbydavid.com/2018/01/09/neural-network/)\n\n## Other NLP Packages\n\n* [Rex Friendly Regular Expressions](https://github.com/kevinushey/rex)\n* [handlr](https://ropensci.org/technotes/2019/02/27/handlr-release/)\n* [`decryptr` An extensible API for breaking captchas](https://github.com/decryptr/decryptr)\n* [`textfeatures`](https://github.com/mkearney/textfeatures)\n* [`dbx` A fast, easy-to-use database library for R](https://github.com/ankane/dbx)\n* [`textreuse`](https://github.com/ropensci/textreuse)\n* [Chunkwise Text-file Processing for 'dplyr'](https://github.com/edwindj/chunked)\n* [iml: interpretable machine learning](https://github.com/christophM/iml)\n* [ggfittext](https://github.com/wilkox/ggfittext)\n* [loggr](https://github.com/smbache/loggr)\n* [text generation with markov files](https://github.com/abresler/markovifyR)\n* [rBayesianOptimization](https://github.com/yanyachen/rBayesianOptimization)\n* [mlr3: Machine Learning in R - next generation](https://github.com/mlr-org/mlr3)\n* [textclean](https://github.com/trinker/textclean)\n* [quanteda: Multilingual Stopword Lists in R](http://stopwords.quanteda.io)\n* [rematch2](https://github.com/MangoTheCat/rematch2)\n* [telegram](https://github.com/lbraglia/telegram)\n* [speedtest](https://github.com/hrbrmstr/speedtest)\n* [preText](https://github.com/matthewjdenny/preText)\n* [String operations the Python way: pystr](https://github.com/Ironholds/pystr)\n* [A better dictionary class for R.](https://github.com/stefano-meschiari/dict)\n* [book code](https://github.com/IronistM/Modern-Optimization-with-R)\n* [textmineR](https://github.com/TommyJones/textmineR)\n* [SuperLearner](https://github.com/ecpolley/SuperLearner) \n\n---\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsystats%2FtextlearnR","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsystats%2FtextlearnR","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsystats%2FtextlearnR/lists"}