{"id":13482948,"url":"https://github.com/apcode/tensorflow_fasttext","last_synced_at":"2025-03-27T13:33:08.420Z","repository":{"id":146061992,"uuid":"94654088","full_name":"apcode/tensorflow_fasttext","owner":"apcode","description":"Simple embedding based text classifier inspired by fastText, implemented in tensorflow","archived":false,"fork":false,"pushed_at":"2018-07-18T23:12:33.000Z","size":60,"stargazers_count":302,"open_issues_count":9,"forks_count":93,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-10-30T16:41:27.272Z","etag":null,"topics":["fasttext","language-identification","tensorflow","text-classifier"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apcode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-06-17T23:15:26.000Z","updated_at":"2024-09-13T12:21:22.000Z","dependencies_parsed_at":"2023-04-24T04:18:55.320Z","dependency_job_id":null,"html_url":"https://github.com/apcode/tensorflow_fasttext","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apcode%2Ftensorflow_fasttext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apcode%2Ftensorflow_fasttext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apcode%2Ftensorflow_fasttext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apcode%2Ftensorflow_fasttext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apcode","download_url":"https://codeload.github.com/apcode/tensorflow_fasttext/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245854744,"owners_count":20683409,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fasttext","language-identification","tensorflow","text-classifier"],"created_at":"2024-07-31T17:01:06.947Z","updated_at":"2025-03-27T13:33:08.052Z","avatar_url":"https://github.com/apcode.png","language":"Python","funding_links":[],"categories":["Models/Projects","Python"],"sub_categories":[],"readme":"# FastText in Tensorflow\n\nThis project is based on the ideas in Facebook's [FastText](https://github.com/facebookresearch/fastText) but implemented in\nTensorflow. However, it is not an exact replica of fastText.\n\nClassification is done by embedding each word, taking the mean\nembedding over the full text and classifying that using a linear\nclassifier. The embedding is trained with the classifier.  You can\nalso specify to use 2+ character ngrams. These ngrams get hashed then\nembedded in a similar manner to the orginal words. Note, ngrams make\ntraining much slower but only make marginal improvements in\nperformance, at least in English.\n\nI may implement skipgram and cbow training later. Or preloading\nembedding tables.\n\n\u003c\u003c Still WIP \u003e\u003e\n\nYou can use [Horovod](https://github.com/uber/horovod) to distribute\ntraining across multiple GPUs, on one or multiple servers. See usage\nsection below.\n\n## FastText Language Identification\n\nI have added utilities to train a classifier to detect languages, as\ndescribed in [Fast and Accurate Language Identification using\nFastText](https://fasttext.cc/blog/2017/10/02/blog-post.html)\n\nSee usage below. It basically works in the same way as default usage.\n\n## Implemented:\n- classification of text using word embeddings\n- char ngrams, hashed to n bins\n- training and prediction program\n- serve models on tensorflow serving\n- preprocess facebook format, or text input into tensorflow records\n\n## Not Implemented:\n- separate word vector training (though can export embeddings)\n- heirarchical softmax.\n- quantize models (supported by tensorflow, but I haven't tried it yet)\n\n# Usage\n\nThe following are examples of how to use the applications. Get full help with\n`--help` option on any of the programs.\n\nTo transform input data into tensorflow Example format:\n\n    process_input.py --facebook_input=queries.txt --output_dir=. --ngrams=2,3,4\n\nOr, using a text file with one example per line with an extra file for labels:\n\n    process_input.py --text_input=queries.txt --labels=labels.txt --output_dir=.\n\nTo train a text classifier:\n\n    classifier.py \\\n      --train_records=queries.tfrecords \\\n      --eval_records=queries.tfrecords \\\n      --label_file=labels.txt \\\n      --vocab_file=vocab.txt \\\n      --model_dir=model \\\n      --export_dir=model\n\nTo predict classifications for text, use a saved_model from\nclassifier. `classifier.py --export_dir` stores a saved model in a\nnumbered directory below `export_dir`. Pass this directory to the\nfollowing to use that model for predictions:\n\n    predictor.py\n      --saved_model=model/12345678\n      --text=\"some text to classify\"\n      --signature_def=proba\n\nTo export the embedding layer you can export from predictor. Note,\nthis will only be the text embedding, not the ngram embeddings.\n\n    predictor.py\n      --saved_model=model/12345678\n      --text=\"some text to classify\"\n      --signature_def=embedding\n\nUse the provided script to train easily:\n\n    train_classifier.sh path-to-data-directory\n\n# Language Identification\n\nTo implement something similar to the method described in [Fast and\nAccurate Language Identification using\nFastText](https://fasttext.cc/blog/2017/10/02/blog-post.html) you need to download the data:\n\n    lang_dataset.sh [datadir]\n\nYou can then process the training and validation data using\n`process_input.py` and `classifier.py` as described above.\n\nThere is a utility script to do this for you:\n\n    train_langdetect.sh datadir\n\nIt reaches about 96% accuracy using word embeddings and this increases to nearly 99% when \nadding `--ngrams=2,3,4`\n\n# Distributed Training\n\nYou can run training across multiple GPUs either on one or multiple\nservers. To do so you need to install MPI and\n[Horovod](https://github.com/uber/horovod) then add the `--horovod`\noption. It runs very close to the GPU multiple in terms of\nperformance. I.e. if you have 2 GPUs on your server, it should run\nclose to 2x the speed.\n\n    NUM_GPUS=2\n    mpirun -np $NUM_GPUS python classifier.py \\\n      --horovod \\\n      --train_records=queries.tfrecords \\\n      --eval_records=queries.tfrecords \\\n      --label_file=labels.txt \\\n      --vocab_file=vocab.txt \\\n      --model_dir=model \\\n      --export_dir=model\n\nThe training script has this option added: `train_classifier.sh`.\n\n# Tensorflow Serving\n\nAs well as using `predictor.py` to run a saved model to provide\npredictions, it is easy to serve a saved model using Tensorflow\nServing with a client server setup. There is a supplied simple rpc client (`predictor_client.py`)\nthat provides predictions by using tensorflow server.\n\nFirst make sure you install the tensorflow serving binaries. Instructions are [here](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/setup.md#installing-the-modelserver).\n\nYou then serve the latest saved model by supplying the base export\ndirectory where you exported saved models to. This directory will\ncontain the numbered model directories:\n\n    tensorflow_model_server --port=9000 --model_base_path=model\n\nNow you can make requests to the server using gRPC calls. An example\nsimple client is provided in `predictor_client.py`:\n\n    predictor_client.py --text=\"Some text to classify\"\n\n# Facebook Examples\n\n\u003c\u003c NOT IMPLEMENTED YET \u003e\u003e\n\nYou can compare with Facebook's fastText by running similar examples\nto what's provided in their repository.\n\n    ./classification_example.sh\n    ./classification_results.sh\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapcode%2Ftensorflow_fasttext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapcode%2Ftensorflow_fasttext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapcode%2Ftensorflow_fasttext/lists"}