{"id":20807460,"url":"https://github.com/tikquuss/imbd","last_synced_at":"2025-07-25T11:33:32.441Z","repository":{"id":76992704,"uuid":"266147740","full_name":"Tikquuss/imbd","owner":"Tikquuss","description":" ​Sentiment analysis on IMBD ​Large Movie Review Dataset​","archived":false,"fork":false,"pushed_at":"2021-05-07T15:21:48.000Z","size":1450,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-25T01:06:55.239Z","etag":null,"topics":["bert","cnn","gru","imbd","lstm","rnn","sentiment-analysis","sentiment-classification"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Tikquuss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-22T15:44:12.000Z","updated_at":"2022-11-05T01:46:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"4901acef-88b9-4a4a-83ad-f046530e7d75","html_url":"https://github.com/Tikquuss/imbd","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Tikquuss/imbd","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Fimbd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Fimbd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Fimbd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Fimbd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Tikquuss","download_url":"https://codeload.github.com/Tikquuss/imbd/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tikquuss%2Fimbd/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266997941,"owners_count":24018949,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-25T02:00:09.625Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","cnn","gru","imbd","lstm","rnn","sentiment-analysis","sentiment-classification"],"created_at":"2024-11-17T19:37:59.699Z","updated_at":"2025-07-25T11:33:32.414Z","avatar_url":"https://github.com/Tikquuss.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"In this repository, we will build machine learning models to detect sentiments (i.e. detect whether a sentence is positive or negative) using IMBD ​Large Movie Review Dataset. We will use three types of models for this purpose: recurrent models, convolutional models and models based entirely on the attention mechanism. See this [notebook](notebook.ipynb) for more details.\n\n## Dependencies\n\n- [Python 3](https://www.python.org/downloads/)\n- [NumPy](http://www.numpy.org/)\n- [PyTorch](http://pytorch.org/) \n- [torchtext](https://pypi.org/project/torchtext/)\n- [transformers](https://pypi.org/project/transformers/)\n- [spacy](https://pypi.org/project/spacy/) : after installing spacy, run this in your terminal : `python -m spacy download en` [source](https://github.com/hamelsmu/Seq2Seq_Tutorial/issues/1)\n\n##  Pretrained models\n\n```\n\n```\n\n## Train your one models\n\n### 1) Instanciate your model among the following models: RNN, LSTM, CNN, CNN1d and BERTGRUSentiment.\n\n```\nfrom src.model import RNN, LSTM, CNN, CNN1d, BERTGRUSentiment, Trainer\n```\n```\napi means `any positive integer`\n```\n```\nrnn_model = RNN(\n    input_dim = api, dimension of the one-hot vectors, which is equal to the vocabulary size, will be update to len(dataset[\"TEXT\"].vocab) during compilation\n    embedding_dim = 100, # size of the dense word vectors\n    hidden_dim = 256, # size of the hidden states\n    output_dim = 1 # usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.\n)\n```\n\n```\nlstm_model = LSTM(\n    vocab_size = api, # vocabulary size, will be update to len(dataset[\"TEXT\"].vocab) during compilation\n    embedding_dim = 100, # size of the dense word vectors\n    hidden_dim = 256, # size of the hidden states\n    output_dim = 1, # usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.\n    n_layers = 2, # number of layers\n    bidirectional = True, # bidirectional or not\n    dropout = 0.5, # we use a method of regularization called dropout. Dropout works by randomly dropping out (setting to 0) neurons in a layer during a forward pass.\n    pad_idx = api # index of \u003cpad\u003e token in th vocabulary, will be update to dataset[\"TEXT\"].vocab.stoi[dataset[\"TEXT\"].pad_token] during compilation\n)\n```\n\n```\n# CNN1d if we want to run the 1-dimensional convolutional model, noting that both models give almost identical results.\n\ncnn_model = CNN( \n    vocab_size = api, # vocabulary size, will be update during compilation to len(TEXT.vocab) during compilation\n    embedding_dim = 100, # size of the dense word vectors\n    n_filters = 100, # number of filters\n    filter_sizes = [3,4,5], # size of the filters or kernel, is going to be [n x emb_dim] where n is the size of the n-grams.\n    output_dim = 1, # usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.\n    dropout = 0.5, # we use a method of regularization called dropout. Dropout works by randomly dropping out (setting to 0) neurons in a layer during a forward pass.\n    pad_idx = api # index of \u003cpad\u003e token in th vocabulary, will be update during compilation to TEXT.vocab.stoi[TEXT.pad_token]\n)\n```\n\n```\nfrom transformers import BertModel\n\nbert_model = BERTGRUSentiment(\n    bert = BertModel.from_pretrained('bert-base-uncased'), # load the pre-trained model, making sure to load the same model as we will do for the tokenizer.\n    hidden_dim = 256, # size of the hidden states\n    output_dim = 1, # usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.\n    n_layers = 2, # number of layers\n    bidirectional = True, # bidirectional or not\n    dropout = 0.25 # we use a method of regularization called dropout. Dropout works by randomly dropping out (setting to 0) neurons in a layer during a forward pass.\n)\n```\n\n### 2) Create his trainer and pass him the model thanks to the model parameter of Trainer.__init__. The dump_path parameter of the same method allows to define the folder where the data will be stored after processing and the models after training.\n\n```\ntrainer = Trainer(\n    model = \"your model\", \n    dump_path=\"your dump path\"\n)\n```\n\n### 3) Compile the trainer by providing him with the following parameters:\n\n- optimizer (torch.optim, default = Adam) : model optimizer (use to update the model parameters)\n- criterion (function, default = nn.BCEWithLogitsLoss) : loss function \n- seed (int, default = 1234) : random seeds for reproducibility\n- train_n_samples (int, defaulf = 25000) : number of training examples to consider (0 \u003c train_n_samples \u003c= 25000)\n- split_ratio (float between 0 and 1, default = 0.8) : ratio of training data to use for training, the rest for validation\n- test_n_samples (int, defaulf = 25000) : number of test examples to consider (0 \u003c test_n_samples \u003c= 25000)\n- batch_size (int, default = 64) : number of examples per batch\n- max_vocab_size (int, default = 25000) : maximun token in the vocabulary\n\n```\n# load the data, build the optimizer and the loss function, and update the model parameters if necessary.\ntrainer.compile(\n    optimizer = \"Adam\", # or SGD\n    criterion = \"BCEWithLogitsLoss\",\n    train_n_samples = 25000,\n    seed = 1234, \n    split_ratio = 0.8, \n    test_n_samples  = 25000,\n    batch_size = 4, \n    max_vocab_size = 25000 \n)\n```\n\n### 4) Train the model\n\n```\nstats = trainer.train(\n    max_epochs = 50, # maximun number of epochs\n    improving_limit = 2, # If the precision of the model does not improve during `improving_limit` epoch, we stop training and keep the best model.\n    eval_metric = \"accuracy_score\", # evaluation metric : 'loss', 'binary_accuracy', 'accuracy_score', 'precision', 'recall', 'f1-score'\n    dump_id = \"\" # identifier to distinguish models in the serialization folder, is by default equal to the name of the base model\n)\n```\n\n### 5) Display statics from training and validation: evolution of loss and accuracy.\n\n```\ntrainer.plot_statistics(statistics = stats, figsize=(20,3))\n```\n\n### 6) Test the model\n\n```\ny, y_pred = trainer.test(dump_id = \"\")\n```\n\n### 7) Putting the model into production\n\n```\npredict = trainer.get_predict_sentiment()\n```\n```\n# example negative review...\nprint(predict(sentence = \"This film is too scary, too much gunfire and blood spilled inside. I can't watch bad movies like this anymore.\"))\n```\n```\n# example positive review...\nprint(predict(sentence = \"Among these actors, I prefer the most romantic one, he likes what he does, is positive about chess and knows how to celebrate victories.\"))\n```\n\n## References\n\n[1] https://paperswithcode.com/task/sentiment-analysis/latest\n\n[2] https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/comments\n\n[3] https://github.com/bentrevett/pytorch-sentiment-analysis  \n\n[4] https://towardsdatascience.com/cnn-sentiment-analysis-9b1771e7cdd6\n\n[5] https://towardsdatascience.com/cnn-sentiment-analysis-9b1771e7cdd6\n\n[6] https://captum.ai/tutorials/IMDB_TorchText_Interpret\n\n## License\nSee the [LICENSE](LICENSE) file for more details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftikquuss%2Fimbd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftikquuss%2Fimbd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftikquuss%2Fimbd/lists"}