{"id":26183191,"url":"https://github.com/tusharsarkar3/tla","last_synced_at":"2025-04-14T23:29:10.098Z","repository":{"id":57475940,"uuid":"388028419","full_name":"tusharsarkar3/TLA","owner":"tusharsarkar3","description":"A comprehensive tool for linguistic analysis of communities","archived":false,"fork":false,"pushed_at":"2021-10-01T06:44:52.000Z","size":6876,"stargazers_count":49,"open_issues_count":1,"forks_count":9,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-11T10:17:53.217Z","etag":null,"topics":["hacktoberfest","machine-learning","nlp","pytorch","sentiment-analysis","text-classification"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/TLAF/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tusharsarkar3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"License.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-21T07:14:02.000Z","updated_at":"2025-02-18T16:06:30.000Z","dependencies_parsed_at":"2022-09-07T17:12:53.365Z","dependency_job_id":null,"html_url":"https://github.com/tusharsarkar3/TLA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tusharsarkar3%2FTLA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tusharsarkar3%2FTLA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tusharsarkar3%2FTLA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tusharsarkar3%2FTLA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tusharsarkar3","download_url":"https://codeload.github.com/tusharsarkar3/TLA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248977586,"owners_count":21192622,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hacktoberfest","machine-learning","nlp","pytorch","sentiment-analysis","text-classification"],"created_at":"2025-03-11T22:36:03.102Z","updated_at":"2025-04-14T23:29:10.078Z","avatar_url":"https://github.com/tusharsarkar3.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TLA - Twitter Linguistic Analysis\n## Tool for linguistic analysis of communities \n\n\n[![](https://img.shields.io/badge/Made_with-PyTorch-res?style=for-the-badge\u0026logo=pytorch)](https://pytorch.org/ \"PyTorch\")\n\n\nTLA is built using PyTorch, Transformers and several other State-of-the-Art machine learning\ntechniques and it aims to expedite and structure the cumbersome process of collecting, labeling, and analyzing data\nfrom Twitter for a corpus of languages while providing detailed labeled datasets\nfor all the languages. The analysis\nprovided by TLA will also go a long way in understanding the sentiments of\ndifferent linguistic communities and come up with new and innovative solutions\nfor their problems based on the analysis.\nList of languages our library provides support for are  listed as follows:\u003cbr\u003e\n\n| Language | Code   | Language | Code |\n| ----------------  | ---------------- | ---------------- | ---------------- |\n| English |   en    | Hindi    |   hi  |\n| Swedish |   sv    | Thai     |   th  |\n| Dutch   |   nl   | Japanese |   ja  |\n | Turkish  |   tr  | Urdu     |  ur   |\n | Indonesian | id   |Portuguese | pt  |\n | French    | fr   | Chinese |  zn-ch |\n | Spanish  | es    | Persian |   fa   |\n | Romainain | ro  | Russian | ru |\n\n\n\n## Features\n\n- Provides 16 labeled Datasets for different languages for analysis.\n- Implements Bert based architecture to identify languages.\n- Provides Functionalities to Extract,process and label tweets from twitter.\n- Provides a Random Forest classifier to implement sentiment analysis on any string.\n\n---\n\n\n### Installation :\n```\npip install --upgrade https://github.com/tusharsarkar3/TLA.git\n```\n---\n## \u003cdiv align=\"center\"\u003eOverview \u003c/div\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eExtract data\u003c/summary\u003e\n\n\n```\nfrom TLA.Data.get_data import store_data\nstore_data('en',False)\n```\nThis will extract and store the unlabeled data in a new directory inside data named \ndatasets.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eLabel data\u003c/summary\u003e\n\n\n```\nfrom TLA.Datasets.get_lang_data import language_data\ndf = language_data('en')\nprint(df)\n```\nThis will print the labeled data that we have already collected.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eClassify languages\u003c/summary\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eTraining \u003c/summary\u003e\n\nTraining can be done in the following way:\n\n```\nfrom TLA.Lang_Classify.train import train_lang\ntrain_lang(path_to_dataset,epochs)\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ePrediction \u003c/summary\u003e\n\nInference is done in the following way:\n\n```\nfrom TLA.Lang_Classify.predict import predict\nmodel = get_model(path_to_weights)\npreds = predict(dataframe_to_be_used,model)\n```\n\u003c/details\u003e\n\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eAnalyse\u003c/summary\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eTraining \u003c/summary\u003e\n\nTraining can be done in the following way:\n\n```\nfrom TLA.Analyse.train_rf import train_rf\ntrain_rf(path_to_dataset)\n```\nThis will store all the vectorizers and models in a seperate directory named\nsaved_rf and saved_vec and they are present inside Analysis directory.\nFurther instructions for training multiple languages is given in the next section which \nshows how to run the commands using CLI\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eFinal Analysis \u003c/summary\u003e\n\nAnalysis is done in the following way:\n\n```\nfrom TLA.Analysis.analyse import analyse_data \nanalyse_data(path_to_weights)\n```\n\nThis will store the final analysis as .csv inside a new directory named\nanalysis.\n\n\u003c/details\u003e\n\n\n\u003c/details\u003e\n\n\n## \u003cdiv align=\"center\"\u003eOverview with Git\u003c/div\u003e\n\u003cdetails\u003e \n\u003csummary\u003eInstallation another method\u003c/summary\u003e\n\n```\ngit clone https://github.com/tusharsarkar3/TLA.git\n```\n\u003c/details\u003e\n\u003cdetails\u003e\n\u003csummary\u003eExtract data\u003c/summary\u003e\nNavigate to the required directory\n\n```\ncd Data\n```\n\nRun the following command:\n```\npython get_data.py --lang en --process True\n```\nLang flag is used to input the language of the dataset that is required and\nprocess flag shows where pre-processing should be done before returning the data.\nGive the following codes in the lang flag wrt the required language:\n\n\n\n \u003csummary\u003eLoading Dataset\u003c/summary\u003e\n\nTo load a dataset run the following command in python.\n \n```\ndf= pd.read_csv(\"TLA/TLA/Datasets/get_data_en.csv\")\n \n```\nThe command will return a dataframe consisting of the data for the specific language requested.\n \nIn the phrase get_data_en, en can be sunstituted by the desired language code to load the dataframe for the specific language.\n \n  \u003csummary\u003ePre-Processing\u003c/summary\u003e\n \n To preprocess a given string run the following command.\n \n In your terminal use code \n \n ```\n cd Data\n ```\n \n then run the command in python\n \n ```\n from TLA.Data import Pre_Process_Tweets\n \n df=Pre_Process_Tweets.pre_process_tweet(df)\n ```\n \n Here the function pre_process_tweet takes an input as a dataframe of tweets and returns an output of a dataframe with the list of preprocessed words\n for a particular tweet next to the tweet in the dataframe.\n \n \n \n \n\u003c/details\u003e\n\n\n\n\n\u003cdetails\u003e\n\u003csummary\u003eAnalysis\u003c/summary\u003e\n \n \u003csummary\u003e Training \u003c/summary\u003e\n To train a random forest classifier for the purpose of sentiment analysis run the following command in your terminal.\n \n ```  \n cd Analysis\n ```\n then \n \n ```\n python train.rf --path \"path to your datafile\" --train_all_datasets False\n ```\n \n here the --path flag represents the path to the required dataset you want to train the Random Forest Classifier on\n the --train_all_datasets flag is a boolean which can be used to train the model on multiple datasets at once.\n \n The output is a file with the a .pkl file extention saved in the folder at location \"TLA\\Analysis\\saved_rf\\{}.pkl\"\n The output for vectorization of is stored in a .pkl file in the directory  \"TLA\\Analysis\\saved_vec\\{}.pkl\"\n \n \u003csummary\u003e Get Sentiment \u003c/summary\u003e\n \n To get the sentiment of any string use the following code.\n \n In your terminal type\n \n ```\n cd Analysis\n ```\n then in your terminal type\n \n ```\n python get_sentiment.py --prediction \"Your string for prediction to be made upon\" --lang \"en\"\n ```\n \n here the --prediction flag collects the string for which you want to get the sentiment for.\n the --lang represents the language code representing the language you typed your string in.\n \n The output is a sentiment which is either positive or negative depending on your string.\n \n \n \u003csummary\u003eStatistics\u003c/summary\u003e\n \n To get a comprehensive statistic on sentiment of datasets run the following command.\n \n In your terminal type\n \n ```\n cd Analysis\n ```\n \n then\n \n ```\n python analyse.py \n ```\n \n This will give you an output of a table1.csv file at the location 'TLA\\Analysis\\analysis\\table1.csv' comprising of statistics relating to the\n percentage of positive or negative tweets for a given language dataset.\n \n It will also give a table2.csv file at 'TLA\\Analysis\\analysis\\table2.csv' comprising of statistics for all languages combined.\n \n \n \u003c/details\u003e  \n\n\n\n\n\n\n\u003cdetails\u003e\n\u003csummary\u003eLanguage Classification \u003c/summary\u003e\n \u003csummary\u003eTraining\u003c/summary\u003e\n To train a model for language classfication on a given dataset run the following commands.\n \n In your terminal run\n \n ```\ncd Lang_Classify\n ```\n then run\n \n ```\n python train.py --data \"path for your dataset\" --model \"path to weights if pretrained\" --epochs 4\n ```\n \nThe --data flag requires the path to your training dataset.\n \n The --model flag requires the path to the model you want to implement\n \n The --epoch flag represents the epochs you want to train your model for.\n \n The output is a file with a .pt extention named saved_wieghts_full.pt where your trained wieghst are stored.\n \n \n \u003csummary\u003ePrediction\u003c/summary\u003e\n To make prediction on any given string Us ethe following code.\n \n In your terminal type\n \n ```\n cd Lang_Classify\n ```\n then run the code\n \n ```\n python predict.py --predict \"Text/DataFrame for language to predicted\" --weights \" Path for the stored weights of your model \" \n ```\n \n The --predict flag requires the string you want to get the language for.\n \n The --wieghts flag is the path for the stored wieghts you want to run your model on to make predictions.\n \n \nThe outputs is the language your string was typed in.\n\n\n\n\u003c/details\u003e\n \n\n\n\n---\n### Results:\n\n![img](ss/performance.png)\n\nPerformance of TLA ( Loss vs epochs)\n\n \n \n \n |Language | Total tweets | Positive Tweets Percentage | Negative Tweets Percentage |\n | ----------------  | ---------------- | ---------------- | ---------------- |\n |English | 500 | 66.8 | 33.2 |\n |Spanish | 500 |  61.4 | 38.6 |\n |Persian  | 50 | 52 | 48 |  \n |French | 500 | 53 | 47 | \n |Hindi | 500 | 62 | 38 | \n |Indonesian | 500 | 63.4 | 36.6|\n |Japanese | 500 | 85.6 |  14.4 |  \n |Dutch | 500 | 84.2 | 15.8  |\n |Portuguese|  500 |  61.2 | 38.8| \n |Romainain|  457 | 85.55 |  14.44| \n |Russian|  213 | 62.91 | 37.08 |\n |Swedish|  420 | 80.23 | 19.76 |\n |Thai|  424 | 71.46 | 28.53 |\n |Turkish|  500 | 67.8 | 32.2 | \n |Urdu| 42 | 69.04 |  30.95 |\n |Chinese| 500 | 80.6 | 19.4 | \n \n---\n#### Reference:\n\n ```\n@misc{sarkar2021tla,\n      title={TLA: Twitter Linguistic Analysis}, \n      author={Tushar Sarkar and Nishant Rajadhyaksha},\n      year={2021},\n      eprint={2107.09710},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n ```\n\n ```\n@misc{640cba8b-35cb-475e-ab04-62d079b74d13,\n  title = {TLA: Twitter Linguistic Analysis},\n  author = {Tushar Sarkar and Nishant Rajadhyaksha},\n   journal = {Software Impacts},\n  doi = {10.24433/CO.6464530.v1}, \n  howpublished = {\\url{https://www.codeocean.com/}},\n  year = 2021,\n  month = {6},\n  version = {v1}\n}\n ```\n\n---\n #### Features to be added :\n- Access to more language\n- Creating GUI based system for better accesibility\n- Improving performance of the baseline model\n\n---\n\n\u003ch3 align=\"center\"\u003e\u003cb\u003eDeveloped by \u003ca href=\"https://github.com/tusharsarkar3\"\u003eTushar Sarkar\u003c/a\u003e and \u003ca href=\"https://github.com/nishant42491\"\u003eNishant Rajadhyaksha\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftusharsarkar3%2Ftla","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftusharsarkar3%2Ftla","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftusharsarkar3%2Ftla/lists"}