{"id":15014151,"url":"https://github.com/thepanacealab/smmt","last_synced_at":"2025-08-20T22:32:08.966Z","repository":{"id":40502997,"uuid":"238335239","full_name":"thepanacealab/SMMT","owner":"thepanacealab","description":"Social Media Mining Toolkit (SMMT) main repository","archived":false,"fork":false,"pushed_at":"2022-11-11T14:26:26.000Z","size":534,"stargazers_count":133,"open_issues_count":0,"forks_count":37,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-12-09T11:10:40.580Z","etag":null,"topics":["annotation","data-acquisition","data-annotation","data-preprocessing","gathering","spacy","tweets","twitter-api"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thepanacealab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-05T00:24:00.000Z","updated_at":"2024-11-09T04:41:29.000Z","dependencies_parsed_at":"2022-09-20T07:52:09.203Z","dependency_job_id":null,"html_url":"https://github.com/thepanacealab/SMMT","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thepanacealab%2FSMMT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thepanacealab%2FSMMT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thepanacealab%2FSMMT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thepanacealab%2FSMMT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thepanacealab","download_url":"https://codeload.github.com/thepanacealab/SMMT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230462907,"owners_count":18229864,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation","data-acquisition","data-annotation","data-preprocessing","gathering","spacy","tweets","twitter-api"],"created_at":"2024-09-24T19:45:15.701Z","updated_at":"2024-12-19T16:12:03.749Z","avatar_url":"https://github.com/thepanacealab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Social Media Mining Toolkit (SMMT)\n\n\u003ctable border=\"1\" style='border-collapse:collapse;color:#000'\u003e\n\u003cthead\u003e\n  \u003ctr style='border:1px solid white;'\u003e\n    \u003ctd style='border:0px solid white;' width=\"30%\"\u003e\u003cimg src=\"http://www.jmbanda.com/SMMT_sticker.png\" alt=\"Aphrodite Sticker\"\u003e\u003c/td\u003e\n    \u003ctd style='border:0px solid white;'\u003e The set of tools collected and presented here are designed with the purpose of facilitating the acquisition, preprocessing, and initial exploration of social media data (mostly Twitter for now). This centralized repository depends on other widely available libraries that need to be installed. \u003cbr/\u003e\u003cbr/\u003e\nWe separated this toolkit in three categories (each one on an individual folder): \u003cbr/\u003e\u003cbr/\u003e\n1. \u003cb\u003eData Acquisition Tools:\u003c/b\u003e Utilities to gather data from social media sites \u003cbr/\u003e\n2. \u003cb\u003eData Preprocessing Tools:\u003c/b\u003e Utilities to parse social media 'raw' data and to separate by terms \u003cbr/\u003e\n      3. \u003cb\u003eData Annotation and Standardization Tools:\u003c/b\u003e Utilities to make automatic NER annotations on preprocessed tweets, plugins to use popular annotation tools and NER systems \u003cbr/\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003c/table\u003e\n\n# Usage\n\n1. Install dependencies (below)\n1. Clone repository\n1. Make sure you have your Twitter API keys handy if you are gathering any Twitter Data\n1. Each tool and their usage is described on the README file on each category of tools folder.\n\nAll the libraries used in this toolkit can be installed using the following command. \n\n```\nsh requirements.sh\n```\n**Note:**  If you would like to setup headless browsing automation tasks, please install additional dependencies given below.\n\n# Dependencies and versions used\n\n1. Python 3+\n\n1. [Spacy v2.2](https://spacy.io/usage)\n``` \npip install spacy \npython -m spacy download en\npython -m spacy download en_core_web_sm\n```\n1. [Twarc](https://github.com/DocNow/twarc)\n` pip install twarc `\n\n1. [Tweepy v3.8.0](http://docs.tweepy.org/en/latest/)\n` pip install tweepy `\n\n1. [argparse - v3.2](https://docs.python.org/3/library/argparse.html)\n` pip install argparse `\n\n1. [xtract - v0.1a3](https://pypi.org/project/xtract/)\n` pip install xtract `\n\n**NOTE:** If you are using the scraping utility, install the following dependencies. These dependencies are needed for the headless browsing automation tasks (no need to have a screen open for them). Configuration of these items is very finicky but there is plenty of documentation online.\n\n1. [Xvf](https://linux.die.net/man/1/xvfb)\n` sudo yum install Xvfb `\n\n1. [Firefox](https://www.mozilla.org/en-US/firefox/linux/)\n` sudo yum install firefox `\n\n1. [selenium](https://selenium.dev/)\n` pip install -U selenium `\n\n1. [pyvirtualdisplay - v0.25](https://pypi.org/project/PyVirtualDisplay/)\n` pip install pyvirtualdisplay `\n\n1. [GeckoDriver - v0.26.0](https://github.com/mozilla/geckodriver/releases)\n` sudo yum install jq `\n\nand then use the provided utility:\n\n` bash SMMT/data_acquisition/geckoDriverInstall.sh `\n\nIf you still have issues or the Firefox window is popping up through your X11, follow this:\nhttps://www.tienle.com/2016/09-20/run-selenium-firefox-browser-centos.html\n\n# Twitter Keys\n**This is a very important step, if you do not have any Twitter API keys, none of the software that uses Twitter API will work without it**\n\n# How to cite our work:\nIf you used SMMT and liked it, please cite the following paper:\n\nR Tekumalla and JM Banda. \"Social Media Mining Toolkit (SMMT)\". Genomics \u0026 Informatics, 18, (2), 2020. https://doi.org/10.5808/GI.2020.18.2.e16\n\n# Social Media Mining Toolkit (SMMT) Extra Information\n\n## Data Acquisition Tools:\n1. **Twitter hydration tool** - This script will hydrate tweet ID’s provided by others. \n1. **Twitter gathering tool** - This script will allow users to specify hashtags and capture from the twitter faucet new tweets with the given hashtag.\n\n## Data Preprocessing Tools: \n1. **Twitter JSON extraction tool** - While seemingly trivial, most biomedical researchers do not want to work with JSON objects. This tool will take the fields the researcher wants and output a simple to use CSV file created from the provided data. \n\n## Data Annotation and Standardization Tools: \n1. **Spacy dictionary-based annotation pipeline** This is the tool that will require the most work during the hackathon. This pipeline will be available as a service as well, with the user providing their dictionaries and feeding data directly.  \n1. **Dictionary generation tool** This tool will transform ontologies or provided dictionary files into spacy compliant dictionaries to use with the previous pipeline.\n1. **Manual annotation hooks to tools like brat annotation tools** \n\n\nThis work was conceptualized for/and (mostly) carried out while at the [Biomedical Linked Annotation Hackathon 6](http://blah6.linkedannotation.org/) in Tokyo, Japan.\n\n![BLAH](http://www.jmbanda.com/blah6.png)\n\nWe are very grateful for the support on this work.\n\n# Proposed functionality of SMMT V1.0\n\n![Architecture](http://www.jmbanda.com/SMMT-v1.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthepanacealab%2Fsmmt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthepanacealab%2Fsmmt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthepanacealab%2Fsmmt/lists"}