{"id":13631494,"url":"https://github.com/msr8/markify","last_synced_at":"2025-04-17T22:31:07.850Z","repository":{"id":47217984,"uuid":"515956271","full_name":"msr8/markify","owner":"msr8","description":"Markify is an open source command line application written in python which scrapes data from your social media accounts and utilises markov chains to generate new sentences based on the scraped data","archived":false,"fork":false,"pushed_at":"2024-08-08T03:15:44.000Z","size":28013,"stargazers_count":12,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-08T12:07:02.104Z","etag":null,"topics":["cli","discord","markov-chain","markov-chains","markov-model","markovify","nltk-python","python","reddit","scraper","twitter"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msr8.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-20T11:36:16.000Z","updated_at":"2024-08-08T03:15:48.000Z","dependencies_parsed_at":"2024-01-01T03:20:13.141Z","dependency_job_id":"09c41e1f-cc1c-4042-aefe-4e378230f7ec","html_url":"https://github.com/msr8/markify","commit_stats":{"total_commits":47,"total_committers":2,"mean_commits":23.5,"dds":"0.021276595744680882","last_synced_commit":"2bcf7a0663fcaece06c93fa05606a1eaf65d5b55"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msr8%2Fmarkify","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msr8%2Fmarkify/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msr8%2Fmarkify/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msr8%2Fmarkify/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msr8","download_url":"https://codeload.github.com/msr8/markify/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223768487,"owners_count":17199355,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","discord","markov-chain","markov-chains","markov-model","markovify","nltk-python","python","reddit","scraper","twitter"],"created_at":"2024-08-01T22:02:27.798Z","updated_at":"2024-11-08T23:30:54.937Z","avatar_url":"https://github.com/msr8.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003c!--\nCOLORS:\n\n302D41 (label)\n96CDFB (blue)\nDDB6F2 (pink)\nABE9B3 (green)\nF8BD96 (orange)\n--\u003e\n\n\u003cbr\u003e\n\n\u003cdiv align='center'\u003e\n\n   \u003c!-- \u003cimg src=\"https://img.shields.io/github/stars/msr8/markify?color=FFBE0B\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=FB5607\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/github/last-commit/msr8/markify?color=FF006E\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e   \n   \u003cimg src=\"https://img.shields.io/github/issues/msr8/markify?color=8338EC\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e\n   \u003cimg src=\"https://img.shields.io/github/license/msr8/markify?color=3A86FF\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n\n   \u003cbr\u003e\u003cbr\u003e\u003cbr\u003e --\u003e\n\n   \u003cimg src=\"https://img.shields.io/github/stars/msr8/markify?color=F72585\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=7209B7\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/github/last-commit/msr8/markify?color=3A0CA3\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e   \n   \u003cimg src=\"https://img.shields.io/github/issues/msr8/markify?color=4361EE\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e\n   \u003cimg src=\"https://img.shields.io/github/license/msr8/markify?color=4CC9F0\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n\n   \u003c!-- \u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n   \u003cimg src=\"https://img.shields.io/github/stars/msr8/markify?color=CDB4DB\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=FFC8DD\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/github/last-commit/msr8/markify?color=FFAFCC\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e   \n   \u003cimg src=\"https://img.shields.io/github/issues/msr8/markify?color=BDE0FE\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e\n   \u003cimg src=\"https://img.shields.io/github/license/msr8/markify?color=A2D2FF\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e --\u003e\n\n   \u003c!-- \u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\n\n   \u003cimg src=\"https://img.shields.io/github/stars/msr8/markify?color=F8BD96\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=048A81\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/github/last-commit/msr8/markify?color=DDB6F2\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e   \n   \u003cimg src=\"https://img.shields.io/github/issues/msr8/markify?color=ABE9B3\u0026labelColor=302D41\u0026style=for-the-badge\"\u003e   \n   \u003cimg src=\"https://img.shields.io/github/license/msr8/markify?color=96CDFB\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e --\u003e\n\n   \u003c!-- \u003cbr\u003e\u003cbr\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=69626D\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=AB4E68\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=305252\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=E36588\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=545454\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=048A81\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e\n   \u003cimg src=\"https://img.shields.io/pypi/v/markify?color=9D6381\u0026labelColor=302D41\u0026style=for-the-badge\"/\u003e --\u003e\n\n   \u003cbr\u003e\n\n   \u003c!-- \u003cvideo controls\u003e \n        \u003csource src='https://raw.githubusercontent.com/msr8/markify/main/ass/usagelol.mp4' type=\"video/mp4\"\u003elol\n    \u003c/video\u003e --\u003e\n\u003c/div\u003e\n\n\u003cbr\u003e\n\n\u003e [!NOTE]\n\u003e Reddit scraping does not work anymore because of the new (June 2023) policy changes, due to which pushshift had to shut down\n\n\n\u003cbr\u003e\n\nhttps://user-images.githubusercontent.com/79649185/182558272-255becc8-1dcc-45b5-99ef-22e0596cf490.mp4\n\n\u003c!-- \u003cbr\u003e --\u003e\n\n\u003cp align='center'\u003e\n\u003ca href='https://github.com/msr8/markify' \u003eGithub\u003c/a\u003e |\n\u003ca href='https://pypi.org/project/markify'\u003ePyPi\u003c/a\u003e\n\u003c/p\u003e\n\n\n\n\n\n\n\n\n# Index\n\n* [Introduction](#introduction)\n* [Installation](#installation)\n* [Usage](#usage)\n* [Flags](#flags)\n* [How does this work?](#how-does-this-work)\n* [FAQs](#faqs)\n\n\u003cbr\u003e\u003cbr\u003e\n\u003cbr\u003e\u003cbr\u003e\n\n# Introduction\n\nMarkify is an open source command line application written in python which scrapes data from your social media accounts and utilises markov chains to generate new sentences based on the scraped data\n\n- Engineered a ***command-line application***, Markify, leveraging Python to extract and analyze data from social media accounts\n- Employed ***NLTK*** for meticulous data sanitization\n- Demonstrated proficiency in ***interfacting with a variety of APIs*** (official and unofficial) to aggregate data\n- Employed the use of the ***markov chains*** for generating new sentences\n- Packaged the application for widespread use by uploading it to ***PyPI***\n\n\u003cbr\u003e\u003cbr\u003e\n\n# Installation\n\nThere are many methods to install markify on your device, such as:\n\n\u003cbr\u003e\n\n## 1) Install the pip package\n***(Reccomended)***\n\n```bash\npython -m pip install markify\n```\n\n## 2) Install it via pip and git\n\n```bash\npython -m pip install git+https://github.com/msr8/markify.git\n```\n\n## 3) Clone the repo and install the package\n\n```bash\ngit clone https://github.com/msr8/markify\ncd markify\npython setup.py install\n```\n\n## 4) Clone the repo and run markify without installing to PATH\n\n```bash\ngit clone https://github.com/msr8/markify\ncd markify\npython -m pip install -r requirements.txt\ncd src\npython markify.py\n```\n\n\u003cbr\u003e\u003cbr\u003e\n\n# Usage\n\nTo use, you can simply just run `markify` on the command line, but we gotta setup a config file first. If you're windows, the default location for the config file is `%LOCALAPPDATA%\\markify\\config.json`, and on linux/macOS it is `~/.config/markify/config.json`. Alterantively, you can provide the path to the config file using the `-c --config` flag. If you run the program and the config file doesn't exist, it makes an empty template. An ideal config file should look like:\n```json\n{\n    \"reddit\": {\n        \"username\"     : \"...\"\n    },\n    \"discord\": {\n        \"token\"        : \"...\"\n    },\n    \"twitter\": {\n        \"username\"     : \"...\"\n    }\n}\n```\nwhere the username under reddit section is your reddit username, token under discord is your discord token, and username under twitter is your twitter username. If any of them are not given, the program will skip the collection process for that social media\n\n\u003cbr\u003e\u003cbr\u003e\n\n# Flags\n\nYou can view the available flags by running `markify --help`. It should show the following text:\n```\n  -h, --help            show this help message and exit\n  -c CONFIG, --config CONFIG\n                        The path to config file. By default, its {LOCALAPPDATA}/markify/config.json on\n                        windows, and ~/.config/markify/config.json on other operating systems\n  -d DATA, --data DATA  The path to the json data file. If given, the program will not scrape any data and\n                        will just compile the model and generate sentences\n  -n NUMBER, --number NUMBER\n                        Number of sentences to generate. Default is 50\n  -v, --version         Print out the version number\n```\nMore explanation is given below:\n\n\u003cbr\u003e\n\n## -c --config\n\nThis is the path to the config file (config.json). By default, its `{LOCALAPPDATA}/markify/config.json` on windows, and `~/.config/markify/config.json` on other operating systems. For example:\n```bash\nmarkify -c /Users/tyrell/Documents/config.json\n```\n\n## -d --data\n\nThis is the path to the data file containing all the scraped content. If it is given, the program doesn't scrape any data and just complies a model based on the data present in the file. By default, a new data file is generated in the `DATA` folder in the config folder and is named `x.json` where `x` is the current epoch time in seconds. For example:\n```bash\nmarkify -d /Users/tyrell/.config/markify/DATA/1658433988.json\n```\n\n## -n --number\n\nThis is the number of sentences to generate after compiling the model. Default is 50. For example:\n```bash\nmarkify -n 20\n```\n\n## -v --version\n\nPrint out the version of markify you're using via this flag. For example:\n```bash\nmarkify -v\n```\n\n\u003cbr\u003e\u003cbr\u003e\n\n# How does this work?\n\nThis program has 4 main parts: Scraping reddit comments, scraping discord messages, scraping tweets, generating sentences using markov chains. More explanation is given below\n\n\u003cbr\u003e\n\n## Scraping reddit comments\n\nThe program uses the [Pushshift's API](https://github.com/pushshift/api) to scrape your comments. Since Pushshift can only return 1000 comments at a time, the program gets the timestamp of the oldest comment and then sends a request to the API to get comments before that timestamp. This loop goes on until either all your comments are scraped, or 10000 comments are scraped. I chose to use Pushshift's API since its faster, yeilds more result, and doesnt need a client ID or secret\n\n\u003cbr\u003e\n\n## Scraping discord messages\n\nTo scrape discord messages, first the program checks if the token is valid or not by getting basic information (username, discriminator, and account ID) through the `/users/@me` endpoint. Then it gets all the DM channels you have participated in through the `/@me/channels` endpoint. Then it extracts the channel IDs from the response and gets the recent 100 messages in the channels using the `/channels/channelid/messages` endpoint, where `channelid` is the channel ID. Then it goes through the respone and adds the messages which are a text message, sent by you, and arent empty, to the data file\n\n\u003cbr\u003e\n\n## Scraping tweets\n\nThe program uses the [snscrape](https://github.com/JustAnotherArchivist/snscrape) module to scrape your tweets. The program keeps scraping your tweets until either it has scraped all the tweets, or has scraped 10000 tweets\n\n\u003cbr\u003e\n\n## Generating sentences using markov chains\n\nThe program extracts all the useful texts from the data file and makes a markov chain model based on them using the [markovify](https://github.com/jsvine/markovify) module. Then the program generates new sentences (default being 50) and prints them out\n\n\u003cbr\u003e\u003cbr\u003e\n\n# FAQs\n\n\u003cbr\u003e\n\n### Q) How do I get my discord token?\n\nRecently (as of July 2022), discord reworked its system of tokens and the format of the new tokes is a bit different. You can obtain your discord token using this [guide](https://www.androidauthority.com/get-discord-token-3149920/)\n\n\u003cbr\u003e\n\n### Q) The program is throwing an error and is telling me to install \"averaged_perceptron_tagger\" or something. What to do?\n\nRunning the command given below should work\n```bash\npython3 -c \"import nltk; nltk.download('averaged_perceptron_tagger')\"\n```\nYou can visit [this page](https://www.nltk.org/data.html) for more information\n\n\u003cbr\u003e\n\n### Q) The installation is stuck at building lxml. What to do?\n\nSadly, all you can do is wait. It is a [known issue with lxml](https://stackoverflow.com/questions/33064433/lxml-will-never-finish-building-on-ubuntu)\n\n\n\n\n\n\n\n\n\n\n\n\u003c!-- \nTODO\n\n-\u003e Convert the video to a video tag in setup.py\n--\u003e\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsr8%2Fmarkify","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsr8%2Fmarkify","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsr8%2Fmarkify/lists"}