{"id":13567057,"url":"https://github.com/PhantomInsights/comments-generator","last_synced_at":"2025-04-04T01:30:59.008Z","repository":{"id":101906998,"uuid":"176202733","full_name":"PhantomInsights/comments-generator","owner":"PhantomInsights","description":"A Reddit bot that generates new context-aware comments using Markov chains trained from a set of given users or subreddits comments history.","archived":false,"fork":false,"pushed_at":"2021-10-21T02:14:56.000Z","size":100,"stargazers_count":73,"open_issues_count":0,"forks_count":3,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-20T02:22:40.409Z","etag":null,"topics":["markov-chain","nlp","praw","python3","reddit-bot","requests"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PhantomInsights.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null},"funding":{"github":"agentphantom","patreon":"agentphantom"}},"created_at":"2019-03-18T04:10:35.000Z","updated_at":"2024-12-30T23:01:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"b0a8092b-edc5-4e41-bf28-51bb8e464f50","html_url":"https://github.com/PhantomInsights/comments-generator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PhantomInsights%2Fcomments-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PhantomInsights%2Fcomments-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PhantomInsights%2Fcomments-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PhantomInsights%2Fcomments-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PhantomInsights","download_url":"https://codeload.github.com/PhantomInsights/comments-generator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247107816,"owners_count":20884793,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["markov-chain","nlp","praw","python3","reddit-bot","requests"],"created_at":"2024-08-01T13:02:22.561Z","updated_at":"2025-04-04T01:30:58.647Z","avatar_url":"https://github.com/PhantomInsights.png","language":"Python","funding_links":["https://github.com/sponsors/agentphantom","https://patreon.com/agentphantom","https://www.patreon.com/bePatron?u=20521425"],"categories":["Python","*education-only* Bots"],"sub_categories":["External Reddit Tools"],"readme":"# Comments Generator\n\nThis project consists of a Reddit bot that replies to users with newly generated context-aware comments using `Markov chains` trained from existing comments of subreddits and users you desire. \n\nThe main purpose of this project was to document the extraction, transformation and load process (`ETL`) of Reddit comments and to create the foundation for a very simple chat bot.\n\nThe project is divided in 3 main parts, the `ETL` process, the generation of the training model and its use to generate new comments and post them on Reddit.\n\nThe most important files are:\n\n* `step1.py` : A Python script that downloads the complete comment history from the given Reddit usernames using the Pushshift API.\n\n* `step1_alt.py` : A Python script that downloads an specified amount of comments from the given subreddits using the Pushshift API.\n\n* `step2.py` : A Python script that reads the generated .csv files from step1.py/step1_alt.py, applies some light clean up and computes their contents into a training model.\n\n* `step2_alt.py` : A Python script that reads specified .txt files and computes their contents into a training model. This script is recommended if your text sources are not Reddit comments. \n\n* `bot.py` : A Reddit bot that checks its inbox for new replies, mentions and private messages and replies to them with newly generated comments using the training model.\n\n* `step3.py` : A Python script that generates new sentences using the training model. This script is recommended if you only want to see the results and don't need a Reddit bot.\n\n## Requirements\n\nThis project uses the following Python libraries\n\n* `PRAW` : Makes the use of the Reddit API very easy.\n* `Requests` : Used to download comments from the Pushshift API.\n\n## ETL Process\n\nThe `Pushshift` API allows us to download `Reddit` comments in batches of 500, this is really useful when we plan to download tens of thousands of comments.\n\nThis project includes 2 methods to get comments, either users comments or subreddits comments.\n\n### Users Comments\n\nFor these kind of small scripts I like to use a *global* list that I can manipulate anywhere in the program. On bigger projects this can be an issue if not correctly structured.\n\nWe start by iterating over the desired usernames and creating a `csv.writer` object.\n\n```python\nfor username in USERNAMES:\n\n    writer = csv.writer(open(\"./{}.csv\".format(username),\n                                 \"w\", newline=\"\", encoding=\"utf-8\"))\n\n    # Adding the header.\n    writer.writerow([\"datetime\", \"subreddit\", \"body\"])\n\n    load_comments(username=username)\n```\n\nThe script downloads 500 comments at a time in reverse chronological order until it doesn't have any more comments to download.\n\nFrom each comment I extract 3 fields, timestamp, subreddit and the comment body.\n\n*Note: Currently the date and time are not used in this project, I added them to verify that I was getting the comments in the order I desired but they can be useful for future projects.*\n\nThose 3 fields are then packed into a list and added to the *global* list.\n\n```python\nfor item in json_data[\"data\"]:\n\n    latest_timestamp = item[\"created_utc\"]\n\n    iso_date = datetime.fromtimestamp(latest_timestamp)\n\n    subreddit = item[\"subreddit\"]\n\n    body = item[\"body\"]\n\n    COMMENTS_LIST.append([iso_date, subreddit, body])\n```\n\nOnce the script finishes downloading all the comments from the current user it calls the `csv.writer.writerows()` method with the contents of the *global* list, clears the *global* list and moves to the next user.\n\n### Subreddits Comments\n\nThis script is very similar to the previous one, the main difference is that we don't specify which users comments we want to download, instead we download comments from all the users that participated in the given subreddits.\n\nThe default maximum amount of comments has been set to 20,000. I found this number to be good enough for creating the training model.\n\nThe script will attempt to download the defined maximum amount of comments and in case the subreddit has fewer comments the script will save them as usual and move to the next subreddit.\n\n## Understanding Markov Chains\n\nBefore moving to more code I need to explain a few very important things about Markov chains.\n\nMarkov chains can have a *variable length memory*, this is very useful to generate natural looking texts.\n\nThe proper name of this memory is *order*, the greater is the order the more realistic the generated text will be but it will have the side effect of having less outcomes.\n\nTo better illustrate the difference we are going to use the following paragraph from Lou Gehrig farewell to baseball speech to create a first and second-order Markov chains and models.\n\n\u003e Fans, for the past two weeks you have been reading about a bad break I got. Yet today I consider myself the luckiest man on the face of the earth. I have been in ballparks for seventeen years and have never received anything but kindness and encouragement from you fans.\n\n### First-order\n\n```python\n{\n    'Fans,': ['for'],\n    'for': ['the', 'seventeen'],\n    'the': ['past','luckiest', 'face', 'earth.'],\n    'past': ['two'],\n    'two': ['weeks'],\n    'weeks': ['you'],\n    'you': ['have', 'fans.'],\n    'have': ['been', 'been', 'never'],\n    'been': ['reading', 'in'],\n    'reading': ['about'],\n    'about': ['a'],\n    'a': ['bad'],\n    'bad': ['break'],\n    'break': ['I'],\n    'I': ['got.', 'consider', 'have'],\n    'got.': ['Yet'],\n    'Yet': ['today'],\n    'today': ['I'],\n    'consider': ['myself'],\n    'myself': ['the'],\n    'luckiest': ['man'],\n    'man': ['on'],\n    'on': ['the'],\n    'face': ['of'],\n    'of': ['the'],\n    'earth.': ['I'],\n    'in': ['ballparks'],\n    'ballparks': ['for'],\n    'seventeen': ['years'],\n    'years': ['and'],\n    'and': ['have', 'encouragement'],\n    'never': ['received'],\n    'received': ['anything'],\n    'anything': ['but'],\n    'but': ['kindness'],\n    'kindness': ['and'],\n    'encouragement': ['from'],\n    'from': ['you']\n}\n```\n\nWith first-order Markov chains we can have multiple outcomes for each state (word) and we can generate sentences like these ones:\n\n* Yet today I have been in ballparks for the earth. I consider myself the past two weeks you fans.\n\n* Fans, for seventeen years and have been reading about a bad break I have never received anything but kindness and have been in ballparks for the earth.\n\n* Yet today I have been in ballparks for seventeen years and encouragement from you fans. Yet today I have been reading about a bad break I got.\n\nThe previous sentences sometimes can make a little bit of sense but if our goal is to generate realistic looking ones we can use a second-order chain.\n\n### Second-order\n\n```python\n{\n    'Fans, for': ['the'],\n    'for the': ['past'],\n    'the past': ['two'],\n    'past two': ['weeks'],\n    'two weeks': ['you'],\n    'weeks you': ['have'],\n    'you have': ['been'],\n    'have been': ['reading', 'in'],\n    'been reading': ['about'],\n    'reading about': ['a'],\n    'about a': ['bad'],\n    'a bad': ['break'],\n    'bad break': ['I'],\n    'break I': ['got.'],\n    'I got.': ['Yet'],\n    'got. Yet': ['today'],\n    'Yet today': ['I'],\n    'today I': ['consider'],\n    'I consider': ['myself'],\n    'consider myself': ['the'],\n    'myself the': ['luckiest'],\n    'the luckiest': ['man'],\n    'luckiest man': ['on'],\n    'man on': ['the'],\n    'on the': ['face'],\n    'the face': ['of'],\n    'face of': ['the'],\n    'of the': ['earth.'],\n    'the earth.': ['I'],\n    'earth. I': ['have'],\n    'I have': ['been'],\n    'been in': ['ballparks'],\n    'in ballparks': ['for'],\n    'ballparks for': ['seventeen'],\n    'for seventeen': ['years'],\n    'seventeen years': ['and'],\n    'years and': ['have'],\n    'and have': ['never'],\n    'have never': ['received'],\n    'never received': ['anything'],\n    'received anything': ['but'],\n    'anything but': ['kindness'],\n    'but kindness': ['and'],\n    'kindness and': ['encouragement'],\n    'and encouragement': ['from'],\n    'encouragement from': ['you'],\n    'from you': ['fans.']\n}\n```\n\nWe can observe that we only have one instance where the outcome can be 50/50: `'have been': ['reading', 'in']`.\n\n* Yet today I consider myself the luckiest man on the face of the earth. I have been in ballparks for seventeen years and have never received anything but kindness and encouragement from you fans.\n\n* I consider myself the luckiest man on the face of the earth. I have been reading about a bad break I got.\n\nThe results look more natural but we will soon realize the chain is identical to the original text.\n\nThis is why it is very important to collect a high amount of data.\n\n## Generating the Model\n\nNow that we have seen the difference between first and second order Markov chains we can continue with the model generation.\n\nThe step2.py/step2_alt.py scripts allows us to define the order. The default one is 2 (second-order).\n\nWe will also have to define which .csv files we want to process. I have implemented a filter mechanism where we can define which subreddits we want to allow, this is to filter out subreddits with NSFW or undesired content.\n\nWe then start iterating over all .csv files using the `csv.DictReader` class.\n\nSome light clean up is made to ensure all comments don't have whitespaces around them and making sure all comments end with punctuation.\n\n```python\nword_dictionary = dict()\ncomments_list = list()\n\nfor csv_file in CSV_FILES:\n\n    # We iterate the .csv row by row.\n    for row in csv.DictReader(open(csv_file, \"r\", encoding=\"utf-8\")):\n\n        # Remove unnecessary whitespaces.\n        row[\"body\"] = row[\"body\"].strip()\n\n        # To improve results we ensure all comments end with a period.\n        ends_with_punctuation = False\n\n        for char in [\".\", \"?\", \"!\"]:\n            if row[\"body\"][-1] == char:\n                ends_with_punctuation = True\n                break\n\n        if not ends_with_punctuation:\n            row[\"body\"] += \".\"\n```\n\nAfter we have cleaned up the comment we add it to a master list.\n\nThis list is then merged into one big string that will then be split into individual words.\n\nThe purpose of this is to increase the number of outcomes.\n\n```python\ncomments_list.append(row[\"body\"])\n\n# We separate each comment into words.\nwords_list = \" \".join(comments_list).split()\n```\n\nCreating the model is actually not hard. We only require to have a way to know the current index of each word in the `word_list`. The `enumerate` built-in function will be perfect for this task.\n\nWe first define our prefix, which is the current word plus the next word(s) equal to the order number.\n\nSince Python slicing is exclusive on the end part, we don't have to do anything extra. \n\nThen we define our suffix, which is very similar to calculate as the prefix.\n\n```python\nfor index, _ in enumerate(words_list):\n\n    # This will always fail in the last word since it doesn't have anything to pair it with.\n    try:\n\n        prefix = \" \".join(words_list[index:index+ORDER])\n        suffix = words_list[index+ORDER]\n\n        # If the word is not in the dictionary, we init it with the next word.\n        if prefix not in word_dictionary.keys():\n            word_dictionary[prefix] = list([suffix])\n        else:\n            # Otherwise we append it to its inner list of outcomes.\n            word_dictionary[prefix].append(suffix)\n\n    except:\n        pass\n```\n\nIf our prefix is not in the `word_dictionary` we initiate it with a list containing the current suffix.\n\nAlternatively, if the prefix is already in the dictionary we just append the current suffix to its inner list.\n\nFinally we save the dictionary using the `pickle` module. This will save us time when reusing it on other Python scripts.\n\n*Note: If you want to create training models from other text sources such as tweets, books or chat logs you can use step2_alt.py instead. The script takes the contents of the specified .txt files, merges them and compiles the model in the same way as in step2.py*\n\n## Reddit Bot\n\nThis bot is simple in nature, it checks its inbox every minute for new unread messages and replies to them.\n\nWe first define a list of users to ignore, this is very important to avoid engaging in infinite conversations with another bots and to avoid errors.\n\nThen we create a `STOP_WORDS` set and add all our desired stop words in uppercase, lowercase and title form. Those will be used later to aid in the context-aware part.\n\nAfter that we load the `pickle` file into memory and remove some prefixes that are known to be used by other bots.\n\n```python\n# Complete our stop words set.\nadd_extra_words()\n\n# Load the model and remove prefixes that are commonly used by other bots.\nmodel = read_model(MODEL_FILE)\n\nfor key in list(model.keys()):\n    if \"^#\" in key or \"|\" in key or \"*****\" in key:\n        del model[key]\n```\n\nWith our model ready we start a `Reddit` object using the `PRAW` library and check our inbox and reply to new messages.\n\n\n```python\nreddit = praw.Reddit(client_id=config.APP_ID, client_secret=config.APP_SECRET,\n                     user_agent=config.USER_AGENT, username=config.REDDIT_USERNAME,\n                     password=config.REDDIT_PASSWORD)\n\nprocessed_comments = load_log()\n\nfor comment in reddit.inbox.all(limit=100):\n\n    if comment.author not in IGNORED_USERS and comment.id not in processed_comments:\n\n        new_comment = generate_comment(model=model, order=2,\n                                       number_of_sentences=2,\n                                       initial_prefix=get_prefix_with_context(model, comment.body))\n\n        # Small clean up when the bot uses Markdown and making sure the first letter is uppercase.\n        new_comment = new_comment.replace(\n            \" \u003e \", \"\\n\\n \u003e \").replace(\" * \", \"\\n\\n* \")\n\n        new_comment = new_comment[0].upper() + new_comment[1:]\n\n        if \"[\" not in new_comment and \"]\" in new_comment:\n            new_comment = \"[\" + new_comment\n\n        new_comment = new_comment.replace(\"U/\", \"u/\").replace(\"R/\", \"r/\")\n\n        comment.reply(new_comment)\n        update_log(comment.id)\n        print(\"Replied to:\", comment.id)\n```\n\nThe most important functions to generate the new comment are `get_prefix_with_context()` and `generate_comment()`.\n\nThe `get_prefix_with_context()` function tries to get a prefix that matches the given context which can be a previous comment or an arbitrary string.\n\nTo achieve this we first clean the context by removing stop words, punctuation marks and duplicates.\n\nOnce cleaned we shuffle the model prefixes and sample one prefix for each remaining word in the context.\n\nFinally we choose one of the sampled prefixes and return it. \n\n```python\ndef get_prefix_with_context(model, context):\n\n    model_keys = list(model.keys())\n\n    # Some light cleanup.\n    context = context.replace(\"?\", \"\").replace(\"!\", \"\").replace(\".\", \"\")\n    context_keywords = list(set(context.split()))\n\n    # we remove stopwords from the context.\n    # We use reversed() to remove items from the list without affecting the sequence.\n    for word in reversed(context_keywords):\n\n        if len(word) \u003c= 3 or word in STOP_WORDS:\n            context_keywords.remove(word)\n\n    # If our context has no keywords left we return a random prefix.\n    if len(context_keywords) == 0:\n        return get_prefix(model_keys)\n\n    # We are going to sample one prefix for each available keyword and return only one.\n    random.shuffle(model_keys)\n    sampled_prefixes = list()\n\n    for word in context_keywords:\n\n        for prefix in model_keys:\n\n            if word in prefix or word.lower() in prefix or word.title() in prefix:\n                sampled_prefixes.append(prefix)\n                break\n\n    # If we don't get any samples we fallback to the random prefix method.\n    if len(sampled_prefixes) == 0:\n        return get_prefix(model_keys)\n    else:\n        return random.choice(sampled_prefixes)\n```\n\nWhen the previous function fails to get a prefix it fallbacks to `get_prefix()`, this function tries to get a prefix that meets 2 conditions.\n\n1. The prefix must start with an uppercase letter.\n2. The prefix must not end with a punctuation mark.\n\n```python\ndef get_prefix(model_keys):\n\n    # We give it a maximum of 10,000 tries.\n    for i in range(10000):\n\n        random_prefix = random.choice(model_keys)\n\n        if random_prefix[0].isupper():\n\n            ends_with_punctuation = False\n            stripped_suffix = random_prefix.strip()\n\n            for char in [\".\", \"?\", \"!\"]:\n                if stripped_suffix[-1] == char:\n                    ends_with_punctuation = True\n                    break\n\n            if not ends_with_punctuation:\n                break\n\n    return random_prefix\n```\n\nThis function is mostly a personal preference. I found out that if the starting prefix matches both conditions the rest of the chain will look more natural.\n\nYou are free to specify other prefix as the `initial_prefix`. In step3.py I included an example of each 3 possible methods.\n\nAnd finally, we have the function that constructs the chain.\n\nWe start the chain with the `initial_prefix` and choose one random suffix from it.\n\nThen we extract the latest suffix from the ongoing chain and request the next suffix, we repeat until we hit a suffix that has a punctuation mark.\n\nOnce we got the desired number of sentences we break the loop and return our newly generated string of text.\n\nI added a small fail-safe of 500 max suffixes in case we go infinite.\n\n```python\ndef generate_comment(model, number_of_sentences, initial_prefix, order):\n\n    model_keys = list(model.keys())\n    counter = 0\n    latest_suffix = initial_prefix\n    final_sentence = latest_suffix + \" \"\n\n    # We add a maximum sentence length to avoid going infinite in edge cases.\n    for _ in range(500):\n\n        try:\n            latest_suffix = random.choice(model[latest_suffix])\n        except:\n            # If we don't get another word we take another one randomly and continue the chain.\n            latest_suffix = get_prefix(model_keys)\n\n        final_sentence += latest_suffix + \" \"\n        latest_suffix = \" \".join(final_sentence.split()[-order:]).strip()\n\n        for char in [\".\", \"?\", \"!\"]:\n            if latest_suffix[-1] == char:\n                counter += 1\n                break\n\n        if counter \u003e= number_of_sentences:\n            break\n\n    return final_sentence\n```\n\nTo extract the latest suffix from the chain we will use the handy reverse slicing method `[-order:]`.\n\n*Note: If you don't want to use a Reddit bot and only want to see the results I recommend using step3.py, this script does exactly the same as bot.py but removes all Reddit specific code.*\n\n## Conclusion\n\nI hope you have enjoyed the article, this project was something I wanted to do for a long time and I'm glad it worked better than what I expected.\n\nIf you plan to deploy the bot on Reddit, I strongly suggest that you read the [Bottiquette](https://www.reddit.com/r/Bottiquette/wiki/bottiquette). You are welcome to use [my subreddit](https://www.reddit.com/r/PhantomAppDev) to test your bot.\n\n[![Become a Patron!](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/bePatron?u=20521425)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPhantomInsights%2Fcomments-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPhantomInsights%2Fcomments-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPhantomInsights%2Fcomments-generator/lists"}