{"id":27334894,"url":"https://github.com/dmuth/social-media-article-post-language-analytics","last_synced_at":"2025-04-12T14:46:35.446Z","repository":{"id":37071334,"uuid":"93344141","full_name":"dmuth/social-media-article-post-language-analytics","owner":"dmuth","description":"Some Analysis of articles I've posted to social media over the years","archived":false,"fork":false,"pushed_at":"2022-06-21T21:11:23.000Z","size":46,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-05-02T06:07:39.280Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmuth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-04T22:03:47.000Z","updated_at":"2017-06-04T22:04:10.000Z","dependencies_parsed_at":"2022-06-24T20:35:04.502Z","dependency_job_id":null,"html_url":"https://github.com/dmuth/social-media-article-post-language-analytics","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Fsocial-media-article-post-language-analytics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Fsocial-media-article-post-language-analytics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Fsocial-media-article-post-language-analytics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Fsocial-media-article-post-language-analytics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmuth","download_url":"https://codeload.github.com/dmuth/social-media-article-post-language-analytics/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248585249,"owners_count":21128974,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-12T14:46:34.705Z","updated_at":"2025-04-12T14:46:35.439Z","avatar_url":"https://github.com/dmuth.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Social Media Article Post Language Analytics\n\nYeah, that's the best title I could come up with to describe this.  I am so sorry.\n\nAlso, let's be clear: this isn't production code.  It's basically a side project that I had been working on\nfor awhile, got the code as far as I could take it (for now), and decided to publish it so I could move onto\nother things.\n\nSo awhile back, someone suggested to me that I consder taking all of the links I ever posted to Twitter\nand Facebook, download the text, and do some sort of curation or language analytics on them.  That led\ndown a rabiit hole wherein I got back up to speed on Python, taught myself a few modules such as \n\u003ca href=\"https://twython.readthedocs.io/en/latest/\"\u003eTwython\u003c/a\u003e (Twitter API integration), \n\u003ca href=\"http://docs.python-requests.org/en/master/\"\u003eRequests\u003c/a\u003e (for talking to Facebook),\n\u003ca href=\"https://docs.python.org/3/library/argparse.html\"\u003eArgparse\u003c/a\u003e (a fantastic document parser),\n\u003ca href=\"https://www.crummy.com/software/BeautifulSoup/\"\u003eBeautiful Soup\u003c/a\u003e (HTML parsing),\nand \u003ca href=\"https://www.sqlite.org/\"\u003eSQLite\u003c/a\u003e (for data storage).\n\n\n## Installation\n\n- `git clone`\n- `virtualenv virtualenv`\n- `./virtualenv/bin/activate`\n- `pip install -r ./requirements.txt`\n\n\n## Configuration\n\nYou will need to copy `config.ini.example` to `config.ini` and then obtain an Access Token from\nFacebook, and an API key and secret from Twitter.\n\n\n## Usage\n\nThis app is broken down into multiple Python scripts, and each script starts with a number, which helps \nmake clear which order they should be run in.\n\nAll scripts will create their database table in SQLite on an as-needed basis.\n\nAll scripts are also written to make use of \"INSERT OR REPLACE INTO\" syntax so that if the same\nscript is run multiple times, you will not wind up with duplicate copies of the data.  For example,\nif the *1-download-facebook.py* script is run multiple times, existing posts will not get duplicated,\nbut rather only new posts will be downloaded.\n\n\n### 1-download-facebook.py\n\nThis script will download as many of your Facebook posts as it can, and write them\nto the `facebook_posts` table.\n\nIf you do not have a Facebook Access Token, you'll need to retrieve one from \n\u003ca href=\"https://developers.facebook.com/tools/explorer\"\u003ehttps://developers.facebook.com/tools/explorer\u003c/a\u003e\n\nThis script employs some basic sanity checking--posts that don't have links, have\nlinks to Twitter, or are photos will be skipped.\n\nA successful run will generate lines like these:\n\n```\n2017-06-03 16:44:07,301 INFO: Querying Facebook Graph for 200 posts...\n2017-06-03 16:44:08,192 INFO: Status Code from Facebook: 200\n2017-06-03 16:44:08,269 INFO: posts=85, skipped_no_message=3, skipped_no_link=112\n2017-06-03 16:44:08,269 INFO: posts_written: 1036, skipped_no_status_type=2, skipped_status_type_photos=1649, skipped_link_twitter=12, skipped_unknown=226, skipped_no_message=284, skipped_application_twitter=140, skipped_no_link=1446\n```\n\nNote that even through we asked Facebook for 200 posts, we don't always get 200 posts.  \nAs far as I can tell, that's normal behavior for their API.\n\n\n### 1-download-twitter.py\n\nThis script will download your Tweets from Twitter.  For reasons only Twitter knows, just the\nlast 3200 tweets are available.\n\nPart of the process of authentication to Twitter involves opening up a web browser to retrieve\na code from Twitter and then paste it into a prompt the script generates.  This code is then\nstored in the table `data`.\n\nThis script employs sanity checking to to skip Tweets that don't have links or are RTs.\n\nA successful run produces results like these:\n```\n2017-06-03 17:37:49,027 INFO: getTweets(): count=200, last_id=861355973117714432\n2017-06-03 17:37:49,336 INFO: Tweets fetched=65, skipped=79, last_id=852296277933076484\n2017-06-03 17:37:49,336 INFO: Tweets left to fetch: 4865\n2017-06-03 17:37:49,337 INFO: Rate limit left: 898\n```\n\n\n### 2-extract-urls.py\n\nThis script goes through all of the saved Tweets and Facebook posts, extracts the URL(s)\npresent in each post, and writes them out to the the `urls` table.  Since we are making \nuse of `INSERT OR REPLACE INTO`, if the same URL is posted to both Twitter and Facebook\n(URL shorteners not withstanding), only one row will wind up in the table so as to\navoid duplicated.\n\n\n### 3-get-core-urls.py\n\nThis script goes through the process of downloading the contents of each URL \n(if it hasn't already been downloaded) and storing the results in the `urls_data` table.\nOnce the URL is downloaded, the \"final\" de-shortened URL is noted as well, and the\noriginal URL, the final URL, and the contents are written to `urls_data`.\n\nThis script is by **far** the most network-intensive part of this project.\n\nBy default, 100 URLs are downloaded with a 10 second timeout.  Then there is sanity checking\nwhich catches things like Twitter photos (which can't be caught by the Content-Type header)\nand non-2XX responses.  All results (including non-2XX) are written to the table, to ensure\nthat we don't repeatedly try to call 404 pages, images, etc.\n\nSanity checking will also be applied to filter out Twitter images, links to other Twitter posts, \nand links to Facebook posts.\n\nA successful run will print results at the end in a table similar to this:\n\n```\n                            Content-Type\tCode\tCount\n                            ============\t====\t=====\n                                 (blank)\t 200\t    2\n                         application/pdf\t 200\t    1\n                               image/gif\t 200\t    8\n                              image/jpeg\t 200\t   52\n                          local/facebook\t 200\t   13\n                           local/twitter\t 200\t  195\n                     local/twitter-image\t 200\t  574\n                               text/html\t 200\t  227\n                text/html; charset=utf-8\t 200\t  936\n                text/html;;charset=UTF-8\t 200\t   14\n                 text/html;charset=UTF-8\t 200\t   87\n                              text/plain\t 200\t    2\n                   timed out? not found?\t    \t  129\n                              video/webm\t 200\t    1\n```\n\nAnything that starts with `local/` isn't an actual Content-Type, just a way for my script\nto note that we did not crawl that URL for some reason.\n\n\n### 4-extract-text.py\n\nThis script goes through `urls_data`, pulls the text of every URL that was crawled,\nparses the HTML, and then writes it to the `urls_text` table.\n\nThere are a few key tags that we pay attention to, namely the title, and any h1, h2, and h3 tags.\nThe body is also grabbed, but only the first 10K (so as to keep things to a reasonable size).\n\n\n### 5-analyze-text.py\n\n\nFinally, the text analysis part!  This is the whole reason why I wrote this project, and why\nI wanted to play around with the \u003ca href=\"http://www.nltk.org/\"\u003eNatural Language ToolKit\u003c/a\u003e.\n\nRunning this script with `-h` will give you a list of options, but in summary, the following \noperations can be performed against all content stored in the `urls_text` table:\n\n- Get a list of unusual words in the titles\n- Get a list of unusual words in the post bodies\n- Display words occuring in post bodies more than a certain number of times\n- Perform stemming on all words before any of the above operations\n\nWhen the script is complete, totals will be printed for unusual words or frequent words \n(if either/both were searched for).\n\nA successful run will display output similar to this:\n```\n\nNumber of posts processed: 1775\n\nTop unusual words that were found in post bodies:\n\nUnusual words that showed up 583 times: years\nUnusual words that showed up 566 times: terms\nUnusual words that showed up 561 times: facebook\nUnusual words that showed up 530 times: things\nUnusual words that showed up 483 times: using\nUnusual words that showed up 426 times: email\nUnusual words that showed up 365 times: features\nUnusual words that showed up 360 times: called\nUnusual words that showed up 350 times: states\nUnusual words that showed up 332 times: makes\nUnusual words that showed up 330 times: developers, comments\n\nTop unusual words that were found in post titles:\n\nUnusual words that showed up 176 times: youtube\nUnusual words that showed up 160 times: comments\nUnusual words that showed up 123 times: posts\nUnusual words that showed up 98 times: stories\nUnusual words that showed up 97 times: facebook\nUnusual words that showed up 85 times: online\nUnusual words that showed up 81 times: categories\nUnusual words that showed up 77 times: things\nUnusual words that showed up 69 times: viewing\nUnusual words that showed up 67 times: email\n\nTop frequent words that were found in post bodies:\n\nUnusual words that showed up 259 times: featured\nUnusual words that showed up 145 times: password\nUnusual words that showed up 110 times: facebook\nUnusual words that showed up 98 times: mcdonald’s\nUnusual words that showed up 89 times: bloomberg\nUnusual words that showed up 86 times: hydraulic\nUnusual words that showed up 78 times: pinterest\nUnusual words that showed up 76 times: toothpaste, comments, dynamite\nUnusual words that showed up 73 times: undertale, javascript\n```\n\nLooking at the unusual words above, it's apparent that I like to post links to\narticles which mention Facebook, along with article about developers, and\nYouTube videos.  Looking at the frequent words, I apparently link to Bloomberg a lot\nand like to talk about \u003ca href=\"http://undertale.com/\"\u003eUndertale\u003c/a\u003e and Javascript.\n\n\n## A bit about database design\n\nSince I'm dealing with different parts of HTML docuemnts (title, body, h1, h2 tags, etc.),\nI need a way to store all of those in a database, without having to constantly adjust the\nschema.  I wound up geting an idea \n\u003ca href=\"http://blog.wix.engineering/2015/12/10/scaling-to-100m-mysql-is-a-better-nosql/\"\n\t\u003efrom this post in the Wix Engineering Blog\u003c/a\u003e which stated: *\"Fields only exist to \nbe indexed. If a field is not needed for an index, store it in one blob/text field \n(such as JSON or XML)\"* and ran with it.  So all tables in this project have a field called \n`value`, which holds JSONified data which contains whatever data I need.  This approach\nturned out to be quite useful, because as my needed changed and I decded to add more data,\nno schema changes were made.\n\nI'm not sure I would totally advocate this approach for an actual production system, at\nleast not without extensive testing (which is what Wix appears to have done).\n\n\n## TODO / Room For Improvement\n\nI only got up to speed on the argparse module towards the end of this project.  I should\nreally add support for argument parsing into the scripts that download social media posts\nand the contents of the URLs in them.  That would make future development a little easier\nin that I could limit a run to just 5 posts, rather than having to tweak code to do that.\n\nThe other thing that sticks in the back of my heads is that for some sites as the\nmore mainstream news sites, downloading the contents of the page doesn't just give me the\narticle content, but all of the stuff that's in the sidebar, related article, etc.  I suspect\nthat is throwing off my analysis, and likely explains all of the times that the word \"facebook\"\npops up.  I'm not sure what the \"fix\" is other than to write a parser that is unique to each\nof those sites.  And I honestly wouldn't be surprise if Google has done something like this\nin order to improve its search results.\n\n\n## Troubleshooting\n\nDealing with this error:\n\n`[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:645)`\n\nYou're probably using a an older version of OpenSSL.  Check your OpenSSL version with:\n\n`python -c 'import ssl; print (ssl.OPENSSL_VERSION)'`\n\nIf's under 1.0, that's cause for concern.  This is a real problem on MacOS/X.\nThat said, if you're on MacOS/X, \u003ca href=\"https://brew.sh/\"\u003einstall HomeBrew\u003c/a\u003e\nand then install Python 3 and OpenSSL, e.g.: `brew install python3 openssl`.\n\nNow createa new virtualenv folder with the specific path for Python 3:\n\n`virtualenv -p /path/to/homebrew/path/to/python3 virtualenv3`\n\nActivate that virtualenv, and that should use the copy of Python in Homebrew\nalong with its copy of OpenSSL.  Check again, and you'll be at least version 1.0.2\nof OpenSSL as of this writing.\n\n\n## Final Thoughts\n\nEven though the end product didn't turn out as awesome as I hoped it would, I still learned\na lot about Nautral Language Processing and the substantial amount of work that goes into it.\nIt gives me a greater appreaciation for what happens \"behind the scenes\" with Google queries\nas well as Siri.\n\nI don't see myself doing much more work on this particular project, as NLP is definately \nnot my paritcular brand of vodka, but I figured I'd share the code just so others can take\na look at it and maybe find it useful.\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmuth%2Fsocial-media-article-post-language-analytics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmuth%2Fsocial-media-article-post-language-analytics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmuth%2Fsocial-media-article-post-language-analytics/lists"}