An open API service indexing awesome lists of open source software.

https://github.com/dmuth/twitter-aws-comprehend

An app to analyze tweets using Amazon Comprehend's Sentiment Analysis service
https://github.com/dmuth/twitter-aws-comprehend

analyze-tweets aws nlp sentiment-analysis splunk tweets twitter

Last synced: 9 months ago
JSON representation

An app to analyze tweets using Amazon Comprehend's Sentiment Analysis service

Awesome Lists containing this project

README

          

NOTE: If you like this, you'll also like my next project, which performs sentiment analysis on Tweets by keyword!

# Twitter AWS Comprehend

I recently learned of Amazon Comprehend and wanted
to play around with its sentiment analysis.

So I built this app to download user timelines from Twitter, send them to AWS for analysis, and visualize them in Splunk. The following metrics are reported:

- Start and end dates for tweets
- Number of tweets
- A graph of "Sentiment Over Time"
- Number of F-bombs used
- Net Happiness Index (percent of happy tweets minus precent of unhappy tweets)
- Top Positive and Negative tweets

## Screenshots

Additional screenshots are available in the img/ directory.

## Requirements

- An AWS Account
- The AWS Command Line Interface installed and configured
- Python 3
- Run the command `pip install -r requirements.txt` to download all required packages
- A Twitter app created at https://apps.twitter.com/. Read-only access is fine.
- A running Splunk Instance. A free copy of Splunk can be downlaoded from Splunk.com.

## Getting started

### Downloading Tweets

You'll want to start off by running the script **./0-fetch-tweets -u username -n num_tweets_to_download** to download Tweets via Twitter's API.
When you first run the script, it will notice the lack of credentials and send you over to Twitter's App page,
where you'll need to create an app. Then grab the App Key and App Secret and enter them when the script prompts you.
Next, you'll be sent over to Twitter one more time and will receive a PIN to enter in the script. Do so,
and you'll be authenticated to Twitter. **This is a one-time process**, so once you do it, you should not need
to do it again.

The maximum number of tweets you can download from Twitter's API is **3200**, but the actual number you get will
be much lower as RTs are ignored and Twitter's API is really weird about giving you the actual number of tweets that you ask for. I do not understand it.

### Analyizing Tweets

WARNING: **This costs money!** Based on AWS's pricing structure, a tweet will be treated as 3 "units", which will cost you $.0003, or 3 hundredths of a cent to analyze. So 100 tweets will cost 3 cents, while 1,000 tweets will cost 30 cents.

The syntax for the script to analyize sentment is **1-analyze-sentiment -u username -n num_tweets [ --fake ]**

I strongly encourage you to run the script with **--fake** on the first few tries so that you can fake calls to AWS and get comfortable running the script.

### Feeding the analyzed tweets to Splunk

The syntax for the script to feed the tweets into Splunk is: **2-ingest-into-splunk -u username [ --splunk-port port ] [ --splunk-host hostname ]** Defaults are 9997 and localhost, respectively.

The data is sent to Splunk over a raw TCP connection, so you'll want to configure Splunk accordingly. Here's a screenshot to help with that:

You'll want to have this source saving to the **main** Index.

## Visualization

This is the most interesting part. So far, we are making the following assumptions about Splunk:
- Use of the **main** Index
- Use of the Sourcetype **twitter**
- Use of the Splunk app **Search**

Assuming those are the case, you're good to go! Just copy the file **splunk/twitter_activity_sentiment.xml** into **$SPLUNK_HOME/etc/apps/search/local/data/ui/views**, restart Splunk, and you should be all set!

Alternatively, a less convoluted way (which does not require restarting Splunk) would be to create a new dashboard, click **Edit**, click **Source**, and paste in the contents of **twitter_activity_sentiment.xml**.

## A Word on Idempotency

I am a HUGE fan of Idempotency. Especially because
AWS Comprehend costs money! Once I analyze a tweet, I never want to analyze it again. So I made a conscious
choice to build my code that way. So, for example, if a tweet is analyzed and later the script **0-fetch-tweets** is
run, that code will not overwrite the sentiement fields. And once a tweet is analyzed by **1-analyze-sentiemtn**, it will never be analyzed again!

One place where this does break down is with Slplunk, since the data is fed in through raw TCP and Splunk does not seem to give any acknowledgement (don't know why...), running that script twice will result in duplicate events. The way around that is to run a Splunk query like **index=main sourcetype=twitter username=dmuth | delete** before re-ingesting any data. I'm not thrilled with this particular workflow, and am looking at some alternatives.

## Future TODO Items

- ~~Make tweet ingestion idempotent~~
- ~~See about using Twitter's search API to get older tweets~~ Seriously, Twitter. Let us get more than 3,200 Tweets through your API!
- Come up with a metric to measure profanity on an account, not just f-bombs
- Add "username" field to the database schema so we can analyze multiple users at once
- Dockerize this to download a user's tweets, analyzes them, exports them, then loads up a Splunk instance to ingest them

## Contact

I had fun writing this, and I hope you had enjoy using this. If there are any issues, feel
free to file an issue against this project, hit me up on Twitter
or Facebook, or drop me a line: **dmuth AT dmuth DOT org**.