Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gatenlp/tweet-rehydrater
Tool to take standoff annotations against a list of Tweets and merge them with the original text from Twitter
https://github.com/gatenlp/tweet-rehydrater
Last synced: about 2 months ago
JSON representation
Tool to take standoff annotations against a list of Tweets and merge them with the original text from Twitter
- Host: GitHub
- URL: https://github.com/gatenlp/tweet-rehydrater
- Owner: GateNLP
- License: lgpl-3.0
- Created: 2016-05-17T12:21:30.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-11-16T09:31:46.000Z (about 2 years ago)
- Last Synced: 2024-04-16T07:59:28.235Z (9 months ago)
- Language: Java
- Size: 21.5 KB
- Stars: 2
- Watchers: 18
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tweet-rehydrater
This is a very simple command line tool to take standoff annotations that relate to tweets from Twitter, fetch the corresponding data including the tweet text from the Twitter API, and merge the two.
## Download
The latest release of the rehydrater tool can be download [from GitHub][1]
## Usage
java -jar tweet-rehydrater.jar
- `credentials` - a Java properties file containing Twitter credentials (see below)
- `input` - file containing "dehydrated" Tweets, i.e. a stream of JSON objects with properties "id" (long integer Tweet ID) and "entities" (standoff annotations). Use "-" (a single hyphen) to read from standard input.
- `output` - file into which rehydrated Tweets should be written. Use "-" (a single hyphen) to write to standard outputAPI access uses the ["application only"][2] authentication scheme. You must create a Twitter application, and provide its consumer key and secret in a properties file:
consumerKey=...
consumerSecret=...You can create an application at https://apps.twitter.com
By default this tool fetches "extended" format tweets, if you want to fetch them in "compatibility" mode instead, add
compatibilityMode=true
to your credentials properties file.
Note that the API used is rate-limited - do not attempt to rehydrate more than 6000 Tweets in any 15 minute window.
## Data format
The input file is expected to contain a stream of JSON objects concatenated together, one per tweet, which are essentially a subset of the standard Twitter JSON format that will be merged into the full JSON retrieved from Twitter. Each object must have at least the following property:
- `id` - a long integer giving the ID of the tweet
In addition, the object may have a property `entities` giving the standoff annotations in the top-level `full_text` (or `text` in compatibility mode), represented in the normal Twitter format as used for things like hashtags and URLs in the Twitter APIs. If the tweet in question is a retweet, the input object may have `retweeted_status` which in turn contains an `entities` property, giving annotations in the `full_text` of the original retweeted status, and if the tweet is a quote tweet the input object may have `quoted_status` which in turn contains `entities` in the same way. Each set of entities will be merged into the corresponding set in the JSON retrieved from Twitter.
Each `entities` property is an object where each property name is an annotation type and the corresponding value is an array of objects representing the annotations of that type. Each annotation object has a property `indices` giving the annotation offsets, and other properties are treated as annotation features, for example:
{
"id":12345678,
"entities":{
"Person":[
{
"indices":[1,5],
"gender":"male"
}
],
"Location":[
{
"indices":[17,23],
"locType":"city"
},
{
"indices":[34,49],
"locType":"country"
}
]
}
}(This example has been pretty-printed for clarity, but this is not required for the rehydrater tool).
## Deleted Tweets
It is possible for a user to delete any of their tweets at any time after posting them, and deleted tweets will not be returned by the Twitter APIs. Therefore it is possible that there may be annotations in the input file for which the original tweet is no longer available, and such tweets will be omitted from the output.
[1]: https://github.com/GateNLP/tweet-rehydrater/releases/latest
[2]: https://dev.twitter.com/oauth/application-only