https://github.com/dmpe/sawi-hiwi-2017

FAU.de course S(A)WI - My files for creating a twitter dataset for students to work with
https://github.com/dmpe/sawi-hiwi-2017

debates elections fau presidential-election python r ressources teaching tweets twitter university

Last synced: 9 months ago
JSON representation

FAU.de course S(A)WI - My files for creating a twitter dataset for students to work with

Host: GitHub
URL: https://github.com/dmpe/sawi-hiwi-2017
Owner: dmpe
Created: 2017-09-21T00:52:49.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2018-09-25T20:51:18.000Z (about 7 years ago)
Last Synced: 2025-01-05T23:11:50.330Z (11 months ago)
Topics: debates, elections, fau, presidential-election, python, r, ressources, teaching, tweets, twitter, university
Language: HTML
Homepage: http://www.wi2.fau.de/teaching/master/master-courses/sawi/
Size: 527 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# 2016 USA Presidential debate - Preparition of a sample of tweets

## Goal:

Being a student research assistant for [SAWI](http://www.wi2.fau.de/teaching/master/master-courses/sawi/) course during autumn 2017/2018 at FAU, the task was to prepare `2016 USA Presidential debate` dataset that could have been passed to the students, for their analysis.

**Data Source in question:**

**Specifically**, the first presidential debate in the USA that was held in 2016.
. See its `Readme` file as well.

### Note for myself
See that [MEGA folder](https://mega.nz/#F!qxNThASR) for the whole twitter dataset (~15 GB; Log file ~ 200 MB). You have to ask me for the key to see files!

## Tasks:

The final idea was to have 5 groups and provide each of them slightly different dataset according to when people have tweeted about it, i.e. wheather it was before, during or after the debate.

In summary, the objective would be to gather 5 samples, each of 5000 tweets:

- 1 sample **before** the debate has began

- 3 samples **during** the debate

- 1 sample **after** the debate

shows some statistics for the final CSV samples. See `RPubs` folder for more.

### 1. Step - Get data

Download tweets because that `.txt` file will just contain the Tweet IDs. Not the whole content of the tweet itself.

We have used TWARC from .

#### 1.1 Configure

First, get Twitter DEV API Keys from :
Then, place them into:
```
twarc configure
```

After that,
```
twarc hydrate first-debate.txt > all_first_tweets.jsonl
```

It takes hours...

#### 1.5 Step - Split data into manageble junks

Why? Well, because the original 13.5 GB `jsonl` file will be hard to read in any programm. So dont try R + limited RAM!
You could use for that .

**Alternatively**, you could split txt file into multiple files and **only** then apply previous `twarc` command.

Something like

```
split -b 1M -d first-debate.txt file
```

### 2. Step - Convert to CSV

Why? Because jsonl files are hard to work with.

Execute for each file, depending on your PC and [twarc](https://github.com/DocNow/twarc/blob/master/utils/json2csv.py) itself:

```
python 2jsonl.py xaa.jsonl -o xaa.csv

python3 2jsonl.py xaa.jsonl -o xaa.csv
```

You can also try:

```
python 2csv_original.py xae -o xae.csv
```

CSV delimiter will be ";"

Overall, this will create very large CSV files at around 250 MB. And we **still need** samples of those.

Again, **alternatively**, you can take one large JSONL file and convert it to one large CSV file which you can later split as well.

### 3. Step

Analyse Data in order to understand time when tweets have been published ;)

E.g.

```
head -n 5 xae.csv
```

The outcome:
```
xaa -> before debate: from 12 PM EST till 18:30 EST

xab -> before debate: from 18:30 EST till 21:00 EST

(first debate was from 21:00 till 22:35)

xac -> during debate: from 20:47 EST till 22:40 EST

xad -> after debate: from 22:40 EST till 01:20 AM EST

xae -> after debate: from 01:20 AM EST till 06:20 AM EST

xaf -> after debate: from 06:20 AM EST till 09:40 AM EST

xag -> .... (rest)
```

### 4. Step - Split large CSVs into smaller samples

Use R script `process_data.R` to apply proper formatting.

You can also go faster (but not more reliable), where in case of xa{a,b}.csv 250MB files, you could execute via bash:

```
shuf -n 2500 xaa.csv > xaa_sample_2500.csv
shuf -n 2500 xab.csv > xab_sample_2500.csv
```

Having those, only then you would use `process_data.R` which contains something along the lines:

```
mt <- fread("xaa.csv") # or xaa_sample_2500.csv directly - depending on previous steps
mt <- mt[sample(.N, 2500)]
fwrite(mt, "xaa_sample_2500.csv", sep = ";")
```

### 5. Step - In any case, you need to combine files from previous R scipt

On **Windows**, type commands like:

```
type xad_sample_1500.csv xae_sample_2500.csv xag_sample_1000.csv> after_sample_5000.csv

type xaa_sample_2500.csv xab_sample_2500.csv > before_random_sample_5000.csv

type after_sample_1000.csv xae_sample_1000.csv xaf_sample_2500.csv xag_sample_2500.csv > after_random_sample_5000.csv
xaf_sample_1500.csv

# combine 2500*2 tweets from the time before the debate into 5000 pieces
type before_sample_2500_a.csv before_sample_2500_b.csv > before_sample_5000.csv
```

## Some research before

Download tweets manually, looking for tweets that include specific `#hastags`. Then store them in the database from which you can then query them. See that folder and Python Notebooks.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dmpe/sawi-hiwi-2017

Awesome Lists containing this project

README