https://github.com/jsonkao/computational-journalism

Notes on computational journalism.
https://github.com/jsonkao/computational-journalism

Last synced: about 1 year ago
JSON representation

Notes on computational journalism.

Host: GitHub
URL: https://github.com/jsonkao/computational-journalism
Owner: jsonkao
Created: 2018-12-23T12:34:12.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2019-03-15T20:34:38.000Z (about 7 years ago)
Last Synced: 2025-02-06T10:15:24.841Z (over 1 year ago)
Size: 6.84 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          | Table of Contents |

|-------------|

| [Potential Projects](#potential-projects) |

| [General Resources](#general-resources) |

| [High Dimensional Data](#high-dimensional-data)

## Potential Projects

- Debiasing word embeddings (see [Man is to Computer Programmer as Woman is to Homemaker?](https://arxiv.org/pdf/1607.06520.pdf)]

## General Resources

- [Jonathan Stray's Frontiers of Computational Journalism course](http://www.compjournalism.com/)

- [Jun Yang's publications](https://users.cs.duke.edu/~junyang/)

- [Information Retrieval Book](https://nlp.stanford.edu/IR-book/information-retrieval-book.html)

_Reporting on society, using computation, and reporting on computation in society._

## High Dimensional Data

Vector-izing data, then projecting it into K << R (typically K=2 or K=3)

**Text analysis in Journalism**

- Clustering, classification

- Document Vector Space Model: what is this document about?

  - finding important words, topic analysis, key component for filtering

  - features = words works fine. One dimension = vocabulary of a document, value of a dimension = # of times word appears

    - Each entry becomes term frequency: tf(t,d)

  - distance metric for text (how similar? clustering?)

    - Looking for overlapping terms: Cosine similarity. similary(a,b) = (a \dot b) / (mag(a) \times mag(b)) = cos(theta). Cosine distance is just 1 - similarity(a,b).

  - also ignore stopwords and "de-weight" common words (e.g. "car" in car reviews). Document frequency `df(t,D)` = fraction of docs containing term

    - Inverse document frequency idf(t, D) = log(|D| / |d \in D : t \in d|)

  - [TF-IDF](https://planspace.org/20150524-tfidf_is_about_what_matters/) = tf(t,d) * idf(d, D) = term frequency * 1 / document frequency

## Text Analysis

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jsonkao/computational-journalism

Awesome Lists containing this project

README