https://github.com/rosette-api-community/visualize-embeddings

A simple Python script for transforming a corpus of documents into text vectors suitable for visualization
https://github.com/rosette-api-community/visualize-embeddings

machine-learning natural-language-processing nlp python text-embedding text-vectorization tsv visualization

Last synced: 12 months ago
JSON representation

A simple Python script for transforming a corpus of documents into text vectors suitable for visualization

Host: GitHub
URL: https://github.com/rosette-api-community/visualize-embeddings
Owner: rosette-api-community
Created: 2017-03-16T19:58:20.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2017-03-16T22:42:25.000Z (almost 9 years ago)
Last Synced: 2025-01-11T13:27:52.829Z (about 1 year ago)
Topics: machine-learning, natural-language-processing, nlp, python, text-embedding, text-vectorization, tsv, visualization
Language: Python
Homepage:
Size: 6.84 KB
Stars: 1
Watchers: 6
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Rosette API Text Embeddings Visualization Sample Code
A simple Python script for transforming a corpus of documents into text vectors suitable for visualization in .tsv format. It uses the [Rosette API](https://developer.rosette.com/)'s `/text-embedding` endpoint and the [BBC News Corpus](http://mlg.ucd.ie/datasets/bbc.html). Note that the corpus is only free for research purposes.

## Getting started
1. Clone the repo and open the files in your favorite text editor/python IDE.
2. Download the [raw text files zip](http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip), `bbc-fulltext.zip` from http://mlg.ucd.ie/datasets/bbc.html and extract it into the project root folder. You should get a folder called "bbc".
3. Run `visualize-embeddings.py` via your python IDE or command line (replace `ROSAPI_KEY` with your [Rosette API key](https://developer.rosette.com/admin/applications)):

$ python visualize-embeddings.py --key ROSAPI_KEY

You'll see that the script parses the raw text files of the corpus into a list of documents. Each document consist of 3 fields:
* category
* headline
* content

The script then creates two files:
* embeddings.tsv: a TSV file where each line contains the text vector for a document's content field.
* metadata.tsv: a TSV file where each line contains a document's metadata (i.e. category and headline).

To visualize the embeddings, load them into Google TensorFlow's [Embedding Projector](http://projector.tensorflow.org/). Turn on color coding by category to really see the vectors in action. You can see our projection [at this link](http://projector.tensorflow.org/?config=https://gist.githubusercontent.com/hillelt/bd4fad5280eefba4d2d8875e87f0eabb/raw/0672efa576a6fd5c14ec93ed86a2b9326a35c3bf/projector_config.json).

## Customize for your data
Try replacing the BBC News corpus with your own data. And if you find anything interesting, we'd love to hear about it! Find us at community@rosette.com.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rosette-api-community/visualize-embeddings

Awesome Lists containing this project

README