Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/clemsos/topoparser
Command-line parser for topogram
https://github.com/clemsos/topoparser
Last synced: about 1 month ago
JSON representation
Command-line parser for topogram
- Host: GitHub
- URL: https://github.com/clemsos/topoparser
- Owner: clemsos
- Created: 2015-06-23T05:24:30.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-06-23T13:31:36.000Z (over 9 years ago)
- Last Synced: 2024-10-14T01:23:17.911Z (2 months ago)
- Language: Python
- Homepage:
- Size: 148 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Topoparser
Command-line parser for [Topogram](http://github.com/topogram/topogram).
Use a YAML file to describe data source and workflow
## Install
virtualenv venv
. venv/bin/activate
pip install -r requirements.txt## Usage
Use the command line to call a YAML file containing the parsing instructions
python topoparser.py weibo.yaml
You can save the output to json file by specficing and ```--output-dir``` params
python topoparser.py --output-dir test weibo.yaml
## YAML command file structure
The overall workflow is separated in 3 main steps :
1. **corpora** : load and describe the corpus
2. **process** : extract information from the data
3. **viz** : parse the data properly (JSON output)#### Corpora
The corpora description follow a standard model
corpora :
- content:
type : csv # could be a mongo or dict adapter
file : 'examples/sampleweibo.csv' # path of the file
columns: # name of the relevant columns or fields
source : uid
text : text
timestamp : created_at
time_pattern : '%Y-%m-%d %H:%M:%S'
data : [ permission_denied, deleted_last_seen ]Multiple corpora can be used together (not implemented yet)
#### Process
You first select the column you want to process from the data set -- ex . ```content.text```
The process remains on several data processors :
* **regexp** : compile a regexp -- ex. ```regexp : '@([^::,,\)\(()|\\\s]+)' ``` extract @ mentions from Twitter-like corpus
* **nlp** : extract keywords from a specific languages -- ex. ``` nlp : zh ``` extract words from Chinese language
* **graph** : will add a list of elements into a graph -- ex. ``` graph : add_edges_from_nodes_list ```
* **timeseries** : will format time information following specific time scales (second, minute, hour, day, month, year) -- ex. ``` timeseries : minute ```You can use 2 different operators to link them together :
**save** : will store the results of the operation
-- ex. extract all mentions and keep the results- mentions :
regexp : '@([^::,,\)\(()|\\\s]+)'
type : save**pipe** : will pass the data to the next operations ( like unix ```|``` symbol )
-- ex. extract all keywords from a Chinese sentence and add words into a network- words :
nlp : "zh"
type : pipe
- words_relationships :
graph: add_edges_from_nodes_list
type : save**Complete example**
This will extract hashtags, mentions and urls + create a network of words + compute a daily timeseries from the quantity of messages
process:
content.text :
- hashtags :
regexp: '#([^#\s]+)#'
type : save
- mentions :
regexp : '@([^::,,\)\(()|\\\s]+)'
type : save
- urls:
regexp : '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^\p{P}\s]|/)))'
type : save
- words :
nlp : "zh"
type : save
- words :
nlp : "zh"
type : pipe
- words_relationships :
graph: add_edges_from_nodes_list
type : save
content.time :
- timecount:
timeseries: day
type : save## Visualization parsers
The final step is to parse into JSON formatted properly for visualization library (like d3js)
Currently available : timeseries and network.
viz :
timeseries:
data : content.time.timecount
network :
nodes : content.text.words
edges : content.text.words_relationships## TODO
* Use multiple datasets
* Additional visualization models (map & network+map)
* New data operators like ```fork``` for parrallel processing
* Support for custom scripts and operationsProject inspired by [Datscript](https://github.com/datproject/datscript/)