https://github.com/stef/ec-experts

expert groups mining
https://github.com/stef/ec-experts

Last synced: 12 months ago
JSON representation

expert groups mining

Host: GitHub
URL: https://github.com/stef/ec-experts
Owner: stef
License: other
Created: 2013-08-15T21:40:55.000Z (almost 13 years ago)
Default Branch: master
Last Pushed: 2013-11-05T15:49:03.000Z (over 12 years ago)
Last Synced: 2025-03-27T20:46:07.790Z (over 1 year ago)
Language: Python
Size: 180 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: readme.txt
- License: COPYING

Awesome Lists containing this project

README

# you can automatically install and run ec-experts if you are running
# debian or ubuntu by issuing the following command
# wget -O - https://raw.github.com/stef/ec-experts/master/readme.txt | sh -

# install required dependencies
# or if not on debian/ubuntu
# sudo pip install -r requirements.txt
sudo apt-get install python-lxml python-dateutil git python-pip

# git clone this project
git clone https://github.com/stef/ec-experts.git
cd ec-experts

# create data directory
mkdir data

# run an update
./update.sh

# update.sh performs the following steps: (you can and should use the
# commands below to achieve manual improvements when deduplicating)

# 0. for all this to work, you have to be in the ec-experts directory
# where you have cloned it while installing.

# 1. you download the newest expert register dump from:
# "http://ec.europa.eu/transparency/regexpert/view/transparency/openXML.cfm?file=RegExp_xml_{today}.xml"
# where you have to replace {today} with the date in the following format:
# YYYYMMDD

# 2. extract register from xml dump to json
# this is needed for all the following steps, but only once
# after downloading the dump in step 1.
# python extract.py data/regexp_{today}.xml >data/regexp_{today}.json
# again replace {today} with the date in the format from step 1.

# 3. transform and dump expert register
# this step deduplicates the names from the intermediary format in step 2.
# using the contents of dedup.txt found in this directory.
# you can add more deduplication blocks or edit the existing ones to
# achieve better results.
# this step generates a csv file called data/entities-{today}.csv
# which you can use for further datamining.
# python experts.py data/regexp_{today}.json dedup.txt >data/entities-{today}.csv
# don't forget to replace {today} with YYYYMMDD

# 4. optionally (update.sh does this automatically) find new
# candidates for dedup expert and rep names.
# python dedup.py data/entities-{today}.csv org_name >data/dedup-{today}.txt
# notice the "org_name" in above line, this command searches for
# possible duplicate names in all the organisation names and outputs
# these into data/dedup-${today}.txt

# alternatively you can run a similar command for the names of the
# experts:
# python dedup.py data/entities-${today}.csv name >>data/dedup-${today}.txt
# notice the >> which appends and not overwrites the results for the
# organizations in the previous example. Also notable i the change
# from "org_name" to "name", which is neccessary for selecting the
# names of the experts.

# That's about it. You should perform steps 3. and 4. iteratively,
# while editing dedup.txt and merging dedup candidate blocks from
# data/dedup-${today}.txt into it, until you have a
# data/entities-{today}.csv file that is clean enough for you.

# you can redo also steps 1. and 2. daily, to regenerate the csv based
# on the newest data from the commission.

# When you're done, load the generated data/entities-{today}.csv
# file in your favourite spreadsheet editor for further analysis.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stef/ec-experts

Awesome Lists containing this project

README