https://github.com/stef/ec-experts
expert groups mining
https://github.com/stef/ec-experts
Last synced: 12 months ago
JSON representation
expert groups mining
- Host: GitHub
- URL: https://github.com/stef/ec-experts
- Owner: stef
- License: other
- Created: 2013-08-15T21:40:55.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2013-11-05T15:49:03.000Z (over 12 years ago)
- Last Synced: 2025-03-27T20:46:07.790Z (over 1 year ago)
- Language: Python
- Size: 180 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: readme.txt
- License: COPYING
Awesome Lists containing this project
README
# you can automatically install and run ec-experts if you are running
# debian or ubuntu by issuing the following command
# wget -O - https://raw.github.com/stef/ec-experts/master/readme.txt | sh -
# install required dependencies
# or if not on debian/ubuntu
# sudo pip install -r requirements.txt
sudo apt-get install python-lxml python-dateutil git python-pip
# git clone this project
git clone https://github.com/stef/ec-experts.git
cd ec-experts
# create data directory
mkdir data
# run an update
./update.sh
# update.sh performs the following steps: (you can and should use the
# commands below to achieve manual improvements when deduplicating)
# 0. for all this to work, you have to be in the ec-experts directory
# where you have cloned it while installing.
# 1. you download the newest expert register dump from:
# "http://ec.europa.eu/transparency/regexpert/view/transparency/openXML.cfm?file=RegExp_xml_{today}.xml"
# where you have to replace {today} with the date in the following format:
# YYYYMMDD
# 2. extract register from xml dump to json
# this is needed for all the following steps, but only once
# after downloading the dump in step 1.
# python extract.py data/regexp_{today}.xml >data/regexp_{today}.json
# again replace {today} with the date in the format from step 1.
# 3. transform and dump expert register
# this step deduplicates the names from the intermediary format in step 2.
# using the contents of dedup.txt found in this directory.
# you can add more deduplication blocks or edit the existing ones to
# achieve better results.
# this step generates a csv file called data/entities-{today}.csv
# which you can use for further datamining.
# python experts.py data/regexp_{today}.json dedup.txt >data/entities-{today}.csv
# don't forget to replace {today} with YYYYMMDD
# 4. optionally (update.sh does this automatically) find new
# candidates for dedup expert and rep names.
# python dedup.py data/entities-{today}.csv org_name >data/dedup-{today}.txt
# notice the "org_name" in above line, this command searches for
# possible duplicate names in all the organisation names and outputs
# these into data/dedup-${today}.txt
# alternatively you can run a similar command for the names of the
# experts:
# python dedup.py data/entities-${today}.csv name >>data/dedup-${today}.txt
# notice the >> which appends and not overwrites the results for the
# organizations in the previous example. Also notable i the change
# from "org_name" to "name", which is neccessary for selecting the
# names of the experts.
# That's about it. You should perform steps 3. and 4. iteratively,
# while editing dedup.txt and merging dedup candidate blocks from
# data/dedup-${today}.txt into it, until you have a
# data/entities-{today}.csv file that is clean enough for you.
# you can redo also steps 1. and 2. daily, to regenerate the csv based
# on the newest data from the commission.
# When you're done, load the generated data/entities-{today}.csv
# file in your favourite spreadsheet editor for further analysis.