https://github.com/26hzhang/dblpparser
A python parser for DBLP dataset
https://github.com/26hzhang/dblpparser
dblp-dataset python3
Last synced: 8 months ago
JSON representation
A python parser for DBLP dataset
- Host: GitHub
- URL: https://github.com/26hzhang/dblpparser
- Owner: 26hzhang
- License: mit
- Created: 2018-04-25T12:36:38.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2019-03-20T06:48:02.000Z (over 7 years ago)
- Last Synced: 2025-03-29T09:21:48.555Z (about 1 year ago)
- Topics: dblp-dataset, python3
- Language: Python
- Size: 543 KB
- Stars: 45
- Watchers: 1
- Forks: 17
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DBLP Dataset Parser
-blue.svg) 
It is a python parser for [DBLP dataset](https://dblp.uni-trier.de/), the XML format dumped file can be downloaded [here](http://dblp.org/xml/) from [DBLP Homepage](https://dblp.org/).
This parser requires `dtd` file, so make sure you have both `dblp-XXX.xml` (dataset) and `dblp-XXX.dtd` files. Note that you also should guarantee that both `xml` and `dtd` files are in the same directory, and the name of `dtd` file shoud same as the name given in the `` tag of the `xml` file. Such information can be easily accessed through `head dblp-XXX.xml` command. As shown below
```xml
Carmen Heine
Modell zur Produktion von Online-Hilfen.
...
```
A sample to use the parser:
```python
def main():
dblp_path = 'dataset/dblp.xml'
save_path = 'article.json'
try:
context_iter(dblp_path)
log_msg("LOG: Successfully loaded \"{}\".".format(dblp_path))
except IOError:
log_msg("ERROR: Failed to load file \"{}\". Please check your XML and DTD files.".format(dblp_path))
exit()
parse_article(dblp_path, save_path, save_to_csv=False) # default save as json format
```
Some extracted results:
**Count the number of all different type of publications**:

**Count the number of all different attributes among all publications**:

**Count the number of five different features of articles**:

**Distribution of published year of articles**:
