{"id":13767509,"url":"https://github.com/benjaminvdb/DBRD","last_synced_at":"2025-05-10T22:32:10.943Z","repository":{"id":41483800,"uuid":"168819565","full_name":"benjaminvdb/DBRD","owner":"benjaminvdb","description":"110k Dutch Book Reviews Dataset for Sentiment Analysis","archived":false,"fork":false,"pushed_at":"2023-10-06T08:00:10.000Z","size":35,"stargazers_count":30,"open_issues_count":4,"forks_count":3,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-11-17T03:30:20.812Z","etag":null,"topics":["dataset","dataset-creation","dutch","nlp","nlp-machine-learning","python","python3","scraped-data","scraper"],"latest_commit_sha":null,"homepage":"https://benjaminvdb.github.io/DBRD/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/benjaminvdb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-02-02T10:17:31.000Z","updated_at":"2024-06-09T20:04:08.000Z","dependencies_parsed_at":"2024-01-12T00:27:21.264Z","dependency_job_id":"bca4e710-7a5e-482d-8922-51b1cf3d577b","html_url":"https://github.com/benjaminvdb/DBRD","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjaminvdb%2FDBRD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjaminvdb%2FDBRD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjaminvdb%2FDBRD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjaminvdb%2FDBRD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/benjaminvdb","download_url":"https://codeload.github.com/benjaminvdb/DBRD/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253492529,"owners_count":21916959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","dataset-creation","dutch","nlp","nlp-machine-learning","python","python3","scraped-data","scraper"],"created_at":"2024-08-03T16:01:09.253Z","updated_at":"2025-05-10T22:32:09.380Z","avatar_url":"https://github.com/benjaminvdb.png","language":"Python","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":"# DBRD: Dutch Book Reviews Dataset\n\n![GitHub release (with filter)](https://img.shields.io/github/v/release/benjaminvdb/DBRD) ![GitHub](https://img.shields.io/github/license/benjaminvdb/DBRD) ![GitHub all releases](https://img.shields.io/github/downloads/benjaminvdb/DBRD/total) ![GitHub Sponsors](https://img.shields.io/github/sponsors/benjaminvdb)\n\nThe DBRD (pronounced *dee-bird*) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly influenced by the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) and intended as a benchmark for sentiment classification in Dutch. The scripts that were used to scrape the reviews from [Hebban](https://www.hebban.nl) can be found in the [DBRD GitHub repository](https://github.com/benjaminvdb/DBRD).\n\n# Dataset\n\n## Downloads\n\nThe dataset is ~79MB compressed and can be downloaded from here:\n\n**[Dutch Book Reviews Dataset](https://github.com/benjaminvdb/DBRD/releases/download/v3.0/DBRD_v3.tgz)**\n\n\nA language model trained with [FastAI](https://github.com/fastai/fastai) on Dutch Wikipedia can be downloaded from here:\n\n**[Dutch language model trained on Wikipedia](http://bit.ly/2trOhzq)**\n\n\n## Overview\n\n### Directory structure\n\nThe dataset includes three folders with data: `test` (test split), `train` (train split) and `unsup` (remaining reviews).\nEach review is assigned a unique identifier and can be deduced from the filename, as well as the rating: `[ID]_[RATING].txt`. *This is different from the Large Movie Review Dataset, where each file in a directory has a unique ID, but IDs are reused between folders.*\n\nThe `urls.txt` file contains on line `L` the URL of the book review on Hebban for the book review with that ID, i.e., the URL of the book review in `48091_5.txt` can be found on line 48091 of `urls.txt`. It cannot be guaranteed that these pages still exist.\n\n````\n.\n├── README.md     // the file you're reading\n├── test          // balanced 10% test split\n│   ├── neg\n│   └── pos:\n├── train:        // balanced 90% train split\n│   ├── neg\n│   └── pos\n└── unsup         // unbalanced positive and neutral\n└── urls.txt      // urls to reviews on Hebban\n````\n\n### Size\n````\n  #all:           118516 (= #supervised + #unsupervised)\n  #supervised:     22252 (= #training + #testing)\n  #unsupervised:   96264\n  #training:       20028\n  #testing:         2224\n````\n\n### Labels\n\nDistribution of labels `positive/negative/neutral` in rounded percentages.\n````\n  training: 50/50/ 0\n  test:     50/50/ 0\n  unsup:    72/ 0/28\n````\n\nTrain and test sets are balanced and contain no neutral reviews (for which `rating==3`).\n\n# Reproduce data\n\nSince scraping Hebban induces a load on their servers, it's best to download the prepared dataset instead. This also makes sure your results can be compared to those of others. The scripts and instructions should be used mostly as a starting point for building a scraper for another website.\n\n## Install dependencies\n\n### ChromeDriver\nI'm making using of [Selenium](https://www.seleniumhq.org) for automating user actions such as clicks. This library requires a browser driver that provides the rendering backend. I've made use of [ChromeDriver](http://chromedriver.chromium.org/).\n\n#### macOS\nIf you're on macOS and you have Homebrew installed, you can install ChromeDriver by running:\n\n    brew install chromedriver\n    \n#### Other OSes\nYou can download ChromeDriver from the official [download page](http://chromedriver.chromium.org/downloads).\n\n### Python\nThe scripts are written for **Python 3**. To install the Python dependencies, run:     \n\n    pip3 install -r ./requirements.txt\n\n\n## Run\nTwo scripts are provided that can be run in sequence. You can also run `run.sh` to run all scripts with defaults.\n\n### Gather URLs\nThe first step is to gather all review URLs from [Hebban](https://www.hebban.nl). Run `gather_urls.py` to fetch them and save them to a text file.\n\n```\nUsage: gather_urls.py [OPTIONS] OUTFILE\n\n  This script gathers review urls from Hebban and writes them to OUTFILE.\n\nOptions:\n  --offset INTEGER  Review offset.\n  --step INTEGER    Number of review urls to fetch per request.\n  --help            Show this message and exit.\n```\n\n### Scrape URLs\nThe second step is to scrape the URLs for review data. Run `scrape_reviews.py` to iterate over the review URLs and save the scraped data to a JSON file.\n\n```\nUsage: scrape_reviews.py [OPTIONS] INFILE OUTFILE\n\n  Iterate over review urls in INFILE text file, scrape review data and\n  output to OUTFILE.\n\nOptions:\n  --encoding TEXT   Output file encoding.\n  --indent INTEGER  Output JSON file with scraped data.\n  --help            Show this message and exit.\n```\n\n### Post-process\n\nThe third and final step is to prepare the dataset using the scraped reviews. By default, we limit the number of reviews to 110k, filter out some reviews and prepare train and test sets of 0.9 and 0.1 the total amount, respectively.\n\n```\nUsage: post_process.py [OPTIONS] INFILE OUTDIR\n\nOptions:\n  --encoding TEXT              Input file encoding\n  --keep-incorrect-date TEXT   Whether to keep reviews with invalid dates.\n  --sort TEXT                  Whether to sort reviews by date.\n  --maximum INTEGER            Maximum number of reviews in output\n  --valid-size-fraction FLOAT  Fraction of total to set aside as validation.\n  --shuffle TEXT               Shuffle data before saving.\n  --help                       Show this message and exit.\n```\n\n## Changelog\n\nv3: Changed name of the dataset from 110kDBRD to DBRD. The dataset itself remains unchanged.\n\nv2: Removed advertisements from reviews and increased dataset size to 118,516.\n\nv1: Initial release\n\n## Citation\n\nPlease use the following citation when making use of this dataset in your work.\n\n```\n@article{DBLP:journals/corr/abs-1910-00896,\n  author    = {Benjamin van der Burgh and\n               Suzan Verberne},\n  title     = {The merits of Universal Language Model Fine-tuning for Small Datasets\n               - a case with Dutch book reviews},\n  journal   = {CoRR},\n  volume    = {abs/1910.00896},\n  year      = {2019},\n  url       = {http://arxiv.org/abs/1910.00896},\n  archivePrefix = {arXiv},\n  eprint    = {1910.00896},\n  timestamp = {Fri, 04 Oct 2019 12:28:06 +0200},\n  biburl    = {https://dblp.org/rec/journals/corr/abs-1910-00896.bib},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\n## Acknowledgements\n\nThis dataset was created for testing out the [ULMFiT](https://arxiv.org/abs/1801.06146) (by Jeremy Howard and Sebastian Ruder) deep learning algorithm for text classification. It is implemented in the [FastAI](https://github.com/fastai/fastai) Python library that has taught me a lot. I'd also like to thank [Timo Block](https://github.com/tblock) for making his [10kGNAD](https://github.com/tblock/10kGNAD) dataset publicly available and giving me a starting point for this dataset. The dataset structure based on the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) by Andrew L. Maas et al. Thanks to [Andreas van Cranenburg](https://github.com/andreasvc) for pointing out a problem with the dataset.\n\nAnd of course I'd like to thank all the reviewers on [Hebban](https://www.hebban.nl) for having taken the time to write all these reviews. You've made both book enthousiast and NLP researchers very happy :)\n\n## License\n\nAll code in this repository is licensed under a MIT License.\n\nThe dataset is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenjaminvdb%2FDBRD","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbenjaminvdb%2FDBRD","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenjaminvdb%2FDBRD/lists"}