https://github.com/pierlauro/mdbubing
From WARC records to MongoDB documents
https://github.com/pierlauro/mdbubing
bubing crawler crawling warc warc-files warc-format warc-record webarchive webarchiving
Last synced: about 1 year ago
JSON representation
From WARC records to MongoDB documents
- Host: GitHub
- URL: https://github.com/pierlauro/mdbubing
- Owner: pierlauro
- License: agpl-3.0
- Created: 2020-10-26T19:01:04.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-03T18:55:45.000Z (over 5 years ago)
- Last Synced: 2025-02-26T03:35:34.420Z (over 1 year ago)
- Topics: bubing, crawler, crawling, warc, warc-files, warc-format, warc-record, webarchive, webarchiving
- Language: Java
- Homepage:
- Size: 145 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
## MDBubing: from WARC records to MongoDB documents


**MDBubing** - ridiculous wordplay to merge the words MongoDB and [BUbiNG](https://github.com/LAW-Unimi/BUbiNG) - is a library aimed to:
- Make it simple to migrate existing WARC files into MongoDB, namely exporting each record in a separate document.
- Save MongoDB documents at BUbiNG crawl-time, bypassing WARC files creation.
### Usage
#### Migrate WARC records to MongoDB documents
1) Create a properties file defining the values of the following fields:
- *connectionString*: the [connection string](https://docs.mongodb.com/manual/reference/connection-string/) of a MongoDB instance (you can also use it to specify eventual write majority concerns)
- *database*: the database
- *collection*: the collection to save documents in
- *warcFilePath*: path of a WARC file (formats supported: `.warc` and `.warc.gz`)
You can refer to the following sample configuration: [WarcToMongo-sample-configuration.properties](https://github.com/pierlauro/MDBubing/blob/master/src/test/resources/WarcToMongo-sample-configuration.properties).
2) Execute the following command:
```
$ java dev.pstux.mdbubing.WarcToMongo -P
```
3) Wait for the records to be exported and enjoy!
#### Save Mongo documents at crawl time
TODO: document this section
### Development
This project expects source files to be formatted following the [Google Java style](https://google.github.io/styleguide/javaguide.html). Run `./gradlew goJF` in order to automatically format all `.java` files under `src/`.
Execute unit tests (mocking MongoDB entities): `./gradlew test`.
Execute integration tests (automatically running a MongoDB docker container, testing, and shutting the instance down): `./gradlew integrationTestWithDocker`.
CI tasks definitions can be found in the [workflows directory](https://github.com/pierlauro/MDBubing/tree/master/.github/workflows).
#### TODO
- Javadoc
- Upload artifact to sonatype
- Add ability to write into multiple collections
- Map by default `WARC-Record-ID` into `_id` field
- Add ability to specify the desired `` mapping
- Performance benchmarks