{"id":17541794,"url":"https://github.com/pierlauro/mdbubing","last_synced_at":"2025-03-29T05:14:03.884Z","repository":{"id":89356526,"uuid":"307476479","full_name":"pierlauro/MDBubing","owner":"pierlauro","description":"From WARC records to MongoDB documents","archived":false,"fork":false,"pushed_at":"2020-11-03T18:55:45.000Z","size":148,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-26T03:35:34.420Z","etag":null,"topics":["bubing","crawler","crawling","warc","warc-files","warc-format","warc-record","webarchive","webarchiving"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pierlauro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-26T19:01:04.000Z","updated_at":"2020-11-03T18:55:48.000Z","dependencies_parsed_at":null,"dependency_job_id":"ac04d37d-3cfe-433e-92a5-84a6975e136d","html_url":"https://github.com/pierlauro/MDBubing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pierlauro%2FMDBubing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pierlauro%2FMDBubing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pierlauro%2FMDBubing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pierlauro%2FMDBubing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pierlauro","download_url":"https://codeload.github.com/pierlauro/MDBubing/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246140592,"owners_count":20729802,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bubing","crawler","crawling","warc","warc-files","warc-format","warc-record","webarchive","webarchiving"],"created_at":"2024-10-20T23:43:15.008Z","updated_at":"2025-03-29T05:14:03.867Z","avatar_url":"https://github.com/pierlauro.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"## MDBubing: from WARC records to MongoDB documents\n\n![unit-tests+lint](https://github.com/pierlauro/MDBubing/workflows/unit-tests+linting/badge.svg)\n![integration-tests](https://github.com/pierlauro/MDBubing/workflows/integration-tests/badge.svg)\n\n**MDBubing** - ridiculous wordplay to merge the words MongoDB and [BUbiNG](https://github.com/LAW-Unimi/BUbiNG) - is a library aimed to:\n- Make it simple to migrate existing WARC files into MongoDB, namely exporting each record in a separate document.\n- Save MongoDB documents at BUbiNG crawl-time, bypassing WARC files creation.\n\n### Usage\n\n#### Migrate WARC records to MongoDB documents\n1) Create a properties file defining the values of the following fields:\n- *connectionString*: the [connection string](https://docs.mongodb.com/manual/reference/connection-string/) of a MongoDB instance (you can also use it to specify eventual write majority concerns)\n- *database*: the database\n- *collection*: the collection to save documents in\n- *warcFilePath*: path of a WARC file (formats supported: `.warc` and `.warc.gz`)\n\nYou can refer to the following sample configuration: [WarcToMongo-sample-configuration.properties](https://github.com/pierlauro/MDBubing/blob/master/src/test/resources/WarcToMongo-sample-configuration.properties).\n\n2) Execute the following command:\n```\n$ java dev.pstux.mdbubing.WarcToMongo -P \u003cproperties_file_path\u003e\n```\n3) Wait for the records to be exported and enjoy!\n\n#### Save Mongo documents at crawl time\nTODO: document this section\n\n\n### Development\n\nThis project expects source files to be formatted following the [Google Java style](https://google.github.io/styleguide/javaguide.html). Run `./gradlew goJF` in order to automatically format all `.java` files under `src/`.\n\nExecute unit tests (mocking MongoDB entities): `./gradlew test`.\n\nExecute integration tests (automatically running a MongoDB docker container, testing, and shutting the instance down): `./gradlew integrationTestWithDocker`.\n\nCI tasks definitions can be found in the [workflows directory](https://github.com/pierlauro/MDBubing/tree/master/.github/workflows).\n\n\n#### TODO\n- Javadoc\n- Upload artifact to sonatype\n- Add ability to write into multiple collections\n- Map by default `WARC-Record-ID` into `_id` field\n- Add ability to specify the desired `\u003cWARC header, document field\u003e` mapping\n- Performance benchmarks\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpierlauro%2Fmdbubing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpierlauro%2Fmdbubing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpierlauro%2Fmdbubing/lists"}