{"id":13414798,"url":"https://github.com/sebdah/scrapy-mongodb","last_synced_at":"2025-04-04T09:10:04.138Z","repository":{"id":6235766,"uuid":"7467486","full_name":"sebdah/scrapy-mongodb","owner":"sebdah","description":"MongoDB pipeline for Scrapy. This module supports both MongoDB in standalone setups and replica sets. scrapy-mongodb will insert the items to MongoDB as soon as your spider finds data to extract.","archived":false,"fork":false,"pushed_at":"2021-04-06T19:13:24.000Z","size":532,"stargazers_count":357,"open_issues_count":7,"forks_count":99,"subscribers_count":26,"default_branch":"master","last_synced_at":"2024-10-13T23:11:45.100Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://sebdah.github.com/scrapy-mongodb/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sebdah.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-01-06T12:26:40.000Z","updated_at":"2024-08-24T21:50:42.000Z","dependencies_parsed_at":"2022-08-26T15:41:26.093Z","dependency_job_id":null,"html_url":"https://github.com/sebdah/scrapy-mongodb","commit_stats":null,"previous_names":[],"tags_count":24,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebdah%2Fscrapy-mongodb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebdah%2Fscrapy-mongodb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebdah%2Fscrapy-mongodb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebdah%2Fscrapy-mongodb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sebdah","download_url":"https://codeload.github.com/sebdah/scrapy-mongodb/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247149505,"owners_count":20891954,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T21:00:36.921Z","updated_at":"2025-04-04T09:10:04.106Z","avatar_url":"https://github.com/sebdah.png","language":"Python","funding_links":[],"categories":["Python","Libraries","Apps","Scrapy Middleware"],"sub_categories":["Python","Data Processing"],"readme":"[![PyPI version](https://badge.fury.io/py/scrapy-mongodb.svg)](https://badge.fury.io/py/scrapy-mongodb)\n[![Build Status](https://travis-ci.org/sebdah/scrapy-mongodb.svg?branch=master)](https://travis-ci.org/sebdah/scrapy-mongodb)\n\n# scrapy-mongodb\n\u003e MongoDB pipeline for Scrapy. This library supports both MongoDB in standalone setups and replica sets. It will insert items to MongoDB as soon as your spider finds data to extract.\n`scrapy-mongodb` can also buffer objects if you prefer to write chunks of data to MongoDB rather than one write per document *(see `MONGODB_BUFFER_DATA` option for details)*.\n\n## INSTALLATION\n### Dependencies\n[Read more here](./requirements.txt).\n\n### Instructions\nInstall via `pip`:\n```\npip install -r requirements.txt\npip install scrapy-mongodb\n```\n\n## CONFIGURATION\n### Basic configuration\nAdd these options to `settings.py` file:\n```\nITEM_PIPELINES = {\n    ...\n    'scrapy_mongodb.MongoDBPipeline': 300,\n    ...\n}\n\nMONGODB_URI = 'mongodb://localhost:27017'\nMONGODB_DATABASE = 'scrapy'\nMONGODB_COLLECTION = 'my_items'\n```\n\nIf you want a unique key in your database, name the key with this option:\n```\nMONGODB_UNIQUE_KEY = 'url'\n```\n\n### Replica sets\nYou can configure `scrapy-mongodb` to support MongoDB replica sets by adding `MONGODB_REPLICA_SET` option and specify additional replica set hosts in `MONGODB_URI`:\n```\nMONGODB_REPLICA_SET = 'myReplicaSetName'\nMONGODB_URI = 'mongodb://host1.example.com:27017,host2.example.com:27017,host3.example.com:27017'\n```\n\nIf you need to ensure that your data has been replicated, use the `MONGODB_REPLICA_SET_W` option. It is an implementation of the `w` parameter in `pymongo`. Details from the `pymongo` documentation:\n\u003e Write operations will block until they have been replicated to the specified number or tagged set of servers. `w=\u003cint\u003e` always includes the replica set primary (e.g. `w=3` means write to the primary and wait until replicated to two secondaries). Passing `w=0` disables write acknowledgement and all other write concern options.\n\n### Data buffering\nTo ease the load on MongoDB, `scrapy-mongodb` has a buffering feature. You can enable it by setting `MONGODB_BUFFER_DATA` to the buffer size you want. E.g: `scrapy-mongodb` will write 10 documents at a time to the database if you set:\n```\nMONGODB_BUFFER_DATA = 10\n```\n\n*It is not possible to combine this feature with `MONGODB_UNIQUE_KEY`. Technically due to that the `update` method in `pymongo` doesn't support multi-doc updates.*\n\n### Timestamps\n`scrapy-mongodb` can append a timestamp to your item when inserting it to the database. Enable this feature with:\n```\nMONGODB_ADD_TIMESTAMP = True\n```\n\nThis will modify the document to something like this:\n```\n{\n    ...\n    'scrapy-mongodb': {\n        'ts': ISODate(\"2013-01-10T07:43:56.797Z\")\n    }\n    ...\n}\n```\n\n*The timestamp is in UTC.*\n\n### One collection per spider\nIt's possible to write data to 1 collection per spider. To enable that\nfeature, set this environment variable:\n```\nMONGODB_SEPARATE_COLLECTIONS = True\n```\n\n### Full list of available options\n\n| **Parameter** | **Default** | **Required?** | **Description** |\n| --- | --- | --- | --- |\n| `MONGODB_DATABASE` | scrapy-mongodb | No | Database to use. Does not need to exist. |\n| `MONGODB_COLLECTION` | items | No | Collection within the database to use. Does not need to exist. |\n| `MONGODB_URI` | mongodb://localhost:27017 | No | URI to the MongoDB instance or replica sets you want to connect to. It must start with `mongodb://` (see more in the [MongoDB docs][1]). E.g.: `mongodb://user:pass@host:port`, `mongodb://user:pass@host:port,host2:port2` |\n| `MONGODB_UNIQUE_KEY` | None | No | If you want to have a unique key in the database, enter the key name here. `scrapy-mongodb` will ensure the key is properly indexed. |\n| `MONGODB_BUFFER_DATA` | None | No | To ease the load on MongoDB, set this option to the number of items you want to buffer in the client before sending them to database. Setting a `MONGODB_UNIQUE_KEY` together with `MONGODB_BUFFER_DATA` is not supported. |\n| `MONGODB_ADD_TIMESTAMP` | False | No | If set to True, scrapy-mongodb will add a timestamp key to the documents.\n| `MONGODB_FSYNC` | False | No | If set to True, it forces MongoDB to wait for all files to be synced before returning. |\n| `MONGODB_REPLICA_SET` | None | Yes, for replica sets | Set this if you want to enable replica set support. The option should be given the name of the replica sets you want to connect to. `MONGODB_URI` should point at your config servers. |\n| `MONGODB_REPLICA_SET_W` | 0 | No | Best described in the [pymongo docs][2]. Write operations will block until they have been replicated to the specified number or tagged set of servers. `w=\u003cint\u003e` always includes the replica set primary (e.g. `w=3` means write to the primary and wait until replicated to two secondaries). Passing `w=0` disables write acknowledgement and all other write concern options.\n| `MONGODB_STOP_ON_DUPLICATE` | 0 | No | Set this to a value greater than 0 to close the spider when that number of duplicated insertions in MongoDB are detected. If set to 0, this option has no effect. |\n\n[1]: http://docs.mongodb.org/manual/reference/connection-string\n[2]: http://api.mongodb.org/python/current/api/pymongo/mongo_replica_set_client.html#pymongo.mongo_replica_set_client.MongoReplicaSetClient\n\n### Deprecated options\n*Since scrapy-mongodb 0.5.0*\n\n| **Parameter** | **Default** | **Required?** | **Description** |\n| --- | --- | --- | --- |\n| `MONGODB_HOST` | localhost | No | MongoDB host name to connect to. Use `MONGODB_URI` instead. |\n| `MONGODB_PORT` | 27017 | No | MongoDB port number to connect to. Use `MONGODB_URI` instead. |\n| `MONGODB_REPLICA_SET_HOSTS` | None | No | Host string to use to connect to the replica set. See the `hosts_or_uri` option in the pymongo docs. Use `MONGODB_URI` instead. |\n\n## PUBLISHING TO PYPI\n```\nmake release\n```\n\n## CHANGELOG\n[Read more here](./CHANGELOG.md).\n\n## AUTHOR\nThis project is maintained by: [Sebastian Dahlgren](http://www.sebastiandahlgren.se) ([GitHub](https://github.com/sebdah) | [Twitter](https://twitter.com/sebdah) | [LinkedIn](http://www.linkedin.com/in/sebastiandahlgren)).\n\n## LICENSE\n[Read more here](./LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebdah%2Fscrapy-mongodb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsebdah%2Fscrapy-mongodb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebdah%2Fscrapy-mongodb/lists"}