{"id":23291002,"url":"https://github.com/jplusplus/rentswatch-scraper","last_synced_at":"2025-08-21T22:31:58.546Z","repository":{"id":57461156,"uuid":"45620170","full_name":"jplusplus/rentswatch-scraper","owner":"jplusplus","description":"A basic framework to scrape renting ads.","archived":false,"fork":false,"pushed_at":"2019-10-22T18:05:20.000Z","size":58,"stargazers_count":17,"open_issues_count":2,"forks_count":3,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-11-09T08:18:30.264Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://pypi.python.org/pypi/rentswatch-scraper","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jplusplus.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-11-05T15:28:15.000Z","updated_at":"2020-05-12T20:40:22.000Z","dependencies_parsed_at":"2022-09-17T08:11:11.568Z","dependency_job_id":null,"html_url":"https://github.com/jplusplus/rentswatch-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2Frentswatch-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2Frentswatch-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2Frentswatch-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2Frentswatch-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jplusplus","download_url":"https://codeload.github.com/jplusplus/rentswatch-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230537085,"owners_count":18241519,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-20T05:13:49.893Z","updated_at":"2024-12-20T05:13:50.484Z","avatar_url":"https://github.com/jplusplus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Rentswatch Scraper Framework\n============================\n\nThis package provides an easy and maintenable way to build a\nRentswatch scraper. Rentswatch is a cross-borders investigation that collects data on flat rents in Europe. Its scrapers mainly focus on classified ads.\n\nHow to install\n--------------\n\nInstall using ``pip``...\n\n::\n\n    pip install rentswatch-scraper\n\nHow to use\n----------\n\nLet's take a look at a quick example of using Rentswatch Scraper to\nbuild a simple model-backed scraper to collect data from a website.\n\nFirst, import the package components to build your scraper:\n\n.. code:: python\n\n    #!/usr/bin/env python\n    from rentswatch_scraper.scraper import Scraper\n    from rentswatch_scraper.browser import geocode, convert\n    from rentswatch_scraper.fields import RegexField, ComputedField\n    from rentswatch_scraper import reporting\n\nTo factorize as much code as possible we created an abstract class that\nevery scraper will implement. For the sake of simplicity we'll use a\n*dummy website* as follow:\n\n.. code:: python\n\n    class DummyScraper(Scraper):\n        # Those are the basic meta-properties that define the scraper behavior\n        class Meta:\n            country         = 'FR'\n            site            = \"dummy\"\n            baseUrl         = 'http://dummy.io'\n            listUrl         = baseUrl + '/rent/city/paris/list.php'\n            adBlockSelector = '.ad-page-link'\n\nWithout any further configuration, this scraper will start to collect\nads from the list page of ``dummy.io``. To find links to the ads, it\nwill use the CSS selector ``.ad-page-link`` to get ``\u003ca\u003e`` markups and\nfollow their ``href`` attributes.\n\nWe have now to teach the scraper how to extract key figures from the ad\npage.\n\n.. code:: python\n\n    class DummyScraper(Scraper):\n        # HEADS UP: Meta declarations are hidden here\n        # ...\n        # ...\n\n        # Extract data using a CSS Selector.\n        realtorName = RegexField('.realtor-title')\n        # Extract data using a CSS Selector and a Regex.\n        serviceCharge = RegexField('.description-list', 'charges : (.*)\\s€')\n        # Extract data using a CSS Selector and a Regex.\n        # This will throw a custom exception if the field is missing.\n        livingSpace = RegexField('.description-list', 'surface :(\\d*)', required=True, exception=reporting.SpaceMissingError)\n        # Extract the value directly, without using a Regex\n        totalRent = RegexField('.description-price', required=True, exception=reporting.RentMissingError)\n        # Store this value as a private property (begining with a underscore).\n        # It won't be saved in the database but it can be helpful as you we'll see.\n        _address = RegexField('.description-address')\n\nEvery attribute will be saved as an Ad's property, according to the Ad\nmodel.\n\nSome properties may not be extractable from the HTML. You may need to\nuse a custom function that received existing properties. For this reason\nwe created a second field type named ``ComputedField``. Since the\nproperties order of declaration is recorded, we can use previously\ndeclared (and extracted) values to compute new ones.\n\n.. code:: python\n\n    class DummyScraper(Scraper):\n        # ...\n        # ...\n\n        # Use existing properties `totalRent` and `livingSpace` as they were\n        # extracted before this one.\n        pricePerSqm = ComputedField(fn=lambda s, values: values[\"totalRent\"] / values[\"livingSpace\"])\n        # This full exemple uses private properties to find latitude and longitude.\n        # To do so we use a buid-in function named `convert` that transforms an\n        # address into a dictionary of coordinates.\n        _latLng = ComputedField(fn=lambda s, values: geocode(values['_address'], 'FRA') )\n        # Gets a the dictionary field we want.\n        latitude = ComputedField(fn=lambda s, values: values['_latLng']['lat'])\n        longitude = ComputedField(fn=lambda s, values: values['_latLng']['lng'])\n\nAll you need to do now is to create an instance of your class and run\nthe scraper.\n\n.. code:: python\n\n    # When you script is executed directly\n    if __name__ == \"__main__\":\n      dummyScraper = DummyScraper()\n      dummyScraper.run()\n\nAPI Doc\n-------\n\n``class`` Ad\n~~~~~~~~~~~~\n\nAttributes\n^^^^^^^^^^\n\nAs seen above, every Ad attribute might be used as a Scraper attribute to declare which attribute extract.\n\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| Name                 | Type                     | Description                                                               |\n+======================+==========================+===========================================================================+\n| ``status``           | *String*                 | \"listed\" if needs more scraping, \"scraped\" if it's done                   |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``site``             | *String*                 | Name of the website                                                       |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``createdAt``        | *DateTime*               | Date the ad was first scraped                                             |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``siteId``           | *String*                 | The unique ID from the site where it's scrapped from                      |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``serviceCharge``    | *Float*                  | Extra costs (heating mostly)                                              |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``baseRent``         | *Float*                  | Base costs (without heating)                                              |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``totalRent``        | *Float*                  | Total cost                                                                |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``livingSpace``      | *Float*                  | Surface in square meters                                                  |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``pricePerSqm``      | *Float*                  | Price per square meter                                                    |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``furnished``        | *Bool*                   | True if the flat or house is furnished                                    |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``realtor``          | *Bool*                   | True if realtor, n if rented by a physical person                         |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``realtorName``      | *Unicode*                | The name of the realtor or person offering the flat                       |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``latitude``         | *Float*                  | Latitude                                                                  |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``longitude``        | *Float*                  | Longitude                                                                 |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``balcony``          | *Bool*                   | True if there is a balcony/terrasse                                       |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``yearConstructed``  | *String*                 | The year the building was built                                           |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``cellar``           | *Bool*                   | True if the flat comes with a cellar                                      |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``parking``          | *Bool*                   | True if the flat comes with a parking or a garage                         |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``houseNumber``      | *String*                 | House Number in the street                                                |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``street``           | *String*                 | Street name (incl. \"street\")                                              |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``zipCode``          | *String*                 | ZIP code                                                                  |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``city``             | *Unicode*                | City                                                                      |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``lift``             | *Bool*                   | True if a lift is present                                                 |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``typeOfFlat``       | *String*                 | Type of flat (no typology)                                                |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``noRooms``          | *String*                 | Number of rooms                                                           |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``floor``            | *String*                 | Floor the flat is at                                                      |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``garden``           | *Bool*                   | True if there is a garden                                                 |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``barrierFree``      | *Bool*                   | True if the flat is wheelchair accessible                                 |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``country``          | *String*                 | Country, 2 letter code                                                    |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n| ``sourceUrl``        | *String*                 | URL of the page                                                           |\n+----------------------+--------------------------+---------------------------------------------------------------------------+\n\n\n``class`` Scraper\n~~~~~~~~~~~~~~~~~\n\nMethods\n^^^^^^^\n\nThe Scraper class defines a lot of method that we encourage you to\nredefine in order to have the full control of your scraper behavior.\n\n+----------------------+------------------------------------------------------------------------------------------------------+\n| Name                 | Description                                                                                          |\n+======================+======================================================================================================+\n| ``extract_ad``       | Extract ads list from a page's soup.                                                                 |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``fail``             | Print out an error message.                                                                          |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``fetch_ad``         | Fetch a single ad page from the target website then create Ad instances by calling ``èxtract_ad``.   |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``fetch_series``     | Fetch a single list page from the target website then fetch an ad by calling ``fetch_ad``.           |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``find_ad_blocks``   | Extract ad block from a page list. Called within ``fetch_series``.                                   |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``get_ad_href``      | Extract a href attribute from an ad block. Called within ``fetch_series``.                           |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``get_ad_id``        | Extract a siteId from an ad block. Called within ``fetch_series``.                                   |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``get_fields``       | Used internally to generate a list of property to extract from the ad.                               |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``get_series``       | Fetch a list page from the target website.                                                           |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``has_issue``        | True if we met issues with this ad before.                                                           |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``is_scraped``       | True if we already scraped this ad before.                                                           |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``ok``               | Print out an success message.                                                                        |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``prepare``          | Just before saving the values.                                                                       |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``run``              | Run the scrapper.                                                                                    |\n+----------------------+------------------------------------------------------------------------------------------------------+\n| ``transform_page``   | Transform HTML content of the series page before parsing it.                                         |\n+----------------------+------------------------------------------------------------------------------------------------------+\n\n\nStart a migration\n-----------------\n\nUse Yoyo_:\n\n::\n\n    yoyo new ./migrations -m \"Your migration's description\"\n\n\nAnd apply it:\n\n::\n\n     yoyo apply --database mysql://user:password@host/db ./migrations\n\n\n\n.. _Yoyo: https://pypi.python.org/pypi/yoyo-migrations\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjplusplus%2Frentswatch-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjplusplus%2Frentswatch-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjplusplus%2Frentswatch-scraper/lists"}