{"id":17345116,"url":"https://github.com/sveetch/py-website-capture","last_synced_at":"2026-04-21T05:32:42.188Z","repository":{"id":48448589,"uuid":"182283859","full_name":"sveetch/py-website-capture","owner":"sveetch","description":"A tool to make website page captures","archived":false,"fork":false,"pushed_at":"2021-07-26T00:52:32.000Z","size":414,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-01T00:59:53.001Z","etag":null,"topics":["python","screenshot","selenium","webdriver","website"],"latest_commit_sha":null,"homepage":null,"language":"CSS","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sveetch.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-19T15:15:17.000Z","updated_at":"2021-07-26T00:53:46.000Z","dependencies_parsed_at":"2022-08-24T06:10:08.635Z","dependency_job_id":null,"html_url":"https://github.com/sveetch/py-website-capture","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sveetch/py-website-capture","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sveetch%2Fpy-website-capture","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sveetch%2Fpy-website-capture/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sveetch%2Fpy-website-capture/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sveetch%2Fpy-website-capture/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sveetch","download_url":"https://codeload.github.com/sveetch/py-website-capture/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sveetch%2Fpy-website-capture/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32078833,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-21T02:38:07.213Z","status":"ssl_error","status_checked_at":"2026-04-21T02:38:06.559Z","response_time":128,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","screenshot","selenium","webdriver","website"],"created_at":"2024-10-15T16:29:26.529Z","updated_at":"2026-04-21T05:32:42.170Z","avatar_url":"https://github.com/sveetch.png","language":"CSS","funding_links":[],"categories":[],"sub_categories":[],"readme":"Website capture\n===============\n\nA tool able to capture content from web pages.\n\nIt implements a high level interface to capture content (like screenshot,\nlogs, etc..) from a page the famous Selenium library.\n\nRequires\n********\n\n* Python\u003e=3.4;\n* Virtualenv;\n* Pip;\n* `Selenium \u003chttps://pypi.org/project/selenium/\u003e`_;\n* A browser and its `WebDriver \u003chttps://developer.mozilla.org/en-US/docs/Web/WebDriver\u003e`_;\n\nInstall\n*******\n\nClone repository and install it as a project ::\n\n    git clone https://github.com/sveetch/py-website-capture\n    cd py-website-capture\n    make install\n\n``py-website-capture`` package is currently not released yet on Pypi so to\ninstall it you will need to do something like: ::\n\n    pip install git+https://github.com/sveetch/py-website-capture.git#egg=py_website_capture\n\nHowever in this way it will only usable as Python module, you won't have\ncommand line requirements.\n\nTo have command line working you will need to do instead: ::\n\n    pip install git+https://github.com/sveetch/py-website-capture.git#egg=py_website_capture[cli]\n\nOnce done you may see below to install a working driver for required browsers.\n\nInstall drivers\n***************\n\nYou will need to install a driver for browsers you want to use.\n\nDepending on your browser version you may need to install a different driver\nversion, you may refer to the driver documentation to find information about\nrelease and compatibility.\n\nCommonly, driver have to be installed at level system in common binaries path\nso it can be found automatically without to set an environment variable or\noption.\n\nOnce installed on your system, you won't need to reinstall it again except if\nyour browser update to an incompatible version with installed driver.\n\ngeckodriver\n-----------\n\nYou need to have Firefox browser installed.\n\nHere is sample commands to quickly download and deploy driver on your system: ::\n\n    wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz\n    tar xvzf geckodriver-v0.24.0-linux64.tar.gz\n    chmod +x geckodriver\n    sudo mv geckodriver /usr/local/bin\n    rm -f geckodriver-v0.24.0-linux64.tar.gz\n\nLinks:\n\n* `\u003chttps://firefox-source-docs.mozilla.org/testing/geckodriver/\u003e`_;\n* `\u003chttps://github.com/mozilla/geckodriver/releases\u003e`_;\n* `\u003chttps://askubuntu.com/questions/870530/how-to-install-geckodriver-in-ubuntu\u003e`_;\n* `\u003chttps://pypi.org/project/selenium/#drivers\u003e`_;\n\nchromedriver\n------------\n\nYou need to have Chrome (or Chromium) browser installed.\n\nHere is sample commands to quickly download and deploy driver on your system: ::\n\n    wget https://chromedriver.storage.googleapis.com/74.0.3729.6/chromedriver_linux64.zip\n    unzip chromedriver_linux64.zip\n    chmod +x chromedriver\n    sudo mv chromedriver /usr/local/bin\n    rm -f chromedriver_linux64.zip\n\nLinks:\n\n* `\u003chttp://chromedriver.chromium.org/\u003e`_;\n* `\u003chttps://pypi.org/project/selenium/#drivers\u003e`_;\n\nGoing full headless\n-------------------\n\nEven if drivers have a ``headless`` mode, it only imply that browser are not\ndisplayed when a Webdriver is performing request. You will still need to have\na desktop environment to run a browser which is not desirable on a server.\n\nTo be able to use this project on a server you may look at ``Xvfb`` tool.\n\n* `\u003chttps://en.wikipedia.org/wiki/Xvfb\u003e`_;\n* `\u003chttp://elementalselenium.com/tips/38-headless\u003e`_;\n* `\u003chttp://tobyho.com/2015/01/09/headless-browser-testing-xvfb/\u003e`_;\n* `\u003chttps://github.com/ponty/pyvirtualdisplay\u003e`_;\n\nUsage\n*****\n\nCommand line interface\n----------------------\n\nActivate virtual environment: ::\n\n    source .venv/bin/activate\n\nThen you can call command line interface, for example to get programm\nversion: ::\n\n    website-capture version\n\nYou may also directly reach the command line interface without to activate\nvirtual environment: ::\n\n    .venv/bin/website-capture version\n\nTo read help about programm and available commands: ::\n\n    website-capture -h\n\nTo read full help about a command, here the ``version`` command: ::\n\n    website-capture version -h\n\nTo launch captures from a job configuration file ``sample.json``: ::\n\n    website-capture capture --interface firefox --config sample.json\n\n``--interface`` argument is not required but by default it use the dummy\ninterface which does not nothing, this is just for development debugging.\nSee ``capture`` command help to see available interfaces.\n\n``--config`` argument is required and must be a path to an existing and valid\nJSON configuration file.\n\nConfiguration file\n------------------\n\nA configuration file in JSON is required to perform tasks, it will contain\ninterface settings to use and pages to capture.\n\nHere is a sample: ::\n\n    {\n        \"output_dir\": \"/home/foo/outputs/\",\n        \"size_dir\": true,\n        \"headless\": true,\n        \"pages\": [\n            {\n                \"name\": \"perdu.com\",\n                \"url\": \"http://perdu.com/\"\n                \"screenshot_method\": \"body\",\n                \"processors\": [\n                    \"website_capture.processors.DummyProcessor\",\n                    \"website_capture.processors.ProcessorBase\"\n                ],\n                \"tasks\": [\n                    \"processing\",\n                    \"screenshot\",\n                    \"report\"\n                ]\n            },\n            {\n                \"name\": \"google.com\",\n                \"url\": \"https://www.google.com/\",\n                \"sizes\": [\n                    [330, 768],\n                    [1440, 768]\n                ],\n                \"tasks\": [\n                    \"screenshot\"\n                ]\n            }\n        ]\n    }\n\noutput_dir\n    Required path where files will be saved.\nsize_dir\n    Optional boolean to enable or not to add size name as a subdirectory of\n    ``output_dir`` when saving file according to the current size they are\n    captured. Default behavior is to enable it.\nheadless\n    Optional boolean to enable or not headless mode for interface, meaning\n    when enabled the used browser won't display to your screen, if disabled\n    browser will show during capture is performed, then it will automatically\n    close once finished. Default behavior is to enable it.\npages\n    List of page items to capture see next section for details.\n\nPage item\n.........\n\nEach item must have a ``name`` and ``url``\nvalues. Optionally you can define a ``sizes`` value which is a list of\nwindow sizes to use during capture, every size will create a new file. This\nis recommended since default size depend from interface and are often too\nsmall.\n\nEach item may have following options\n\nname\n    Required name to use to display in log for page and possibly used into\n    filename destination.\nurl\n    Required url to get to perform capture.\nsizes\n    Optional list of sizes which browser will adopts, each one will perform a\n    new capture for given size. Each size is a list of two items respectively\n    for width and height. If no sizes is defined the default size from driver\n    is used, this is not recommanded since each driver has its own size which\n    is often odd. If needed you can add default size with value ``[0, 0]``.\nfilename\n    Optional filename to be used as base filepath for resulting files from\n    task. Then each task will suffix this base filepath with its extension(s).\n\n    When undefined, default behavior is to use the filename\n    format from interface class that commonly contains size, page name and\n    interface name. Filename can be formatted with some pattern according to\n    page configuration. Like ``{name}``, ``{size}``, ``{url}``.\ntasks\n    A list of tasks to perform for this page. Available tasks are:\n\n    * ``screenshot``: will create an image file of page screenshot;\n    * ``processing`` will perform some tasks on page from additional modules;\n    * ``report`` will create a JSON file to report captured logs from page;\n\n    Although it's an optional argument, this is not really useful to define a\n    page job without it since it won't do nothing except to initialize driver.\n\n    Also the order does matter, ``report`` should always been the last item to\n    be available to get every logs from possible previous tasks. ``screenshot``\n    should be the first one if you don't use ``processing`` or if your\n    processors don't alterate the page.\nscreenshot_method\n    Optional method to perform screenshot. It can be either ``body`` or\n    ``window``, default when not defined is ``body``.\n\n    * ``body`` method will capture content from  ``\u003cbody\u003e`` element, it means\n      content are rendered from browser size but screenshot image will\n      probably smaller or bigger than window size depending of content size;\n    * ``window`` method will strictly respect browser size, if content is\n      bigger it will be cutted out from screenshot and if bigger you will\n      empty space in resulting image. You may also have window scrollbar added\n      or removed from image depending content and browser.\nprocessors\n    A list of Python path to processor objects, they will be executed one after\n    another given the page content (which could be altered by possible\n    previous processors). Last part of path must be the processor object to run\n    and everything before is the module(s) path to reach the object.\n\n\nProcessors\n..........\n\n**TODO: On current development**\n\nProcessors are objects to perform custom jobs you can code on your own.\n\nAvailable ``processors`` are defined in page option as a list of Python paths\nand their execution is enabled when ``processing`` is in the tasks list.\n\nFor example, this is the base processor: ::\n\n    class ProcessorBase(object):\n        \"\"\"\n        Basic processor don't do anything except exposing required methods\n        signatures.\n\n        Attributes:\n            name (string): Processor name used to store its report datas or\n                logging possible events. Each processor must have an unique name.\n        \"\"\"\n        name = \"basic\"\n\n        def __init__(self, *args, **kwargs):\n            pass\n\n        def run(self, driver, config, response):\n            \"\"\"\n            This is where your processor should perform its work and possibly\n            returns datas to append to processor reports which will be stored with\n            processor name.\n            \"\"\"\n            return None\n\n\nKnown issues\n************\n\n* Firefox report task is not able to get console logs, only Javascript errors;\n* When doing a screenshot with ``body`` method with Chrome browser, if content\n  width and height is bigger than browser size the horizontal scrollbar will be\n  included at browser size bottom. This seems a bug of Chrome driver.\n\nDevelopment\n***********\n\nProject is developped with tests, for convenience they are splitted in two\ndistinct directory.\n\nOne to cover core interface which can be runned once project is installed\nand one another dedicated to cover webdriver interfaces.\n\nThe last one will require you have installed every implemented drivers (and\ntheir related browser) and running the demo server which you can find in\n``page_tests`` directory, it have its own Makefile to install its requirements.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsveetch%2Fpy-website-capture","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsveetch%2Fpy-website-capture","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsveetch%2Fpy-website-capture/lists"}