{"id":13697280,"url":"https://github.com/evolvingweb/sitediff","last_synced_at":"2025-05-03T19:33:07.402Z","repository":{"id":21282961,"uuid":"24598926","full_name":"evolvingweb/sitediff","owner":"evolvingweb","description":"SiteDiff makes it easy to see differences between two versions of a website.","archived":false,"fork":false,"pushed_at":"2024-08-22T23:18:32.000Z","size":1187,"stargazers_count":232,"open_issues_count":14,"forks_count":50,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-04-03T16:19:38.509Z","etag":null,"topics":["comparison","diff","html","sanitization"],"latest_commit_sha":null,"homepage":"http://sitediff.io","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/evolvingweb.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-09-29T14:39:19.000Z","updated_at":"2025-02-26T17:52:26.000Z","dependencies_parsed_at":"2024-06-17T15:58:36.185Z","dependency_job_id":"9f1ce609-2cb9-4737-9f46-f45dc275e9b0","html_url":"https://github.com/evolvingweb/sitediff","commit_stats":null,"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evolvingweb%2Fsitediff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evolvingweb%2Fsitediff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evolvingweb%2Fsitediff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evolvingweb%2Fsitediff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/evolvingweb","download_url":"https://codeload.github.com/evolvingweb/sitediff/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252242281,"owners_count":21717135,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["comparison","diff","html","sanitization"],"created_at":"2024-08-02T18:00:55.012Z","updated_at":"2025-05-03T19:33:07.022Z","avatar_url":"https://github.com/evolvingweb.png","language":"HTML","readme":"# SiteDiff CLI\n\n**Warning:** SiteDiff 1.2.0 requires at least Ruby 3.1.2.\n\n**Warning:** SiteDiff 1.0.0 introduces some backwards incompatible changes.\n\n[![Build Status](https://travis-ci.org/evolvingweb/sitediff.svg?branch=master)](https://travis-ci.org/evolvingweb/sitediff)\n\n## Table of contents\n\n- [Introduction](#introduction)\n- [Installation](#installation)\n- [Demo](#demo)\n- [Usage](#usage)\n    - [Getting Started](#getting-started)\n    - [Comparing 2 Sites](#comparing-2-sites)\n    - [Spurious Diffs](#spurious-diffs)\n- [Command Line Options](#command-line-options)\n    - [Finding Configuration Files](#finding-configuration-files)\n    - [Specifying Paths](#specifying-paths)\n    - [Debugging Rules](#debugging-rules)\n    - [Including and Excluding URLs](#including-and-excluding-urls)\n    - [Paths and Paths-file](#paths--paths-file)\n    - [Report Export](#export)\n    - [Running inside containers](#running-inside-containers)\n- [Configuration](#configuration)\n    - [before_url / after_url](#before_url--after_url)\n    - [selector](#selector)\n    - [sanitization](#sanitization)\n    - [ignore_whitespace](#ignore_whitespace)\n    - [remove_html_comments](#remove_html_comments)\n    - [before / after](#before--after)\n    - [includes](#incudes)\n    - [dom_transform](#dom_transform)\n        - [remove](#remove)\n        - [strip](#strip)\n        - [unwrap](#unwrap)\n        - [remove_class](#remove_class)\n        - [unwrap_root](#unwrap_root)\n    - [Organizing configuration files](#organizing-configuration-files)\n    - [Named regions](#named-regions)\n    - [report](#report)\n        - [title](#title)\n        - [details](#details)\n        - [before_note](#before_note)\n        - [after_note](#after_note)\n        - [before_url_report / after_url_report](#before_url_report--after_url_report)\n    - [Miscellaneous](#miscellaneous)\n        - [preset](#preset)\n        - [Include / Exclude Paths](#includeexclude-paths)\n    - [Curl Options](#curl-options)\n        - [Throttling](#throttling)\n        - [Timeouts](#timeouts)\n        - [Handling security](#handling-security)\n        - [interval](#interval)\n        - [concurrency](#concurrency)\n        - [depth](#depth)\n        - [curl_opts](#curl_opts)\n- [Tips and Tricks](#tips-and-tricks)\n    - [Removing empty elements](#removing-empty-elements)\n    - [HTML Tag Formatting](#html-tag-formatting)\n    - [Empty Attributes](#empty-attributes)\n- [Acknowledgements](#acknowledgements)\n\n## Introduction\nSiteDiff makes it easy to see how a website changes. It can compare two similar\nsites or it can show how a single site changed over time. It helps identify\nundesirable changes to the site's HTML and it's a useful tool for conducting QA\non re-deployments, site upgrades, and more!\n\nWhen you run SiteDiff, it produces an HTML report showing whether pages on\nyour site have changed or not. For pages that have changed, you can see a\ncolorized diff exactly what changed, or compare the visual differences\nside-by-side in a browser.\n\nSiteDiff supports a range of normalization / sanitization rules. These allow\nyou to eliminate spurious differences, narrowing down differences to the ones\nthat materially affect the site.\n\n## Installation\n\nSiteDiff is fairly easy to install. Please refer to the\n[installation docs](INSTALLATION.md).\n\n## Demo\n\nAfter installing all dependencies including the `bundle` version 2 gem, you can quickly\nsee what SiteDiff can do. Simply use the following commands:\n\n```sh\ngit clone https://github.com/evolvingweb/sitediff\ncd sitediff\nbundle install\nbundle exec thor fixture:serve\n```\n\nThen visit `http://localhost:13080/` to view the report.\n\nSiteDiff shows you an overview of all the pages and clearly indicates which\npages have changed and not changed.\n![page report preview](misc/sitediff%20-%20overview%20report.png?raw=true)\n\nWhen you click on a changed page, you see a colorized diff of the page's markup\nshowing exactly what changed on the page.\n![page report preview](misc/sitediff%20-%20page%20report.png?raw=true)\n\n## Usage\n\nHere are some instructions on getting started with SiteDiff. To see a list of\ncommands that SiteDiff offers, you can run:\n\n```sitediff help```\n\nTo get help for a particular command, say, `diff`, you can run:\n\n```sitediff help diff```\n\n### Getting started\n\nTo use SiteDiff on your site, create a configuration for your site:\n\n```sitediff init http://mysite.example.com```\n\nSiteDiff will generate a configuration file named `sitediff.yaml` by default.\n\nYou can open the configuration file ```sitediff/sitediff.yaml``` to see the\ndefault configuration generated by SiteDiff.\nThe [the configuration reference](#configuration) section explains the contents\nof this file and helps you customize it as per your requirements.\n\nThen get SiteDiff to crawl your site by using:\n\n```sitediff crawl```\n\nSiteDiff will then crawl your site, finding pages and caching their\ncontents. A list of discovered paths will be saved to a `paths.txt` file.\n\nNow, you can make alterations to your site. For example, change a word on your\nsite's front page. After you're done, you can check what actually changed:\n\n```sitediff diff```\n\nFor each page, SiteDiff will report whether it did or did not change. For pages\nthat changed, it will display a diff. You can also see an HTML version of the\nreport using the following command:\n\n```sitediff serve```\n\nSiteDiff will start an internal web server and open a report page on your\nbrowser. For each page, you can see the diff and a side-by-side view of the\nold and new versions.\n\nYou can now see if the changes were as you expected, or if some things didn't\nquite work out as you hoped. If you noticed unexpected changes, congratulations:\nSiteDiff just helped you find an issue you would have otherwise missed!\n\nAs you fix any issues, you can continue to alter your site and run\n```sitediff diff``` to check the changes against the old version. Once you're\nsatisfied with the state of your site, you can inform SiteDiff that it should\nre-cache your site:\n\n```sitediff store```\n\nThis takes a snapshot of your website and the next time you run\n```sitediff diff```, it will use this new version as the reference for\ncomparison.\n\nHappy diffing!\n\n### Comparing 2 sites\n\nSometimes you have two sites that you want to compare, for example a production\nsite hosted on a public server and a development site hosted on your computer.\nSiteDiff can handle this situation, too! Just inform SiteDiff that there are\ntwo sites to compare:\n\n```sitediff init http://mysite.example.com http://localhost/mysite```\n\nThen when you run `sitediff diff`, it will compare the cached version of the\nfirst site with the current version of the second site.\n\nIf both the first and second sites may be changing, you should tell SiteDiff\nnot to cache either site:\n\n```sitediff diff --cached=none```\n\n### Spurious diffs\n\nSometimes sites have spurious differences, that you don't want to show up in a\ncomparison. For example, many sites protect against Cross-Site Request Forgery\nusing a [semi-random token](http://en.wikipedia.org/wiki/Cross-site_request_forgery#Synchronizer_token_pattern).\nSince this token changes on each HTTP GET, you probably don't care about such\na change.\n\nTo help with issues such as this, SiteDiff allows you to normalize the HTML it\nfetches as it compares pages. In the ```sitediff.yaml``` configuration file,\nyou can add \"sanitization rules\", which specify either DOM transformations or\nregular expression substitutions.\n\nHere's an example of a rule you might add to remove CSRF-protection tokens\ngenerated by Django:\n\n```yaml\ndom_transform:\n  - title: Remove CSRF tokens\n    type: remove\n    selector: input[name=csrfmiddlewaretoken]\n```\n\nYou can use one of the presets to apply framework-specific sanitization.\nCurrently, SiteDiff only comes with Drupal-specific presets.\n\nSee the [preset](#preset) section for more details.\n\n## Command Line Options\n\n### Finding configuration files\n\nBy default SiteDiff will put everything in the `sitediff` folder. You can use\nthe `--directory` flag to specify a different directory.\n\n```bash\nsitediff init -C my_project_folder https://example.com\nsitediff diff -C my_project_folder\nsitediff serve -C my_project_folder\n```\n\n### Specifying paths\n\nWhen you run ```sitediff diff```, you can specify which pages to look at in\n2 ways:\n\n1. The option ```--paths /foo /bar ...```.\n\n   If you're trying to fix one page in particular, specifying just that one\n   path will make ```sitediff diff``` run quickly!\n\n2. The option ```--paths-file FILE``` with a newline-delimited text file.\n\nThis is particularly useful when you're trying to eliminate all diffs.\nSiteDiff creates a file ```output/failures.txt``` containing all paths\nwhich had differences, so as you try to fix differences, you can run:\n\n```sitediff diff --paths-file sitediff/failures.txt```\n\n### Debugging rules\n\nWhen a sanitization rule isn't working quite right for you, you might run\n`sitediff diff` many times over. If fetching all the pages is taking too long,\ntry adding the option ```--cached=all```. This tells SiteDiff not to re-fetch\nthe content, but just compare previously cached versions — it's a lot faster!\n\n### Including and Excluding URLs\n\nBy default sitediff crawls pages that are indicated with an HTML anchor using\nthe `\u003cA HREF` syntax. Most pages linked will be HTML pages, but some links\nwill contain binaries such as PDF documents and images.\n\nUsing the option `--exclude='.*\\.pdf'` ensures the crawler skips links\nfor document with a `.pdf` extension. Note that the regular expression is\napplied to the path of the URL, not the base of the URL.\n\nFor example `--include='.*\\.com'` will not match `http://www.google.com/`,\nbecause the path of that URL is `/` while the base is `www.google.com`.\n\n### paths / paths-file\n\nSiteDiff allows you to specify a list of paths that you want it to work with.\nAlternatively, it can crawl the entire site and detect all paths.\n\n* Running `sitediff init` configures SiteDiff for crawling and seeing differences.\n\n* Running `sitediff crawl` makes sitediff crawl your site and detect\n  available paths. These paths are written to a `paths.txt` file which you\n  can modify according to your needs.\n\n* You can also compute diffs only for paths specified in a custom paths file\n  using the `--paths-file` parameter. This file should contain paths starting\n  with a `/`, having one path per line.\n\n  ```\n  sitediff diff --paths-file=/path/to/paths.txt\n  ```\n\n* You can also compute diffs for a handful of specific paths by specifying\n  them directly on the command line using the `--paths` parameter. Each path\n  should be separated by a space.\n\n  ```\n  sitediff diff --paths=/home /about /contact\n  ```\n\n### export\nGenerate a gzipped tar file containing the HTML report instead of generating\nand serving live web pages, this option overrides `--report-format`, forcing\nHTML.\n\n```\nsitediff diff --export\nsitediff diff -e\n```\n\nThis will perform the diff and export the results in a gzipped tar file.\n\n### Running inside containers\n\nIf you run SiteDiff inside a container or virtual machine, the URLs in its\nreport might not work from your host, such as ```localhost```. You can fix\nthis by using the ```--before-url-report``` and ```--after-url-report```\noptions, to tell SiteDiff to use a different URL in the report than the one\nit uses for fetching.\n\nFor example, if you ran `sitediff init http://mysite.com http://localhost`\ninside a [Vagrant](https://www.vagrantup.com/) VM, you might then run\nsomething like:\n\n```sitediff diff --after-url-report=http://vagrant:8080```\n\n## Configuration\n\nSiteDiff relies on a [YAML](http://yaml.org/) configuration file, usually\ncalled `sitediff.yaml`. You can create a reasonable one using `sitediff init`,\nbut there are many useful things you may want to add or change manually.\n\nIn the `sitediff.yaml`, SiteDiff recognizes the keys described below. The\n`config` directory contains some example `sitediff.yaml` files. For example,\n[sitediff.example.yaml](config/sitediff.example.yaml).\n\n### before_url / after_url\n\n```yaml\nbefore_url: http://example.com/subsite\nafter_url: http://localhost:8080/subsite\n```\n\nThey can also be paths to directories on the local filesystem.\n\nThe `after_url` MUST provided either at the command-line or in the\n`sitediff.yaml`. If the `before_url` is provided, SiteDiff will compare the\ntwo sites. Otherwise, it will compare the current version of the `after` site\nwith the stored version of that site, as created by `sitediff init` or\n`sitediff store`.\n\n### selector\n\nChooses the sections of HTML we wish to compare, if you don't\nwant to compare the entire page. For example if you only want to compare\nbreadcrumbs between your two sites, you might specify:\n\n```yaml\nselector: '#breadcrumb'\n```\n\n### sanitization\n\nA list of regular expression rules to normalize your HTML for comparison.\n\nEach rule should have a **pattern** regex, which is used to search the HTML.\nEach found instance is replaced with the provided **substitute** or deleted\nif no substitute is provided.  A rule may also have a **selector**, which\nconstrains it to operate only on HTML fragments which match that CSS selector.\n\nFor example, forms on Drupal sites have a randomly generated `form_build_id`\non form pages:\n\n```html\n\u003cinput type=\"hidden\" name=\"form_build_id\" value=\"form-1cac6b5b6141a72b2382928249605fb1\"/\u003e\n```\n\nWe're not interested in comparing random content, so we could use the\nfollowing rule to fix this:\n\n```yaml\nsanitization:\n# Remove form build IDs\n  - pattern: '\u003cinput type=\"hidden\" name=\"form_build_id\" value=\"form-[a-zA-Z0-9_-]+\" *\\/?\u003e'\n    selector: 'input'\n    substitute: '\u003cinput type=\"hidden\" name=\"form_build_id\" value=\"__form_build_id__\"\u003e'\n```\n\nSanitization rules may also have a **path** attribute, whose value is a\nregular expression. If present, the rule will only apply to matching paths.\n\n### ignore_whitespace\nIgnore whitespace when doing the diff. This passes the `-w` option to the native OS `diff` command.\n\n```yaml\nignore_whitespace: true\n```\n\nOn the command line, use `-w` or `--ignore-whitespace`.\n\n```bash\nsitediff diff -w\n```\n\n### remove_html_comments\nRemove HTML comments from a page.  Useful for Drupal sites with the theme debugging enabled.\n\nOn the command line, use `-c` or `--remove-html-comments`.\n\n```bash\nsitediff diff -c\n```\n\n### before / after\n\nApplies rules to just one side of the comparison.\n\nThese blocks can contain any of the following sections: `selector`,\n`sanitization`,  `dom_transform`. Such a section placed in `before` will be\napplied just to the `before` side of the comparison and similarly for `after`.\n\nFor example, if you wanted to let different date formatting not create diff\nfailures, you might use the following:\n\n```yaml\nbefore:\n  sanitization:\n    - pattern: '[1-2][0-9]{3}/[0-1][0-9]/[0-9]{2}'\n      substitute: '__date__'\nafter:\n  sanitization:\n    - pattern:  '[A-Z][a-z]{2} [0-9]{1,2}(st|nd|rd|th) [1-2][0-9]{3}'\n      substitute: '__date__'\n```\n\nThe above rule will replace dates of the form `2004/12/05` in `before` and\ndates of the form `May 12th 2004` in `after` with `__date__`.\n\n### includes\n\nThe names of other configuration YAML files to merge with this one.\n\n```yaml\nincludes:\n  - config/sanitize_domains.yaml\n  - config/strip_css_js.yaml\n```\n\n### dom_transform\n\nA list of transformations to apply to the HTML before comparing.\n\nThis is similar to _sanitization_, but it applies transformations to the\nstructure of the HTML, instead of to the text. Each transformation has a\n**type**, and potentially other attributes. The following types are available:\n\n#### remove\n\nGiven a **selector**, removes all elements that match it.\n\nFor example, say we have a block containing the current time, which is\nexpected to change. To ignore that, we might choose to delete the block\nbefore comparison:\n\n```yaml\ndom_transform:\n# Remove current time block\n  - type: remove\n    selector: div#block-time\n```\n\n#### strip\n\nStrip leading and trailing whitespace from the contents of a tag.\n\nUses the Ruby string `strip()` method. Whitespace is defined as any of the\nfollowing characters: null, horizontal tab, line feed, vertical tab, form\nfeed, carriage return, space.\n\nTo transform `\u003ch1\u003e  Foo and Bar\\n  \u003c/h1\u003e` to `\u003ch1\u003eFoo and Bar\u003c\\h1\u003e`:\n\n```yaml\ndom_transform:\n# Strip H1 tags\n  - type: strip\n    selector: h1\n```\n\n#### unwrap\n\nGiven a **selector**, replaces all matching elements with\ntheir children. For example, your content on one side of the comparison might\nlook like this:\n\n```html\n\u003cp\u003eThis is some text\u003c/p\u003e\n\u003cimg src=\"lola.png\" alt=\"Lola is a cute kitten.\" /\u003e\n```\n\nBut on the other side, it might be wrapped in an `article` tag:\n```html\n\u003carticle\u003e\n  \u003cp\u003eThis is some text\u003c/p\u003e\n  \u003cimg src=\"test.png\"/\u003e\n\u003c/article\u003e\n```\n\nYou could fix it with the following configuration:\n\n```yaml\ndom_transform:\n  - type: unwrap\n    selector: article\n```\n\n#### remove_class\n\nGiven a **selector** and a **class**, removes that class\nfrom each element that matches the selector. It can also take a list of\nclasses, instead of just one.\n\nFor example, here are two sample rules for removing a single class and\nremoving multiple classes from all `div` elements:\n\n```yaml\ndom_transform:\n  # Remove class foo from div elements\n  - type: remove_class\n    selector: div\n    class: class-foo\n  # Remove class bar and class baz from div elements\n  - type: remove_class\n    selector: div\n    class:\n      - class-bar\n      - class-baz\n```\n\n#### unwrap_root\n\nReplaces the entire root element with its children.\n\n### report\n\nThe settings under the `report` key allow you to display helpful details on the report.\n\n```yaml\nreport:\n  title: \"Updates to example.com\"\n  details: \"This report verifies updates to example.com.\"\n  before_note: \"The old site\"\n  after_note: \"The new site\"\n  before_url_report: http://example.com\n  after_url_report: http://staging.example.com\n```\n\n#### title\n\nDisplay a title string at the top of the report.\n\n#### details\n\nText displays as a paragraph at the top of the report, below the title.\n\n#### before_note\n\nDisplay a brief explanatory note next to `before` URL.\n\n#### after_note\n\nDisplay a brief explanatory note next to `after` URL.\n\n#### before_url_report / after_url_report\n\nChanges how SiteDiff reports which URLs it is comparing, but don't change what\nit actually compares.\n\nSuppose you are serving your 'after' website on a virtual machine with\nIP 192.168.2.3, and you are also running SiteDiff inside that VM. To make links\nin the report accessible from outside the VM, you might provide:\n\n```yaml\nafter_url: http://localhost\nreport:\n  after_url_report: http://192.168.2.3\n```\n\nIf you don't wish to have the \"Before\" or \"After\" links in the report, set to false:\n\n```yaml\nreport:\n  after_url_report: false\n```\n\n### Miscellaneous\n\n#### preset\n\nPresets are stored in the `/lib/sitediff/presets` directory of this gem. You\ncan select a preset as follows:\n\n```yaml\nsettings:\n  preset: drupal\n```\n\n#### Include/Exclude Paths\n\n##### exclude paths\n\nA RegEx indicating the paths that should not be crawled.\n\n##### include paths\n\nA RegEx indicating the paths that should be crawled.\n\n### Organizing configuration files\n\nIf your configuration file starts getting really big, SiteDiff lets you\nseparate it out into multiple files. Just have one base file that includes\nother files:\n\n```yaml\nincludes:\n  - sanitization.yaml\n  - paths.yaml\n```\n\nThis allows you to separate your configuration into logical groups.\nFor example, generic rules for your site could live in a `generic.yaml` file,\nwhile rules pertaining to a particular update you're conducting could\nlive in `update-8.2.yaml`.\n\n### Named regions\n\nIn major upgrades and migrations where there are significant changes to the markup,\nsimple diffs will not be of much value. To assist in these cases, `named\nregions` let you define regions in the page markup and the specify order in which\nthey should be compared. Specifying the order helps in cases where the fields are\nnot in the same order on the new site.\n\nFor example, if you have a CMS displaying `title`, `author`, and `body` fields, you\ncould define the named regions and the selectors for the three fields as follows:\n\n```yaml\n  regions:\n    - name: title\n      selector: h1.title\n    - name: author\n      selector: .field-name-attribution\n    - name: body\n      selector: .field-name-body\n```\n\n(You need to define `regions` for both the `before` and `after` sections.)\n\nYou must then define the order that the fields should be compared, using the\n`output` key.\n\n```yaml\noutput:\n  - title\n  - author\n  - body\n```\n\nBefore the two versions are compared, SiteDiff generates markup with\n`\u003cregion\u003e` tags and each `region` contains the markup matching the\ncorresponding selector.\n\nEG:\n\n```html\n\u003cregion id=\"title\"\u003e\n  \u003ch1 class=\"title\"\u003eMy Blog Post\u003c/h1\u003e\n\u003c/region\u003e\n\u003cregion id=\"author\"\u003e\n  \u003cdiv class=\"field-name-attribution\"\u003e\n    \u003cspan class=\"label\"\u003eBy:\u003c/span\u003e Alfred E. Neuman\n  \u003c/div\u003e\n\u003c/region\u003e\n\u003cregion id=\"body\"\u003e\n  \u003cdiv class=\"field-name-attribution\"\u003e\n    \u003cp\u003eLorem ipsum...\n  \u003c/div\u003e\n\u003c/region\u003e\n```\n\nThe regions are processed first, so you can reference the `\u003cregion\u003e` tags to\nbe more specific in your selectors for `dom_transform` and `sanitization`\nsections.\n\nEG:\n\n```yaml\ndom_transform:\n  - name: Remove body div wrapper\n    type: unwrap\n    selector: region#body .field-name-attribution\n```\n\n### Curl Options\n\n[Many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html) can be\npassed to the underlying curl library. Add `--curl_options=name1:value1 name2:value2`\nto the command line (such as `--curl_options=max_recv_speed_large:100000`\n(remove the `CURLOPT_` prefix and write the name in lowercase) or add them to\nyour configuration file.\n\n```yaml\nsettings:\n  curl_opts:\n    max_recv_speed_large: 10000\n    ssl_verifypeer: false\n```\n\nThese CURL options can be put under the `settings` section of `sitediff.yaml`\nas demonstrated above.\n\n#### Throttling\n\nA few options are also available to control how aggressively SiteDiff crawls.\n\n- There's a command line option `--concurrency=N` for `sitediff init`\n  which controls the maximum number of simultaneous connections made.\n  Lower N mean less aggressive. The default is 3. You can specify this in the\n  `sitediff.yaml` file under the `settings` key.\n\n- The underlying curl library has [many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)\n  such as `max_recv_speed_large` which can be helpful.\n\n- There is a special command line option `--interval=T` for `sitediff init`.\n  This option and allows the fetcher to delay for T milliseconds between\n  fetching pages. You can specify this in the `sitediff.yaml` file under the\n  `settings` key.\n\n#### Timeouts\n\nBy default, no timeout is set but one can be added `--curl_options=timeout:60`\nor in your configuration file.\n\n  ```yaml\n  settings:\n    curl_opts:\n      timeout: 60 # In seconds; or...\n      timeout_ms: 60000 # In milliseconds.\n  ```\n\n#### Handling security\n\nOften development or staging sites are protected by [HTTP Authentication](http://en.wikipedia.org/wiki/Basic_access_authentication).\nSiteDiff allows you to specify a username and password, by using a URL like\n`http://user:pass@example.com` or by adding a `userpwd` setting to your file.\n\nSiteDiff ignores untrusted certificates by default. This is equivalent to the following settings:\n\n```yaml\nsettings:\n  curl_opts:\n    ssl_verifypeer: false\n    ssl_verifyhost: 0\n    userpwd: \"username:password\"\n```\n\nThis contains various parameters which affect the way SiteDiff works. You can\nhave the following keys under `settings`.\n\n#### interval\nAn integer indicating the number of milliseconds SiteDiff should wait for\nbetween requests.\n\n#### concurrency\nThe maximum number of simultaneous requests that SiteDiff should make.\n\n#### depth\n\nThe depth to which SiteDiff should crawl the website. Defaults to 3,\nwhich means, 3 levels deep.\n\n#### curl_opts\n\nOptions to pass to the underlying curl library. Remove the `CURLOPT_` prefix in\nthis [full list of options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)\nand write in lowercase. Useful for throttling.\n\n```yaml\nsettings:\n  curl_opts:\n    connecttimeout: 3\n    followlocation: true\n    max_recv_speed_large: 10000\n```\n\n## Tips and Tricks\n\nHere are some tips and tricks that we've learned using SiteDiff:\n\n- Use single quotes or double quotes around selectors.  Remember that the `#` is a comment in YAML.\n- Be specific enough with selectors to not affect elements on other pages.\n\n### Removing Empty Elements\n\nIf you have an empty `\u003cp/\u003e` tag appearing in the diff, you can write the following in your sanitization lists:\n```yaml\n  - name: remove_empty_p\n    pattern: '\u003cp/\u003e'\n    substitute: ''\n```\n\n### HTML Tag Formatting\n\nThere are times when the HTML tags do not have newlines between them on one of the sites you wish to compare.  In this\ncase, these sanitzation rules are useful:\n```yaml\n  - name: remove_space_before\n    pattern: '\\s*(\\n)\u003c'\n    substitute: '\\1\u003c'\n\n  - name: remove_space_after\n    pattern: '\u003e(\\n)\\s*'\n    substitute: '\u003e\\1'\n```\n\n### Empty Attributes\n\nAfter writing rules, you may end up with empty attributes, like `width=\"\"`.  Here's a sanitization rule:\n```yaml\n  - name: remove_empty_class\n    pattern: ' class=\"\"'\n    substitute: ''\n```\n\n## Acknowledgements\n\nSiteDiff is brought to you by [Evolving Web](https://evolvingweb.ca/).\n","funding_links":[],"categories":["HTML"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvingweb%2Fsitediff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevolvingweb%2Fsitediff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvingweb%2Fsitediff/lists"}