Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/evolvingweb/sitediff

SiteDiff makes it easy to see differences between two versions of a website.
https://github.com/evolvingweb/sitediff

comparison diff html sanitization

Last synced: 3 months ago
JSON representation

SiteDiff makes it easy to see differences between two versions of a website.

Awesome Lists containing this project

README

        

# SiteDiff CLI

**Warning:** SiteDiff 1.2.0 requires at least Ruby 3.1.2.

**Warning:** SiteDiff 1.0.0 introduces some backwards incompatible changes.

[![Build Status](https://travis-ci.org/evolvingweb/sitediff.svg?branch=master)](https://travis-ci.org/evolvingweb/sitediff)

## Table of contents

- [Introduction](#introduction)
- [Installation](#installation)
- [Demo](#demo)
- [Usage](#usage)
- [Getting Started](#getting-started)
- [Comparing 2 Sites](#comparing-2-sites)
- [Spurious Diffs](#spurious-diffs)
- [Command Line Options](#command-line-options)
- [Finding Configuration Files](#finding-configuration-files)
- [Specifying Paths](#specifying-paths)
- [Debugging Rules](#debugging-rules)
- [Including and Excluding URLs](#including-and-excluding-urls)
- [Paths and Paths-file](#paths--paths-file)
- [Report Export](#export)
- [Running inside containers](#running-inside-containers)
- [Configuration](#configuration)
- [before_url / after_url](#before_url--after_url)
- [selector](#selector)
- [sanitization](#sanitization)
- [ignore_whitespace](#ignore_whitespace)
- [before / after](#before--after)
- [includes](#incudes)
- [dom_transform](#dom_transform)
- [remove](#remove)
- [strip](#strip)
- [unwrap](#unwrap)
- [remove_class](#remove_class)
- [unwrap_root](#unwrap_root)
- [Organizing configuration files](#organizing-configuration-files)
- [Named regions](#named-regions)
- [report](#report)
- [title](#title)
- [details](#details)
- [before_note](#before_note)
- [after_note](#after_note)
- [before_url_report / after_url_report](#before_url_report--after_url_report)
- [Miscellaneous](#miscellaneous)
- [preset](#preset)
- [Include / Exclude Paths](#includeexclude-paths)
- [Curl Options](#curl-options)
- [Throttling](#throttling)
- [Timeouts](#timeouts)
- [Handling security](#handling-security)
- [interval](#interval)
- [concurrency](#concurrency)
- [depth](#depth)
- [curl_opts](#curl_opts)
- [Tips and Tricks](#tips-and-tricks)
- [Removing empty elements](#removing-empty-elements)
- [HTML Tag Formatting](#html-tag-formatting)
- [Empty Attributes](#empty-attributes)
- [Acknowledgements](#acknowledgements)

## Introduction
SiteDiff makes it easy to see how a website changes. It can compare two similar
sites or it can show how a single site changed over time. It helps identify
undesirable changes to the site's HTML and it's a useful tool for conducting QA
on re-deployments, site upgrades, and more!

When you run SiteDiff, it produces an HTML report showing whether pages on
your site have changed or not. For pages that have changed, you can see a
colorized diff exactly what changed, or compare the visual differences
side-by-side in a browser.

SiteDiff supports a range of normalization / sanitization rules. These allow
you to eliminate spurious differences, narrowing down differences to the ones
that materially affect the site.

## Installation

SiteDiff is fairly easy to install. Please refer to the
[installation docs](INSTALLATION.md).

## Demo

After installing all dependencies including the `bundle` version 2 gem, you can quickly
see what SiteDiff can do. Simply use the following commands:

```sh
git clone https://github.com/evolvingweb/sitediff
cd sitediff
bundle install
bundle exec thor fixture:serve
```

Then visit `http://localhost:13080/` to view the report.

SiteDiff shows you an overview of all the pages and clearly indicates which
pages have changed and not changed.
![page report preview](misc/sitediff%20-%20overview%20report.png?raw=true)

When you click on a changed page, you see a colorized diff of the page's markup
showing exactly what changed on the page.
![page report preview](misc/sitediff%20-%20page%20report.png?raw=true)

## Usage

Here are some instructions on getting started with SiteDiff. To see a list of
commands that SiteDiff offers, you can run:

```sitediff help```

To get help for a particular command, say, `diff`, you can run:

```sitediff help diff```

### Getting started

To use SiteDiff on your site, create a configuration for your site:

```sitediff init http://mysite.example.com```

SiteDiff will generate a configuration file named `sitediff.yaml` by default.

You can open the configuration file ```sitediff/sitediff.yaml``` to see the
default configuration generated by SiteDiff.
The [the configuration reference](#configuration) section explains the contents
of this file and helps you customize it as per your requirements.

Then get SiteDiff to crawl your site by using:

```sitediff crawl```

SiteDiff will then crawl your site, finding pages and caching their
contents. A list of discovered paths will be saved to a `paths.txt` file.

Now, you can make alterations to your site. For example, change a word on your
site's front page. After you're done, you can check what actually changed:

```sitediff diff```

For each page, SiteDiff will report whether it did or did not change. For pages
that changed, it will display a diff. You can also see an HTML version of the
report using the following command:

```sitediff serve```

SiteDiff will start an internal web server and open a report page on your
browser. For each page, you can see the diff and a side-by-side view of the
old and new versions.

You can now see if the changes were as you expected, or if some things didn't
quite work out as you hoped. If you noticed unexpected changes, congratulations:
SiteDiff just helped you find an issue you would have otherwise missed!

As you fix any issues, you can continue to alter your site and run
```sitediff diff``` to check the changes against the old version. Once you're
satisfied with the state of your site, you can inform SiteDiff that it should
re-cache your site:

```sitediff store```

This takes a snapshot of your website and the next time you run
```sitediff diff```, it will use this new version as the reference for
comparison.

Happy diffing!

### Comparing 2 sites

Sometimes you have two sites that you want to compare, for example a production
site hosted on a public server and a development site hosted on your computer.
SiteDiff can handle this situation, too! Just inform SiteDiff that there are
two sites to compare:

```sitediff init http://mysite.example.com http://localhost/mysite```

Then when you run `sitediff diff`, it will compare the cached version of the
first site with the current version of the second site.

If both the first and second sites may be changing, you should tell SiteDiff
not to cache either site:

```sitediff diff --cached=none```

### Spurious diffs

Sometimes sites have spurious differences, that you don't want to show up in a
comparison. For example, many sites protect against Cross-Site Request Forgery
using a [semi-random token](http://en.wikipedia.org/wiki/Cross-site_request_forgery#Synchronizer_token_pattern).
Since this token changes on each HTTP GET, you probably don't care about such
a change.

To help with issues such as this, SiteDiff allows you to normalize the HTML it
fetches as it compares pages. In the ```sitediff.yaml``` configuration file,
you can add "sanitization rules", which specify either DOM transformations or
regular expression substitutions.

Here's an example of a rule you might add to remove CSRF-protection tokens
generated by Django:

```yaml
dom_transform:
- title: Remove CSRF tokens
type: remove
selector: input[name=csrfmiddlewaretoken]
```

You can use one of the presets to apply framework-specific sanitization.
Currently, SiteDiff only comes with Drupal-specific presets.

See the [preset](#preset) section for more details.

## Command Line Options

### Finding configuration files

By default SiteDiff will put everything in the `sitediff` folder. You can use
the `--directory` flag to specify a different directory.

```bash
sitediff init -C my_project_folder https://example.com
sitediff diff -C my_project_folder
sitediff serve -C my_project_folder
```

### Specifying paths

When you run ```sitediff diff```, you can specify which pages to look at in
2 ways:

1. The option ```--paths /foo /bar ...```.

If you're trying to fix one page in particular, specifying just that one
path will make ```sitediff diff``` run quickly!

2. The option ```--paths-file FILE``` with a newline-delimited text file.

This is particularly useful when you're trying to eliminate all diffs.
SiteDiff creates a file ```output/failures.txt``` containing all paths
which had differences, so as you try to fix differences, you can run:

```sitediff diff --paths-file sitediff/failures.txt```

### Debugging rules

When a sanitization rule isn't working quite right for you, you might run
`sitediff diff` many times over. If fetching all the pages is taking too long,
try adding the option ```--cached=all```. This tells SiteDiff not to re-fetch
the content, but just compare previously cached versions — it's a lot faster!

### Including and Excluding URLs

By default sitediff crawls pages that are indicated with an HTML anchor using
the `
```

We're not interested in comparing random content, so we could use the
following rule to fix this:

```yaml
sanitization:
# Remove form build IDs
- pattern: ''
selector: 'input'
substitute: ''
```

Sanitization rules may also have a **path** attribute, whose value is a
regular expression. If present, the rule will only apply to matching paths.

### ignore_whitespace
Ignore whitespace when doing the diff. This passes the `-w` option to the native OS `diff` command.

```yaml
ignore_whitespace: true
```

On the command line, use `-w` or `--ignore-whitespace`.

```bash
sitediff diff -w
```

### before / after

Applies rules to just one side of the comparison.

These blocks can contain any of the following sections: `selector`,
`sanitization`, `dom_transform`. Such a section placed in `before` will be
applied just to the `before` side of the comparison and similarly for `after`.

For example, if you wanted to let different date formatting not create diff
failures, you might use the following:

```yaml
before:
sanitization:
- pattern: '[1-2][0-9]{3}/[0-1][0-9]/[0-9]{2}'
substitute: '__date__'
after:
sanitization:
- pattern: '[A-Z][a-z]{2} [0-9]{1,2}(st|nd|rd|th) [1-2][0-9]{3}'
substitute: '__date__'
```

The above rule will replace dates of the form `2004/12/05` in `before` and
dates of the form `May 12th 2004` in `after` with `__date__`.

### includes

The names of other configuration YAML files to merge with this one.

```yaml
includes:
- config/sanitize_domains.yaml
- config/strip_css_js.yaml
```

### dom_transform

A list of transformations to apply to the HTML before comparing.

This is similar to _sanitization_, but it applies transformations to the
structure of the HTML, instead of to the text. Each transformation has a
**type**, and potentially other attributes. The following types are available:

#### remove

Given a **selector**, removes all elements that match it.

For example, say we have a block containing the current time, which is
expected to change. To ignore that, we might choose to delete the block
before comparison:

```yaml
dom_transform:
# Remove current time block
- type: remove
selector: div#block-time
```

#### strip

Strip leading and trailing whitespace from the contents of a tag.

Uses the Ruby string `strip()` method. Whitespace is defined as any of the
following characters: null, horizontal tab, line feed, vertical tab, form
feed, carriage return, space.

To transform `

Foo and Bar\n

` to `

Foo and Bar<\h1>`:

```yaml
dom_transform:
# Strip H1 tags
- type: strip
selector: h1
```

#### unwrap

Given a **selector**, replaces all matching elements with
their children. For example, your content on one side of the comparison might
look like this:

```html

This is some text


Lola is a cute kitten.
```

But on the other side, it might be wrapped in an `article` tag:
```html

This is some text


```

You could fix it with the following configuration:

```yaml
dom_transform:
- type: unwrap
selector: article
```

#### remove_class

Given a **selector** and a **class**, removes that class
from each element that matches the selector. It can also take a list of
classes, instead of just one.

For example, here are two sample rules for removing a single class and
removing multiple classes from all `div` elements:

```yaml
dom_transform:
# Remove class foo from div elements
- type: remove_class
selector: div
class: class-foo
# Remove class bar and class baz from div elements
- type: remove_class
selector: div
class:
- class-bar
- class-baz
```

#### unwrap_root

Replaces the entire root element with its children.

### report

The settings under the `report` key allow you to display helpful details on the report.

```yaml
report:
title: "Updates to example.com"
details: "This report verifies updates to example.com."
before_note: "The old site"
after_note: "The new site"
before_url_report: http://example.com
after_url_report: http://staging.example.com
```

#### title

Display a title string at the top of the report.

#### details

Text displays as a paragraph at the top of the report, below the title.

#### before_note

Display a brief explanatory note next to `before` URL.

#### after_note

Display a brief explanatory note next to `after` URL.

#### before_url_report / after_url_report

Changes how SiteDiff reports which URLs it is comparing, but don't change what
it actually compares.

Suppose you are serving your 'after' website on a virtual machine with
IP 192.168.2.3, and you are also running SiteDiff inside that VM. To make links
in the report accessible from outside the VM, you might provide:

```yaml
after_url: http://localhost
report:
after_url_report: http://192.168.2.3
```

If you don't wish to have the "Before" or "After" links in the report, set to false:

```yaml
report:
after_url_report: false
```

### Miscellaneous

#### preset

Presets are stored in the `/lib/sitediff/presets` directory of this gem. You
can select a preset as follows:

```yaml
settings:
preset: drupal
```

#### Include/Exclude Paths

##### exclude paths

A RegEx indicating the paths that should not be crawled.

##### include paths

A RegEx indicating the paths that should be crawled.

### Organizing configuration files

If your configuration file starts getting really big, SiteDiff lets you
separate it out into multiple files. Just have one base file that includes
other files:

```yaml
includes:
- sanitization.yaml
- paths.yaml
```

This allows you to separate your configuration into logical groups.
For example, generic rules for your site could live in a `generic.yaml` file,
while rules pertaining to a particular update you're conducting could
live in `update-8.2.yaml`.

### Named regions

In major upgrades and migrations where there are significant changes to the markup,
simple diffs will not be of much value. To assist in these cases, `named
regions` let you define regions in the page markup and the specify order in which
they should be compared. Specifying the order helps in cases where the fields are
not in the same order on the new site.

For example, if you have a CMS displaying `title`, `author`, and `body` fields, you
could define the named regions and the selectors for the three fields as follows:

```yaml
regions:
- name: title
selector: h1.title
- name: author
selector: .field-name-attribution
- name: body
selector: .field-name-body
```

(You need to define `regions` for both the `before` and `after` sections.)

You must then define the order that the fields should be compared, using the
`output` key.

```yaml
output:
- title
- author
- body
```

Before the two versions are compared, SiteDiff generates markup with
`` tags and each `region` contains the markup matching the
corresponding selector.

EG:

```html

My Blog Post


By: Alfred E. Neuman


Lorem ipsum...


```

The regions are processed first, so you can reference the `` tags to
be more specific in your selectors for `dom_transform` and `sanitization`
sections.

EG:

```yaml
dom_transform:
- name: Remove body div wrapper
type: unwrap
selector: region#body .field-name-attribution
```

### Curl Options

[Many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html) can be
passed to the underlying curl library. Add `--curl_options=name1:value1 name2:value2`
to the command line (such as `--curl_options=max_recv_speed_large:100000`
(remove the `CURLOPT_` prefix and write the name in lowercase) or add them to
your configuration file.

```yaml
settings:
curl_opts:
max_recv_speed_large: 10000
ssl_verifypeer: false
```

These CURL options can be put under the `settings` section of `sitediff.yaml`
as demonstrated above.

#### Throttling

A few options are also available to control how aggressively SiteDiff crawls.

- There's a command line option `--concurrency=N` for `sitediff init`
which controls the maximum number of simultaneous connections made.
Lower N mean less aggressive. The default is 3. You can specify this in the
`sitediff.yaml` file under the `settings` key.

- The underlying curl library has [many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)
such as `max_recv_speed_large` which can be helpful.

- There is a special command line option `--interval=T` for `sitediff init`.
This option and allows the fetcher to delay for T milliseconds between
fetching pages. You can specify this in the `sitediff.yaml` file under the
`settings` key.

#### Timeouts

By default, no timeout is set but one can be added `--curl_options=timeout:60`
or in your configuration file.

```yaml
settings:
curl_opts:
timeout: 60 # In seconds; or...
timeout_ms: 60000 # In milliseconds.
```

#### Handling security

Often development or staging sites are protected by [HTTP Authentication](http://en.wikipedia.org/wiki/Basic_access_authentication).
SiteDiff allows you to specify a username and password, by using a URL like
`http://user:[email protected]` or by adding a `userpwd` setting to your file.

SiteDiff ignores untrusted certificates by default. This is equivalent to the following settings:

```yaml
settings:
curl_opts:
ssl_verifypeer: false
ssl_verifyhost: 0
userpwd: "username:password"
```

This contains various parameters which affect the way SiteDiff works. You can
have the following keys under `settings`.

#### interval
An integer indicating the number of milliseconds SiteDiff should wait for
between requests.

#### concurrency
The maximum number of simultaneous requests that SiteDiff should make.

#### depth

The depth to which SiteDiff should crawl the website. Defaults to 3,
which means, 3 levels deep.

#### curl_opts

Options to pass to the underlying curl library. Remove the `CURLOPT_` prefix in
this [full list of options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)
and write in lowercase. Useful for throttling.

```yaml
settings:
curl_opts:
connecttimeout: 3
followlocation: true
max_recv_speed_large: 10000
```

## Tips and Tricks

Here are some tips and tricks that we've learned using SiteDiff:

- Use single quotes or double quotes around selectors. Remember that the `#` is a comment in YAML.
- Be specific enough with selectors to not affect elements on other pages.

### Removing Empty Elements

If you have an empty `

` tag appearing in the diff, you can write the following in your sanitization lists:
```yaml
- name: remove_empty_p
pattern: '

'
substitute: ''
```

### HTML Tag Formatting

There are times when the HTML tags do not have newlines between them on one of the sites you wish to compare. In this
case, these sanitzation rules are useful:
```yaml
- name: remove_space_before
pattern: '\s*(\n)<'
substitute: '\1<'

- name: remove_space_after
pattern: '>(\n)\s*'
substitute: '>\1'
```

### Empty Attributes

After writing rules, you may end up with empty attributes, like `width=""`. Here's a sanitization rule:
```yaml
- name: remove_empty_class
pattern: ' class=""'
substitute: ''
```

## Acknowledgements

SiteDiff is brought to you by [Evolving Web](https://evolvingweb.ca/).