https://github.com/joshdata/xml_diff
Compares two XML documents by diffing their text.
https://github.com/joshdata/xml_diff
Last synced: 10 months ago
JSON representation
Compares two XML documents by diffing their text.
- Host: GitHub
- URL: https://github.com/joshdata/xml_diff
- Owner: JoshData
- License: cc0-1.0
- Created: 2014-06-24T17:39:52.000Z (almost 12 years ago)
- Default Branch: primary
- Last Pushed: 2024-06-13T03:52:16.000Z (about 2 years ago)
- Last Synced: 2024-12-08T05:12:57.556Z (over 1 year ago)
- Language: Python
- Size: 29.3 KB
- Stars: 39
- Watchers: 8
- Forks: 11
- Open Issues: 1
-
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
xml_diff
========
Compares the text inside two XML documents and marks up the differences with ```` and ```` tags.
This is the result of about 7 years of trying to get this right and coded simply. I've used code like this in one form or another to compare bill text on GovTrack.us .
The comparison is completely blind to the structure of the two XML documents. It does a word-by-word comparison on the text content only, and then it goes back into the original documents and wraps changed text in new ```` and ```` wrapper elements.
The documents are then concatenated to form a new document and the new document is printed on standard output. Or use this as a library and call ``compare`` yourself with two ``lxml.etree.Element`` nodes (the roots of your documents).
The script is written in Python 3.
Example
-------
Comparing these two documents::
Here is some bold text.
and::
Here is some italic content that shows how xml_diff works.
Yields::
Here is some bold text.
Here is some italic content that shows how xml_diff works.
On Ubuntu, get dependencies with::
apt-get install python3-lxml libxml2-dev libxslt1-dev
For really fast comparisons, get Google's Diff Match Patch library , as re-written and sped-up by @leutloff and then turned into a Python extension module by me ::
pip3 install diff_match_patch_python
Or if you can't install that for any reason, use the pure-Python library::
pip3 install diff-match-patch
This is also at . xml_diff will use whichever is installed.
Finally, install this module::
pip3 install xml_diff
Then call the module from the command line::
python3 -m xml_diff --tags del,ins doc1.xml doc2.xml > changes.xml
Or use the module from Python::
import lxml.etree
from xml_diff import compare
dom1 = lxml.etree.parse("doc1.xml").getroot()
dom2 = lxml.etree.parse("doc2.xml").getroot()
comparison = compare(dom1, dom2)
The two DOMs are modified in-place.
Optional Arguments
------------------
The ``compare`` function takes other optional keyword arguments:
``merge`` is a boolean (default false) that indicates whether the comparison function should perform a merge. If true, ``dom1`` will contain not just ```` nodes but also ```` nodes and, similarly, ``dom2`` will contain not just ```` nodes but also ```` nodes. Although the two DOMs will now contain the same semantic information about changes, and the same text content, each preserves their original structure --- since the comparison is only over text and not structure. The new ``ins``/``del`` nodes contain content from the other document (including whole subtrees), and so there's no guarantee that the final documents will conform to any particular structural schema after this operation.
``word_separator_regex`` (default ``r"\s+|[^\s\w]"``) is a regular expression for how to separate words. The default splits on one or more spaces in a row and single instances of non-word characters.
``differ`` is a function that takes two arguments ``(text1, text2)`` and returns an iterator over difference operations given as tuples of the form ``(operation, text_length)``, where ``operation`` is one of ``"="`` (no change in text), ``"+"`` (text inserted into ``text2``), or ``"-"`` (text deleted from ``text1``). (See xml_diff/__init__.py's ``default_differ`` function for how the default differ works.)
``tags`` is a two-tuple of tag names to use for deleted and inserted content. The default is ``('del', 'ins')``.
``make_tag_func`` is a function that takes one argument, which is either ``"ins"`` or ``"del"``, and returns a new ``lxml.etree.Element`` to be inserted into the DOM to wrap changed content. If given, the ``tags`` argument is ignored.