An open API service indexing awesome lists of open source software.

https://github.com/adamniederer/elquery

Read and manipulate HTML in emacs
https://github.com/adamniederer/elquery

emacs-lisp html selectors web-scale

Last synced: about 2 months ago
JSON representation

Read and manipulate HTML in emacs

Awesome Lists containing this project

README

        

#+TITLE: elquery: parse, query, and format HTML with Emacs Lisp
#+AUTHOR: Adam Niederer

#+BEGIN_HTML















#+END_HTML

#+BEGIN_QUOTE
Write things. Do things.
#+END_QUOTE

elquery is a library that lets you parse, query, set, and format HTML using
Emacs Lisp. It implements (most of) the [[https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector][querySelector]] API with ~elquery-$~, and
can get and set HTML attributes.

* Usage
Given the below HTML file, some usage examples include:
#+BEGIN_SRC html

Complex HTML Page

Wow this is an example







#+END_SRC
#+BEGIN_SRC elisp
(let ((html (elquery-read-file "~/baz.html")))
;;; Query elements
(elquery-el (car (elquery-$ ".baz#quux" html)))
;; => "input"
(mapcar 'elquery-el (elquery-$ ".baz" html))
;; => ("input" "h1" "body" "title" "head")
(mapcar (lambda (el) (elquery-el (elquery-parent el))) (elquery-$ ".baz" html))
;; => ("body" "body" "html" "head" "html")
(mapcar (lambda (el) (mapcar 'elquery-el (elquery-siblings el))) (elquery-$ ".baz" html))
;; => (("h1" "input" "iframe") ("h1" "input" "iframe") ("head" "body") ("title") ("head" "body"))

;;; Read properties of elements
(elquery-classes (car (elquery-$ ".baz#quux" html)))
;; => ("baz" "foo")
(elquery-prop (car (elquery-$ "iframe" html)) "sandbox")
;; => "allow-same-origin allow-scripts allow-popups allow-forms"
(elquery-text (car (elquery-$ "h1" html)))
;; => "Wow this is an"
(elquery-full-text (car (elquery-$ "h1" html)) " ")
;; => "Wow this is an example"

;;; Write parsed HTML
(elquery-write html nil)
;; => " ... "
)
#+END_SRC
* Functions
** I/O and Parsing
- ~(elquery-read-url URL)~ - Return a parsed tree of the HTML at URL
- ~(elquery-read-file FILE)~ - Return a parsed tree of the HTML in FILE
- ~(elquery-read-buffer BUFFER)~ - Return a parsed tree of the HTML in BUFFER
- ~(elquery-read-string STRING)~ - Return a parsed tree of the HTML in STRING
- ~(elquery-write TREE)~ - Return a string of the HTML representation of TREE
and its children.
- ~(elquery-fmt TREE)~ - Return a string of the HTML representation of the
topmost node of TREE, without its children or closing tag.
- ~(elquery-pprint TREE)~ - Return a pretty, CSS-selector-like representation of
the topmost node of TREE.
** DOM Traversal
- ~(elquery-$ SELECTOR TREE)~ - Return a list of subtrees of TREE which matches
the query string. Currently, the element name, id, class, and property values
are supported, along with some set operations and hierarchy relationships. See
Selector Syntax for a more detailed overview of valid selectors.
- ~(elquery-parent NODE)~ - Return the parent of NODE
- ~(elquery-child NODE)~ - Return the first child of NODE
- ~(elquery-children NODE)~ - Return a list of the children of NODE
- ~(elquery-siblings NODE)~ - Return a list of the siblings of NODE
- ~(elquery-next-children NODE)~ - Return a list of children of NODE with the
children's children removed.
- ~(elquery-next-sibling NODE)~ - Return the sibling immediately following NODE
** Common Trait Selection
- ~(elquery-class? NODE CLASS)~ - Return whether NODE has the class CLASS
- ~(elquery-id NODE)~ - Return the id of NODE
- ~(elquery-classes NODE)~ - Return a list of the classes of NODE
- ~(elquery-text NODE)~ - Return the text content of NODE. If there are multiple
text elements, for example, ~Hello cruel world~,
return the concatenation of these nodes, ~Hello world~, with two spaces
between ~Hello~ and ~world~
- ~(elquery-full-text NODE)~ - Return the text content of NODE and its
children. If there are multiple text elements, for example, ~Hello
cruel world
~, return the concatenation of these nodes,
~Hello cruel world~.
** Property Modification
- ~(elquery-props NODE)~ - Return a plist of this node's properties, including
its id, class, and data attributes.
- ~(elquery-data NODE KEY &optional VAL)~ - Return the value of NODE's data-KEY
property. If VAL is supplied, destructively set NODE's data-KEY property to
VAL. For example, on the node ~~, ~(elquery-data node
"name")~ would return ~adam~
- ~(elquery-prop NODE PROP &optional VAL)~ - Return the value of PROP in
NODE. If VAL is supplied, destructively set PROP to VAL.
- ~(elquery-rm-prop NODE)~ - Destructively remove PROP from NODE
** Predicates
- ~(elquery-nodep OBJ)~ - Return whether OBJ is a DOM node
- ~(elquery-elp OBJ)~ - Return whether OBJ is not a text node
- ~(elquery-textp OBJ)~ - Return whether OBJ is a text node
** General Tree Functions
Because HTML is a large tree representation, elq includes some general tree
manipulation functions which it uses internally, and may be useful to you when
dealing with the DOM.

- ~(elquery-tree-remove-if pred tree)~ - Remove all elements from TREE if they
satisfy PRED. Preserves the structure and order of the tree.
- ~(elquery-tree-remove-if-not pred tree)~ - Remove all elements from TREE if
they do not satisfy PRED. Preserves the structure and order of the tree.
- ~(elquery-tree-mapcar fn tree)~ - Apply FN to all elements in TREE
- ~(elquery-tree-reduce fn tree)~ - Perform an in-order reduction of TREE with
FN. Equivalent to a reduction on a flattened tree.
- ~(elquery-tree-flatten tree)~ - Flatten the tree, removing all list nesting
and leaving a list of only atomic elements. This does not preserve the order
of the elements.
- ~(elquery-tree-flatten-until pred tree)~ - Flatten the tree, but treat
elements matching PRED as atomic elements, not preserving order.
* Selector Syntax
We support a significant subset of jQuery's selector syntax. If I ever decide to
make this project even more web-scale, I'll add colon selectors and more
property equality tests.

- ~#foo~ - Select all elements with the id "foo"
- ~.bar~ - Select all elements with the class "bar"
- ~[name=user]~ - Select all elements whose "name" property is "user"
- ~#foo.bar[name=user]~ - Logical intersection of the above three selectors.
Select all elements whose id is "foo", class is ".bar", and "name" is "user"
- ~#foo .bar, [name=user]~ - Select all elements with the class "bar" in the
subtrees of all elements with the id "foo", along with all elements whose
"name" is "user"
- ~#foo > .bar~ - Select all elements with class "bar" whose immediate parent
has id "foo"
- ~#foo ~ .bar~ - Select all elements with class "bar" which are siblings of
elements with id "foo"
- ~#foo + .bar~ - Select all elements with class "bar" which immediately follow
elements with id "foo"

All permutations of union, intersection, child, next-child, and sibling
relationships are supported.
* Internal Data Structure
Each element is a plist, which is guaranteed to have at least one key-value
pair, and an ~:el~ key. All elements of this plist are accessible with the above
functions, but the internal representation of a document node is below for
anybody brave enough to hack on this:

- ~:el~ - A string containing the name of the element. If the node is a "text
node", ~:el is nil~
- ~:text~ - A string containing the concatenation of all text elements
immediately below this one on the tree. For example, the node representing
~Hello cruel world~ would be ~Hello world".
- ~:props~ - A plist of HTML properties for each element, including but not
limited to its ~:id~, ~class~, ~data-*~, and ~name~ attributes.
- ~:parent~ - A pointer to the parent element. Emacs thinks this is a list.
- ~:children~ - A list of elements immediately below this one on the tree,
including text nodes.

The data structure used in queries via ~(elquery-$)~ is very similar, although
it doesn't have ~:text~ keyword (PRs welcome!) and has an extra ~:rel~ keyword,
which specifies the relationship between the query and its ~:children~. ~:rel~
may be one of ~:next-child~, ~:child~, ~next-sibling~, and ~:sibling~. This is
used by the internal function ~(elquery--$)~ which must determine whether it can
continue recursion down the tree based on the relationship of two intersections
in a selector.