Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sritchie/optimizer

Boolean query optimizer in Clojure.
https://github.com/sritchie/optimizer

Last synced: 2 months ago
JSON representation

Boolean query optimizer in Clojure.

Awesome Lists containing this project

README

        

#+STARTUP: showall indent
#+STARTUP: hidestars
#+PROPERTY: header-args :noweb yes :cache yes :padline yes :tangle no :mkdirp yes

This post is based on a challenge that [[https://twitter.com/peterseibel][Peter Seibel]] gave our team back in 2013 when I worked at Twitter. He'd been screwing around with Pig and Scalding and was thinking about how some optimizer could rewrite a Pig query to save time on some of the tremendous joins our "customers" were running. Here's the original challenge:

* The Challenge

Write a program whose input is a boolean expression (some combination of ANDs, ORs, NOTs, literal true and false values, and boolean variables) some of whose variables are cheap to access and some of which are expensive to access. (We'll in fact use the simplest cost model: cheap variables are free and expensive variables are infinitely expensive.) The output of this program is a new boolean expression that uses only the cheap variables and which returns true whenever the original would (i.e. for any specific set of values for the variables) and which returns false as often as it can when the original would.

(The motivation for this challenge is things like this: imagine a query that joins data sets and then filters the result. The filter predicate may access variables from both sides of the join but it may be a win to perform a pre-filter on each side of the join first to weed out rows before the join is performed. The pre-filter predicates obviously can only use the variables that are present on one side of the join.)

In slightly more formal terms, given a function f of n variables, the first k < n of which are cheap, you need to produce g such that:

#+BEGIN_EXAMPLE
f(v1,v2,..vn) = g(v1,v2,...vk) && f(v1,v2,..vn)
#+END_EXAMPLE

Or, to put it another way:

#+BEGIN_EXAMPLE
!g(v1,v2,...vk) implies !f(v1,v2,...vn)
#+END_EXAMPLE

For purposes of this challenge, you need to write a program that can parse the following grammar:

#+BEGIN_EXAMPLE
formula := variable | literal | expression
variable := cheap | expensive
cheap := v[0-9]+
expensive := w[0-9]+
literal := "T" | "F"
expression := conjunction | disjunction | negation
conjunction := "(and" ws formula ws formula ")"
disjunction := "(or" ws formula ws formula ")"
negation := "(not" ws formula ")"
ws := " "+
#+END_EXAMPLE

Then write a program that takes a file containing expressions in the form, one per line, and output a file containing for each input expression a pre-filter that uses only the cheap variables.

Your entry is disqualified for any of your pre-filters, g:

#+BEGIN_EXAMPLE
input.exists { x => f(x) != (g(x) && f(x)) } // i.e. must be correct
#+END_EXAMPLE

And correct pre-filters should minimize:

#+BEGIN_EXAMPLE
input.count { x => !g(x) }
#+END_EXAMPLE

All other things being equal, smaller pre-filters (measured by tree size of the expression) beat bigger ones. Style points are awarded for efficient computation and general cleverness. The grand prize is bragging rights and the satisfaction of a job well done.

* The Solution

Simple enough. I'll start by saying that at the time Peter issued the challenge I had NO idea how to solve a problem like this. How does one rip a boolean expression apart into TWO sub-expressions? In true [[http://www.amazon.com/gp/product/069111966X/ref%3Das_li_tl?ie%3DUTF8&camp%3D1789&creative%3D390957&creativeASIN%3D069111966X&linkCode%3Das2&tag%3Dtheroato201-20&linkId%3D4676I2A4I5RWW7U4][Polya]] style, I deferred that question. The first step was to write a parser. That was easy. Next I'd have to apply boolean transformations like ~(not (not a)) => a~ to try and simplify the original equation, hoping that might kill expensive variables. Beyond that, not sure.

I knew Oscar would be using Scala, and Peter'd code up some Common Lisp monstrosity, so I dusted off Leiningen and got back to the Clojure.

To be extra sexy about it, I wrote the exercise up in an org-mode file. Literate programming, baby. Code Is Data! Documentation Is Code!

Here's the =project.clj= file. For testing I'm using [[https://github.com/clojure/test.check][test.check]], a Clojure port of Haskell's [[https://hackage.haskell.org/package/QuickCheck][QuickCheck]]. This let me run my optimizer on randomly generated boolean expressions and saved me the pain in the ass of trying to construct edge cases. Because I'd been writing a bunch of Scala at the time and couldn't bear to lose my pattern matching, [[https://twitter.com/swannodette][David Nolen's]] [[https://github.com/clojure/core.match][core.match]] makes an appearance as well.

#+BEGIN_SRC clojure :tangle ./project.clj
(defproject io.samritchie/optimizer "0.1.0-SNAPSHOT"
:description "Boolean Optimizer in Clojure."
:url "https://github.com/sritchie/optimizer"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/core.match "0.3.0-alpha4"]]
:profiles {:provided
{:dependencies [[org.clojure/clojure "1.6.0"]]}
:dev
{:dependencies [[org.clojure/test.check "0.7.0"]]}})
#+END_SRC

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj :exports none
(ns optimizer.core
(:require [clojure.core.match :refer [match]]
[clojure.set :refer [subset? difference]]))
#+END_SRC

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj :exports none
(ns optimizer.core-test
(:use optimizer.core)
(:require [clojure.core.match :refer [match]]
[clojure.test :refer [deftest is]]
[clojure.test.check :as tc]
[clojure.test.check.clojure-test :refer [defspec]]
[clojure.test.check.generators :as gen]
[clojure.test.check.properties :as prop]))
#+END_SRC

** Parsing

The first step is to parse that grammar. Surprise surprise, Peter's grammar looks like Lisp! This means that we get to use Clojure's reader as our parser, saving us a few lines of code over the solutions those Strongly Typed fellows will have to implement. I used symbols to represent symbols and literals and lists prefixed with the symbols =and=, =or= and =not= to represent the compound expressions.

Alongside the grammar, I'll be writing generators for each expression type using [[https://github.com/clojure/test.check][test.check]]. Then, instead of unit tests, we can write "laws" for each function in the optimizer. test.check will use the generators to create hundreds of random boolean expressions and make sure that these laws hold for every huge expression it can come up with. This was enormously helpful.

*** Variables

Cheap variables start with =v=, expensive variables start with =w=.

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(def prefixes
"The set of valid variable prefixes."
#{\v \w})

(def prefix
"Returns the supplied symbol's first character."
(comp first name))

(defn cheap
"Generates a cheap variable using the supplied number."
[n]
(symbol (str \v n)))

(defn expensive
"Generates an expensive variable using the supplied number."
[n]
(symbol (str \w n)))
#+END_SRC

Now, as promised, we get to the test.check generators.

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj
(def cheap-v (gen/fmap cheap gen/nat))
(def expensive-v (gen/fmap expensive gen/nat))
#+END_SRC

=gen/nat= is a generator that produces natural numbers. =gen/fmap= takes a function and an existing generator and produces a NEW generator by applying the function all generated values. For example:

#+BEGIN_SRC clojure
optimizer.core-test> (gen/sample gen/nat 10)
(0 1 1 1 4 4 5 6 4 3)
optimizer.core-test> (gen/sample cheap-v 10)
(v0 v0 v1 v1 v1 v1 v1 v7 v3 v1)
optimizer.core-test> (gen/sample expensive-v 10)
(w0 w1 w1 w1 w0 w5 w5 w1 w4 w7)
#+END_SRC

=gen/one-of= samples randomly between a list of supplied generators:

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj
(def variable (gen/one-of [cheap-v expensive-v]))

;; optimizer.core-test> (gen/sample variable 10)
;; (v0 w1 w0 v2 v4 w4 w5 w1 v3 v2)
#+END_SRC

*** Literals

=true= and =false= are both represented as literals:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(def literals #{'T 'F})
#+END_SRC

=gen/elements= creates a generator that chooses elements from some collection:

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj
(def literal-gen (gen/elements literals))
#+END_SRC

We can use =gen/frequency= to build up a generator that spits out variables and literals, preferring variables with a 3:1 ratio.

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj
(def non-compound
(gen/frequency
[[3 variable]
[1 literal-gen]]))
#+END_SRC

Let's round out variables and literals with a couple of validators, since we don't have a type system to help us out:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(defn variable?
"Returns true if the argument is a valid cheap or expensive
variable, false otherwise."
[x]
(and (symbol? x)
(contains? prefixes (prefix x))))

(def literal?
"Returns true if passed a literal, false otherwise."
(comp boolean literals))
#+END_SRC

*** Compound Expressions

A formula is a variable, a literal or an expression. Let's implement expression parsing. Conjunctions and disjunctions, or =AND=s and =OR=s, are both binary expressions. negation, or =NOT=, is unary. These validators help us distinguish those cases and peel apart lists:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(defn unary? [exp]
(and (coll? exp)
(= 2 (count exp))))

(defn binary? [exp]
(and (coll? exp)
(= 3 (count exp))))

(def func
"Returns the function of the supplied boolean expression."
first)

(def args
"Returns the arguments of the supplied boolean expression."
rest)
#+END_SRC

Next, some functions to build and validate the various compound expressions. Conjunctions are lists of the form ~(and )~:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(defn AND?
"Returns true if the supplied expression is of the form
(and ), false otherwise."
[exp]
(and (binary? exp)
(= 'and (func exp))))

(defn AND [a b] (list 'and a b))
#+END_SRC

Similarly, disjunctions are lists of the form ~(or )~:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(defn OR?
"Returns true if the supplied expression is of the form
(or ), false otherwise."
[exp]
(and (binary? exp)
(= 'or (func exp))))

(defn OR [a b] (list 'or a b))
#+END_SRC

And negations are one-arg lists starting with the ~not~ symbol:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(defn NOT?
"Returns true if the supplied expression is of the form
(not ), false otherwise."
[exp]
(and (unary? exp)
(= 'not (func exp))))

(defn NOT
"If x is a negation, returns its argument, else returns the negation
of x."
[x]
(if (NOT? x)
(first (args x))
(list 'not x)))
#+END_SRC

The =NOT= constructor gets ahead of the game a little by implementing a simplification using the involution law:

#+BEGIN_EXAMPLE
(NOT (NOT p)) => p
#+END_EXAMPLE

If =NOT= is passed a form that's already a negation, it plucks that argument out rather than wrapping it up in a further negation.

Finally, a compound validator for expressions:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(def expr?
"Returns true if the supplied expression is a valid boolean
expression, false otherwise."
(some-fn AND? OR? NOT?))
#+END_SRC

*** Compound Generators

The compound expression generator is tricky because to generate anything interesting, it has to use itself. Luckily test.check has great documentation on [[https://github.com/clojure/test.check/blob/master/doc/intro.md#recursive-generators][writing recursive generators]] using =gen/recursive-gen=.

We'll write two different compound generators. The first one, =nested-binary=, will take one of the binary expression constructors (either =AND= or =OR=) and returns a generator. This allows us to generate compound expressions of a single type.

The next, =expr=, will generate arbitrary expressions that conform to our grammar. We'll need both at various stages.

First, =nested-binary= and a helper function:

#+NAME: nested-binary
#+BEGIN_SRC clojure
(defn tuplefn
"Takes a generator that spits out lists where the first item is a
function. Returns a new generator that applies that function to the
other items in the coll."
[g]
(letfn [(apply-tuple [[op & xs]] (apply op xs))]
(gen/fmap apply-tuple g)))

(defn nested-binary
"Takes a binary constructor (AND or OR) and returns a generator of
those types of expressions."
[f]
(-> (fn [g]
(tuplefn
(gen/tuple (gen/return f) g g)))
(gen/recursive-gen non-compound)))
#+END_SRC

=gen/recursive-gen= takes two arguments. The second argument is a seed function; every boolean expression has a literal or a variable at its leaves, so we use =non-compound=. The first argument is a function that takes an "inner generator" and returns a new overall generator. It's structured this way so that test.check can pass that generator into itself.

=gen/return= just spits the supplied argument back out, and =gen/tuple= takes n arguments and returns an n-tuple with an entry pulled from each generator. This in combination with =tuplefn= was the cleanest way I could find to build up a sort-of multi-argument =gen/fmap=.

Here's what =(nested-binary AND)= generates:

#+BEGIN_SRC clojure
optimizer.core-test> (last (gen/sample (nested-binary AND) 10))
(and (and (and v7 T) (and v6 F)) (and (and v3 w7) (and w5 w2)))
#+END_SRC

So much better than writing out examples by hand. I generated 10 samples and chose the last one because test.check generates bigger expressions as the sample size increases. This is so you don't get clobbered with huge examples if smaller ones will suffice to point out your error.

Writing a generator for any arbitrary expression is just as easy. The only difference is that instead of =(gen/return f)= we choose from =AND= and =OR= with =gen/elements=, and use =gen/one-of= to include negations of one argument in the mix as well.

#+NAME: compound-gen
#+BEGIN_SRC clojure
(def compound
(fn [g]
(tuplefn
(gen/one-of
[(gen/tuple (gen/elements [AND OR]) g g)
(gen/tuple (gen/return NOT) g)]))))

(def expr
"test.check generator for expressions."
(gen/recursive-gen compound non-compound))
#+END_SRC

Sampling =expr= looks like this:

#+BEGIN_SRC clojure
optimizer.core-test> (last (gen/sample expr 100))
(or (and T v90) (not (or (or v90 v6) (not v98))))
#+END_SRC

** Solving

TODO: here's a solver.

#+NAME: solver
#+BEGIN_SRC clojure
(defn solve
"Takes an expression and a map of variables -> boolean value."
[e m]
(letfn [(solve* [e]
(match (if (expr? e) (vec e) e)
'T true
'F false
['and p q] (and (solve* p) (solve* q))
['or p q] (or (solve* p) (solve* q))
['not p] (not (solve* p))
:else (m e)))]
(solve* e)))
#+END_SRC

TODO: Some tests. Here's a solution of the expression generated above. I solved it myself in steps. The test makes sure that I got it right at every step. As you can see, =solve= is a lot more useful for comparing two expressions.

#+NAME: solver-test
#+BEGIN_SRC clojure
(deftest solve-test
(let [solve* #(solve % {'v90 false 'v6 false 'v98 true})]
(is (= true
(solve* '(or (and T v90) (not (or (or v90 v6)
(not v98)))))
(solve* '(or (and T v90) (not (or F F))))
(solve* '(or (and T v90) T))
(solve* 'T)))))
#+END_SRC

** Splitting Expressions

The original challenge was to pull a boolean expression out into two expressions, such that

#+BEGIN_EXAMPLE
f(v1,v2,..vn) = g(v1,v2,...vk) && f(v1,v2,..vn)
#+END_EXAMPLE

This restriction made a lot more sense here in 2015 now that I've heard of "[[http://en.wikipedia.org/wiki/Conjunctive_normal_form][Conjunctive Normal Form]]", or CNF.

*** Conjunctive Normal Form

A CNF expression =AND=s together a bunch of "clauses"; a clause can be a disjunction or a negation (an =OR= or a =NOT=), a literal or a variable. Clauses can nest inside each other, but =AND=s only exist at the top level. This makes it easy to tear apart a boolean expression into two! Just filter the list of top-level conjunctions and remove every conjunct that has expensive variables. The entire expression is =f= and this filtered list is =g=.

Here are some examples of CNF from the [[http://en.wikipedia.org/wiki/Conjunctive_normal_form][wiki page]]:

#+BEGIN_SRC clojure
(and (not a) (or b c))
(and (and (or a b)
(or (or (not b) c)
(not d)))
(or d (not e)))
(and a b)

;; Because there's only one clause, this is like (and T (or a b))
(or a b)
#+END_SRC

These expressions break the CNF rules:

#+BEGIN_SRC clojure
(not (and b c)) ;; top level negation
(or c (and a b)) ;; and inside or
#+END_SRC

We can translate these requirements into code. A leaf in CNF is either a variable, a literal, or a negation of a literal:

#+NAME: is-cnf-literal
#+BEGIN_SRC clojure
(defn cnf-literal? [p]
(boolean
(or (variable? p)
(literal? p)
(if (NOT? p)
(cnf-literal?
(second p))))))
#+END_SRC

Nice. A clause in CNF is either a literal (since we might have an expression like =(and T ,,,)= or a disjunction of other clauses:

#+NAME: is-cnf-clause
#+BEGIN_SRC clojure
(defn cnf-clause? [p]
(or (cnf-literal? p)
(and (OR? p) (every? cnf-clause? (args p)))))
#+END_SRC

This is going to let through anything but an =AND=, as expected. With these functions we can write a top-level =cnf?= checker. An expression is in CNF if it's either a literal, a clause, or a conjunction of clauses.

#+NAME: is-cnf
#+BEGIN_SRC clojure
(defn cnf? [p]
(or (cnf-literal? p)
(cnf-clause? p)
(and (AND? p) (every? cnf-clause? (flatten-and p)))))
#+END_SRC

These functions will come in handy for writing =test.check= laws for verifying that our boolean transformations actually pop out CNF forms.

#+NAME: cnf-statement
#+BEGIN_SRC clojure :exports none
<>
<>
<>
#+END_SRC

Every boolean expression can be converted to CNF through the mechanical transformations we'll implement below. [[http://www.cs.jhu.edu/~jason/tutorials/convert-to-CNF.html][This page]] does a nice job of describing the algorithm.

*** Simplifying

Along the way to CNF the optimizer will also try to simplify the incoming boolean expressions. If some simplification kills an expensive variable, great.

There are a bunch of boolean simplification laws (see [[http://www.nayuki.io/page/boolean-algebra-laws][this page]] for a nice summary) that will lead toward CNF and potentially kill terms:

- Involution Law: ~(not (not a)) == a~
- Identity Laws: ~(and a T) == a~, ~(or a F) == a~
- Idempotent Laws: ~(or a a) == a~, ~(and a a) == a~
- Complement Laws: ~(and a (not a) == F~, ~(or a (not a)) == T~,
~(not F) == T~, ~(not T) == F~
- Annihilation: ~(or a T) == T~, ~(and a F) == F~
- Absorption Law: ~(and p (or p q)) == p~, ~(or p (and p q) == p~

We'll also want to apply [[http://en.wikipedia.org/wiki/De_Morgan%2527s_laws][DeMorgan's Law]] in one direction to move negations deeper into the expression.

- ~(not (and p q)) == (or (not p) (not q))~
- ~(not (or p q)) == (and (not p) (not q))~

The =simplify= function we want will take a valid boolean expression and return a valid boolean expression. Here's a first try, using [[https://github.com/clojure/core.match][core.match]]'s pattern matching to destructure our boolean expressions. Take a look at the whole thing before we break it down.

#+NAME: simplify
#+BEGIN_SRC clojure
(defn simplify
"returns a simplified expression in Conjunctive Normal
Form."
[exp]
(match (if (expr? exp) (vec exp) exp)
;; AND and OR simplification
<>

;; NOT complement laws:
<>

;; (NOT (NOT p)) => p (involution law)
<>

;; DeMorgan's Laws
<>

<>

;; Returns constants and literals.
:else exp))
#+END_SRC

Make sense? If the argument's a valid expression via ~(expr? expr)~, turn it into a vector to make pattern matching look cleaner. Otherwise leave it alone.

If we have a conjunction or disjunction, we'll use the helper functions =simplify-and= and =simplify-or= to apply the simplification laws from above to the recursively-simplified expression arguments. (If you don't know how to code something, functional programming is brilliant at letting you kick the problem down the road into another function.)

#+NAME: binary-simple
#+BEGIN_SRC clojure
['and p q] (simplify-and (simplify p) (simplify q))
['or p q] (simplify-or (simplify p) (simplify q))
#+END_SRC

If the expression is a negation, we can apply a few laws directly inside the pattern match. Negating a literal gives back a literal:

#+NAME: not-simple
#+BEGIN_SRC clojure
['not 'T] 'F
['not 'F] 'T
#+END_SRC

If the negation has another negation inside of it, we can un-nest the =p= of =(not (not p))= and recursively simplify it. (Note that core.match needs that internal =(,,, :seq)= wrapper to match a list).

#+NAME: involution
#+BEGIN_SRC clojure
['not (['not p] :seq)] (simplify p)
#+END_SRC

Otherwise we just simplify the argument and return the negation of that:

#+NAME: simplify-negation
#+BEGIN_SRC clojure
['not x] (NOT (simplify x))
#+END_SRC

DeMorgan's laws are easy to match as well. If we see ~(not (and p q))~ or ~(not (or p q))~, we apply the law and simplify the resulting form.

#+NAME: demorgan
#+BEGIN_SRC clojure
['not (['and p q] :seq)] (simplify (OR (NOT p) (NOT q)))
['not (['or p q] :seq)] (simplify (AND (NOT p) (NOT q)))
#+END_SRC

The =:else= clause bounces literals and variables back out without any transformation.

We can use =test.check= write a law for this function right away:

#+NAME: cnf-law
#+BEGIN_SRC clojure
(defspec cnf-law 100
(prop/for-all [e expr] (cnf? (simplify e))))
#+END_SRC

This law states that for every =e= generated by the =expr= generator, the result of =(simplify e)= is in Conjunctive Normal Form. Every time the test suite runs, this law will generate 100 random expressions and throw them at our function. Every time we run the test suit we can feel more secure that the implementation is correct. How amazing is that?

To get =simplify= working and passing this law we have to write =simplify-and= and =simplify-or=.

*** Flattening

After thinking about this for a while, it became clear that simplifying binary expressions was a major pain in the ass. Take annihilation:

#+BEGIN_EXAMPLE
(and a (not a)) => F
#+END_EXAMPLE

It's really hard to find this pattern with deep nesting of =AND= expressions:

#+BEGIN_SRC clojure
(and (and a b) (and c (not a)))
#+END_SRC

It's much easier to deal with the simplification laws with some way of flattening out those binary expressions. We need a way of transforming the above expression into

#+BEGIN_SRC clojure
(and a b c (not a))
#+END_SRC

Then it becomes easy to perform operations on the set of all conjunctions. Because we'll need to flatten =AND= and =OR= trees, I wrote a =flatten-binary= function that takes a predicate to see if some expression can be flattened. I can't express it without a type system, but pred has to be =AND?= or =OR?= from above.

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(defn flatten-binary
"Returns a function that takes a binary expression and flattens it
down into a variadic version. Returns the arguments to the variadic
version.

If the initial expression doesn't pass the checker, returns a
singleton list with only that element."
[pred]
(fn flatten* [e]
(if-not (pred e)
[e]
(mapcat (fn [x]
(if (pred x)
(flatten* x)
[x]))
(rest e)))))
#+END_SRC

The returned function takes an expression. If that expression does NOT pass the predicate - say the predicate is =AND?= and you pass in =(or a b)= - it returns a singleton list with that argument.

If it does pass the predicate, every argument to the expression gets flattened recursively using that same predicate and concatenated together. Now we can make specific versions for =AND?= and =OR?=:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(def flatten-and (flatten-binary AND?))
(def flatten-or (flatten-binary OR?))
#+END_SRC

TODO describe these damned laws.

#+NAME: flatten-laws
#+BEGIN_SRC clojure
;; Make sure that flatten-and kills all the nested ands.
(defspec flatten-and-laws 100
(prop/for-all
[e (nested-binary AND)]
(let [flattened (flatten-and e)]
(and (AND? e)
(every? variable? flattened)))))

;; Same thing for or:
(defspec flatten-or-laws 100
(prop/for-all
[e (nested-binary OR)]
(let [flattened (flatten-or e)]
(and (OR? e) (every? variable? flattened)))))
#+END_SRC

*** Expansion

Flattening is great for simplification, but to stick to the grammar we'll need to convert a flattened expression back into a nested form. The beatifully-named =op->binary= does this by folding all the expression arguments together using =AND= or =OR=. If the argument list is empty, you get the literal ='T= back out.

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(defn op->binary
"Moves the `op` instances back into binary form. If no ops are
provided, returns 'T."
[op]
(fn [[x & xs]]
(reduce op (or x 'T) xs)))
#+END_SRC

Specialized versions, like before:

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(def and->binary (op->binary AND))
(def or->binary (op->binary OR))
#+END_SRC

TODO: that and->binary reverses flatten-and.

#+NAME: expansion-laws
#+BEGIN_SRC clojure
(defspec and->binary-laws 100
(prop/for-all
[e (nested-binary AND)]
(let [flattened (flatten-and e)]
(= flattened (flatten-and (and->binary flattened))))))

(defspec or->binary-laws 100
(prop/for-all
[e (nested-binary OR)]
(let [flattened (flatten-or e)]
(= flattened (flatten-or (or->binary flattened))))))
#+END_SRC

*** Absorption Law

Flattening binary expressions makes it MUCH easier to handle the remaining simplification laws:

#+BEGIN_EXAMPLE
- Identity Laws: ~(and a T) == a~, ~(or a F) == a~
- Idempotent Laws: ~(or a a) == a~, ~(and a a) == a~
- Complement Laws: ~(and a (not a) == F~, ~(or a (not a)) == T~
- Annihilation: ~(or a T) == T~, ~(and a F) == F~
- Absorption Law: ~(and p (or p q)) == p~, ~(or p (and p q) == p~
#+END_EXAMPLE

We can handle all of these except the absorption law by scanning across the flattened expression arguments. The absorption law is trickier. To collapse conjunctions, for example, we need to

- compare every =AND= argument against every other argument.
- flatten those args into disjuncts using =flatten-or=, and
- check if either flattened set of disjuncts is a subset of the other.

If it is, that means we have a situation like =(and p (or p q))=. Whenever we find a clash like that, we need to remove the clause that contained the clash (=(or p q)= in this example). If not, we move on.

Here's the implementation:

#+NAME: absorption
#+BEGIN_SRC clojure
(defn absorption-law
"let lawHandled = case `flatten-fn` of
`flatten-or` -> p AND (p OR q) == p
`flatten-and` -> p OR (p AND q) == p

Absorption law, from: http://www.nayuki.io/page/boolean-algebra-laws

The input exprs must all be conjunctions if you pass `flatten-or`
and all disjunctions if you pass `flatten-and`.

Returns a sequence of simplified conjunctions (or disjunctions)."
[flatten-fn exprs]
(let [exprs (set exprs)
args* (comp set flatten-fn)]
(->> (for [[l r] (combinations 2 exprs)
:let [ls (args* l)
rs (args* r)]]
(cond (subset? ls rs) #{r}
(subset? rs ls) #{l}
:else #{}))
(reduce into #{})
(difference exprs)
(seq))))
#+END_SRC

The implementation uses [[https://github.com/amalloy][amalloy's]] combinations function to generate every pair of arguments from the input set. Here's that code:

#+NAME: combinations
#+BEGIN_SRC clojure
(defn combinations
"Thanks to amalloy: https://gist.github.com/amalloy/1042047"
[n coll]
(if (= 1 n)
(map list coll)
(lazy-seq
(when-let [[head & tail] (seq coll)]
(concat (for [x (combinations (dec n) tail)]
(cons head x))
(combinations n tail))))))
#+END_SRC

(=lazy-seq= can be confusing, so I'd recommend skipping right on over that unless you're in the mood for puzzle code.)

The =absorption-law= function implements the algorithm we discussed above. For every combination, we generate a set of clauses to remove from the supplied expression arguments. If the function sees some =p= and some =(or p q)=, that'll generate =#{'(or p q)}=. If not, it returns an empty set.

After generating all of these exclusions, =(reduce into #{})= merges all the sets into one big exclusion set. =(difference exprs )= removes all exclusions from the original expression list.

That's it! Bouncing the arguments in and out of set form takes care of the idempotent laws too, since ~(set '(p p)) => #{'p}~.

#+NAME: absorption-law
#+BEGIN_SRC clojure :exports none
<>

<>
#+END_SRC

*** Final Simplifications

The remaining simplification laws don't need that pairwise comparison craziness required by the absorption law. Here's what's left:

#+BEGIN_SRC example
- Identity Laws: ~(and a T) == a~, ~(or a F) == a~
- Complement Laws: ~(and a (not a) == F~, ~(or a (not a)) == T~
- Annihilation: ~(or a T) == T~, ~(and a F) == F~
#+END_SRC

We can deal with each of these rules by scanning across a list of flattened arguments, building up a set of all arguments we've seen and comparing each new argument to that set. If we see

- an identity (=F= for =OR=, =T= for =AND=), ignore it
- an annihilator (=T= for =OR=, =F= for =AND=), short circuit and return the annihilator.
- =p= and the set contains =(not p)=, short circuit and return the annihilator

Otherwise just keep scanning. At the end, pass the arguments into =absorption-law= for a final simplification.

Here's the implementation:

#+NAME: simplify-binary
#+BEGIN_SRC clojure
(defn simplify-binary
"Returns a function that simplifies binary expressions.

Rules handled:

Annihilator: (p OR T) = T, (p AND F) = F
Identity: (p AND T) = p, (p OR F) = p
Idempotence: (p AND p) = (p OR p) = p (accumulating into a set)
Complement: (p AND (NOT p)) = F, (p OR (NOT p)) = T

The flattening implementation depends on associativity and
commutativity."
[{:keys [ctor annihilator id flatten-fn tear-fn]}]
(let [zip-fn (op->binary ctor)]
(fn attack
([l r] (attack (flatten-fn (ctor l r))))
([xs]
(letfn [(absorb [acc p]
(cond (= p id) acc
(or (= p annihilator)
(acc (NOT p)))
(reduced [annihilator])
:else (conj acc p)))]
(->> (reduce absorb #{} xs)
(absorption-law tear-fn)
(zip-fn)))))))
#+END_SRC

Boom! Like most of our functions so far, we'll have to customize =simplify-binary= to use it for =AND= and =OR= simplification.

The function returned by =simplify-binary= can take a flattened list of expressions or a binary expression, to stick to our grammar. If passed two arguments, it sews them into a binary expression, flattens that and calls itself recursively.

The inner =absorb= function takes a set that accumulates previous expressions and a new expression and performs the comparisons we decided on above.

To make an =AND= simplifier we configure =simplify-binary= with the proper annihilator, ID expression constructor and flattening functions. You can tell that I got a little lazy with naming here. =flatten-fn= flattens =AND= expressions, of course, and =tear-fn= is for "tearing" subexpressions apart inside of =absorption-law=. All these nested functions get confusing.

#+NAME: simplify-and
#+BEGIN_SRC clojure
(def simplify-and
"Returns a function that simplifies an AND expression. Returns an
expression in conjunctive normal form."
(simplify-binary
{:ctor AND
:annihilator 'F
:id 'T
:flatten-fn flatten-and
:tear-fn flatten-or}))
#+END_SRC

Here's =simplify-binary= configured for =OR= statements:

#+NAME: simplify-or-star
#+BEGIN_SRC clojure
(def simplify-or*
"Returns a function that simplifies an OR expression."
(simplify-binary
{:ctor OR
:id 'F
:annihilator 'T
:flatten-fn flatten-or
:tear-fn flatten-and}))
#+END_SRC

I stuck a star on the end of this binding because the =OR= case is actually a little more complicated. Remember, =simplify= has to return functions in conjunctive normal form. As written, =simplify-or*= would bounce disjunctions back out, leaving an =OR= at the top level.

The solution is to turn the =OR= into CNF by applying the Distributive Law:

- ~(or (and a b) (and c d)) == (and (or a c) (or a d) (or b c) (or b d))~

passing each new sub =OR= expression into =simplify-or*=, and then passing the whole returned monster back into =simplify-and=. (I allowed the =simplify-binary= return function to take a set of arguments directly so we could do this instead of sewing the new =OR= statements back up into a big binary =AND=.)

Clojure's =for= comprehension makes this easy to express:

#+NAME: simplify-or
#+BEGIN_SRC clojure
(defn simplify-or
"Applies the distributive law to convert the OR into CNF, then
applies the AND simplifications."
[l r]
(simplify-and
(for [l (flatten-and l)
r (flatten-and r)]
(simplify-or* l r))))
#+END_SRC

Flatten the left and right expressions into a list of conjuncts, combine each pairwise, simplify the new =OR=s then simplify the whole thing again. Distributing like this is going to blow up the size of the expression, but that's okay for now. Once we strip out expensive variables we can reverse the distributive law and shrink our expression down again.

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj :exports none
<>
<>
<>
<>
<>
<>
<>
#+END_SRC

That completes the implementation of the original =simplify= function:

#+BEGIN_SRC clojure
<>
#+END_SRC

Now we have a way of transforming every expression into a simplified expression in Conjunctive Normal Form. Winning so hard.

*** Killing expensive variables

The final step of the puzzle is to filter our initial expression =f= down to the subexpression =g= such that:

#+BEGIN_EXAMPLE
f(v1,v2,..vn) = g(v1,v2,...vk) && f(v1,v2,..vn)
#+END_EXAMPLE

Now that we can wrangle =f= into conjunctive normal form, all we need is a way to check if a conjunct has only cheap variables. Then we can flatten our simplified expression, filter using this proposed =cheap?= function and sew the conjuncts back up into a binary expression. The resulting pre-filter =g= can be pushed down onto one side of a database query. Here's our template:

#+NAME: pushdown
#+BEGIN_SRC clojure
(defn pushdown-only [exp]
(and->binary
(filter cheap? (flatten-and (simplify exp)))))
#+END_SRC

Let's implement =cheap?=. To do this, we need to check if all of the leaves of the expression are either literals or cheap variables. The following function takes a predicate and returns a "checker" function that can be applied to a boolean expression. The checker does a depth-first walk on the nodes of the tree, called =pred= at every step. If =pred= returns true, that branch short circuits; else, =pred= has to return true for all branches for the node to return true.

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(defn make-checker
"Takes a predicate that checks the leaves. Optionally takes an
`else` function called if an invalid expression is passed in."
([pred] (make-checker pred (fn [_] false)))
([pred else]
(fn recurse [exp]
(boolean
(cond (or (pred exp) (literal? exp)) true
(expr? exp) (every? recurse (args exp))
:else (else exp))))))
#+END_SRC

Now =cheap?= is easy. If the node is a variable, check that it starts with the cheap prefix. (An =expensive?= expression is any expression that's not cheap, such as a fully-expensive expr or an expr with mixed variables.)

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(def cheap?
"Returns true if the supplied expression contains only cheap
variables, false otherwise."
(make-checker
(fn [x]
(if (variable? x)
(= \v (prefix x))))))

(def expensive?
"Returns true if the supplied expression is fully expensive, false
otherwise."
(complement cheap?))
#+END_SRC

TODO: And that's it! Here are some examples:

#+NAME: cheap-expression-test
#+BEGIN_SRC clojure
(def valid?
"Returns true if the supplied expression is a valid boolean
expression, false otherwise. The test is applied recursively down to
all subforms."
(make-checker
variable?
#(println "Subexpression is invalid: " %)))

(deftest cheap-expression-test
(let [mixed-exp '(and (or w1 v1) v2)
cheap-exp '(and (or v1 v2) v3)]
(is (= mixed-exp
(AND (OR (expensive 1)
(cheap 1))
(cheap 2))))
(is (cheap? mixed-exp))
(is (expensive? mixed-exp))
(is (and (valid? cheap-exp)
(valid? mixed-exp)))))
#+END_SRC

** Factoring

TODO Factoring reverses out that explosion we got, tearing the ORs out to get into CNF.

#+BEGIN_SRC clojure :tangle src/optimizer/core.clj
(def separate (juxt filter remove))

(defn factor
"Reverse of the distributive property:

(and (p or q) (p or z)) = (p or (and q z))"
[cnf-exp]
(letfn [(max-factor [ors]
(->> (apply concat ors)
(frequencies)
(sort-by (comp - val))
(first)))
(factor* [clauses]
(let [flat-clauses (map flatten-or clauses)
[shared-exp n] (max-factor flat-clauses)]
(and->binary
(if (= n 1)
clauses
(let [factorable? (partial some #{shared-exp})
[haves have-nots] (separate factorable? flat-clauses)
conjuncts (for [clause haves :when (not= clause [shared-exp])]
(or->binary (remove #{shared-exp} clause)))]
;; If you can't pull the shared expression out of 2
;; or more subexpressions, abort.
(if (< (count conjuncts) 2)
clauses
(let [factored (OR shared-exp (factor* conjuncts))]
(if-let [remaining (not-empty (map or->binary have-nots))]
[(factor* remaining) factored]
[factored]))))))))]
(factor*
(flatten-and cnf-exp))))

(def pushdown
(comp factor pushdown-only))
#+END_SRC

** Tests

To meaningfully compare this stuff we need a way of checking if two expressions are equal.

TODO: Talk about brute force, how we get an explosion, why we need =sized-expr=.

#+NAME: expr-equal
#+BEGIN_SRC clojure
(defn variables
"Returns a set of all unique variables in the supplied expression."
[e]
(let [e (if (expr? e) (flatten e) [e])]
(set (filter variable? e))))

(defn sized-expr
"Takes some limit on the size of the number of variables in the
generated expression and returns a generator that won't break that
number."
[variable-limit]
(gen/such-that #(< (count (variables %))
variable-limit)
expr))

(defn cartesian-prod
"Generates the cartesian product of all the input sequences."
[colls]
(if (empty? colls)
'(())
(for [x (first colls)
more (cartesian-prod (rest colls))]
(cons x more))))

(defn variable-map
"Returns a sequence of maps of variable -> Boolean assignment. The
returned number of maps is equal to 2^n, where n is the number of
variables."
[vs]
(let [vs (vec vs)
c (count vs)]
(map (partial zipmap vs)
(cartesian-prod
(repeat c [true false])))))

(defn expr-variables
"Returns a sequence of maps of the variables that appear in any of
the exprs -> boolean combinations."
[& exprs]
(variable-map (mapcat variables exprs)))

(defn equal?
"Are the two expressions equal for every possible input?"
[e1 e2]
(every? (fn [m]
(= (solve e1 m)
(solve e2 m)))
(expr-variables e1 e2)))
#+END_SRC

Finally we can write some law for =simplify=. Simplifying an expression yields an expression equal to the original expression:

#+NAME: simplify-laws
#+BEGIN_SRC clojure
(defspec simplify-laws 100
(prop/for-all
[e (sized-expr 7)]
(let [s (simplify e)]
(equal? e s))))
#+END_SRC

Here are some more basic tests

#+NAME: simplify-tests
#+BEGIN_SRC clojure
(deftest simplify-tests
(let [example-expression '(or (and (and v1 (or v2 v3)) (not w1)) F)]
"Reduce away the or F:"
(is (equal? example-expression (simplify example-expression)))

"and F == F"
(is (equal? 'F '(and (and (and v1 (or v2 v3)) (not w1)) F)))

"No reduction..."
(is (equal? '(and (or w1 v1) v2)
(simplify '(and (or w1 v1) v2))))

"(or a a) => a"
(is (equal? '(and w1 v2)
(simplify '(and (or w1 w1) v2))))))
#+END_SRC

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj
<>
<>
<>
<>
<>
<>
<>
<>
<>
#+END_SRC

Factoring tests. Simplifying then factoring shouldn't mess with the equality of the boolean expressions.

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj
(defspec factor-laws 100
(prop/for-all
[e (sized-expr 7)]
(let [s (simplify e)
f (factor s)]
(equal? s f))))

;; pushing
(defspec cheap-laws 100
(prop/for-all
[e (gen/such-that expensive? expr)]
(let [p (pushdown-only e)
f (factor p)]
(and (cheap? p)
(cheap? f)))))
#+END_SRC

And the final law! The simplified function returns true whenever the original would, and false as often as it can.

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj
(defspec prefilter-correctness-law 100
(prop/for-all
[e (sized-expr 8)]
(let [simplified (pushdown e)]
(every? (fn [m]
;; !simplified => !e
;; !(!simplified) OR !e
;; simplified OR !e
(or (solve simplified m)
(not (solve e m))))
(expr-variables e simplified)))))
#+END_SRC

#+BEGIN_SRC clojure :tangle test/optimizer/core_test.clj
<>
<
<>
#+END_SRC