The Filter Expression Language

2026-04-24

We’re all at least somewhat familiar with databases (DBs). You issue an arbitrary query against a trove of messages stored on disk (or memory for certain scenarios) and get back some subset of those messages that match your query. By matching, we mean that the boolean expression comprising the query returns true when input a message. Consider a toy example:

SELECT * FROM guitarist
  WHERE
    first_name = "Jimi" AND last_name = "Hendrix"

We expect first_name and last_name to each be keys in the message of type string. We haven’t seen the schema of this message yet, but let’s assume those keys exist.

The real weight of the query lays in the where clause, which is just the expression following the keyword WHERE. Above, it’s first_name = "Jimi" AND last_name = "Hendrix". We can think of this expression like a predicate (a function returning a boolean) that accepts a message. For the above, we would expect a set messages of input to this predicate to behave like:

[
  { "first_name": "Jimi", "last_name": "Hendrix" } // match
  { "first_name": "Eric", "last_name": "Johnson" } // no match
  { "first_name": "Eric", "last_name": "Clapton" } // no match
]

The messages stored in a DB can be indexed according to a set of fields; let’s examine the case of indexing a single field. If our users were primarily concerned with searching for guitarists by last_name, we can do better than an table scan and get down to even lookup time if we use a hash index.

In that case, the DB keeps around a hash map pointing field values of the index to rows, which we’ll use at query time:

// {
//   "Hendrix": [0]
//   "Johnson": [1]
//   "Clapton": [2]
// }
[
  { "first_name": "Jimi", "last_name": "Hendrix" } // row 0
  { "first_name": "Eric", "last_name": "Johnson" } // row 1
  { "first_name": "Eric", "last_name": "Clapton" } // row 2
]

This is all fairly straightforward, but what if we wanted to solve this exact problem, but backwards? Suppose that instead of having a trove of messages and executing arbitrary queries against those messages, we had a trove of queries and fed them arbitrary messages. In that case, we’d be asking for the set of matching queries given an input message—that’s a quite different problem.

Imagine that we had a few where clauses somewhere on disk:

first_name = "Eric" AND last_name = "Johnson"    // row 0
first_name = "Jimi" OR first_name = "Eric"       // row 1
first_name = "Jimi"                              // row 2

Given an input like { "first_name": "Eric", "last_name": "Johnson" }, it’s clear that this should match the first 2 rows. The tricky part is figuring out how to do this quickly. Like before, we’d like to use an index, but how does one index a boolean expression?

That depends on the structure of the expression. The fields expressed by our set of queries have to be part of the input message. Thus, the set of fields contained in our message’s schema have to be at least as big as the outer union of the set of fields expressed by each query in our set of queries .

We can think of as a high-dimensional space—it’s often the case that our message type will have many fields. Of course, each of those fields in can take on a number of values according to the field’s type; let’s call this set where .

In that case, the set of all possible message values (assuming we give every field a value) is just a high-dimensional cartesian product over each field’s set of values:

A particular message value can also be called an assignment, since we’re assigning some set of the fields to a value. Note that we don’t have to assign every field to a value; some can be left free. This means that also needs to account for the case where the field isn’t set. We’ll just tack on a special null value to to handle this:

Now, let’s get back to indexing these expressions. If we consider the “empty” query expression—one that specifies no fields and leaves all of them free—what messages would it match? When a field is left free, our query is effectively saying “I don’t care about the value of this field”. If it doesnt’ care about the value of any field, then by extension it has to match every input message. Its solution space, or the subset of which match the query, is just .

If we expand this expression by introducing some assignment to a field like first_name = "Eric", we are refining the solution space of the query by restricting a free dimension to an exact value. In other words, we’re cutting through with a hyperplane which represents our new solution space for the query.

Consider if we expanded again to first_name = "Eric" AND last_name = "Johnson". The solution space for each leg of the AND is a hyperplane— and respectively. The AND operator dictates that the solution space for the query is the intersection of the spaces for its operands. Thus, the solution space for such a query is . As one might suspect, the OR operator is effectively the inverse of AND; it returns a union of the spaces for its operands.

If we considered indexing this expression, we might consider an approach analogous to a typical DB—we could index first_name, last_name, or the pair jointly. Let’s look at a basic hash index on each field (not the pair) using some additional queries from above:

// {
//   first_name: {
//     null      : []
//     "Eric"    : [0]
//     "Jimi"    : [1, 2]
//   }
//   last_name: {
//     null      : [1, 2]
//     "Johnson" : [0]
//   }
// }
first_name = "Eric" AND last_name = "Johnson"    // row 0
first_name = "Jimi" OR first_name = "Eric"       // row 1
first_name = "Jimi"                              // row 2

While a hash index works well for the above query structures, it can’t (easily) account for range operations, like favorite_number > 42. A hash index would have to insert an entry for every number above 42, which is too much space. In other words, the space implied by this syntax is just too large for a hash index to express. Similarly, an inequality like first_name != "Jimi" or NOT first_name = "Jimi" also specifies a rather large space.

As one might conclude, hash indexability of an expression is closely tied to its structure. If we want to ensure that our queries are hash indexable, then we need to control the syntactic structure of the queries—this is the job of a language’s grammar.

Let’s take a walk through the design and implementation of a language called Filter that’s suited for hash indexing arbitrary boolean expressions. We’ll focus on hash indexing because it’s rather easy to understand, but this experiment could be repeated targeting other indexing styles. This would naturally change the what the grammar could support.