The Filter Expression Language
2026-04-24We’re all at least somewhat familiar with databases (DBs). You issue an
arbitrary query against a trove of messages stored on disk (or memory for
certain scenarios) and get back some subset of those messages that match your
query. By matching, we mean that the boolean expression comprising the query
returns true when input a message. Consider a toy example:
SELECT * FROM guitarist
WHERE
first_name = "Jimi" AND last_name = "Hendrix"
We expect first_name and last_name to each be keys in the message of type
string. We haven’t seen the schema of this message yet, but let’s assume those
keys exist.
The real weight of the query lays in the where clause, which is just the
expression following the keyword WHERE. Above, it’s
first_name = "Jimi" AND last_name = "Hendrix". We can think of this expression
like a predicate (a function returning a boolean) that accepts a message. For
the above, we would expect a set messages of input to this predicate to behave
like:
[
{ "first_name": "Jimi", "last_name": "Hendrix" } // match
{ "first_name": "Eric", "last_name": "Johnson" } // no match
{ "first_name": "Eric", "last_name": "Clapton" } // no match
]
The messages stored in a DB can be indexed according to a set of fields; let’s
examine the case of indexing a single field. If our users were primarily
concerned with searching for guitarists by last_name, we can do better than an
In that case, the DB keeps around a hash map pointing field values of the index to rows, which we’ll use at query time:
// {
// "Hendrix": [0]
// "Johnson": [1]
// "Clapton": [2]
// }
[
{ "first_name": "Jimi", "last_name": "Hendrix" } // row 0
{ "first_name": "Eric", "last_name": "Johnson" } // row 1
{ "first_name": "Eric", "last_name": "Clapton" } // row 2
]
This is all fairly straightforward, but what if we wanted to solve this exact problem, but backwards? Suppose that instead of having a trove of messages and executing arbitrary queries against those messages, we had a trove of queries and fed them arbitrary messages. In that case, we’d be asking for the set of matching queries given an input message—that’s a quite different problem.
Imagine that we had a few where clauses somewhere on disk:
first_name = "Eric" AND last_name = "Johnson" // row 0
first_name = "Jimi" OR first_name = "Eric" // row 1
first_name = "Jimi" // row 2
Given an input like { "first_name": "Eric", "last_name": "Johnson" }, it’s
clear that this should match the first 2 rows. The tricky part is figuring out
how to do this quickly. Like before, we’d like to use an index, but how does one
index a boolean expression?
That depends on the structure of the expression. The fields expressed by our set
of queries have to be part of the input message. Thus, the set of fields
contained in our message’s schema
We can think of
In that case, the set of all possible message values
A particular message value can also be called an assignment, since we’re
assigning some set of the fields to a value. Note that we don’t have to assign
every field to a value; some can be left free. This means that null value
Now, let’s get back to indexing these expressions. If we consider the “empty”
query expression—one that specifies no fields and leaves all of them free—what
messages would it match? When a field is left free, our query is effectively
saying “I don’t care about the value of this field”. If it doesnt’ care about
the value of any field, then by extension it has to match every input message.
Its solution space, or the subset of
If we expand this expression by introducing some assignment to a field like
first_name = "Eric", we are refining the solution space of the query by
restricting a free dimension to an exact value. In other words, we’re cutting
through
Consider if we expanded again to
first_name = "Eric" AND last_name = "Johnson". The solution space for each leg
of the AND is a hyperplane—AND operator dictates that the solution space for the query
is the intersection of the spaces for its operands. Thus, the solution space
for such a query is OR operator is effectively the inverse of AND; it returns a union of
the spaces for its operands.
If we considered indexing this expression, we might consider an approach
analogous to a typical DB—we could index first_name, last_name, or the pair
jointly. Let’s look at a basic hash index on each field (not the pair) using
some additional queries from above:
// {
// first_name: {
// null : []
// "Eric" : [0]
// "Jimi" : [1, 2]
// }
// last_name: {
// null : [1, 2]
// "Johnson" : [0]
// }
// }
first_name = "Eric" AND last_name = "Johnson" // row 0
first_name = "Jimi" OR first_name = "Eric" // row 1
first_name = "Jimi" // row 2
While a hash index works well for the above query structures, it can’t (easily)
account for range operations, like favorite_number > 42. A hash index would
have to insert an entry for every number above 42, which is too much space. In
other words, the space implied by this syntax is just too large for a hash index
to express. Similarly, an inequality like first_name != "Jimi" or
NOT first_name = "Jimi" also specifies a rather large space.
As one might conclude, hash indexability of an expression is closely tied to its structure. If we want to ensure that our queries are hash indexable, then we need to control the syntactic structure of the queries—this is the job of a language’s grammar.
Let’s take a walk through the design and implementation of a language called Filter that’s suited for hash indexing arbitrary boolean expressions. We’ll focus on hash indexing because it’s rather easy to understand, but this experiment could be repeated targeting other indexing styles. This would naturally change the what the grammar could support.