What Is Kaskada?
Understanding and reacting to the world in real-time requires understanding what is happening now in the context of what happened in the past. You need the ability to understand if what just happened is unusual, how it relates to what happened previously, and how it relates to other things that are happening at the same time.
Getting to this type of contextual real-time insight has historically been difficult, as it required bringing together incompatible tools designed for either bulk or streaming applications. Recent stream-processing frameworks make it easier to work across streams and bulk data sources, but force you to pick one: either you get the power of a low-level API or the convenience of a high-level query language.
Kaskada provides a single, high-level, declarative query language. The power and convenience of Kaskada’s query language come from the fact that it’s built from a new abstraction: the timeline. Timelines give you the declarative transformations and aggregations of SQL without losing the ability to reason about temporal context, time travel, sequencing, timeseries, etc. Any query can be used, unchanged, in either batch or streaming mode.
What are Timelines?
Having the right tool makes every job easier: different data-processing jobs benefit from different ways of thinking. Tables are useful for inter-related records, graphs are useful for thinking about networks - Kaskada was designed for thinking about changes over time, and is built on the idea of a timeline.
A timeline describes how a value changes over time. In the same way that SQL queries transform tables and graph queries transform nodes and edges, Kaskada queries transforms timelines. In comparison to a timeseries which is defined at fixed, periodic times (i.e., every minute), a timeline is defined at arbitrary times.
Timelines simplify reasoning about time, change, and behavior. Timelines support SQL’s aggregations but extend them with sequential operations typically provided by complex event processing (CEP) systems.
Current Challenges
Let’s imagine you have two tables containing different kinds of events. One table contains an event for each time a user visits your web page:
time | user | path |
---|---|---|
8:20 |
Alice |
/index.html |
9:45 |
Alice |
/products.html |
2:10 |
Alice |
/product/1923.html |
8:52 |
Alice |
/checkout.html |
11:03 |
Bob |
index.html |
Another table contains an event for each time a user makes a purchase:
time | user | amount |
---|---|---|
7:02 |
Alice |
$3.99 |
10:47 |
Alice |
$5.00 |
We can do a lot with these tables with SQL - we can transform, filter, and aggregate them. But what if we want to answer questions that depend on time and order? We might want to know how a result changes over time, or aggregate over time-based groups, or detect when a condition is first met.
Many of these things are possible with SQL, but they’re not always simple and/or easy. Reasoning about time using abstractions designed for tables forces you to carefully reflect temporal logic in your query code, for example to see how a simple sum changes over time you might write:
SELECT sum(amount) OVER (PARTITION BY customer_id ORDER BY timestamp)
FROM Purchase
This uses a slightly arcane (and inconsistently supported) feature of SQL, windowed aggregations. Queries like this can get complex and unreadable quickly, because the abstractions do not easily facilitate these types of queries.
For example, if you wanted to know how many times each user had visited your website since the last time they purchased something, you’d need to write something like this:
WITH last_purchase AS(
SELECT user_id, max(timestamp) FROM purchase GROUP BY user_id
)
SELECT user_id, count(*)
FROM pageview
JOIN last_purchase ON pageview.user_id = last_purchase.user_id
WHERE pageview.timestamp > last_purchase.timestamp
GROUP BY user_id
If you wanted to know how that value has changed over time you’d need to re-write the query from scratch, and the result would be too long to show in this quick introduction.
Many time and sequence related questions end up being surprisingly hard to answer with SQL. This is where the notion of timelines can make your life much easier.
The solution offered by timelines
Rather than thinking of each event as a row in a table, we can think of it as a point along a timeline.
Kaskada provides many ways of transforming timelines, for example we can compute the simple sum we saw earlier:
Purchase.amount | sum()
Aggregating a timeline produces a new timeline - rather than computing a single answer, the timeline describes how the result of the aggregation changes over time.
Since the value of a timeline is specific to a point in time, we can easily describe aggregations in a temporal context. See how easy it is to describe the earlier example of counting page views since the last purchase:
Pageview
| count(since(Purchase))
This timeline describes the result of a query at every point in time, so we can easily observe its value at specific points in time without making any changes to the query:
Pageview
| count(since(Purchase))
| when(daily())
Taking this a step further, we can re-aggregate the previous result. Here we compute the average of each day’s pageview-since-purchase count:
Pageview
| count(since(Purchase))
| when(daily())
| mean()
Finally, we’re not limited to only thinking about a single point in time. By shifting timelines relative to each other we can easily describe how values change over time, for example how the previous result has changed hour-over-hour:
let daily_average = Pageview
| count(since(Purchase))
| when(daily())
| mean()
in daily_average - (daily_average | shift_by(hours(1)))
Writing these simple-seeming queries over timelines with SQL queries over tables would have been much harder, more verbose, and less maintainable due to the lack of alignment between the problem and the abstractions used to solve the problem. Aligning our mental model with the problem being solved makes reasoning about time and behavior much easier.
The shift away from technology-specific solutions
A big reason for the power and persistence of SQL is that it’s a declarative language - you write what you want, not how to compute it. This allows you to focus on understanding your data, without having to think about query implementation details.
Unfortunately, the rise of stream-based data processing has forced developers to spend a lot of time thinking about implementation details. SQL queries written against OLAP offline data stores often aren’t supported by streaming data processors. While some real-time systems support "streaming SQL", streams and tables are very different things and much of the power of stream processing is lost in translation.
How a computation is described shouldn’t depend on where events are stored - streaming vs batch is an implementation detail. By building Kaskada’s query language on timelines, it brings the abstractions of streaming to bulk storage, rather than the other way around.
Kaskada allows developers to focus on solving problems with event data by raising the abstraction level used to describe queries.
Why Kaskada?
Kaskada was built to be performant and easy to use and operate.
We chose to build Kaskada in Rust because of it’s performance, safety, lack of garbage collection and support for columnar data formats. The implementation leverages Apache Arrow for event processing and takes advantage of modern CPU optimizations like SIMD, branch prediction, and caching.
Computation is implemented as a single, chronological pass over the input events, so you can compute over datasets that are significantly larger than available memory. Internally, events are stored on disk as Parquet files. We find that most computations are bottlenecked on I/O, so using an efficient columnar file format lets us selectively read the columns and row ranges needed to produce a result.
The result is a modern event processing engine that installs in seconds without any external dependencies and computes quickly and efficiently.
Next Steps
To get started, you can follow one of our "Hello World" examples. These examples will guide you through installing Kaskada and making your first query.
-
Hello world using Python Jupyter
-
Hello world using the command line