Modern, open-source event processing

Kaskada is a unified event processing engine that provides all the power of stateful stream processing in a high-level, declarative query language designed specifically for reasoning about events in bulk and in real time.

Get Started Read the docs

Why Kaskada?

Kaskada's query language builds on the best features of SQL to provide a more expressive way to compute over events. Queries are simple and declarative. Unlike SQL, they are also concise, composable, and designed for processing events. By focusing on the event-processing use case, Kaskada's query language makes it easier to reason about when things happen, state at specific points in time, and how results change over time.

Kaskada is implemented as a modern compute engine designed for processing events in bulk or real-time. Written in Rust and built on Apache Arrow, Kaskada can compute most workloads without the complexity and overhead of distributed execution.

What can you do with Kaskada?

Kaskada's query language builds on the lessons of 50+ years of query language design to provide a declarative, composable, easy-to-read, and type-safe way of describing computations related to time.

Stateful aggregations Automatic joins Event-based windowing Pipelined operations Row generators Continuous expressions Native time travel Simple, composable syntax

Aggregate events to produce a continuous timeline whose value can be observed at arbitrary points in time.

          # What is the highest review to date?
max(review.stars)

Every expression is associated with an “entity”, allowing tables and expressions to be automatically joined. Entities eliminate redundant boilerplate code.

          # How many purchases have been made per page view
# (combines tables)?
count(purchase) / count(pageview)

Collect events as you move through time, and aggregate them with respect to other events. Ordered aggregation makes it easy to describe temporal interactions.

          # How many pageviews have occurred since
# the last purchase?
count(pageview, window=since(purchase))

# Auto-join purchases
# How much has been spent since the last review?
count(purchase, window=since(review))

Pipe syntax allows multiple operations to be chained together. Write your operations in the same order you think about them. It's timelines all the way down, making it easy to aggregate the results of aggregations.

          # What's the largest spend on food over 2 purchases?
purchase.amount 
| when(purchase.category == "food") 
| sum(window=sliding(2, $input)) # Inner aggregation
| max()                          # Outer aggregation

Pivot from events to time-series. Unlike grouped aggregates, generators produce rows even when there's no input, allowing you to react when something doesn't happen.

          # How many signups occurred every hour
# (even if there were none)?
count(signups, window=since(daily()))
| when(daily())
| mean()

Observe the value of aggregations at arbitrary points in time. Timelines are either “discrete” (instantaneous values or events) or “continuous” (values produced by a stateful aggregations). Continuous timelines let you combine aggregates computed from different event sources.

          # Compute the average review for each product?
let product_average = review
| with_key(review.product_id)
| mean()
# Each purchase joined with average
in product_average | lookup(purchase.product_id))

Shift values forward (but not backward) in time, allowing you to combine different temporal contexts without the risk of temporal leakage. Shifted values make it easy to compare a value “now” to a value from the past.

          # How many purchases have occurred in the last day?
let purchases_now = count(purchase)
let purchases_yesterday =
   purchases_now | shift_by(days(1))
in purchases_now - purchases_yesterday

It is functions all the way down. No global state, no dependencies to manage, and no spooky action at a distance. Quickly understand what a query is doing, and painlessly refactor to make it DRY.

          # How many big purchases happen each hour and where?
let cadence = hourly()
# Anything can be named and re-used
let hourly_big_purchases = purchase
| when(purchase.amount > 10)
# Filter anywhere 
| count(window=since(cadence))
# Aggregate anything
| when(cadence)
# No choosing between “when” & “having”

in {hourly_big_purchases}
# Records are just another type
| extend({
  # …modify them sequentially
  last_visit_region: last(pageview.region)
})

Aggregate events to produce a continuous timeline whose value can be observed at arbitrary points in time.

              # What is the highest review to date?
max(review.stars)

Every expression is associated with an “entity”, allowing tables and expressions to be automatically joined. Entities eliminate redundant boilerplate code.

              # How many purchases have been made per page view
# (combines tables)?
count(purchase) / count(pageview)

Collect events as you move through time, and aggregate them with respect to other events. Ordered aggregation makes it easy to describe temporal interactions.

              # How many pageviews have occurred since
# the last purchase?
count(pageview, window=since(purchase))

# Auto-join purchases
# How much has been spent since the last review?
count(purchase, window=since(review))

              # What's the largest spend on food over 2 purchases?
purchase.amount 
| when(purchase.category == "food") 
| sum(window=sliding(2, $input)) # Inner aggregation
| max()                          # Outer aggregation

Pivot from events to time-series. Unlike grouped aggregates, generators produce rows even when there's no input, allowing you to react when something doesn't happen.

              # How many signups occurred every hour
# (even if there were none)?
count(signups, window=since(daily()))
| when(daily())
| mean()

              # Compute the average review for each product?
let product_average = review
| with_key(review.product_id)
| mean()
# Each purchase joined with average
in product_average | lookup(purchase.product_id))

              # How many purchases have occurred in the last day?
let purchases_now = count(purchase)
let purchases_yesterday =
   purchases_now | shift_by(days(1))
in purchases_now - purchases_yesterday

It is functions all the way down. No global state, no dependencies to manage, and no spooky action at a distance. Quickly understand what a query is doing, and painlessly refactor to make it DRY.

              # How many big purchases happen each hour and where?
let cadence = hourly()
# Anything can be named and re-used
let hourly_big_purchases = purchase
| when(purchase.amount > 10)
# Filter anywhere 
| count(window=since(cadence))
# Aggregate anything
| when(cadence)
# No choosing between “when” & “having”

in {hourly_big_purchases}
# Records are just another type
| extend({
  # …modify them sequentially
  last_visit_region: last(pageview.region)
})

Benefits

Modern

Built on the latest in efficient, GC-free, columnar computation and packaged up to easily install and run locally on your existing hardware. High-efficiency compute means most workloads fit on a single instance, but Kaskada is cloud-native so you can scale when needed.

Streaming-native

Declaratively express queries over partitioned, ordered streams without lossy mappings from streams to relational models. Queries freely combine rich analytic transformations and aggregations with order-dependent temporal and sequential operations.

Unified batch and streaming

Columnar compute allows you to execute analytic queries over large historical event datasets in seconds. End-to-end incremental execution allows you to maintain real-time query results computed over event streams efficiently. Kaskada's streaming-native query language means any query can be used, unchanged, for both purposes.

Get Started

What can you use Kaskada for?

Machine Learning

Compute event-based features at arbitrary, data-dependent, points in time in historical feature computation. Prevent data leakage, or accidental computation of future events that contaminate ML models.

Analytics

Compute events in batch or real time for marketing, sales, or business analytics applications.

Dashboards & Monitoring

Analyze logs and events across multiple real time and batch sources for monitoring, troubleshooting, and threat detection. Visualize aggregations over the full history of your events.

Supply Chain Management

Manage supply chain events and processes across locations providing a better end user experience across channels. Dynamically adapt to changing resource availability in real-time.

Reactive applications

Provide a differentiated user experience with applications that respond dynamically to user actions and behavior. Easily write sophisticated trigger conditions to implement real-time business logic.

Modern, open-source event processing

Why Kaskada?

What can you do with Kaskada?

Benefits

Modern

Streaming-native

Unified batch and streaming

What can you use Kaskada for?

Machine Learning

Analytics

Dashboards & Monitoring

Supply Chain Management

Reactive applications

Latest From Kaskada

Introducing The New Kaskada: Embedded in Python for accessible Real-Time AI

by Ryan Michael · Aug 25, 2023

It’s Time for a Time-centric Stack: Timeline Abstractions Save Time and Money

by Brian Godsey · Jun 5, 2023

Introducing Timelines (Part 2): Declarative, Temporal Queries

by Ben Chambers and Therapon Skoteiniotis · May 25, 2023