Modern, open-source event processing

Kaskada is a unified event processing engine that provides all the power of stateful stream processing in a high-level, declarative query language designed specifically for reasoning about events in bulk and in real time.

Why Kaskada?

Kaskada's query language builds on the best features of SQL to provide a more expressive way to compute over events. Queries are simple and declarative. Unlike SQL, they are also concise, composable, and designed for processing events. By focusing on the event-processing use case, Kaskada's query language makes it easier to reason about when things happen, state at specific points in time, and how results change over time.

Kaskada is implemented as a modern compute engine designed for processing events in bulk or real-time. Written in Rust and built on Apache Arrow, Kaskada can compute most workloads without the complexity and overhead of distributed execution.

What can you do with Kaskada?

Kaskada's query language builds on the lessons of 50+ years of query language design to provide a declarative, composable, easy-to-read, and type-safe way of describing computations related to time.

Aggregate events to produce a continuous timeline whose value can be observed at arbitrary points in time.

          # What is the highest review to date?
max(review.stars)
        
Stateful aggregations

Every expression is associated with an “entity”, allowing tables and expressions to be automatically joined. Entities eliminate redundant boilerplate code.

          # How many purchases have been made per page view
# (combines tables)?
count(purchase) / count(pageview)
        
Automatic joins

Collect events as you move through time, and aggregate them with respect to other events. Ordered aggregation makes it easy to describe temporal interactions.

          # How many pageviews have occurred since
# the last purchase?
count(pageview, window=since(purchase))

# Auto-join purchases
# How much has been spent since the last review?
count(purchase, window=since(review))
        
Event-based windowing

Pipe syntax allows multiple operations to be chained together. Write your operations in the same order you think about them. It's timelines all the way down, making it easy to aggregate the results of aggregations.

          # What's the largest spend on food over 2 purchases?
purchase.amount 
| when(purchase.category == "food") 
| sum(window=sliding(2, $input)) # Inner aggregation
| max()                          # Outer aggregation
        
Pipelined operations

Pivot from events to time-series. Unlike grouped aggregates, generators produce rows even when there's no input, allowing you to react when something doesn't happen.

          # How many signups occurred every hour
# (even if there were none)?
count(signups, window=since(daily()))
| when(daily())
| mean()
        
Row generators

Observe the value of aggregations at arbitrary points in time. Timelines are either “discrete” (instantaneous values or events) or “continuous” (values produced by a stateful aggregations). Continuous timelines let you combine aggregates computed from different event sources.

          # Compute the average review for each product?
let product_average = review
| with_key(review.product_id)
| mean()
# Each purchase joined with average
in product_average | lookup(purchase.product_id))
        
Continuous expressions

Shift values forward (but not backward) in time, allowing you to combine different temporal contexts without the risk of temporal leakage. Shifted values make it easy to compare a value “now” to a value from the past.

          # How many purchases have occurred in the last day?
let purchases_now = count(purchase)
let purchases_yesterday =
   purchases_now | shift_by(days(1))
in purchases_now - purchases_yesterday
        
Native time travel

It is functions all the way down. No global state, no dependencies to manage, and no spooky action at a distance. Quickly understand what a query is doing, and painlessly refactor to make it DRY.

          # How many big purchases happen each hour and where?
let cadence = hourly()
# Anything can be named and re-used
let hourly_big_purchases = purchase
| when(purchase.amount > 10)
# Filter anywhere 
| count(window=since(cadence))
# Aggregate anything
| when(cadence)
# No choosing between “when” & “having”

in {hourly_big_purchases}
# Records are just another type
| extend({
  # …modify them sequentially
  last_visit_region: last(pageview.region)
})
        

Aggregate events to produce a continuous timeline whose value can be observed at arbitrary points in time.

              # What is the highest review to date?
max(review.stars)
            
Stateful aggregations

Every expression is associated with an “entity”, allowing tables and expressions to be automatically joined. Entities eliminate redundant boilerplate code.

              # How many purchases have been made per page view
# (combines tables)?
count(purchase) / count(pageview)
            
Automatic joins

Collect events as you move through time, and aggregate them with respect to other events. Ordered aggregation makes it easy to describe temporal interactions.

              # How many pageviews have occurred since
# the last purchase?
count(pageview, window=since(purchase))

# Auto-join purchases
# How much has been spent since the last review?
count(purchase, window=since(review))
            
Event-based windowing

Pipe syntax allows multiple operations to be chained together. Write your operations in the same order you think about them. It's timelines all the way down, making it easy to aggregate the results of aggregations.

              # What's the largest spend on food over 2 purchases?
purchase.amount 
| when(purchase.category == "food") 
| sum(window=sliding(2, $input)) # Inner aggregation
| max()                          # Outer aggregation
            
Pipelined operations

Pivot from events to time-series. Unlike grouped aggregates, generators produce rows even when there's no input, allowing you to react when something doesn't happen.

              # How many signups occurred every hour
# (even if there were none)?
count(signups, window=since(daily()))
| when(daily())
| mean()
            
Row generators

Observe the value of aggregations at arbitrary points in time. Timelines are either “discrete” (instantaneous values or events) or “continuous” (values produced by a stateful aggregations). Continuous timelines let you combine aggregates computed from different event sources.

              # Compute the average review for each product?
let product_average = review
| with_key(review.product_id)
| mean()
# Each purchase joined with average
in product_average | lookup(purchase.product_id))
            
Continuous expressions

Shift values forward (but not backward) in time, allowing you to combine different temporal contexts without the risk of temporal leakage. Shifted values make it easy to compare a value “now” to a value from the past.

              # How many purchases have occurred in the last day?
let purchases_now = count(purchase)
let purchases_yesterday =
   purchases_now | shift_by(days(1))
in purchases_now - purchases_yesterday
            
Native time travel

It is functions all the way down. No global state, no dependencies to manage, and no spooky action at a distance. Quickly understand what a query is doing, and painlessly refactor to make it DRY.

              # How many big purchases happen each hour and where?
let cadence = hourly()
# Anything can be named and re-used
let hourly_big_purchases = purchase
| when(purchase.amount > 10)
# Filter anywhere 
| count(window=since(cadence))
# Aggregate anything
| when(cadence)
# No choosing between “when” & “having”

in {hourly_big_purchases}
# Records are just another type
| extend({
  # …modify them sequentially
  last_visit_region: last(pageview.region)
})
            

Benefits

Modern

Built on the latest in efficient, GC-free, columnar computation and packaged up to easily install and run locally on your existing hardware. High-efficiency compute means most workloads fit on a single instance, but Kaskada is cloud-native so you can scale when needed.

Streaming-native

Declaratively express queries over partitioned, ordered streams without lossy mappings from streams to relational models. Queries freely combine rich analytic transformations and aggregations with order-dependent temporal and sequential operations.

Unified batch and streaming

Columnar compute allows you to execute analytic queries over large historical event datasets in seconds. End-to-end incremental execution allows you to maintain real-time query results computed over event streams efficiently. Kaskada's streaming-native query language means any query can be used, unchanged, for both purposes.

What can you use Kaskada for?

Machine Learning

Compute event-based features at arbitrary, data-dependent, points in time in historical feature computation. Prevent data leakage, or accidental computation of future events that contaminate ML models.

Analytics

Compute events in batch or real time for marketing, sales, or business analytics applications.

Dashboards & Monitoring

Analyze logs and events across multiple real time and batch sources for monitoring, troubleshooting, and threat detection. Visualize aggregations over the full history of your events.

Supply Chain Management

Manage supply chain events and processes across locations providing a better end user experience across channels. Dynamically adapt to changing resource availability in real-time.

Reactive applications

Provide a differentiated user experience with applications that respond dynamically to user actions and behavior. Easily write sophisticated trigger conditions to implement real-time business logic.