Hello World (CLI)

Installation

To use Kaskada on the command line, you’ll need to install three components:

  • The Kaskada command-line executable

  • The Kaskada manager, which serves the Kaskada API

  • The Kaskada engine, which executes queries

Each Kaskada release has pre-compiled binaries for each component. You can visit the Releases page on Github to obtain the latest Kaskada release version binaries for your platform. The example commands below will download the latest Kaskada binaries and applies to Linux and OSX.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/kaskada-ai/kaskada/main/install/install.sh)"

Update PATH (optional)

Update your environment’s PATH to include the downloaded binaries. If this is not preferred, skip this section.

Print a colon-separated list of the directories in your PATH.

echo $PATH

Move the Kaskada binaries to one of the listed locations. This command assumes that the binaries are currently in your working directory and that your PATH includes /usr/local/bin, but you can customize it if your locations are different.

mv kaskada-* /usr/local/bin/
Authorizing applications on OSX

If you’re using OSX, you may need to unblock the applications. OSX prevents applications you download from running as a security feature. You can remove the block placed on the file when it was downloaded with the following command:

xattr -w com.apple.quarantine kaskada-cli
xattr -w com.apple.quarantine kaskada-manager
xattr -w com.apple.quarantine kaskada-engine

Start the services

You can start a local instance of the Kaskada service by running the manager and engine:

./kaskada-manager 2>&1 > manager.log 2>&1 &
./kaskada-engine serve > engine.log 2>&1 &
Allowing services to listen on OSX

When using OSX, you may need to allow these services to create an API listener the first time you run these commands. This is normal, and indicates the services are working as expected - the API allows services to communicate between themselves.

To verify they’re installed correctly and executable, try running the following command (which lists any resources you’ve created):

./kaskada-cli sync export --all

You should see output similar to the following:

10:18AM INF starting export
{}
10:18AM INF Success!

Loading Data into a Table

Kaskada stores data in tables. Tables consist of multiple rows, and each row is a value of the same type. When querying Kaskada, the contents of a table are interpreted as a discrete timeline: the value associated with each event corresponds to a value in the timeline.

Creating a Table

Every table is associated with a schema which defines the structure of each event in the table. Schemas are inferred from the data you load into a table, however, some columns are required by Kaskada’s data model. Every table must include a column identifying the time and entity associated with each row.

When creating a table, you must tell Kaskada which columns contain the time and entity of each row:

  • The time column is specified using the time_column_name parameter. This parameter must identify a column name in the table’s data which contains time values. The time should refer to when the event occurred.

  • The entity key is specified using the entity_key_column_name parameter. This parameter must identify a column name in the table’s data which contains the entity key value. The entity key should identify a thing in the world that each event is associated with. Don’t worry too much about picking the "right" value - it’s easy to change the entity using the with_key() function.

⭢ Read more about configuring tables in Tables

⭢ Read more about the expected structure of input files in Expected File Format

Start by running the following command:

./kaskada-cli table create Purchase  --timeColumn purchase_time --entityKeyColumn customer_id
Show result
> tableId: 1ba8ed9a-76bd-4302-b9fa-3c8655535f4a
> tableName: Purchase
> timeColumnName: purchase_time
> entityKeyColumnName: customer_id
> createTime: 2023-05-08T13:16:00.237166Z
> updateTime: 2023-05-08T13:16:00.237167Z

This creates a table named Purchase. Any data loaded into this table must have a timestamp field named purchase_time, and a field named customer_id.

Idiomatic Kaskada

We like to use CamelCase to name tables because it helps distinguish data sources from transformed values and function names.

Loading data

Now that we’ve created a table, we’re ready to load some data into it.

A table must be created before data can be loaded into it.

Data can be loaded into a table in multiple ways. In this example we’ll load the contents of a Parquet file into the table.

⭢ Read more about the different ways to load data in Loading Data

# Download a file to load and save it to path 'purchase.parquet'
curl -L "https://drive.google.com/uc?export=download&id=1SLdIw9uc0RGHY-eKzS30UBhN0NJtslkk" -o purchase.parquet

# Load the file into the Purchase table (which was created in the previous step)
./kaskada-cli table load Purchase file://${PWD}/purchase.parquet
Show result
> Successfully loaded "purchases.parquet" into "Purchase" table

The file’s content is added to the table.

Querying data

Data loaded into Kaskada is accessed by performing Fenl Queries.

Identity query

Let’s start by looking at the Purchase table without any filters. Begin by creating a text file with the following query:

query.fenl
Purchase

This query will return all of the columns and rows contained in a table. Run it by sending the query to kaskada-cli query run:

cat query.fenl | ./kaskada-cli query run --stdout
Show result
Enter the expression to run and then press CTRL+D to execute it, or CTRL+C to cancel:



Executing query...

_time,_subsort,_key_hash,_key,id,purchase_time,customer_id,vendor_id,amount,subsort_id
2020-01-01T00:00:00.000000000,12232903146196084293,10966214875107816766,karen,cb_001,2020-01-01T00:00:00.000000000,karen,chum_bucket,9,0
2020-01-01T00:00:00.000000000,12232903146196084294,15119067519137142314,patrick,kk_001,2020-01-01T00:00:00.000000000,patrick,krusty_krab,3,1
2020-01-02T00:00:00.000000000,12232903146196084295,10966214875107816766,karen,cb_002,2020-01-02T00:00:00.000000000,karen,chum_bucket,2,2
2020-01-02T00:00:00.000000000,12232903146196084296,15119067519137142314,patrick,kk_002,2020-01-02T00:00:00.000000000,patrick,krusty_krab,5,3
2020-01-03T00:00:00.000000000,12232903146196084297,10966214875107816766,karen,cb_003,2020-01-03T00:00:00.000000000,karen,chum_bucket,4,4
2020-01-03T00:00:00.000000000,12232903146196084298,15119067519137142314,patrick,kk_003,2020-01-03T00:00:00.000000000,patrick,krusty_krab,12,5
2020-01-04T00:00:00.000000000,12232903146196084299,15119067519137142314,patrick,cb_004,2020-01-04T00:00:00.000000000,patrick,chum_bucket,5000,6
2020-01-04T00:00:00.000000000,12232903146196084300,10966214875107816766,karen,cb_005,2020-01-04T00:00:00.000000000,karen,chum_bucket,3,7
2020-01-05T00:00:00.000000000,12232903146196084301,10966214875107816766,karen,cb_006,2020-01-05T00:00:00.000000000,karen,chum_bucket,5,8
2020-01-05T00:00:00.000000000,12232903146196084302,15119067519137142314,patrick,kk_004,2020-01-05T00:00:00.000000000,patrick,krusty_krab,9,9

Filtering by a single Entity

It can be helpful to limit your results to a single entity. This makes it easier to see how a single entity changes over time.

query.fenl
Purchase | when(Purchase.customer_id == "patrick")
cat query.fenl | ./kaskada-cli query run --stdout
Show result
Enter the expression to run and then press CTRL+D to execute it, or CTRL+C to cancel:



Executing query...

_time,_subsort,_key_hash,_key,id,purchase_time,customer_id,vendor_id,amount,subsort_id
2020-01-01T00:00:00.000000000,12232903146196084294,15119067519137142314,patrick,kk_001,2020-01-01T00:00:00.000000000,patrick,krusty_krab,3,1
2020-01-02T00:00:00.000000000,12232903146196084296,15119067519137142314,patrick,kk_002,2020-01-02T00:00:00.000000000,patrick,krusty_krab,5,3
2020-01-03T00:00:00.000000000,12232903146196084298,15119067519137142314,patrick,kk_003,2020-01-03T00:00:00.000000000,patrick,krusty_krab,12,5
2020-01-04T00:00:00.000000000,12232903146196084299,15119067519137142314,patrick,cb_004,2020-01-04T00:00:00.000000000,patrick,chum_bucket,5000,6
2020-01-05T00:00:00.000000000,12232903146196084302,15119067519137142314,patrick,kk_004,2020-01-05T00:00:00.000000000,patrick,krusty_krab,9,9

Complex Examples with Fenl functions

In this example, we build a pipeline of functions using the | character. We begin with the timeline produced by the table Purchase, then filter it to the set of times where the purchase’s customer is "patrick" using the when() function.

Kaskada’s query language provides a rich set of operations for reasoning about time. Here’s a more sophisticated example that touches on many of the unique features of Kaskada queries:

query.fenl
# How many big purchases happen each hour and where?
# Anything can be named and re-used
let hourly_big_purchases = Purchase
| when(Purchase.amount > 10)

# Filter anywhere
| count(window=since(hourly()))

# Aggregate anything
| when(hourly())

# Shift timelines relative to each other
let purchases_now = count(Purchase)
let purchases_yesterday =
   purchases_now | shift_by(days(1))

# Records are just another type
in { hourly_big_purchases, purchases_in_last_day: purchases_now - purchases_yesterday }

⭢ Read more about writing queries in Queries

Configuring query execution

A given query can be computed in different ways. You can configure how a query is executed by providing arguments to the CLI command.

Changing how the result timeline is output

When you make a query, the resulting timeline is interpreted in one of two ways: as a history or as a snapshot.

  • A timeline History generates a value each time there is a change in the value for the entity, and each row is associated with a different entity and point in time.

  • A timeline Snapshot generates a value for each entity at the same point in time; each row is associated with a different entity, but all rows are associated with the same time.

By default, timelines are output as histories. You can output a timeline as a snapshot by setting the --result-behavior argument to final-results.

cat query.fenl | ./kaskada-cli query run --result-behavior final-results

Limiting how many rows are returned

You can limit the number of rows returned from a query:

cat query.fenl | ./kaskada-cli query run --preview-rows 10

This may return more rows that you asked for. Kaskada computes data in batches. When you configure --preview-rows Kaskada stops processing at the end of a batch once the given number of rows have been computed, and returns all the rows that were computed.

⭢ Read more about configuring queries in Configuring Queries

Cleaning Up

When you’re done with this tutorial, you can delete the table you created in order to free up resources. Note that this also deletes all of the data loaded into the table.

# Delete the Purchase table
kaskada-cli table delete --table Purchase

Conclusion

Congratulations, you’ve begun processing events with Kaskada!

Where you go now is up to you

⭢ Read about Kaskada’s query language in Query Syntax > Introduction

⭢ Read about real-time ML in Training real-time ML models

⭢ Explore some code samples in the examples directory (Github)

⭢ Check out the source code on Github