Slices

Kaskada offers the ability to interact with slices of datasets. A slice represents a way to filter a large dataset to create a smaller dataset. By slicing a large dataset, queries may access a subset of the data and thus run significantly faster.

Slices preserve the statistical properties of the entire dataset. In general, slices either include all of a given entity’s data or none of it - sampling occurs at the granularity of individual entities.

Slicing only ever affects the produced set of results - never the data used to produce a given result. As a result, some expressions like lookup cannot slice as efficiently as others because the entire dataset may be required to produce any result.

Entity Key Percent Filter

This filter type slices the input dataset down to a percentage of the entity keys. Recall that Kaskada manages data with tables, and creating a table requires an entity_key_column_name. An entity key is a key associated with each row. The entity should identify a thing in the world related to each event.

This filter will read every row and remove rows based on the entity key column in a deterministic and scalable fashion. The filter only runs on the new data when adding additional data to the table. Re-computation of previous sliced data is not required.

Here is an example of creating an entity key filter in Python:

from kaskada.slice_filters import EntityPercentFilter

filter_percentage = 12.34
entity_filter = EntityPercentFilter(filter_percentage)

The example above creates a new EntityPercentFilter from the Kaskada Slicing module with a filtering percentage of 12.34%.

Entity Key Percent Filter Additional Details

  • The provided filter percentage must be between 0.1% and 100% inclusive.

  • Slices with larger percentages include entities from smaller percentages. For example:

    • Given a slice with 10% included results from entities: A, B, and C.

    • A slice with 20% would at least include A, B, and C plus additional entities D, E, and F.

  • If new data is added to the table, the previously sliced entity keys will also be included in the latest data with an additional probability of new entity keys. For example:

    • Given a slice with 10% included results from entities: A and B.

    • New data is uploaded to the table with entities: A, C, D, and E. A will automatically be included in the slice. C, D, and E all have a 10% chance of being included in the slice.

Usage

The usage of a slice is only applicable at query time, and only one filter can be applied per query. To apply a filter to a query, use the Kaskada module method: set_default_slice.

After setting the slice, all subsequent queries will utilize the slice filter on the same session.

IPython (Jupyter) Extension

from kaskada.slice_filters import EntityPercentFilter
from kaskada.client import set_default_slice

filter_percentage = 12.34
entity_filter = EntityPercentFilter(filter_percentage)
set_default_slice(entity_filter)
%%fenl
{
    time: Purchase.purchase_time,
    entity: Purchase.customer_id,
    max_amount: Purchase.amount | max(),
    min_amount: Purchase.amount | min(),
}

Python

from kaskada import query
from kaskada.slice_filters import EntityPercentFilter
from kaskada.client import set_default_slice

filter_percentage = 12.34
entity_filter = EntityPercentFilter(filter_percentage)
set_default_slice(entity_filter)

query = '''
{
    time: Purchase.purchase_time,
    entity: Purchase.customer_id,
    max_amount: last(Purchase.amount) | max(), min_amount: Purchase.amount | min()
}
'''

query.create_query(expression=query)

Entity Keys Filter

The entity keys filter slices the input dataset to the provided entity keys. Once the filter is applied, only data with an entity key in the provided entity keys will be queryable. Currently, we only support numeric and string entity key filtering.

Here is an example of creating an entity key filter:

from kaskada.slice_filters import EntityFilter
from kaskada.client import set_default_slice

entity_keys = ["customer_01", "customer_03"]
entity_filter = EntityFilter(entity_keys)
set_default_slice(entity_filter)

The example above creates a new EntityFilter from the Kaskada Slicing module with entity key filters for "customer_01" and "customer_03".

Entity Key Filter Additional Details

  • The provided keys must match the table’s entity key type, and only numeric/string types are supported.