Loading Data into a Table

Kaskada stores data in tables. Tables consist of multiple rows, and each row is a value of the same type. The Tables section of the reference docs has more information about managing tables.

A table must be created before data can be loaded into it.

Any data loaded into a table must include all columns used in the table definition. The full schema of a table is inferred from the data loaded into it. At the moment, all data loaded into a table must have the same schema.

Additionally, it expects the following:

  • The time column (and all other date-time columns in your dataset) should be a type that can be cast to a timestamp, for example an integer or RFC3330-formatted string.

  • If a subsort column is defined, the combination of the entity key column, time column and subsort column should guarantee that each row is unique.

  • Python

  • CLI

from kaskada import table
from kaskada.api.session import LocalBuilder

session = LocalBuilder().build()

table.load(
  table_name = "Purchase",
  file = "/path/to/file.parquet",
)

The Python library infers a file’s format from the extension of the file’s path. Parquet file names must end in .parquet. CSV file names must end in .csv.

kaskada-cli table load Purchase file:///path/to/file.parquet

The CLI client infers a file’s format from the extension of the file’s path. Parquet file names must end in .parquet. CSV file names must end in .csv. The --type flag can also be used to force a file to be treated as Parquet or CSV.

The result of loading data is a data_token_id. The data token ID is a unique reference to the data currently stored in Kaskada. Data tokens enable repeatable queries: queries performed against the same data token always run on the same input data.

data_token_id: "aa2***a6b9"

Supported Formats

Parquet

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. Parquet is a good choice for large numbers of events.

  • Python

  • CLI

from kaskada import table
from kaskada.api.session import LocalBuilder

session = LocalBuilder().build()

table.load(
  table_name = "Purchase",
  file = "/path/to/file.parquet",
)

The Python library infers a file’s format from the extension of the file’s path. Parquet file names must end in .parquet.

kaskada-cli table load Purchase file:///path/to/file.parquet

The CLI client infers a file’s format from the extension of the file’s path. The --type parquet flag can also be used to force a file to be treated as Parquet.

CSV

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record.

CSV files must include a header row, which is used to infer the file’s schema. CSV type inference occurs when data is loaded. The schema is inferred by reading the first 1000 rows.

  • Python

  • CLI

from kaskada import table
from kaskada.api.session import LocalBuilder

session = LocalBuilder().build()

table.load(
  table_name = "Purchase",
  file = "/path/to/file.csv",
)

The Python library infers a file’s format from the extension of the file’s path. CSV file names must end in .csv.

kaskada-cli table load Purchase file:///path/to/file.csv

The CLI client infers a file’s format from the extension of the file’s path. The --type csv flag can also be used to force a file to be treated as CSV.

File Location

Kaskada supports loading files from different kinds of file storage. The file storage system is specified with the file path protocol prefix, for example file: or s3:.

Local Storage

Data can be loaded from the local disk using the file: protocol. When loading from disk, the file path must identify a file accessible to the Kaskada service. Local storage is not reccommended when using a remote Kaskada service.

AWS S3 Storage

Data can be loaded from AWS S3 (or compatible stores such as Minio) using the s3: protocol. When loading non-public objects from S3, the Kaskada service must be configured with credentials. Credentials are configured using environment variables.

The following environment variables are used to configure credentials:

  • AWS_ACCESS_KEY_ID: AWS credential key.

  • AWS_SECRET_ACCESS_KEY: AWS credential secret.

  • AWS_DEFAULT_REGION: The AWS S3 region to use if not specified.

  • AWS_ENDPOINT: The S3 endpoint to connect to.

  • AWS_SESSION_TOKEN: The session token. Session tokens are required for credentials created by assuming an IAM role.

  • AWS_CONTAINER_CREDENTIALS_RELATIVE_URI: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html

  • AWS_ALLOW_HTTP: Set to “true” to permit HTTP connections without TLS

GCP GS Storage

Data can be loaded from GCP GS using the gs: protocol. Like the AWS S3 Storage configuration, the Kaskada service must be configured with credentials.

The following environment variables are used to configure credentials: * GOOGLE_SERVICE_ACCOUNT: location of service account file * GOOGLE_SERVICE_ACCOUNT_PATH: (alias) location of service account file * SERVICE_ACCOUNT: (alias) location of service account file * GOOGLE_SERVICE_ACCOUNT_KEY: JSON serialized service account key