Loading Data into a Table
Kaskada stores data in tables. Tables consist of multiple rows, and each row is a value of the same type. The Tables section of the reference docs has more information about managing tables.
A table must be created before data can be loaded into it. |
Any data loaded into a table must include all columns used in the table definition. The full schema of a table is inferred from the data loaded into it. At the moment, all data loaded into a table must have the same schema.
Additionally, it expects the following:
-
The time column (and all other date-time columns in your dataset) should be a type that can be cast to a timestamp, for example an integer or RFC3330-formatted string.
-
If a subsort column is defined, the combination of the entity key column, time column and subsort column should guarantee that each row is unique.
-
Python
-
CLI
from kaskada import table
from kaskada.api.session import LocalBuilder
session = LocalBuilder().build()
table.load(
table_name = "Purchase",
file = "/path/to/file.parquet",
)
The Python library infers a file’s format from the extension of the file’s path.
Parquet file names must end in |
kaskada-cli table load Purchase file:///path/to/file.parquet
The CLI client infers a file’s format from the extension of the file’s path.
Parquet file names must end in |
The result of loading data is a data_token_id
.
The data token ID is a unique reference to the data currently stored in Kaskada.
Data tokens enable repeatable queries: queries performed against the same data token always run on the same input data.
data_token_id: "aa2***a6b9"
Supported Formats
Parquet
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. Parquet is a good choice for large numbers of events.
-
Python
-
CLI
from kaskada import table
from kaskada.api.session import LocalBuilder
session = LocalBuilder().build()
table.load(
table_name = "Purchase",
file = "/path/to/file.parquet",
)
The Python library infers a file’s format from the extension of the file’s path.
Parquet file names must end in |
kaskada-cli table load Purchase file:///path/to/file.parquet
The CLI client infers a file’s format from the extension of the file’s path.
The |
CSV
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record.
CSV files must include a header row, which is used to infer the file’s schema. CSV type inference occurs when data is loaded. The schema is inferred by reading the first 1000 rows.
-
Python
-
CLI
from kaskada import table
from kaskada.api.session import LocalBuilder
session = LocalBuilder().build()
table.load(
table_name = "Purchase",
file = "/path/to/file.csv",
)
The Python library infers a file’s format from the extension of the file’s path.
CSV file names must end in |
kaskada-cli table load Purchase file:///path/to/file.csv
The CLI client infers a file’s format from the extension of the file’s path.
The |
File Location
Kaskada supports loading files from different kinds of file storage.
The file storage system is specified with the file path protocol prefix, for example file:
or s3:
.
Local Storage
Data can be loaded from the local disk using the file:
protocol.
When loading from disk, the file path must identify a file accessible to the Kaskada service.
Local storage is not reccommended when using a remote Kaskada service.
AWS S3 Storage
Data can be loaded from AWS S3 (or compatible stores such as Minio) using the s3:
protocol.
When loading non-public objects from S3, the Kaskada service must be configured with credentials.
Credentials are configured using environment variables.
The following environment variables are used to configure credentials:
-
AWS_ACCESS_KEY_ID
: AWS credential key. -
AWS_SECRET_ACCESS_KEY
: AWS credential secret. -
AWS_DEFAULT_REGION
: The AWS S3 region to use if not specified. -
AWS_ENDPOINT
: The S3 endpoint to connect to. -
AWS_SESSION_TOKEN
: The session token. Session tokens are required for credentials created by assuming an IAM role. -
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html -
AWS_ALLOW_HTTP
: Set to “true” to permit HTTP connections without TLS
GCP GS Storage
Data can be loaded from GCP GS using the gs:
protocol. Like the AWS S3 Storage configuration, the
Kaskada service must be configured with credentials.
The following environment variables are used to configure credentials:
* GOOGLE_SERVICE_ACCOUNT
: location of service account file
* GOOGLE_SERVICE_ACCOUNT_PATH
: (alias) location of service account file
* SERVICE_ACCOUNT
: (alias) location of service account file
* GOOGLE_SERVICE_ACCOUNT_KEY
: JSON serialized service account key