Skip to main content

Amazon S3 Parquet

The Amazon S3 Parquet Data Pool lets you synchronize Parquet files stored in your Amazon S3 bucket to Propel, providing an easy way to power your analytic dashboards, reports, and workflows with a low-latency data API on top of your data lake.

Consider using the Amazon S3 Parquet Data Pool when:

  • You require sub-second query performance for dashboards or reports on your data lake.
  • You need to support high-concurrency and high-availability data workloads, such as customer-facing or mission-critical applications.
  • You require fast data access through an API for web and mobile apps.
  • You are building B2B SaaS or consumer applications that require multi-tenant access controls.

Get startedโ€‹

Set up guide

Follow our step-by-step Amazon S3 Parquet setup guideย to connect Parquet files stored in an Amazon S3 bucket to Propel.

Architecture Overviewโ€‹

Amazon S3 Parquet Data Pools connect to a specified Amazon S3 bucket and automatically synchronize Parquet files from the bucket into your Data Pool in Propel.

The architectural overview when connecting an Amazon S3 bucket with Parquet files to Propel.

Featuresโ€‹

Amazon S3 Parquet Data Pools support the following features:

Feature nameSupportedNotes
Syncs new recordsโœ…
Real-time updatesโŒReal-time updates are not yet supported.
Real-time deletesโŒReal-time deletes are not yet supported. See the Delete Job API for batch deletes.
Re-syncโŒRe-syncing from the source Parquet files is not yet supported.
Configurable sync intervalโœ…See the How Propel syncs section below. It can be configured to occur at intervals ranging from every minute to every 24 hours.
Sync Pausing / Resumingโœ…
Delete Job APIโœ…See Delete Job API.
API configurableโœ…See API reference docs.
Terraform configurableโœ…See Propel Terraform docs.

How Propel syncs Parquet files in Amazon S3โ€‹

The Amazon S3-based Data Pool synchronizes Parquet files from your S3 bucket into the Data Pool. To do this, you need to specify the bucket name, the path to the files, and a sync interval. The sync interval determines how frequently files are synchronized.

The sync interval can range from 1 minute to 24 hours. During each sync Propel retrieves all the new files in the S3 bucket and synchronizes them with the Data Pool.

Syncing all files in the Amazon S3 bucketโ€‹

To sync all Parquet files in your S3 bucket across all paths, use the path value provided below:

**/*.parquet
tip

Notice that the S3 paths only match Parquet files using theย *.parquetย wildcard pattern. This is important because we don't want to attempt to sync non-Parquet files.

Syncing files in a specific pathโ€‹

To sync all Parquet files in a specific path of your S3 bucket, use the path value for that specific directory.

For instance, consider an S3 bucket with โ€œsalesโ€ and โ€œmaintenanceโ€ directories as shown below:

s3://tacosoft
โ”œโ”€โ”€ sales
โ”‚ โ”œโ”€โ”€ metadata.txt
โ”‚ โ”œโ”€โ”€ orders_1.parquet
โ”‚ โ”œโ”€โ”€ orders_2.parquet
โ”‚ โ””โ”€โ”€ orders_3.parquet
โ””โ”€โ”€ maintenance
โ”œโ”€โ”€ metadata.txt
โ”œโ”€โ”€ schedule_1.parquet
โ”œโ”€โ”€ schedule_2.parquet
โ””โ”€โ”€ schedule_3.parquet

If you only want to sync the data in the โ€œsalesโ€ directory to Propel, use the path value provided below:

sales/**/*.parquet
tip

Notice that the S3 paths only match Parquet files using the *.parquet wildcard pattern. This is important because we don't want to attempt to sync non-Parquet files, like metadata.txt.

Data requirementsโ€‹

The Parquet files you sync to Propel must meet the following requirements:

  • Must have at least one DATE or TIMESTAMP column as the primary timestamp. Propel uses the primary timestamp to order and partition your data in Data Pools. It will serve as the time dimension on your Metrics. It must be included, cannot be nullable, and cannot be changed after the Data Pool is created. Timestamps without a timezone will be synced as UTC. Check our Selecting the right primary timestamp column for your Data Pool guide to learn more.

Data Typesโ€‹

The table below describes default data type mappings from Parquet types to Propel types. When creating an Amazon S3 Parquet Data Pool, you can modify these default mappings. For instance, if you know that a column originally typed as a NUMBER contains a UNIX timestamp, you can convert it to a TIMESTAMP by changing the default mapping.

Parquet TypePropel TypeNotes
BOOLEANBOOLEAN
INT8INT8
UINT8INT16
INT16INT16
UINT16INT32
INT32INT32
UINT32INT64
INT64INT64
UINT64INT64
FLOATFLOAT
DOUBLEDOUBLE
DECIMAL(p โ‰ค 9, s=0)INT32
DECIMAL(p โ‰ค 9, s>0)FLOAT
DECIMAL(p โ‰ค 18, s=0)INT64
DECIMAL(p โ‰ค 18, s>0)DOUBLE
DECIMAL(p โ‰ค 76, s)DOUBLE
DATEDATE
TIME (ms)INT32
TIME (ยตs, ns)INT64
TIMESTAMPTIMESTAMP
INT96TIMESTAMP
BINARYSTRING
STRINGSTRING
ENUMSTRING
FIXED_LENGTH_BYTE_ARRAYSTRING
MAPJSON
LISTJSON

API reference documentationโ€‹

Below is the relevant API documentation for the Amazon S3 Parquet Data Pool.

Queriesโ€‹

Mutationsโ€‹

Limitsโ€‹

No limits at this point.