How to set up an Amazon S3 Parquet Data Pool
This step-by-step guide explains how to connect Propel to your Amazon S3 bucket to sync Parquet files.
It covers how to configure your Amazon S3 bucket for Propel access, how to create the Data Pool in the Propel Console, and how to verify the setup.
- You have a Propel account.
- You have an AWS account.
- You have an Amazon S3 bucket configured with at least one Parquet file at a specific path. The Parquet file(s) contain data you intend to operationalize for use in Propel.
Step 1: Create an IAM policy and user to allow Propel to read from the Amazon S3 bucket
1. Create a new IAM policy
As a first step, you need to create an AWS IAM policy that allows two actions: the bucket contents to be listed and individual files to be retrieved.
Create an IAM policy in your AWS account using the AWS Console or CLI.
<YOUR_BUCKET_NAME> with the name of your bucket.
"Action": ["s3:List*", "s3:Get*"],
We recommend that you configure the policy to be as specific as possible. For example, if all of your Parquet files are written to a specific location within the bucket, then grant access to that path only. The above example policy grants access to all files in your named Amazon S3 bucket.
With the IAM policy created, you can create a specific AWS IAM user to access your Amazon S3 bucket based on the above policy.
2. Create a new IAM user
Create an IAM user in your AWS account using the AWS Console or CLI.
3. Connect the IAM policy created above directly to the IAM user
Attach the policy you created to the user.
4. Create an access key for your IAM user
Once the user is created, create a new Access Key. Be careful to note the secret access key before closing the dialog because the secret is not displayed again once the dialog is closed.
You will use the Access Key and the Secret Access Key in Step 2.
Read the AWS documentation to learn more about controlling access to S3 buckets.
Step 2: Create the Data Pool in the Propel Console
Now that you have configured your AWS IAM user for access, you can create an Amazon S3 Data Pool in the Propel Console.
1. Go to the Data Pools section and click "Create Data Pool”
On the left-hand side menu, click on "Data Pools" then "Create Data Pool". Then select the "Amazon S3" as the data source for your Data Pool.
2. Select the appropriate set of credentials for your Data Pool.
If you haven't added any credentials, click on "Add new credentials".
- Enter a unique name for the credentials.
- Provide your AWS Access Key ID and AWS Access Key Secret.
- Add a bucket name to test the connection.
- Specify the path to the files in the S3 bucket.
- Finalize by clicking "Create and test Credentials."
3. Confirm the connection status
Ensure the status displays "CONNECTED."
- Verify that you have permissions for "List Bucket" and "Get Object."
- If needed, use the "Reconnect" button to refresh the connection.
- Proceed by clicking "Next" button.
4. Define the schema
There are two ways to define the schema for the table.
Drag and drop a Parquet file from your filesystem. The Parquet file's schema should match the Parquet files located in your S3 bucket. Propel will scan the file and generate the schema based on the data.note
Propel does not upload the file. The file is scanned locally in the browser for discovering the schema only.
Alternatively, you can define the schema manually by specifying the name, type, and nullability for each column you want to import.
Once the schema is defined, click "Next".
5. Set primary timestamp, sync interval, and Tenant ID
After defining your table schema, the next step is configuring the Data Pool's time-related settings and access controls.
- Primary timestamp: This is mandatory. Propel utilizes the primary timestamp to sequence and partition the data within Data Pools. It also represents the time dimension for your Metrics. The primary timestamp column cannot be null in your Parquet files.
- Sync interval: Determine the frequency of syncing attempts with this Data Pool. For example, you can set it to "EVERY_1_HOUR" to synchronize the data every hour.
- Tenant ID (Optional): If you wish to control access to your data with access policies, specify the Tenant ID. This can only be set during the creation of the Data Pool.
Then, give your Data Pool a unique name and description.
6. Confirm setup and preview data
Ensure your data pool setup is accurate by checking the following:
- Status: Ensure it's set to "LIVE".
- Records: Validate the number of records matches your dataset.
Lastly, click on the "Preview Data" tab to view a sample of your data and ensure it looks as expected.
That's it! To recap, we created an Amazon S3 Parquet Data Pool in Propel syncing Parquet files from your S3 bucket and verified that it arrived in the Data Pool successfully.
You can learn more about using the GraphQL API you set up and check the examples.