How to set up an AWS S3 Data Source
This guide explains how to connect Propel to your AWS S3 bucket.
It covers how to allow Propel to connect to your AWS S3 bucket, how to create the Data Source in the Propel Console, and how to know if it is working.
- You have a Propel account.
- You have an AWS account.
- You have an AWS S3 bucket configured with at least one Parquet file at a specific path. The Parquet file(s) contain data you intend to operationalize for use in Propel.
Step 1: Create an IAM Policy to allow the Propel Data Source access to read from the AWS S3 bucket
As a first step, you need to create an AWS IAM Policy that allows two actions, the bucket contents to be listed and individual files to be retrieved.
We recommend that you configure the policy to be as specific as possible. For example, if all of your Parquet files are written to a specific location within the bucket, then grant access to that path only. The above example policy grants access to all files in your named AWS S3 bucket.
With the IAM policy created, you can create a specific AWS IAM user to access your AWS S3 bucket based on the above policy. Credentials for this user will be used by Propel.
- Create a new IAM user.
- Connect the IAM policy created above directly to the IAM user.
- Create an access key for your IAM user. Be careful to note the secret access key before closing the dialog, because the secret is not displayed again once the dialog is closed.
You will use the Bucket Name, Access Key and the Secret Access Key in Step 2.
Step 2: Create the Data Source in the Propel Console
Now that you have configured your AWS S3 bucket for access, you can create an AWS S3 Data Source in the Propel Console.
- On the left hand side menu, click on "Data Sources".
- Click on "Create Data Source".
- Select AWS S3 as the Data Source type.
- Enter a unique name and description for your Data Source and click "Next".
- Enter your AWS S3 connection details: bucket name and the access key and secret access key you created above and then click "Next".
- Enter a table name. This can be anything you want, but it should make sense according to the data in your Parquet files.
- Enter a path to the location of the Parquet files that constitute your table. For example
path/to/your/files/**/*.parquet. If left blank, the entire bucket will be scanned.
Specifying a path is strongly recommended, especially as the number of Parquet files in your S3 bucket increases. By specifying a path, Propel can skip over directories it doesn't need to read, thereby speeding up the syncing process.
8. Defining table schema
- There are two ways to define the schema for the table.
- Drag and drop a Parquet file from your filesystem. The Parquet file's schema should match the Parquet files located in your S3 bucket. Propel will scan the file and generate the schema based on the data.
Propel does not upload the file, the file is scanned locally in the browser for discovering the schema only.
Alternatively, you can define the schema manually by the specifying the name, type, and nullability for each column you want to import.
Once the schema is defined, click "Next". If successful, you'll see a green "Data Source created" notification along with the S3 Data Source details.
To verify your setup, click on the "Table details" button, where you can view the S3 bucket details and table schema for the imported table.
Now that you have your AWS S3 Data Source created, you can proceed to creating a Data Pool.