Amazon S3 setup guide

This step-by-step guide explains how to connect Propel to your Amazon S3 bucket to sync Parquet files.

This guide covers how to:

Requirements

You have a Propel account.
You have an Amazon S3 bucket configured with at least one Parquet file.

Step 1: Create an IAM policy and user for Propel

Create a new IAM policy

Create an AWS IAM policy to allow listing bucket contents and retrieving files:

Create a JSON file named propel-s3-policy.json with the following content, replacing <YOUR_BUCKET_NAME> with your actual bucket name:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:List*", "s3:Get*"],
      "Resource": [
        "arn:aws:s3:::<YOUR_BUCKET_NAME>",
        "arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
      ]
    }
  ]
}

Use the AWS CLI to create an IAM policy:

aws iam create-policy --policy-name PropelS3Policy --policy-document file://propel-s3-policy.json

Note the ARN of the created policy from the output. You’ll need this in the next step.

Limit policy scope by granting access only to the specific path containing your Parquet files, rather than the entire bucket.

Create a new IAM user

Create an IAM user in your AWS account using the AWS Console or CLI.

aws iam create-user --user-name propel-s3-user

Attach the IAM policy to the user

Connect the IAM policy you created in Step 1 directly to the IAM user you created in Step 2.

aws iam attach-user-policy --user-name propel-s3-user --policy-arn arn:aws:iam::<YOUR_ACCOUNT_ID>:policy/<YOUR_POLICY_NAME>

Create an access key for your IAM user

Create a new Access Key for the user you created.

aws iam create-access-key --user-name propel-s3-user

Note down the secret access key immediately, as it won’t be displayed again.

You’ll use this Access Key and Secret Access Key in the next main step.

For more information, read the AWS documentation on controlling access to S3 buckets.

Step 2: Create an Amazon S3 Data Pool

Now that you have configured your AWS IAM user for access, you can create an Amazon S3 Data Pool.

Create an Amazon S3 Data Pool

Go to the “Data Pools” section in the Console, click “Create Data Pool” and click on the “Kafka” tile.

If you create an Amazon S3 Data Pool for the first time, you must create your Amazon S3 credentials for Propel to connect to your Amazon S3 bucket.

Create your Amazon S3 credentials

To create your Amazon S3 credentials, you will need the following details from Step 1:

Access Key ID: The access key ID for your IAM user.
Secret Access Key: The secret access key for your IAM user.
Bucket Name: The name of your S3 bucket.
Path: The path to the files in your S3 bucket.

Test your credentials

After entering your Amazon S3 credentials, click “Create and test credentials” to ensure Propel can successfully connect to your Amazon S3 bucket.

If the connection is successful, you will see a confirmation message. If not, check your entered credentials and try again.

Define the schema

There are two ways to define the schema for the table:

Drag and drop a Parquet file from your filesystem. The Parquet file’s schema should match the Parquet files located in your S3 bucket. Propel will scan the file and generate the schema based on the data.

Propel does not upload the file. The file is scanned locally in the browser for discovering the schema only.
Alternatively, you can define the schema manually by specifying the name, type, and nullability for each column you want to import.

Once you’ve uploaded a Parquet file or defined the schema manually, you’ll see the schema preview.

Once the schema is defined, click “Next”.

Configure data type and settings

Select whether your data is “Append-only” or “Mutable data”.

To learn more, read out guide on Selecting table engine and sorting key.

Answer the questions in the wizard to complete the setup.

Confirm your table settings and click “Continue”.

Set sync interval

Specify how often you want Propel to sync your data.

Name your Data Pool

You can enable access policies later to restrict access to the data pool.

Click “Create Data Pool” to complete the setup.

Confirm setup and preview data

Ensure your data pool setup is accurate by checking the following:

Status: Ensure it’s set to “LIVE”.
Records: Validate the number of records matches your dataset.

Lastly, click on the “Preview Data” tab to view a sample of your data and ensure it looks as expected.

Create an Amazon S3 Data Pool

Go to the “Data Pools” section in the Console, click “Create Data Pool” and click on the “Kafka” tile.

If you create an Amazon S3 Data Pool for the first time, you must create your Amazon S3 credentials for Propel to connect to your Amazon S3 bucket.

Create your Amazon S3 credentials

To create your Amazon S3 credentials, you will need the following details from Step 1:

Access Key ID: The access key ID for your IAM user.
Secret Access Key: The secret access key for your IAM user.
Bucket Name: The name of your S3 bucket.
Path: The path to the files in your S3 bucket.

Test your credentials

After entering your Amazon S3 credentials, click “Create and test credentials” to ensure Propel can successfully connect to your Amazon S3 bucket.

If the connection is successful, you will see a confirmation message. If not, check your entered credentials and try again.

Define the schema

There are two ways to define the schema for the table:

Drag and drop a Parquet file from your filesystem. The Parquet file’s schema should match the Parquet files located in your S3 bucket. Propel will scan the file and generate the schema based on the data.

Propel does not upload the file. The file is scanned locally in the browser for discovering the schema only.
Alternatively, you can define the schema manually by specifying the name, type, and nullability for each column you want to import.

Once you’ve uploaded a Parquet file or defined the schema manually, you’ll see the schema preview.

Once the schema is defined, click “Next”.

Configure data type and settings

Select whether your data is “Append-only” or “Mutable data”.

To learn more, read out guide on Selecting table engine and sorting key.

Answer the questions in the wizard to complete the setup.

Confirm your table settings and click “Continue”.

Set sync interval

Specify how often you want Propel to sync your data.

Name your Data Pool

You can enable access policies later to restrict access to the data pool.

Click “Create Data Pool” to complete the setup.

Confirm setup and preview data

Ensure your data pool setup is accurate by checking the following:

Status: Ensure it’s set to “LIVE”.
Records: Validate the number of records matches your dataset.

Lastly, click on the “Preview Data” tab to view a sample of your data and ensure it looks as expected.

First, you need to create a Data Source with your Amazon S3 credentials with the path and schema of the files you want to ingest.

You’ll need to provide:

The AWS Access Key ID
The AWS Secret Access Key
The name of the S3 bucket
A table name, to represent the path and schema. It can be anything as you can create multiple tables per S3 bucket.
The path to the files you want to ingest

mutation {
  createS3DataSource(input: {
    uniqueName: "AmazonS3Credentials"
    description: "My Amazon S3 Credentials"
    connectionSettings: {
      awsAccessKeyId: "<YOUR_AWS_ACCESS_KEY_ID>"
      awsSecretAccessKey: "<YOUR_AWS_SECRET_ACCESS_KEY>"
      bucket: "MyS3Bucket"
      tables: [
        {
          name: "MyTable"
          path: "**/*.parquet"
          columns: [
            { name: "quantity", type: INT32, nullable: false },
            { name: "taco_name", type: STRING, nullable: false },
            { name: "sauce_name", type: STRING, nullable: false },
            { name: "restaurant_id", type: STRING, nullable: false },
            { name: "restaurant_name", type: STRING, nullable: false },
            { name: "taco_total_price", type: FLOAT, nullable: false },
            { name: "order_item_id", type: STRING, nullable: false },
            { name: "tortilla_id", type: STRING, nullable: false },
            { name: "toppings", type: JSON, nullable: false },
            { name: "sauce_id", type: STRING, nullable: false },
            { name: "taco_unit_price", type: FLOAT, nullable: false },
            { name: "order_id", type: STRING, nullable: false },
            { name: "order_item_generated_at", type: TIMESTAMP, nullable: false },
            { name: "taco_id", type: STRING, nullable: false },
            { name: "timestamp", type: TIMESTAMP, nullable: false },
            { name: "tortilla_name", type: STRING, nullable: false }
          ]
        }
      ]
    }
  }) {
    ...on DataSourceResponse {
      dataSource {
        id
        uniqueName
        status
      }
    }
  }
}

To create the Data Pool, you need to:

Take the id of the Data Source to create the Data Pool replacing the <DATA_SOURCE_ID> in the example below.
Provide the name of the table to ingest in the table field.
Specify the columns you want to ingest.

mutation {
  createDataPoolV2(
    input: {
      dataSource: "<DATA_SOURCE_ID>"
      table: "MyTable"
      timestamp: {
        columnName: "timestamp"
      }
      uniqueName: "AmazonS3DataPool"
      description: "A sample dataset consisting of orders for a taco ordering SaaS"
      accessControlEnabled: true
      tableSettings: {
        engine: {
          mergeTree: {
            type: MERGE_TREE
          }
        }
        orderBy: ["timestamp"]
      }
      columns: [
        { columnName: "quantity", type: INT32, isNullable: false },
        { columnName: "taco_name", type: STRING, isNullable: false },
        { columnName: "sauce_name", type: STRING, isNullable: false },
        { columnName: "restaurant_id", type: STRING, isNullable: false },
        { columnName: "restaurant_name", type: STRING, isNullable: false },
        { columnName: "taco_total_price", type: FLOAT, isNullable: false },
        { columnName: "order_item_id", type: STRING, isNullable: false },
        { columnName: "tortilla_id", type: STRING, isNullable: false },
        { columnName: "toppings", type: JSON, isNullable: false },
        { columnName: "sauce_id", type: STRING, isNullable: false },
        { columnName: "taco_unit_price", type: FLOAT, isNullable: false },
        { columnName: "order_id", type: STRING, isNullable: false },
        { columnName: "order_item_generated_at", type: TIMESTAMP, isNullable: false },
        { columnName: "taco_id", type: STRING, isNullable: false },
        { columnName: "timestamp", type: TIMESTAMP, isNullable: false },
        { columnName: "tortilla_name", type: STRING, isNullable: false }
      ]
    }
  ) {
    dataPool {
      id
      uniqueName
      description
      columns {
        nodes {
          columnName
          clickHouseType
          isNullable
        }
      }
    }
  }
}

resource "propel_data_source" "my_amazon_s3" {
  unique_name = "AmazonS3Credentials"
  description = "My Amazon S3 Credentials"
  type        = "S3"

  s3_connection_settings {
    aws_access_key_id     = var.aws_access_key_id
    aws_secret_access_key = var.aws_secret_access_key
    bucket                = "MyBucket"
  }
}

variable "aws_access_key_id" {
  type      = string
  sensitive = true
}

variable "aws_secret_access_key" {
  type      = string
  sensitive = true
}

resource "propel_data_pool" "amazon_s3_data_pool" {
  unique_name             = "AmazonS3DataPool"
  description             = "A sample dataset consisting of orders for a taco ordering SaaS"
  data_source             = propel_data_source.my_amazon_s3.id
  table                   = "ORDERS"
  timestamp               = "created_at"
  access_control_enabled  = true

  table_settings {
    engine {
      type = "MERGE_TREE"
    }
    order_by = ["created_at"]
  }

  column {
    name     = "quantity"
    type     = "INT32"
    nullable = false
  }
  column {
    name     = "taco_name"
    type     = "STRING"
    nullable = false
  }
  column {
    name     = "sauce_name"
    type     = "STRING"
    nullable = false
  }
  column {
    name     = "created_at"
    type     = "TIMESTAMP"
    nullable = false
  }
  column {
    name     = "restaurant_id"
    type     = "STRING"
    nullable = false
  }
  column {
    name     = "restaurant_name"
    type     = "STRING"
    nullable = false
  }
  column {
    name     = "taco_total_price"
    type     = "FLOAT"
    nullable = false
  }
  column {
    name     = "order_item_id"
    type     = "STRING"
    nullable = false
  }
  column {
    name     = "tortilla_id"
    type     = "STRING"
    nullable = false
  }
  column {
    name     = "toppings"
    type     = "JSON"
    nullable = false
  }
  column {
    name     = "sauce_id"
    type     = "STRING"
    nullable = false
  }
  column {
    name     = "taco_unit_price"
    type     = "FLOAT"
    nullable = false
  }
}

Get Started

Streaming

Data warehouses

Databases

ETL Platforms

Amazon S3 setup guide

Requirements

Step 1: Create an IAM policy and user for Propel

Step 2: Create an Amazon S3 Data Pool

Get Started

Streaming

Data warehouses

Databases

ETL Platforms

​Requirements

​Step 1: Create an IAM policy and user for Propel

​Step 2: Create an Amazon S3 Data Pool

Requirements

Step 1: Create an IAM policy and user for Propel

Step 2: Create an Amazon S3 Data Pool