How to Move Data from MongoDB to Amazon S3 in Parquet

Learn how to move data from MongoDB to Amazon S3 in Parquet format. This in-depth tutorial walks you through the steps of setting up a data pipeline for customer-facing analytics, utilizing MongoDB Atlas's Data Federation feature and AWS services.

Photo: Propel Data

‍

Customer-facing analytics has emerged as a key feature for SaaS products to offer their customers. It involves surfacing insights to end customers as part of the product experience. When you log in to Stripe, Shopify, or Google Adwords, they show you all the relevant analytics you need to know to use their product effectively. To effectively analyze data, you may need to move data from operational databases, such as MongoDB, to a scalable and cost-effective data storage service like Amazon S3. This transfer is often necessary to perform large-scale data processing and analytics tasks that are not feasible within the operational database due to resource constraints.

Parquet, a columnar storage file format, is particularly suitable for this data transfer pipeline. It offers efficient data compression and encoding schemes that result in reduced storage space and improved query performance. Parquet is also optimized for use with big data processing frameworks like Apache Spark, Flink, Hadoop, DucksDB, and so on. Moreover, it supports complex nested data structures that are common in document-oriented databases like MongoDB, making it an excellent choice for storing MongoDB data.

In this tutorial, you'll learn how to set up a data pipeline to move data from MongoDB to Amazon S3 in Parquet format using MongoDB Atlas's Data Federation feature. Opting for MongoDB Atlas's Data Federation leverages MongoDB's existing Atlas infrastructure, so you don’t need any external data pipelines. Using this method, you can query and transform data before exporting to S3 and automate data movement based on event triggers. Beyond this method, if you wanted better flexibility, you could write custom scripts in Python or other languages to move data from MongoDB to Amazon S3 in Parquet format. However, this would require significant development effort.

Prerequisites

You'll need the following to complete this tutorial:

A MongoDB Atlas account
A MongoDB cluster, as well as a remote MongoDB instance, is managed in the Atlas interface and loaded with sample data sets provided by MongoDB Atlas. You'll be using the sales collection in the [sample_supplies database](https://www.mongodb.com/docs/atlas/sample-data/sample-supplies/#std-label-sample-supplies).
The AWS CLI is installed and configured to use with your AWS account. Your AWS account also needs permission to manage IAM roles and S3 buckets.
Two new Amazon S3 buckets. You'll be using one of the S3 buckets (named something like demo-initial-load-data-from-mongodb-to-s3) for moving the existing data once and another (named something like demo-data-from-mongodb-to-s3) for moving the incremental data as part of a continuous data pipeline.
The MongoDB Shell is installed to interact with MongoDB and perform an initial load of data from the sample database to S3.

Moving Data from MongoDB to Amazon S3 in Parquet

Once you have all the required software, tools, and accounts in place, you can review the diagram below to help you understand the solution that you'll be implementing in order to learn how to move data from MongoDB to Amazon S3 in Parquet:

This tutorial explores two different ways of moving data from MongoDB to Amazon S3: manual, one-time migrations and automated, continuous migrations.

Manual, one-time migrations are useful when you need to move a large amount of existing data from MongoDB to S3 simultaneously. For this, you'll use the mongosh tool to carry out the initial data load from a MongoDB collection to the S3 bucket demo-initial-load-data-from-mongodb-to-s3. This operation is performed manually and is typically done when setting up a new data pipeline or migrating data to a new storage system.

Automated, continuous migrations are useful when you need to continuously move new data from MongoDB to S3 as it comes in. For this, you'll use MongoDB Atlas's Data Federation service for incremental data load operations from the same MongoDB collection to another S3 bucket,demo-data-from-mongodb-to-s3. This operation is performed automatically and continuously, ensuring that your S3 bucket is always up to date with the latest data from MongoDB.

The Data Federation service acts as a bridging layer between the MongoDB collection and the S3 bucket. In both of the data load operations (initial and continuous), the final data will be converted into the Parquet file format. For a successful data load operation, you need to set up AWS IAM roles and policies so that the user account in Atlas has the required privilege. You'll see how to do this setup shortly.

Verifying the Initial State of the MongoDB Collection

Let's begin by verifying the MongoDB collection and S3 buckets before setup.

When you log in to your Atlas account, you should see a landing overview page:

This shows that you have a MongoDB cluster ready to connect. Clicking the CONNECT button will show you a screen with tools to connect to MongoDB:

Choosing the Shell tool from the listed options will display a screen with database connection information:

Copy the connection string information displayed on the screen to connect to the MongoDB database using the mongosh tool. You should see the connection string information in a format similar to the following:

mongosh "mongodb+srv://cluster0.tvbjhhu.mongodb.net/" --apiVersion 1 --username

Open a terminal and execute the connection string command. You'll be prompted for the password for your MongoDB database (which you set up earlier). Once authentication is successful, you should see the following output:

Execute the following commands in the mongosh terminal to switch to the sample_supplies database and count the documents in the sales collection:

use sample_supplies
db.sales.countDocuments();

If you have loaded the sample data set, the output 5000 should be displayed:

Execute the following command to select a document from the sales collection:

 db.sales.findOne()

You should get this output:

{
  _id: ObjectId("5bd761dcae323e45a93ccfe8"),
  saleDate: ISODate("2015-03-23T21:06:49.506Z"),
  items: [
    {
      name: 'printer paper',
      tags: [ 'office', 'stationary' ],
      price: Decimal128("40.01"),
      quantity: 2
    },
    {
      name: 'notepad',
      tags: [ 'office', 'writing', 'school' ],
      price: Decimal128("35.29"),
      quantity: 2
    }
…

You'll use the _id field during the federated database setup.

Keep the terminal with the mongosh session open, as you'll need this later.

Verifying the Initial State of the S3 Buckets

Open another terminal and execute the following commands:

aws s3 ls demo-initial-load-data-from-mongodb-to-s3
aws s3 ls demo-data-from-mongodb-to-s3

After executing these commands, you should see an empty response for each command, indicating that both S3 buckets (demo-initial-load-data-from-mongodb-to-s3) currently have no content. If you don't encounter any errors during this process, it means that your AWS CLI setup is correct. The terminal screen should look like this:

Now that MongoDB and the S3 buckets are ready and verified in their initial state, let's move on to creating the federated database instance in the Atlas console.

Creating a Federated Database in Atlas

You'll be setting up your federated database instance to establish a connection between your MongoDB cluster and the Amazon S3 bucket and allow data movement between them.

To start, click the Data Federation option from the side menu in the Atlas UI. You'll be presented with a screen that offers several options for querying and transforming your data:

Click the Create New Federated Database dropdown menu and select the Feed Downstream Systems option:

You can read the information displayed on the next screen and click Get Started:

Select the cloud provider and key in the federated database instance name. This example uses the default options, so you can click Continue:

Next, you have to choose the source (MongoDB cluster) to set up the federated database instance connection with the database. Choose Cluster0 from the dropdown menu, then select Specific Collections. Make sure that sample_supplies and sales are checked.

You then have to set up an AWS role in your AWS account to perform the data load operation from MongoDB to the Amazon S3 buckets. In the Role ARN dropdown menu, select Authorize an AWS IAM Role to create a new AWS IAM role:

Copy the Atlas AWS account ARN and your unique external ID, and keep them somewhere safe. You'll need this information to add your Atlas account to the trust relationship of the AWS IAM role that you'll create shortly.

On the same screen (you might need to scroll down here), the UI will display some information that shows you how to create a new AWS IAM role in a step-by-step approach:

Follow the instructions as shown to save the role-trust-policy.json file, then open a terminal and switch to the directory where you stored the file. Execute the create-role AWS CLI command in the terminal:

aws iam create-role --role-name atlas-data-lake-role --assume-role-policy-document file://role-trust-policy.json

You should see the following output:

{
    "Role": {
        "Path": "/",
        "RoleName": "atlas-data-lake-role",
        "RoleId": "AROA5Z4PWBWJFEY7BG4UR",
        "Arn": "arn:aws:iam::123143412506:role/atlas-data-lake-role",
        "CreateDate": "2024-01-03T12:34:54+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": "arn:aws:iam::536727724300:root"
                    },
                    "Action": "sts:AssumeRole",
                    "Condition": {
                        "StringEquals": {
                            "sts:ExternalId": "ec136f7b-9c1b-4731-a74f-3835b06d94da"
                        }
                    }
                }
            ]
        }
    }
}

Copy the AWS role ARN shown in your terminal, which is arn:aws:iam: :12343412506:/role/atlas-data-lake-role in the above example. Paste the copied ARN into the role ARN field and click Validate AWS IAM role to validate the completed setup:

Once validated, proceed with filling in the S3 bucket information:

Remember to fill in the incremental load S3 bucket name (demo-data-from-mongodb-to-s3). The schedule that you'll create as part of this federated database setup can only carry out scheduled jobs and not a one-time initial load process. Therefore, you'll need to do the initial loading of data to the S3 bucket later using a different approach.

Once you have filled in the S3 bucket name, the UI will display the contents of adl-s3-policy.json, which is a new file that you'll create next:

Create the adl-s3-policy.json file on your machine using the code on your screen. Next, open a terminal and switch to the directory where you stored this file, then execute the following command in the terminal:

aws iam put-role-policy --role-name atlas-data-lake-role --policy-name atlas-data-lake-role-policy --policy-document file://adl-s3-policy.json

This command will also be displayed on your UI screen. After executing the command, click Validate AWS S3 bucket access to validate the setup. On successful validation, click Continue.

You'll see a screen with options to schedule the data movement process. For the purposes of this tutorial, configure the values in the fields as shown below:

The above screenshot shows that you are scheduling the data movement process for every minute. Note that Parquet is selected as the data file format. You don't need to write any code to achieve this format conversion. Simply specify the values in the other fields as follows:

Entering a value in the File Destination Root Path field will generate a directory bearing that name within the S3 bucket. As mentioned, the _id field is specified in the date field and uses the format specified in the Date Field Format dropdown menu. MongoDB internally understands the value in this _id field as a valid date/timestamp format. This is helpful for incremental data load operations, as it allows you to fetch the data during a specified time interval. You don't need to worry about the internal implementation since all these aspects are handled automatically for you.

Click Continue to proceed to the final review and confirmation screen:

After reviewing your inputs, click Create to create the federated database instance. It might take a while for the Atlas UI screen to complete this operation.

You should be able to see a screen similar to the one below by clicking the Data Federation option in the side menu:

Click Connect, then choose Shell as the tool option on the next screen:

On the next screen, you'll see the connection string information:

Copy the connection string and keep it somewhere safe, as you'll need this information to perform the initial data load operation from MongoDB to an S3 bucket in the next step.

Setting Up the Initial Data Load from MongoDB to S3 in Parquet

First, you must ensure that the IAM role atlas-data-lake-role has the necessary permissions to access the S3 bucket demo-initial-load-data-from-mongodb-to-s3. This involves adding a policy to the IAM role. To do this, edit the adl-s3-policy.json file that you created earlier by replacing its existing contents with the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::demo-initial-load-data-from-mongodb-to-s3",
        "arn:aws:s3:::demo-initial-load-data-from-mongodb-to-s3/*",
		"arn:aws:s3:::demo-data-from-mongodb-to-s3",
        "arn:aws:s3:::demo-data-from-mongodb-to-s3/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::demo-initial-load-data-from-mongodb-to-s3",
        "arn:aws:s3:::demo-initial-load-data-from-mongodb-to-s3/*",
		"arn:aws:s3:::demo-data-from-mongodb-to-s3",
        "arn:aws:s3:::demo-data-from-mongodb-to-s3/*"
      ]
    }
  ]
}

Save the file and execute the below command in a terminal from the directory where the file is saved:

aws iam put-role-policy --role-name atlas-data-lake-role --policy-name atlas-data-lake-role-policy --policy-document file://adl-s3-policy.json

Open a terminal and execute the federated database connection string command you copied earlier:

mongosh "mongodb://federateddatabaseinstance0-qwyry.a.query.mongodb.net/" --tls --authenticationDatabase admin --username user

You'll be prompted for the password for your MongoDB database. Once authenticated, execute the following command in themongosh terminal to perform the initial data load:

db.getSiblingDB("sample_supplies").getCollection("sales").aggregate([ { "$out" : { "s3" : { "bucket" : "demo-initial-load-data-from-mongodb-to-s3", "region" : "ap-south-1", "filename" : { "$concat" : [ "initial-load/" ] }, "format" : { "name" : "parquet", "maxFileSize" : "10GB", "maxRowGroupSize" : "100MB" } } } } ], { background: true });

This command performs an operation that moves data from the sales collection in the sample_supplies database and writes the output to an S3 bucket (demo-initial-load-data-from-mongodb-to-s3) in Parquet format. The $out keyword holds the destination options. This whole operation runs in the background.

Once the initial data load is complete, open another terminal and execute the following AWS CLI command:

aws s3 ls demo-initial-load-data-from-mongodb-to-s3

You should see the following output:

2024-01-05 12:28:54     349259 initial-load.1.parquet

Download the Parquet file to your local machine's current directory by using the command below:

aws s3 cp s3://demo-initial-load-data-from-mongodb-to-s3/initial-load.1.parquet ./

Viewing the Parquet File Using an Online Parquet Reader

Open https://parquetreader.com/ in your browser to upload the Parquet file online. Once the file is uploaded, you'll see an output similar to the one shown below:

Setting Up Incremental Data Load from MongoDB to S3 in Parquet

Next, switch your view to the Atlas UI screen in your browser. Navigate to the Triggers page via the side menu:

This trigger was created as part of the federated database setup process. It was done automatically for you based on the steps you completed when selecting your MongoDB database as a source, setting your S3 bucket as a destination, and configuring the scheduler process. Click the name of the trigger to view its definition:

Scroll down to view the trigger function:

The whole definition of the trigger is automatically created for you and includes the logic to move data from the MongoDB collection to the chosen S3 bucket, demo-data-from-mongodb-to-s3, in an incremental manner.

Although the trigger is in an enabled state, since you don't have any new documents inserted into the sales collection, you won't see any data in demo-data-from-mongodb-to-s3. So, to check if the incremental load is working through the trigger function, insert the following document into the sales collection using the mongosh session, which you opened earlier by connecting to the Atlas MongoDB cluster:

use sample_supplies

db.sales.insertOne({ "saleDate": { "$date": "2024-01-05T16:11:59.565Z" }, "items": [ { "name": "binder", "tags": [ "school", "general", "organization" ], "price": { "$numberDecimal": "13.44" }, "quantity": 8 }, { "name": "binder", "tags": [ "school", "general", "organization" ], "price": { "$numberDecimal": "16.66" }, "quantity": 10 } ], "storeLocation": "London", "customer": { "gender": "M", "age": 44, "email": "owtar@pu.cd", "satisfaction": 2 }, "couponUsed": false, "purchaseMethod": "In store" });

On successful execution of the above command, you should see the following output:

{
  acknowledged: true,
  insertedId: ObjectId("6597fa048779c7fd66ab863d")
}

Open a terminal and execute the following AWS CLI command to verify if the incremental trigger function has moved the inserted document from the MongoDB collection to the S3 bucket:

aws s3 ls demo-data-from-mongodb-to-s3 --recursive

You should see an output with a Parquet file created in the S3 bucket:

incremental-load/Cluster0/sample_supplies/sales/1704458760378/1.parquet

As an exercise, try out the aws s3 cp AWS CLI command from earlier and the online Parquet viewer site to verify the contents of the uploaded Parquet file.

‍

Conclusion

You've now completed the tutorial and learned how to move data from MongoDB to Amazon S3 in Parquet format. As part of this tutorial, you set up the necessary environments, created an AWS IAM role and an S3 bucket access policy, connected the MongoDB data federation instance with a MongoDB database and Amazon S3 buckets, moved the data once, and set up a continuous pipeline for future data. By now, you should have a clear understanding of how to implement this pipeline for your own customer-facing analytics use cases.

When it comes to dealing with projects or products involving customer-facing analytics, Propel comes in handy. Propel's Serverless Analytics API platform helps you build high-performance analytics into web and mobile apps with data from your data warehouse, webhooks, streaming service, or transactional database. Propel provides and maintains the serverless ClickHouse infrastructure, allowing dev teams to focus on product experiences like usage reports, insights dashboards, or analytics APIs.

Next steps

To learn how to use S3 in customer-facing analytics in your app, check out our next blog post about Building blazing-fast data-serving APIs powered by S3 data lakes.

Rajkumar Venkatasamy

•

Principal Architect at an MNC

Data Engineering

In-depth: What is a columnar database?

This is some text inside of a div block.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.