Parquet is a highly compressed file format for storing tabular data that is used widely in data engineering use cases.
Parquet originated from the Apache Hadoop ecosystem, the popular data processing framework created in a joint effort by engineers at Twitter and Cloudera. The first version release was in 2013 and was adopted formally as an open-source project by the Apache foundation in 2015.
Example of a Parquet data set
The first time many developers encounter Parquet is when they want to access analytical data sets. Many example materials use publicly available Parquet files to get a data set into a database for querying. One of the more famous examples is the TLC Trip Record data set, which stores historical trip records of New York City taxis and other vehicles. It's an important data set to play with, but be wary that the size of some of the datasets can be quite large!
Where do you see Parquet used in modern data stacks?
Parquet uses storage and transport mechanism for large amounts of data between different systems. (You'll often hear data engineers refer to this process as "ETL" - Extract, Transform, Load.) For example, it's common to see Parquet files stored on cloud storage platforms like Amazon S3. Parquet was typical for Apache Hive users — an open-source data warehouse — but commercial offerings like Snowflake's data warehouse also support loading and unloading Parquet files to and from cloud storage. Additionally, Apache Iceberg — an open-source table format we touched on previously — is built entirely around Parquet files in cloud storage.
What's the difference between Parquet and a CSV file?
If you are familiar with CSV (Comma Separated Values), a text file format where tabular data is stored, you might see little reason for Parquet to exist. CSV is named precisely for how it works: different values are separated by a comma (,), and each file line usually represents a row of tabular data. There are many variations on CSV files, such as different separators like the pipe (I) and conventions such as treating the first line of values as headers, but they behave similarly.
You'll see the data if you open a CSV file in a text editor. But if you open a Parquet file, you'll see it is a binary file format. Viewing, editing, and creating a Parquet file requires specific tools or programming language libraries. The reason it's a binary format because the file is heavily compressed to make it as small as possible. When written, it creates a smaller file vs. a gzipped CSV file, optimized to read faster when ingested by an appropriate tool. CSV will work well for more minor use cases, but when talking about millions of rows or more, Parquet is far more efficient. Parquet organizes data into columns rather than rows. Thus, tools that only need to query a subset of columns can skip over the columns they're uninterested in instead of having to read entire rows.
Why is Parquet so suitable for storing large amounts of data?
Parquet files get compressed when written using a variety of algorithms. For example, suppose different columns share the same values (like lots of Orders in sales data having the same value for Country). In that case, Parquet will store each unique Country value only once. A count of its occurrences and a reference to its locations are known as dictionary encoding. Instead of the same values getting stored over and over as a CSV would, Parquet can achieve much smaller file sizes. This also means that writing Parquet is relatively more expensive than simpler file formats. Parquet uses different algorithms to encode data types, such as strings vs. integers. It also means that as new, better algorithms are developed, these can be added to Parquet to make it even better.
How can I read a Parquet file?
Parquet has broad support across programming languages and tools, making it easy to work with these implementations available for all major languages. Most big data tools support importing and exporting Parquet. If you want to work quickly with Parquet, a good example is pqrs on Github by Manoj Karthick. It's a command line tool for reading Parquet and outputting the data as CSV or JSON. For example, using Karthick's tooling, it's easy to view the schema of a Parquet file:
How does Propel use Parquet?
At Propel, we equip our customers to build customer-facing analytics use cases without the requirement to operate any complex infrastructure. You may already have a bunch of analytical data sitting as Parquet files in AWS S3, or see how your data as it sits today can be done. Propel can treat a set of Parquet files sitting in AWS S3 as a Data Source. You can authorize access to an AWS S3 bucket containing Parquet files, and Propel will ingest them, making them ready to serve blazingly fast analytics experiences. You can learn more about using AWS S3 with Propel by visiting the documentation.
Parquet is a columnar, binary file format that offers many advantages over row-oriented, textual file formats like CSV. It is designed for big data applications, where storage and bandwidth are at a premium. Parquet can be read and written in many different programming languages, which makes it very flexible. And its performance is superior to many other data formats. If you're looking for a better way to store your data, Parquet is worth considering.