How to Convert CSV to Parquet Files?


Apache Parquet

Apache Parquet is a columnar data storage format, which provides a way to store tabular data column wise. Columns of same date-time are stored together as rows in Parquet format, so as to offer better storage, compression and data retrieval.

What is Row Oriented Storage Format?

In row oriented storage, data is stored row wise on to the disk.

data is stored row wise on to the disk

For example above table has three columns of different data types (Integer, String and Double). This table will be saved on disk row wise in row oriented storage format as following-

This table will be saved on disk row wise

Columnar Storage Format

In columnar storage format above table will be stored column wise.

Columnar Storage Format

As you can see in this format all the IDs are together and so are names and salaries. A Query selecting Name column will require less I/O time as all the values are adjacent unlike in row oriented format.

Actual implementation of columnar format for Apache Parquet is defined here.

Using Apache Parquet

Using Parquet format has two advantages

  1. Reduced storage
  2. Query performance

Depending on your business use case, Apache Parquet is a good option if you have to provide partial search features i.e. not querying all the columns, and you are not worried about file write time.

Apache Parquet format is supported in all Hadoop based frameworks. Queries selecting few columns from a big set of columns, run faster because disk I/O is much improved because of homogeneous data stored together.

To use Apache spark we need to convert existing data into parquet format. In this article we will learn to convert CSV files to parquet format and then retrieve them back.

CSV to Parquet

We will convert csv files to parquet format using Apache Spark. For Introduction to Spark you can refer to Spark documentation.

Below is pyspark code to convert csv to parquet. You can edit the names and types of columns as per your input.csv

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

if __name__ == "__main__":
    sc = SparkContext(appName="CSV2Parquet")
    sqlContext = SQLContext(sc)
    schema = StructType([
            StructField("col1", IntegerType(), True),
            StructField("col2", IntegerType(), True),
            StructField("col3", StringType(), True),
            StructField("col4", StringType(), True),
            StructField("col5", StringType(), True),
            StructField("col6", DoubleType(), True)])
    rdd = sc.textFile("/home/sarvesh/Desktop/input.csv").map(lambda line: line.split(","))
    df = sqlContext.createDataFrame(rdd, schema)


Above code will create parquet files in input-parquet directory. Files will be in binary format so you will not able to read them. You can check the size of the directory and compare it with size of CSV compressed file.

For a 8 MB csv, when compressed, it generated a 636kb parquet file.

The other way: Parquet to CSV

You can retrieve csv files back from parquet files.

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

if __name__ == "__main__":
    sc = SparkContext(appName="Parquet2CSV")
    sqlContext = SQLContext(sc)

    readdf = sqlContext.read.parquet('/home/sarvesh/Desktop/submissions-parquet')
    readdf.rdd.map(tuple).map(lambda row: str(row[0]) + "," + str(row[1]) + ","+ str(row[2]) + ","+ str(row[3])+ ","+
                              str(row[4])+","+ str(row[5])).saveAsTextFile("/home/sarvesh/Desktop/parquet-to-csv.csv")


You can compare the original and converted CSV files.

You can provide parquet files to your Hadoop based applications rather than providing plain CSV files.

For more information about parquet schema and performance comparisons with other storage formats you can refer to Parquet official docs.

You might also like