Efficient data storage and processing are crucial for businesses and organizations dealing with large datasets. Apache Parquet is a popular columnar storage format offering fast query performance and data compression, while CSV is a row-based format that may not be suitable for large-scale processing. This blog post covers how to convert CSV files to Parquet files in Python, including dropping NaN values to prepare the data for analysis. It concludes by highlighting the advantages of using Parquet files.
Step 1: Import required libraries
We begin by importing the required libraries to read the CSV file, convert it to a Parquet file, and work with data in Pandas DataFrame format.import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
Here, we use pandas
to read the CSV file, pyarrow
to convert the Pandas DataFrame to PyArrow Table format, and pyarrow.parquet
to write the PyArrow Table to a Parquet file.
Step 2: Define a function to convert CSV to Parquet
We define a function convert_csv_to_parquet
that takes the following arguments:
input_file_path
: Path to the CSV file to be convertedoutput_file_path
: Path to the output Parquet filedrop_option
: Option to drop rows or columns with NaN values (if any)
def convert_csv_to_parquet(input_file_path, output_file_path, drop_option):
# Read CSV file into a Pandas DataFrame
df = pd.read_csv(input_file_path)
# Remove rows or columns with NaN fields based on the drop_option argument
if drop_option == ‘row’:
df = df.dropna()
elif drop_option == ‘column’:
df = df.dropna(axis=1)
# Convert Pandas DataFrame to PyArrow Table
table = pa.Table.from_pandas(df)
# Write PyArrow Table to Parquet file
pq.write_table(table, output_file_path)
# Open the Parquet file
table = pq.read_table(output_file_path)
# Convert the table to a Pandas DataFrame
df = table.to_pandas()
# Print the DataFrame
print(df.head(100))
The function reads the CSV file using pd.read_csv
and stores the data in a Pandas DataFrame. We then drop rows or columns with NaN fields based on the drop_option
argument using df.dropna
with axis=0
for rows and axis=1
for columns.
Next, we convert the Pandas DataFrame to PyArrow Table format using pa.Table.from_pandas
, and write the table to a Parquet file using pq.write_table
.
We then open the Parquet file using pq.read_table
and convert the table back to a Pandas DataFrame using table.to_pandas
.
Finally, we print the first 100 rows of the DataFrame using df.head(100)
.
Step 3: Call the function with appropriate arguments
We call the function convert_csv_to_parquet
with appropriate arguments to convert the CSV file to Parquet format.input_file_path = ‘input.csv’
output_file_path = ‘output.parquet’
drop_option = ‘column’ # options: ‘row’ or ‘column’
convert_csv_to_parquet(input_file_path, output_file_path, drop_option)
Here, we specify the paths to the input CSV file and output Parquet file, and the option to drop rows or columns with NaN values (if any).