Efficient data storage and processing are crucial for businesses and organizations dealing with large datasets. Apache Parquet is a popular columnar storage format offering fast query performance and data compression, while CSV is a row-based format that may not be suitable for large-scale processing. This blog post covers how to convert CSV files to Parquet files in Python, including dropping NaN values to prepare the data for analysis. It concludes by highlighting the advantages of using Parquet files.

Step 1: Import required libraries

We begin by importing the required libraries to read the CSV file, convert it to a Parquet file, and work with data in Pandas DataFrame format.import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

Here, we use pandas to read the CSV file, pyarrow to convert the Pandas DataFrame to PyArrow Table format, and pyarrow.parquet to write the PyArrow Table to a Parquet file.

Step 2: Define a function to convert CSV to Parquet

We define a function convert_csv_to_parquet that takes the following arguments:

input_file_path: Path to the CSV file to be converted
output_file_path: Path to the output Parquet file
drop_option: Option to drop rows or columns with NaN values (if any)

def convert_csv_to_parquet(input_file_path, output_file_path, drop_option):
# Read CSV file into a Pandas DataFrame
df = pd.read_csv(input_file_path)

# Remove rows or columns with NaN fields based on the drop_option argument
if drop_option == ‘row’:
df = df.dropna()
elif drop_option == ‘column’:
df = df.dropna(axis=1)

# Convert Pandas DataFrame to PyArrow Table
table = pa.Table.from_pandas(df)

# Write PyArrow Table to Parquet file
pq.write_table(table, output_file_path)

# Open the Parquet file
table = pq.read_table(output_file_path)

# Convert the table to a Pandas DataFrame
df = table.to_pandas()

# Print the DataFrame
print(df.head(100))

The function reads the CSV file using pd.read_csv and stores the data in a Pandas DataFrame. We then drop rows or columns with NaN fields based on the drop_option argument using df.dropna with axis=0 for rows and axis=1 for columns.

Next, we convert the Pandas DataFrame to PyArrow Table format using pa.Table.from_pandas, and write the table to a Parquet file using pq.write_table.

We then open the Parquet file using pq.read_table and convert the table back to a Pandas DataFrame using table.to_pandas.

Finally, we print the first 100 rows of the DataFrame using df.head(100).

Step 3: Call the function with appropriate arguments

We call the function convert_csv_to_parquet with appropriate arguments to convert the CSV file to Parquet format.input_file_path = ‘input.csv’
output_file_path = ‘output.parquet’
drop_option = ‘column’ # options: ‘row’ or ‘column’

convert_csv_to_parquet(input_file_path, output_file_path, drop_option)

Here, we specify the paths to the input CSV file and output Parquet file, and the option to drop rows or columns with NaN values (if any).

Transforming Your Data with Python: CSV to Parquet Conversion and NaN Handling

Step 1: Import required libraries

Step 2: Define a function to convert CSV to Parquet

Step 3: Call the function with appropriate arguments

Leave a Comment Cancel reply