Skip to main content

Dataframe Column Matches Serialized Series

Compare a DataFrame column with a serialized Series

Location of the snippet: python/pandas/dataframes/dataframe_column_matches_serialized_series.

This snippet compares a column of a DataFrame to a serialized Series. The default location to serialize the data is always /root/.cache/.local/.trash/ which you should REMOVE before your students have access to the project. It can optionally be overwritten with the parameter location.

Whenever we leave the location parameter blank, the assertion will look for the file in the default location /root/.cache/.local/.trash/.

Device Type
Jupyter

Variables:

Variable NameVariable DescriptionTypeRequired?Default
df_variable_nameName of the student's DataFrame variablestrYes
column_nameName of the column in the DataFrame to comparestrYes
serialized_series_file_nameName of the serialized Series filestrYes
locationLocation of the serialized Series filestrNo
read_kwargsOptional dictionary of additional keyword arguments for reading the file.dictNo
testing_kwargsExtra keyword arguments to pass to the testing functiondictNo
serialization_methodExplicit method for serialization. If blank, it is inferred from the file extension. Choices: pickle, csv.strNo

Examples:

1. Basic Column Comparison (Pickle Serialization, Default Location)

This example verifies if a specific column in the student's DataFrame matches a reference Series serialized using Pickle, stored in the default cache location.

Scenario: The student creates a DataFrame student_grades_df and you want to check if their 'Score' column is correct.

Task: Create a DataFrame named student_grades_df that includes at least a 'Name' and 'Score' column.

Example student_grades_df (in student's notebook):

     Name  Score  Attempts
0 Alice 85 2
1 Bob 92 1
2 Charlie 78 3

Placeholder for student's notebook:

student_grades_df = ...

Solution:

student_grades_df = pd.DataFrame({
"Name": ["Alice", "Bob", "Charlie"],
"Score": [85, 92, 78],
"Attempts": [2, 1, 3]
})

Serialization:

import pandas as pd
import pickle
import os

# Data for the reference Series
expected_scores_series = pd.Series([85, 92, 78], name='Score')

# Define the default location
default_location = "/root/.cache/.local/.trash/"
os.makedirs(default_location, exist_ok=True) # Ensure the directory exists

# Define the reference file path
reference_file_name = "expected_scores_ex1.pkl"
reference_file_path = os.path.join(default_location, reference_file_name)

# Serialize the Series using pickle
with open(reference_file_path, 'wb') as f:
pickle.dump(expected_scores_series, f)

print(f"Reference Series saved to: {reference_file_path}")

Snippet for the assertion:

Variable NameValue
df_variable_namestudent_grades_df
column_nameScore
serialized_series_file_nameexpected_scores_ex1.pkl

2. Column Comparison with CSV Serialization and Custom Location

This example demonstrates how to compare a DataFrame column with a Series stored in a CSV file, located in a non-default directory.

Scenario: The student is cleaning a dataset raw_data_df and you want to verify the 'cleaned_category' column. The reference categories are in a CSV file.

Task: Create a DataFrame named raw_data_df and process it to include a cleaned_category column.

Example raw_data_df (in student's notebook after processing):

     ID     Category  Value     cleaned_category
0 A001 electronics 100 Electronics
1 B002 cloths 50 Apparel
2 C003 _books__ 20 Books

Placeholder:

raw_data_df = ...

Solution:

import pandas as pd

raw_data_df = pd.DataFrame({
'ID': ['A001', 'B002', 'C003'],
'Category': ['electronics', 'cloths', '_books__'],
'Value': [100, 50, 20]
})

raw_data_df['cleaned_category'] = ['Electronics', 'Apparel', 'Books']

Serialization:

import pandas as pd
import os

# Data for the reference Series
expected_categories_series = pd.Series(["Electronics", "Apparel", "Books"], name='cleaned_category')

# Define a custom location
custom_location = "/test_data/reference_series/"
os.makedirs(custom_location, exist_ok=True) # Ensure the directory exists

# Define the reference file path
reference_file_name = "expected_categories_ex2.csv"
reference_file_path = os.path.join(custom_location, reference_file_name)

# Serialize the Series to CSV
expected_categories_series.to_csv(reference_file_path)

print(f"Reference Series saved to: {reference_file_path}")

Snippet for the assertion:

Variable NameValue
df_variable_nameraw_data_df
column_namecleaned_category
serialized_series_file_nameexpected_categories_ex2.csv
location/test_data/reference_series/

3. Column Comparison with read_kwargs for CSV Series

This example shows how to use read_kwargs when the reference CSV Series has a specific format that requires custom parsing (e.g., skips rows, uses a different delimiter).

Scenario: The student calculates daily_averages_df and you need to check its 'temperature' column. The reference Series is in a CSV with comments and a custom delimiter.

Task: Create a DataFrame named daily_averages_df with a 'temperature' column.

Example daily_averages_df (in student's notebook):

        Date  temperature  Humidity
0 2023-01-01 25.1 60.5
1 2023-01-02 26.3 62.1
2 2023-01-03 24.8 58.9

Placeholder:

daily_averages_df = ...

Solution:

import pandas as pd

daily_averages_df = pd.DataFrame({
'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03']),
'temperature': [25.1, 26.3, 24.8],
'Humidity': [60.5, 62.1, 58.9]
})

Serialization: The reference csv file looks like this:

# Daily Temperature Averages Reference
25.1;
26.3;
24.8;

The following code snippet creates the reference file:

import pandas as pd
import os

# Create a custom CSV file for the reference Series
custom_location = "/test_data/"
os.makedirs(custom_location, exist_ok=True)

reference_file_name = "expected_temperatures_ex3.csv"
reference_file_path = os.path.join(custom_location, reference_file_name)

# Content with a comment line and semicolon delimiter
csv_content = """# Daily Temperature Averages Reference
25.1;
26.3;
24.8;
"""

with open(reference_file_path, 'w') as f:
f.write(csv_content)

print(f"Reference CSV Series saved to: {reference_file_path}")

Snippet for the assertion:

Variable NameValue
df_variable_namedaily_temperatures_df
serialized_series_file_nameexpected_temperatures_ex3.csv
location/test_data/
read_kwargs{'skiprows': 1, 'sep': ';'}