Dataframe Column Matches Serialized Series

Compare a DataFrame column with a serialized Series

Location of the snippet: python/pandas/dataframes/dataframe_column_matches_serialized_series.

This snippet compares a column of a DataFrame to a serialized Series. The default location to serialize the data is always /root/.cache/.local/.trash/ which you should REMOVE before your students have access to the project. It can optionally be overwritten with the parameter location.

Whenever we leave the location parameter blank, the assertion will look for the file in the default location /root/.cache/.local/.trash/.

Device Type
Jupyter

Variables:

Variable Name	Variable Description	Type	Required?
`df_variable_name`	Name of the student's DataFrame variable	`str`	Yes
`column_name`	Name of the column in the DataFrame to compare	`str`	Yes
`serialized_series_file_name`	Name of the serialized Series file	`str`	Yes
`location`	Location of the serialized Series file	`str`	No
`read_kwargs`	Optional dictionary of additional keyword arguments for reading the file.	`dict`	No
`testing_kwargs`	Extra keyword arguments to pass to the testing function	`dict`	No
`serialization_method`	Explicit method for serialization. If blank, it is inferred from the file extension. Choices: pickle, csv.	`str`	No

Examples:

1. Basic Column Comparison (Pickle Serialization, Default Location)

This example verifies if a specific column in the student's DataFrame matches a reference Series serialized using Pickle, stored in the default cache location.

Scenario: The student creates a DataFrame student_grades_df and you want to check if their 'Score' column is correct.

Task: Create a DataFrame named student_grades_df that includes at least a 'Name' and 'Score' column.

Example student_grades_df (in student's notebook):

     Name  Score  Attempts
  Alice     85         2
    Bob     92         1
Charlie     78         3

Placeholder for student's notebook:

student_grades_df = ...

Solution:

student_grades_df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Score": [85, 92, 78],
    "Attempts": [2, 1, 3]
})

Serialization:

import pandas as pd
import pickle
import os

# Data for the reference Series
expected_scores_series = pd.Series([85, 92, 78], name='Score')

# Define the default location
default_location = "/root/.cache/.local/.trash/"
os.makedirs(default_location, exist_ok=True) # Ensure the directory exists

# Define the reference file path
reference_file_name = "expected_scores_ex1.pkl"
reference_file_path = os.path.join(default_location, reference_file_name)

# Serialize the Series using pickle
with open(reference_file_path, 'wb') as f:
    pickle.dump(expected_scores_series, f)

print(f"Reference Series saved to: {reference_file_path}")

Snippet for the assertion:

Variable Name	Value
`df_variable_name`	`student_grades_df`
`column_name`	`Score`
`serialized_series_file_name`	`expected_scores_ex1.pkl`

2. Column Comparison with CSV Serialization and Custom Location

This example demonstrates how to compare a DataFrame column with a Series stored in a CSV file, located in a non-default directory.

Scenario: The student is cleaning a dataset raw_data_df and you want to verify the 'cleaned_category' column. The reference categories are in a CSV file.

Task: Create a DataFrame named raw_data_df and process it to include a cleaned_category column.

Example raw_data_df (in student's notebook after processing):

     ID     Category  Value     cleaned_category
A001  electronics    100          Electronics
B002       cloths     50             Apparel
C003     _books__     20              Books

Placeholder:

raw_data_df = ...

Solution:

import pandas as pd

raw_data_df = pd.DataFrame({
    'ID': ['A001', 'B002', 'C003'],
    'Category': ['electronics', 'cloths', '_books__'],
    'Value': [100, 50, 20]
})

raw_data_df['cleaned_category'] = ['Electronics', 'Apparel', 'Books']

Serialization:

import pandas as pd
import os

# Data for the reference Series
expected_categories_series = pd.Series(["Electronics", "Apparel", "Books"], name='cleaned_category')

# Define a custom location
custom_location = "/test_data/reference_series/"
os.makedirs(custom_location, exist_ok=True) # Ensure the directory exists

# Define the reference file path
reference_file_name = "expected_categories_ex2.csv"
reference_file_path = os.path.join(custom_location, reference_file_name)

# Serialize the Series to CSV
expected_categories_series.to_csv(reference_file_path)

print(f"Reference Series saved to: {reference_file_path}")

Snippet for the assertion:

Variable Name	Value
`df_variable_name`	`raw_data_df`
`column_name`	`cleaned_category`
`serialized_series_file_name`	`expected_categories_ex2.csv`
`location`	`/test_data/reference_series/`

3. Column Comparison with `read_kwargs` for CSV Series

This example shows how to use read_kwargs when the reference CSV Series has a specific format that requires custom parsing (e.g., skips rows, uses a different delimiter).

Scenario: The student calculates daily_averages_df and you need to check its 'temperature' column. The reference Series is in a CSV with comments and a custom delimiter.

Task: Create a DataFrame named daily_averages_df with a 'temperature' column.

Example daily_averages_df (in student's notebook):

        Date  temperature  Humidity
2023-01-01         25.1      60.5
2023-01-02         26.3      62.1
2023-01-03         24.8      58.9

Placeholder:

daily_averages_df = ...

Solution:

import pandas as pd

daily_averages_df = pd.DataFrame({
    'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03']),
    'temperature': [25.1, 26.3, 24.8],
    'Humidity': [60.5, 62.1, 58.9]
})

Serialization: The reference csv file looks like this:

# Daily Temperature Averages Reference
1;
3;
8;

The following code snippet creates the reference file:

import pandas as pd
import os

# Create a custom CSV file for the reference Series
custom_location = "/test_data/"
os.makedirs(custom_location, exist_ok=True)

reference_file_name = "expected_temperatures_ex3.csv"
reference_file_path = os.path.join(custom_location, reference_file_name)

# Content with a comment line and semicolon delimiter
csv_content = """# Daily Temperature Averages Reference
25.1;
26.3;
24.8;
"""

with open(reference_file_path, 'w') as f:
    f.write(csv_content)

print(f"Reference CSV Series saved to: {reference_file_path}")

Snippet for the assertion:

Variable Name	Value
`df_variable_name`	`daily_temperatures_df`
`serialized_series_file_name`	`expected_temperatures_ex3.csv`
`location`	`/test_data/`
`read_kwargs`	`{'skiprows': 1, 'sep': ';'}`

Variables:​

Examples:​

1. Basic Column Comparison (Pickle Serialization, Default Location)​

2. Column Comparison with CSV Serialization and Custom Location​

3. Column Comparison with read_kwargs for CSV Series​

Variables:

Examples:

1. Basic Column Comparison (Pickle Serialization, Default Location)

2. Column Comparison with CSV Serialization and Custom Location

3. Column Comparison with `read_kwargs` for CSV Series