Skip to main content

Dataframe Variable Matches Serialized Dataframe

Verifies whether a student's DataFrame matches a reference serialized DataFrame (pickle or CSV).

Location of the snippet: python/pandas/dataframes/dataframe_variable_matches_serialized_dataframe

This snippet accepts a wide range of variations, from partial checking of a dataframe (that is, just a few columns) to multiple serialization formats (CSV, Pickle, etc).

The default location to serialize the data is always /root/.cache/.local/.trash/ which you should REMOVE before your students have access to the project. It can optionally be overwritten with the parameter location.

Whenever we leave the location parameter blank, the assertion will look for the file in the default location /root/.cache/.local/.trash/.

Device Type
Jupyter

Variables:

Variable NameVariable DescriptionTypeRequired?Default
df_variable_nameName of the student's DataFrame variable.strYes
serialized_dataframe_file_nameFile name of the serialized reference DataFrame.strYes
locationDirectory where the serialized file is stored.strNo
df_subsetPython list of column names to compare. If left blank, the entire DataFrame is compared.listNo
read_kwargsOptional dictionary of additional keyword arguments for reading the file.dictNo
testing_kwargsOptional dictionary of additional keyword arguments for testing the DataFrame.dictNo
serialization_methodExplicit method for serialization. If blank, it is inferred from the file extension. Choices: pickle, csv.strNo

Examples:

1. Basic DataFrame Comparison (CSV)

This example demonstrates a fundamental comparison of a student's DataFrame variable against a reference CSV file.

Task: Create a DataFrame named sales_df with the following product sales data:

  Product  Quantity  Price
0 Apple 10 1.50
1 Banana 20 0.75
2 Orange 15 1.25

Placeholder:

sales_df = ...

Solution:

import pandas as pd
sales_df = pd.DataFrame({
"Product": ["Apple", "Banana", "Orange"],
"Quantity": [10, 20, 15],
"Price": [1.50, 0.75, 1.25]
})

Serialization:

import pandas as pd

# Data for the reference DataFrame
data = {
"Product": ["Apple", "Banana", "Orange"],
"Quantity": [10, 20, 15],
"Price": [1.50, 0.75, 1.25]
}
sales_df_ref = pd.DataFrame(data)

# Serialization step (creating the reference file)
# This uses the default location: /root/.cache/.local/.trash/
# The serialization method is explicitly set to 'csv'.
reference_file_path = "/root/.cache/.local/.trash/sales_data_ex1.csv"
sales_df_ref.to_csv(reference_file_path)

# Student's 'sales_df' variable will be compared against 'sales_data_ex1.csv'

Snippet for the assertion:

Variable NameValue
df_variable_namesales_df
serialized_dataframe_file_namesales_data_ex1.csv

2. DataFrame Comparison with Pickle Serialization and Custom Location

This example shows how to use Pickle for serialization and store the reference file in a custom directory.

Task: Create a DataFrame named user_data_df representing user profiles:

   UserID     Name  Age
0 101 Alice 30
1 102 Bob 24
2 103 Charlie 35

Placeholder:

user_data_df = ...

Solution:

import pandas as pd

user_data_df = pd.DataFrame({
"UserID": [101, 102, 103],
"Name": ["Alice", "Bob", "Charlie"],
"Age": [30, 24, 35]
})

Serialization:

import pandas as pd

# Data for the reference DataFrame
data = {
"UserID": [101, 102, 103],
"Name": ["Alice", "Bob", "Charlie"],
"Age": [30, 24, 35]
}
user_data_df_ref = pd.DataFrame(data)

# Serialization step (creating the reference file)
# Using a custom location and pickle serialization.

reference_file_path = "/data/custom_data_store/user_profiles_ex2.pkl"

user_data_df_ref.to_pickle(reference_file_path)

# Student's 'user_data_df' variable will be compared against 'user_profiles_ex2.pkl' from the custom location.

Snippet for the assertion:

Variable NameValue
df_variable_nameuser_data_df
serialized_dataframe_file_nameuser_profiles_ex2.pkl
location/data/custom_data_store/

3. Comparing a Subset of DataFrame Columns

This example demonstrates how to configure the checker to compare only a specific subset of columns from the student's DataFrame against the reference.

Task: Create a DataFrame named inventory_df with detailed inventory data. However, for verification, only the Item and Stock columns will be checked.

Reference DataFrame Structure (full data):

     Item Category  Stock  PurchasePrice  SalePrice
0 Laptop Elec 12 800.00 1200.00
1 Mouse Elec 75 10.00 25.00
2 Keyboard Elec 30 20.00 45.00
3 Desk Furniture 5 150.00 250.00

Placeholder:

inventory_df = ... # Should contain all columns: Item, Category, Stock, PurchasePrice, SalePrice

Solution:

import pandas as pd

data = {
"Item": ["Laptop", "Mouse", "Keyboard", "Desk"],
"Category": ["Elec", "Elec", "Elec", "Furniture"],
"Stock": [12, 75, 30, 5],
"PurchasePrice": [800.00, 10.00, 20.00, 150.00],
"SalePrice": [1200.00, 25.00, 45.00, 250.00]
}
inventory_df = pd.DataFrame(data)

Serialization:

import pandas as pd

# Data for the full reference DataFrame
data = {
"Item": ["Laptop", "Mouse", "Keyboard", "Desk"],
"Category": ["Elec", "Elec", "Elec", "Furniture"],
"Stock": [12, 75, 30, 5],
"PurchasePrice": [800.00, 10.00, 20.00, 150.00],
"SalePrice": [1200.00, 25.00, 45.00, 250.00]
}
inventory_df_ref = pd.DataFrame(data)

# Serialization step (creating the reference file with all columns)
# Using default location and CSV format.
reference_file_path = "/root/.cache/.local/.trash/full_inventory_ex3.csv"
inventory_df_ref.to_csv(reference_file_path, index=False)

# Student's 'inventory_df' variable will be compared, but only for columns 'Item' and 'Stock'.

Snippet for the assertion:

Variable NameValue
df_variable_nameinventory_df
serialized_dataframe_file_namefull_inventory_ex3.csv
df_subset['Item', 'Stock']

4. Using read_kwargs for Reading Specially Formatted CSV

This example shows how to use read_kwargs when the reference CSV file has a special format (e.g., custom delimiter, comments to skip).

Task: Create a DataFrame named sensor_log_df. The reference data is stored in a CSV file (sensor_readings_ex4.csv) that uses a semicolon (;) delimiter and has a comment line at the beginning.

Placeholder:

sensor_log_df = ...

Solution:

sensor_log_df = pd.DataFrame({
"Timestamp": ["2023-10-01T10:00:00", "2023-10-01T10:00:05", "2023-10-01T10:00:10"],
"SensorID": ["S1", "S2", "S1"],
"Value": [23.5, 1012.1, 23.6]
})
                Timestamp	SensorID	Value
0 2023-10-01T10:00:00 S1 23.5
1 2023-10-01T10:00:05 S2 1012.1
2 2023-10-01T10:00:10 S1 23.6

Reference File (sensor_readings_ex4.csv) content:

# Sensor Data Log
Timestamp;SensorID;Value
2023-10-01T10:00:00;S1;23.5
2023-10-01T10:00:05;S2;1012.1
2023-10-01T10:00:10;S1;23.6

Snippet for the assertion: The read_kwargs are used by the checker to correctly parse sensor_readings_ex4.csv.

Variable NameValue
df_variable_namesensor_log_df
serialized_dataframe_file_namesensor_readings_ex4.csv
read_kwargs{'delimiter': ';', 'skiprows': 1}

5. Using testing_kwargs for Flexible DataFrame Comparison

This example demonstrates the use of testing_kwargs to allow for minor differences during comparison, such as floating-point tolerances or ignoring column/row order.

Task: Create a DataFrame named calculation_results_df with floating-point numbers. The comparison should tolerate minor differences in float values (relative tolerance of 0.001) and should not be strict about the data types (e.g., int vs float for whole numbers).

Placeholder:

calculation_results_df = ...

Solution:

calculation_results_df = pd.DataFrame({
'ID': [1, 2, 3],
'ValueA': [1/3, 2/3, 1.0], # Results in floating point inaccuracies
'ValueB': [1.0, 2.0, 3.0001] # Slight difference
})
	ID	  ValueA	ValueB
0 1 0.333333 1.0000
1 2 0.666667 2.0000
2 3 1.000000 3.0001

Reference DataFrame structure (saved as calculation_ref_ex5.csv):

   ID      ValueA      ValueB
0 1 0.333000 1.000000
1 2 0.667000 2.000000
2 3 1.000000 3.000000

(Student's DataFrame might have ValueA as 0.333333, ValueB as 1.0, etc.)

Snippet for the assertion: The testing_kwargs are passed to the underlying pandas testing utility (e.g., pd.testing.assert_frame_equal).

Variable NameValue
df_variable_namecalculation_results_df
serialized_dataframe_file_namecalculation_ref_ex5.csv
testing_kwargs{'check_dtype': False, 'rtol': 0.001}