Dataframe Variable Matches Serialized Dataframe
Verifies whether a student's DataFrame matches a reference serialized DataFrame (pickle or CSV).
Location of the snippet: python/pandas/dataframes/dataframe_variable_matches_serialized_dataframe
This snippet accepts a wide range of variations, from partial checking of a dataframe (that is, just a few columns) to multiple serialization formats (CSV, Pickle, etc).
The default location to serialize the data is always /root/.cache/.local/.trash/ which you should REMOVE before your students have access to the project. It can optionally be overwritten with the parameter location.
Whenever we leave the location parameter blank, the assertion will look for the file in the default location /root/.cache/.local/.trash/.
| Device Type |
|---|
| Jupyter |
Variables:
| Variable Name | Variable Description | Type | Required? | Default |
|---|---|---|---|---|
df_variable_name | Name of the student's DataFrame variable. | str | Yes | |
serialized_dataframe_file_name | File name of the serialized reference DataFrame. | str | Yes | |
location | Directory where the serialized file is stored. | str | No | |
df_subset | Python list of column names to compare. If left blank, the entire DataFrame is compared. | list | No | |
read_kwargs | Optional dictionary of additional keyword arguments for reading the file. | dict | No | |
testing_kwargs | Optional dictionary of additional keyword arguments for testing the DataFrame. | dict | No | |
serialization_method | Explicit method for serialization. If blank, it is inferred from the file extension. Choices: pickle, csv. | str | No |
Examples:
1. Basic DataFrame Comparison (CSV)
This example demonstrates a fundamental comparison of a student's DataFrame variable against a reference CSV file.
Task:
Create a DataFrame named sales_df with the following product sales data:
Product Quantity Price
0 Apple 10 1.50
1 Banana 20 0.75
2 Orange 15 1.25
Placeholder:
sales_df = ...
Solution:
import pandas as pd
sales_df = pd.DataFrame({
"Product": ["Apple", "Banana", "Orange"],
"Quantity": [10, 20, 15],
"Price": [1.50, 0.75, 1.25]
})
Serialization:
import pandas as pd
# Data for the reference DataFrame
data = {
"Product": ["Apple", "Banana", "Orange"],
"Quantity": [10, 20, 15],
"Price": [1.50, 0.75, 1.25]
}
sales_df_ref = pd.DataFrame(data)
# Serialization step (creating the reference file)
# This uses the default location: /root/.cache/.local/.trash/
# The serialization method is explicitly set to 'csv'.
reference_file_path = "/root/.cache/.local/.trash/sales_data_ex1.csv"
sales_df_ref.to_csv(reference_file_path)
# Student's 'sales_df' variable will be compared against 'sales_data_ex1.csv'
Snippet for the assertion:
| Variable Name | Value |
|---|---|
df_variable_name | sales_df |
serialized_dataframe_file_name | sales_data_ex1.csv |
2. DataFrame Comparison with Pickle Serialization and Custom Location
This example shows how to use Pickle for serialization and store the reference file in a custom directory.
Task:
Create a DataFrame named user_data_df representing user profiles:
UserID Name Age
0 101 Alice 30
1 102 Bob 24
2 103 Charlie 35
Placeholder:
user_data_df = ...
Solution:
import pandas as pd
user_data_df = pd.DataFrame({
"UserID": [101, 102, 103],
"Name": ["Alice", "Bob", "Charlie"],
"Age": [30, 24, 35]
})
Serialization:
import pandas as pd
# Data for the reference DataFrame
data = {
"UserID": [101, 102, 103],
"Name": ["Alice", "Bob", "Charlie"],
"Age": [30, 24, 35]
}
user_data_df_ref = pd.DataFrame(data)
# Serialization step (creating the reference file)
# Using a custom location and pickle serialization.
reference_file_path = "/data/custom_data_store/user_profiles_ex2.pkl"
user_data_df_ref.to_pickle(reference_file_path)
# Student's 'user_data_df' variable will be compared against 'user_profiles_ex2.pkl' from the custom location.
Snippet for the assertion:
| Variable Name | Value |
|---|---|
df_variable_name | user_data_df |
serialized_dataframe_file_name | user_profiles_ex2.pkl |
location | /data/custom_data_store/ |
3. Comparing a Subset of DataFrame Columns
This example demonstrates how to configure the checker to compare only a specific subset of columns from the student's DataFrame against the reference.
Task:
Create a DataFrame named inventory_df with detailed inventory data. However, for verification, only the Item and Stock columns will be checked.
Reference DataFrame Structure (full data):
Item Category Stock PurchasePrice SalePrice
0 Laptop Elec 12 800.00 1200.00
1 Mouse Elec 75 10.00 25.00
2 Keyboard Elec 30 20.00 45.00
3 Desk Furniture 5 150.00 250.00
Placeholder:
inventory_df = ... # Should contain all columns: Item, Category, Stock, PurchasePrice, SalePrice
Solution:
import pandas as pd
data = {
"Item": ["Laptop", "Mouse", "Keyboard", "Desk"],
"Category": ["Elec", "Elec", "Elec", "Furniture"],
"Stock": [12, 75, 30, 5],
"PurchasePrice": [800.00, 10.00, 20.00, 150.00],
"SalePrice": [1200.00, 25.00, 45.00, 250.00]
}
inventory_df = pd.DataFrame(data)
Serialization:
import pandas as pd
# Data for the full reference DataFrame
data = {
"Item": ["Laptop", "Mouse", "Keyboard", "Desk"],
"Category": ["Elec", "Elec", "Elec", "Furniture"],
"Stock": [12, 75, 30, 5],
"PurchasePrice": [800.00, 10.00, 20.00, 150.00],
"SalePrice": [1200.00, 25.00, 45.00, 250.00]
}
inventory_df_ref = pd.DataFrame(data)
# Serialization step (creating the reference file with all columns)
# Using default location and CSV format.
reference_file_path = "/root/.cache/.local/.trash/full_inventory_ex3.csv"
inventory_df_ref.to_csv(reference_file_path, index=False)
# Student's 'inventory_df' variable will be compared, but only for columns 'Item' and 'Stock'.
Snippet for the assertion:
| Variable Name | Value |
|---|---|
df_variable_name | inventory_df |
serialized_dataframe_file_name | full_inventory_ex3.csv |
df_subset | ['Item', 'Stock'] |
4. Using read_kwargs for Reading Specially Formatted CSV
This example shows how to use read_kwargs when the reference CSV file has a special format (e.g., custom delimiter, comments to skip).
Task:
Create a DataFrame named sensor_log_df. The reference data is stored in a CSV file (sensor_readings_ex4.csv) that uses a semicolon (;) delimiter and has a comment line at the beginning.
Placeholder:
sensor_log_df = ...
Solution:
sensor_log_df = pd.DataFrame({
"Timestamp": ["2023-10-01T10:00:00", "2023-10-01T10:00:05", "2023-10-01T10:00:10"],
"SensorID": ["S1", "S2", "S1"],
"Value": [23.5, 1012.1, 23.6]
})
Timestamp SensorID Value
0 2023-10-01T10:00:00 S1 23.5
1 2023-10-01T10:00:05 S2 1012.1
2 2023-10-01T10:00:10 S1 23.6
Reference File (sensor_readings_ex4.csv) content:
# Sensor Data Log
Timestamp;SensorID;Value
2023-10-01T10:00:00;S1;23.5
2023-10-01T10:00:05;S2;1012.1
2023-10-01T10:00:10;S1;23.6
Snippet for the assertion:
The read_kwargs are used by the checker to correctly parse sensor_readings_ex4.csv.
| Variable Name | Value |
|---|---|
df_variable_name | sensor_log_df |
serialized_dataframe_file_name | sensor_readings_ex4.csv |
read_kwargs | {'delimiter': ';', 'skiprows': 1} |
5. Using testing_kwargs for Flexible DataFrame Comparison
This example demonstrates the use of testing_kwargs to allow for minor differences during comparison, such as floating-point tolerances or ignoring column/row order.
Task:
Create a DataFrame named calculation_results_df with floating-point numbers. The comparison should tolerate minor differences in float values (relative tolerance of 0.001) and should not be strict about the data types (e.g., int vs float for whole numbers).
Placeholder:
calculation_results_df = ...
Solution:
calculation_results_df = pd.DataFrame({
'ID': [1, 2, 3],
'ValueA': [1/3, 2/3, 1.0], # Results in floating point inaccuracies
'ValueB': [1.0, 2.0, 3.0001] # Slight difference
})
ID ValueA ValueB
0 1 0.333333 1.0000
1 2 0.666667 2.0000
2 3 1.000000 3.0001
Reference DataFrame structure (saved as calculation_ref_ex5.csv):
ID ValueA ValueB
0 1 0.333000 1.000000
1 2 0.667000 2.000000
2 3 1.000000 3.000000
(Student's DataFrame might have ValueA as 0.333333, ValueB as 1.0, etc.)
Snippet for the assertion:
The testing_kwargs are passed to the underlying pandas testing utility (e.g., pd.testing.assert_frame_equal).
| Variable Name | Value |
|---|---|
df_variable_name | calculation_results_df |
serialized_dataframe_file_name | calculation_ref_ex5.csv |
testing_kwargs | {'check_dtype': False, 'rtol': 0.001} |