Dataframe Variable Matches Serialized Dataframe

Verifies whether a student's DataFrame matches a reference serialized DataFrame (pickle or CSV).

Location of the snippet: python/pandas/dataframes/dataframe_variable_matches_serialized_dataframe

This snippet accepts a wide range of variations, from partial checking of a dataframe (that is, just a few columns) to multiple serialization formats (CSV, Pickle, etc).

The default location to serialize the data is always /root/.cache/.local/.trash/ which you should REMOVE before your students have access to the project. It can optionally be overwritten with the parameter location.

Whenever we leave the location parameter blank, the assertion will look for the file in the default location /root/.cache/.local/.trash/.

Device Type
Jupyter

Variables:

Variable Name	Variable Description	Type	Required?
`df_variable_name`	Name of the student's DataFrame variable.	`str`	Yes
`serialized_dataframe_file_name`	File name of the serialized reference DataFrame.	`str`	Yes
`location`	Directory where the serialized file is stored.	`str`	No
`df_subset`	Python list of column names to compare. If left blank, the entire DataFrame is compared.	`list`	No
`read_kwargs`	Optional dictionary of additional keyword arguments for reading the file.	`dict`	No
`testing_kwargs`	Optional dictionary of additional keyword arguments for testing the DataFrame.	`dict`	No
`serialization_method`	Explicit method for serialization. If blank, it is inferred from the file extension. Choices: pickle, csv.	`str`	No

Examples:

1. Basic DataFrame Comparison (CSV)

This example demonstrates a fundamental comparison of a student's DataFrame variable against a reference CSV file.

Task: Create a DataFrame named sales_df with the following product sales data:

  Product  Quantity  Price
 Apple        10   1.50
Banana        20   0.75
Orange        15   1.25

Placeholder:

sales_df = ...

Solution:

import pandas as pd
sales_df = pd.DataFrame({
    "Product": ["Apple", "Banana", "Orange"],
    "Quantity": [10, 20, 15],
    "Price": [1.50, 0.75, 1.25]
})

Serialization:

import pandas as pd

# Data for the reference DataFrame
data = {
    "Product": ["Apple", "Banana", "Orange"],
    "Quantity": [10, 20, 15],
    "Price": [1.50, 0.75, 1.25]
}
sales_df_ref = pd.DataFrame(data)

# Serialization step (creating the reference file)
# This uses the default location: /root/.cache/.local/.trash/
# The serialization method is explicitly set to 'csv'.
reference_file_path = "/root/.cache/.local/.trash/sales_data_ex1.csv"
sales_df_ref.to_csv(reference_file_path)

# Student's 'sales_df' variable will be compared against 'sales_data_ex1.csv'

Snippet for the assertion:

Variable Name	Value
`df_variable_name`	`sales_df`
`serialized_dataframe_file_name`	`sales_data_ex1.csv`

2. DataFrame Comparison with Pickle Serialization and Custom Location

This example shows how to use Pickle for serialization and store the reference file in a custom directory.

Task: Create a DataFrame named user_data_df representing user profiles:

   UserID     Name  Age
   101    Alice   30
   102      Bob   24
   103  Charlie   35

Placeholder:

user_data_df = ...

Solution:

import pandas as pd

user_data_df = pd.DataFrame({
    "UserID": [101, 102, 103],
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [30, 24, 35]
})

Serialization:

import pandas as pd

# Data for the reference DataFrame
data = {
    "UserID": [101, 102, 103],
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [30, 24, 35]
}
user_data_df_ref = pd.DataFrame(data)

# Serialization step (creating the reference file)
# Using a custom location and pickle serialization.

reference_file_path = "/data/custom_data_store/user_profiles_ex2.pkl"

user_data_df_ref.to_pickle(reference_file_path)

# Student's 'user_data_df' variable will be compared against 'user_profiles_ex2.pkl' from the custom location.

Snippet for the assertion:

Variable Name	Value
`df_variable_name`	`user_data_df`
`serialized_dataframe_file_name`	`user_profiles_ex2.pkl`
`location`	`/data/custom_data_store/`

3. Comparing a Subset of DataFrame Columns

This example demonstrates how to configure the checker to compare only a specific subset of columns from the student's DataFrame against the reference.

Task: Create a DataFrame named inventory_df with detailed inventory data. However, for verification, only the Item and Stock columns will be checked.

Reference DataFrame Structure (full data):

     Item Category  Stock  PurchasePrice  SalePrice
Laptop      Elec     12         800.00    1200.00
Mouse       Elec     75          10.00      25.00
Keyboard    Elec     30          20.00      45.00
Desk     Furniture   5          150.00     250.00

Placeholder:

inventory_df = ... # Should contain all columns: Item, Category, Stock, PurchasePrice, SalePrice

Solution:

import pandas as pd

data = {
    "Item": ["Laptop", "Mouse", "Keyboard", "Desk"],
    "Category": ["Elec", "Elec", "Elec", "Furniture"],
    "Stock": [12, 75, 30, 5],
    "PurchasePrice": [800.00, 10.00, 20.00, 150.00],
    "SalePrice": [1200.00, 25.00, 45.00, 250.00]
}
inventory_df = pd.DataFrame(data)

Serialization:

import pandas as pd

# Data for the full reference DataFrame
data = {
    "Item": ["Laptop", "Mouse", "Keyboard", "Desk"],
    "Category": ["Elec", "Elec", "Elec", "Furniture"],
    "Stock": [12, 75, 30, 5],
    "PurchasePrice": [800.00, 10.00, 20.00, 150.00],
    "SalePrice": [1200.00, 25.00, 45.00, 250.00]
}
inventory_df_ref = pd.DataFrame(data)

# Serialization step (creating the reference file with all columns)
# Using default location and CSV format.
reference_file_path = "/root/.cache/.local/.trash/full_inventory_ex3.csv"
inventory_df_ref.to_csv(reference_file_path, index=False)

# Student's 'inventory_df' variable will be compared, but only for columns 'Item' and 'Stock'.

Snippet for the assertion:

Variable Name	Value
`df_variable_name`	`inventory_df`
`serialized_dataframe_file_name`	`full_inventory_ex3.csv`
`df_subset`	`['Item', 'Stock']`

4. Using `read_kwargs` for Reading Specially Formatted CSV

This example shows how to use read_kwargs when the reference CSV file has a special format (e.g., custom delimiter, comments to skip).

Task: Create a DataFrame named sensor_log_df. The reference data is stored in a CSV file (sensor_readings_ex4.csv) that uses a semicolon (;) delimiter and has a comment line at the beginning.

Placeholder:

sensor_log_df = ...

Solution:

sensor_log_df = pd.DataFrame({
    "Timestamp": ["2023-10-01T10:00:00", "2023-10-01T10:00:05", "2023-10-01T10:00:10"],
    "SensorID": ["S1", "S2", "S1"],
    "Value": [23.5, 1012.1, 23.6]
})

                Timestamp	SensorID	Value
2023-10-01T10:00:00	          S1	23.5
2023-10-01T10:00:05	          S2	1012.1
2023-10-01T10:00:10	          S1	23.6

Reference File (sensor_readings_ex4.csv) content:

# Sensor Data Log
Timestamp;SensorID;Value
2023-10-01T10:00:00;S1;23.5
2023-10-01T10:00:05;S2;1012.1
2023-10-01T10:00:10;S1;23.6

Snippet for the assertion: The read_kwargs are used by the checker to correctly parse sensor_readings_ex4.csv.

Variable Name	Value
`df_variable_name`	`sensor_log_df`
`serialized_dataframe_file_name`	`sensor_readings_ex4.csv`
`read_kwargs`	`{'delimiter': ';', 'skiprows': 1}`

5. Using `testing_kwargs` for Flexible DataFrame Comparison

This example demonstrates the use of testing_kwargs to allow for minor differences during comparison, such as floating-point tolerances or ignoring column/row order.

Task: Create a DataFrame named calculation_results_df with floating-point numbers. The comparison should tolerate minor differences in float values (relative tolerance of 0.001) and should not be strict about the data types (e.g., int vs float for whole numbers).

Placeholder:

calculation_results_df = ...

Solution:

calculation_results_df = pd.DataFrame({
    'ID': [1, 2, 3],
    'ValueA': [1/3, 2/3, 1.0], # Results in floating point inaccuracies
    'ValueB': [1.0, 2.0, 3.0001] # Slight difference
})

	ID	  ValueA	ValueB
1	0.333333	1.0000
2	0.666667	2.0000
3	1.000000	3.0001

Reference DataFrame structure (saved as calculation_ref_ex5.csv):

   ID      ValueA      ValueB
 1    0.333000    1.000000
 2    0.667000    2.000000
 3    1.000000    3.000000

(Student's DataFrame might have ValueA as 0.333333, ValueB as 1.0, etc.)

Snippet for the assertion: The testing_kwargs are passed to the underlying pandas testing utility (e.g., pd.testing.assert_frame_equal).

Variable Name	Value
`df_variable_name`	`calculation_results_df`
`serialized_dataframe_file_name`	`calculation_ref_ex5.csv`
`testing_kwargs`	`{'check_dtype': False, 'rtol': 0.001}`

Variables:​

Examples:​

1. Basic DataFrame Comparison (CSV)​

2. DataFrame Comparison with Pickle Serialization and Custom Location​

3. Comparing a Subset of DataFrame Columns​

4. Using read_kwargs for Reading Specially Formatted CSV​

5. Using testing_kwargs for Flexible DataFrame Comparison​

Variables:

Examples:

1. Basic DataFrame Comparison (CSV)

2. DataFrame Comparison with Pickle Serialization and Custom Location

3. Comparing a Subset of DataFrame Columns

4. Using `read_kwargs` for Reading Specially Formatted CSV

5. Using `testing_kwargs` for Flexible DataFrame Comparison