[ISSUE] Logstream is flooded with _get_default_value called with key "table", but it is not a known field messages #10897

jschra · 2025-01-29T12:59:24Z

Describe the bug
Whenever I call expectations over my data, my logs get flooded by _get_default_value called with key "table" but it is not a known field (see below). For reference, I am using a pandas in-memory DataFrame (via the pandas data-source) which I am passing to a checkpoint with 1 expectation suite.

To Reproduce
I'll share a toy example which you can run where it happens. You need to ensure logging level is set to INFO

import logging

import great_expectations as gx
import pandas as pd

# -- Set GX constants for artifact creation
NAME_DATA_SOURCE = "pandas"
NAME_DATA_ASSET = "tutorial_data"
NAME_BATCH_DEF = "pandas_tutorial"
NAME_EXPECTATION_SUITE = "pandas_tutorial"
NAME_VALIDATION_DEF = "pandas_validation"
NAME_CHECKPOINT = "pandas"

FILE_CONFIGURE = "data/yellow_tripdata_2021-11.csv"

formatter = logging.Formatter(
    ("%(levelname)-8s [ %(asctime)s - %(module)s.%(funcName)s : %(lineno)d ] %(message)s"),
    datefmt="%Y-%m-%d %H:%M:%S",
)
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
logging.basicConfig(level=logging.INFO)
handler = root_logger.handlers[0]
handler.setFormatter(formatter)


# -- Load data for configuration
df_configure = pd.read_csv(FILE_CONFIGURE)

# -- 1. Initialize GX for configuration & set up in-memory source
context = gx.get_context(mode="file")

data_source = context.data_sources.add_pandas(name=NAME_DATA_SOURCE)
data_asset = data_source.add_dataframe_asset(name=NAME_DATA_ASSET)
batch_definition = data_asset.add_batch_definition_whole_dataframe(NAME_BATCH_DEF)

# -- 2. Configure expectation suite to be called over runtime data later
expectation_suite = gx.ExpectationSuite(name=NAME_EXPECTATION_SUITE)
expectation_suite = context.suites.add(expectation_suite)

# -- 2.1. Define table level expectations
columns = list(df_configure.columns)
expectation = gx.expectations.ExpectTableColumnsToMatchSet(column_set=columns)
expectation_suite.add_expectation(expectation)

# -- 2.2. Define column level expectations
# -- 2.2.1. Ensure vendor ID is either 1 or 2
expected_values = [1, 2]
expectation = gx.expectations.ExpectColumnValuesToBeInSet(
    column="VendorID",
    value_set=expected_values,
)
expectation_suite.add_expectation(expectation)

# -- 2.2.2. Validate that all columns have non-null values
for column in columns:
    expectation = gx.expectations.ExpectColumnValuesToNotBeNull(column=column)
    expectation_suite.add_expectation(expectation)

# -- 2.2.3. Validate that pickup and dropoff datetimes are in the correct format
date_columns = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
DATE_PATTERN = (
    r"^(?:19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]) (?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$"
)
for date_column in date_columns:
    expectation = gx.expectations.ExpectColumnValuesToMatchRegex(
        column=date_column,
        regex=DATE_PATTERN,
    )
    expectation_suite.add_expectation(expectation)

# -- 2.2.4. Validate non-zero columns
numeric_columns = [
    "passenger_count",
    "trip_distance",
    "tip_amount",
]
for numeric_column in numeric_columns:
    expectation = gx.expectations.ExpectColumnValuesToBeBetween(
        column=numeric_column,
        min_value=0,
    )
    expectation_suite.add_expectation(expectation)

# -- 2.3. Evaluate results on test dataset
batch_parameters = {"dataframe": df_configure}
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
validation_results = batch.validate(expectation_suite)

# -- 3. Bundle suite and batch into validation definition and checkpoint w/ bundled
# --    actions for easy execution later
validation_definition = gx.ValidationDefinition(
    data=batch_definition,
    suite=expectation_suite,
    name=NAME_VALIDATION_DEF,
)
_ = context.validation_definitions.add(validation_definition)

action_list = [
    gx.checkpoint.UpdateDataDocsAction(
        name="update_all_data_docs",
    ),
]
checkpoint = gx.Checkpoint(
    name=NAME_CHECKPOINT,
    validation_definitions=[validation_definition],
    actions=action_list,
    result_format={
        "result_format": "COMPLETE",
    },
)
_ = context.checkpoints.add(checkpoint)

# -- 4. Run checkpoint to validate if everything works properly
file_identifier = FILE_CONFIGURE.split("/")[-1]
runid = gx.RunIdentifier(run_name=f"Configuration run - {file_identifier}")
results = checkpoint.run(batch_parameters=batch_parameters, run_id=runid)

Apart from the code to adjust the root logger, it is the exact same code as you can find here: https://github.com/jschra/joriktech/tree/main/data_testing_gx_1. So you could use that for full reproducibility (by adding the logger snippet), since the data is also stored there.

Logs you'll get when you run it:

Expected behavior
I would expect my logstream (set to INFO) to not be flooded with the same message over-and-over again with

INFO [ 2025-01-29 13:50:08 - expectation._get_default_value : 1161 ] _get_default_value called with key "table", but it is not a known field

Environment (please complete the following information):

Operating System: MacOS
Great Expectations Version: 1.3.3
Data Source: pandas
Cloud environment: Not relevant

Additional context
None needed I think

adeola-ak · 2025-01-30T16:40:07Z

thanks for reporting this and including the steps for reproduction! this seems to have also been previously reported be another user. I will be sure to share this with the team and follow up with you

adeola-ak added this to GX Core Issues Board Jan 30, 2025

github-project-automation bot moved this to To Do in GX Core Issues Board Jan 30, 2025

adeola-ak moved this from To Do to In progress in GX Core Issues Board Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE] Logstream is flooded with _get_default_value called with key "table", but it is not a known field messages #10897

[ISSUE] Logstream is flooded with _get_default_value called with key "table", but it is not a known field messages #10897

jschra commented Jan 29, 2025

adeola-ak commented Jan 30, 2025

[ISSUE] Logstream is flooded with _get_default_value called with key "table", but it is not a known field messages #10897

[ISSUE] Logstream is flooded with _get_default_value called with key "table", but it is not a known field messages #10897

Comments

jschra commented Jan 29, 2025

adeola-ak commented Jan 30, 2025