You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Whenever I call expectations over my data, my logs get flooded by _get_default_value called with key "table" but it is not a known field (see below). For reference, I am using a pandas in-memory DataFrame (via the pandas data-source) which I am passing to a checkpoint with 1 expectation suite.
To Reproduce
I'll share a toy example which you can run where it happens. You need to ensure logging level is set to INFO
importloggingimportgreat_expectationsasgximportpandasaspd# -- Set GX constants for artifact creationNAME_DATA_SOURCE="pandas"NAME_DATA_ASSET="tutorial_data"NAME_BATCH_DEF="pandas_tutorial"NAME_EXPECTATION_SUITE="pandas_tutorial"NAME_VALIDATION_DEF="pandas_validation"NAME_CHECKPOINT="pandas"FILE_CONFIGURE="data/yellow_tripdata_2021-11.csv"formatter=logging.Formatter(
("%(levelname)-8s [ %(asctime)s - %(module)s.%(funcName)s : %(lineno)d ] %(message)s"),
datefmt="%Y-%m-%d %H:%M:%S",
)
root_logger=logging.getLogger()
root_logger.setLevel(logging.INFO)
logging.basicConfig(level=logging.INFO)
handler=root_logger.handlers[0]
handler.setFormatter(formatter)
# -- Load data for configurationdf_configure=pd.read_csv(FILE_CONFIGURE)
# -- 1. Initialize GX for configuration & set up in-memory sourcecontext=gx.get_context(mode="file")
data_source=context.data_sources.add_pandas(name=NAME_DATA_SOURCE)
data_asset=data_source.add_dataframe_asset(name=NAME_DATA_ASSET)
batch_definition=data_asset.add_batch_definition_whole_dataframe(NAME_BATCH_DEF)
# -- 2. Configure expectation suite to be called over runtime data laterexpectation_suite=gx.ExpectationSuite(name=NAME_EXPECTATION_SUITE)
expectation_suite=context.suites.add(expectation_suite)
# -- 2.1. Define table level expectationscolumns=list(df_configure.columns)
expectation=gx.expectations.ExpectTableColumnsToMatchSet(column_set=columns)
expectation_suite.add_expectation(expectation)
# -- 2.2. Define column level expectations# -- 2.2.1. Ensure vendor ID is either 1 or 2expected_values= [1, 2]
expectation=gx.expectations.ExpectColumnValuesToBeInSet(
column="VendorID",
value_set=expected_values,
)
expectation_suite.add_expectation(expectation)
# -- 2.2.2. Validate that all columns have non-null valuesforcolumnincolumns:
expectation=gx.expectations.ExpectColumnValuesToNotBeNull(column=column)
expectation_suite.add_expectation(expectation)
# -- 2.2.3. Validate that pickup and dropoff datetimes are in the correct formatdate_columns= ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
DATE_PATTERN= (
r"^(?:19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]) (?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$"
)
fordate_columnindate_columns:
expectation=gx.expectations.ExpectColumnValuesToMatchRegex(
column=date_column,
regex=DATE_PATTERN,
)
expectation_suite.add_expectation(expectation)
# -- 2.2.4. Validate non-zero columnsnumeric_columns= [
"passenger_count",
"trip_distance",
"tip_amount",
]
fornumeric_columninnumeric_columns:
expectation=gx.expectations.ExpectColumnValuesToBeBetween(
column=numeric_column,
min_value=0,
)
expectation_suite.add_expectation(expectation)
# -- 2.3. Evaluate results on test datasetbatch_parameters= {"dataframe": df_configure}
batch=batch_definition.get_batch(batch_parameters=batch_parameters)
validation_results=batch.validate(expectation_suite)
# -- 3. Bundle suite and batch into validation definition and checkpoint w/ bundled# -- actions for easy execution latervalidation_definition=gx.ValidationDefinition(
data=batch_definition,
suite=expectation_suite,
name=NAME_VALIDATION_DEF,
)
_=context.validation_definitions.add(validation_definition)
action_list= [
gx.checkpoint.UpdateDataDocsAction(
name="update_all_data_docs",
),
]
checkpoint=gx.Checkpoint(
name=NAME_CHECKPOINT,
validation_definitions=[validation_definition],
actions=action_list,
result_format={
"result_format": "COMPLETE",
},
)
_=context.checkpoints.add(checkpoint)
# -- 4. Run checkpoint to validate if everything works properlyfile_identifier=FILE_CONFIGURE.split("/")[-1]
runid=gx.RunIdentifier(run_name=f"Configuration run - {file_identifier}")
results=checkpoint.run(batch_parameters=batch_parameters, run_id=runid)
Apart from the code to adjust the root logger, it is the exact same code as you can find here: https://github.com/jschra/joriktech/tree/main/data_testing_gx_1. So you could use that for full reproducibility (by adding the logger snippet), since the data is also stored there.
Logs you'll get when you run it:
Expected behavior
I would expect my logstream (set to INFO) to not be flooded with the same message over-and-over again with
INFO [ 2025-01-29 13:50:08 - expectation._get_default_value : 1161 ] _get_default_value called with key "table", but it is not a known field
Environment (please complete the following information):
Operating System: MacOS
Great Expectations Version: 1.3.3
Data Source: pandas
Cloud environment: Not relevant
Additional context
None needed I think
The text was updated successfully, but these errors were encountered:
thanks for reporting this and including the steps for reproduction! this seems to have also been previously reported be another user. I will be sure to share this with the team and follow up with you
Describe the bug
Whenever I call expectations over my data, my logs get flooded by _get_default_value called with key "table" but it is not a known field (see below). For reference, I am using a pandas in-memory DataFrame (via the pandas data-source) which I am passing to a checkpoint with 1 expectation suite.
To Reproduce
I'll share a toy example which you can run where it happens. You need to ensure logging level is set to INFO
Apart from the code to adjust the root logger, it is the exact same code as you can find here: https://github.com/jschra/joriktech/tree/main/data_testing_gx_1. So you could use that for full reproducibility (by adding the logger snippet), since the data is also stored there.
Logs you'll get when you run it:
Expected behavior
I would expect my logstream (set to INFO) to not be flooded with the same message over-and-over again with
INFO [ 2025-01-29 13:50:08 - expectation._get_default_value : 1161 ] _get_default_value called with key "table", but it is not a known field
Environment (please complete the following information):
Additional context
None needed I think
The text was updated successfully, but these errors were encountered: