Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE] Logstream is flooded with _get_default_value called with key "table", but it is not a known field messages #10897

Open
jschra opened this issue Jan 29, 2025 · 1 comment

Comments

@jschra
Copy link

jschra commented Jan 29, 2025

Describe the bug
Whenever I call expectations over my data, my logs get flooded by _get_default_value called with key "table" but it is not a known field (see below). For reference, I am using a pandas in-memory DataFrame (via the pandas data-source) which I am passing to a checkpoint with 1 expectation suite.

Image

To Reproduce
I'll share a toy example which you can run where it happens. You need to ensure logging level is set to INFO

import logging

import great_expectations as gx
import pandas as pd

# -- Set GX constants for artifact creation
NAME_DATA_SOURCE = "pandas"
NAME_DATA_ASSET = "tutorial_data"
NAME_BATCH_DEF = "pandas_tutorial"
NAME_EXPECTATION_SUITE = "pandas_tutorial"
NAME_VALIDATION_DEF = "pandas_validation"
NAME_CHECKPOINT = "pandas"

FILE_CONFIGURE = "data/yellow_tripdata_2021-11.csv"

formatter = logging.Formatter(
    ("%(levelname)-8s [ %(asctime)s - %(module)s.%(funcName)s : %(lineno)d ] %(message)s"),
    datefmt="%Y-%m-%d %H:%M:%S",
)
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
logging.basicConfig(level=logging.INFO)
handler = root_logger.handlers[0]
handler.setFormatter(formatter)


# -- Load data for configuration
df_configure = pd.read_csv(FILE_CONFIGURE)

# -- 1. Initialize GX for configuration & set up in-memory source
context = gx.get_context(mode="file")

data_source = context.data_sources.add_pandas(name=NAME_DATA_SOURCE)
data_asset = data_source.add_dataframe_asset(name=NAME_DATA_ASSET)
batch_definition = data_asset.add_batch_definition_whole_dataframe(NAME_BATCH_DEF)

# -- 2. Configure expectation suite to be called over runtime data later
expectation_suite = gx.ExpectationSuite(name=NAME_EXPECTATION_SUITE)
expectation_suite = context.suites.add(expectation_suite)

# -- 2.1. Define table level expectations
columns = list(df_configure.columns)
expectation = gx.expectations.ExpectTableColumnsToMatchSet(column_set=columns)
expectation_suite.add_expectation(expectation)

# -- 2.2. Define column level expectations
# -- 2.2.1. Ensure vendor ID is either 1 or 2
expected_values = [1, 2]
expectation = gx.expectations.ExpectColumnValuesToBeInSet(
    column="VendorID",
    value_set=expected_values,
)
expectation_suite.add_expectation(expectation)

# -- 2.2.2. Validate that all columns have non-null values
for column in columns:
    expectation = gx.expectations.ExpectColumnValuesToNotBeNull(column=column)
    expectation_suite.add_expectation(expectation)

# -- 2.2.3. Validate that pickup and dropoff datetimes are in the correct format
date_columns = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
DATE_PATTERN = (
    r"^(?:19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]) (?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$"
)
for date_column in date_columns:
    expectation = gx.expectations.ExpectColumnValuesToMatchRegex(
        column=date_column,
        regex=DATE_PATTERN,
    )
    expectation_suite.add_expectation(expectation)

# -- 2.2.4. Validate non-zero columns
numeric_columns = [
    "passenger_count",
    "trip_distance",
    "tip_amount",
]
for numeric_column in numeric_columns:
    expectation = gx.expectations.ExpectColumnValuesToBeBetween(
        column=numeric_column,
        min_value=0,
    )
    expectation_suite.add_expectation(expectation)

# -- 2.3. Evaluate results on test dataset
batch_parameters = {"dataframe": df_configure}
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
validation_results = batch.validate(expectation_suite)

# -- 3. Bundle suite and batch into validation definition and checkpoint w/ bundled
# --    actions for easy execution later
validation_definition = gx.ValidationDefinition(
    data=batch_definition,
    suite=expectation_suite,
    name=NAME_VALIDATION_DEF,
)
_ = context.validation_definitions.add(validation_definition)

action_list = [
    gx.checkpoint.UpdateDataDocsAction(
        name="update_all_data_docs",
    ),
]
checkpoint = gx.Checkpoint(
    name=NAME_CHECKPOINT,
    validation_definitions=[validation_definition],
    actions=action_list,
    result_format={
        "result_format": "COMPLETE",
    },
)
_ = context.checkpoints.add(checkpoint)

# -- 4. Run checkpoint to validate if everything works properly
file_identifier = FILE_CONFIGURE.split("/")[-1]
runid = gx.RunIdentifier(run_name=f"Configuration run - {file_identifier}")
results = checkpoint.run(batch_parameters=batch_parameters, run_id=runid)

Apart from the code to adjust the root logger, it is the exact same code as you can find here: https://github.com/jschra/joriktech/tree/main/data_testing_gx_1. So you could use that for full reproducibility (by adding the logger snippet), since the data is also stored there.

Logs you'll get when you run it:

Image

Expected behavior
I would expect my logstream (set to INFO) to not be flooded with the same message over-and-over again with

INFO [ 2025-01-29 13:50:08 - expectation._get_default_value : 1161 ] _get_default_value called with key "table", but it is not a known field

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 1.3.3
  • Data Source: pandas
  • Cloud environment: Not relevant

Additional context
None needed I think

@adeola-ak
Copy link
Contributor

thanks for reporting this and including the steps for reproduction! this seems to have also been previously reported be another user. I will be sure to share this with the team and follow up with you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

2 participants