Difference in features generated during training and inference stages #1099

arjunsatheesan · 2025-01-05T09:46:05Z

The problem:
As part of the training process, I save the features generated as a Pyspark dataframe (train_features_df). During inference time, I use tsfresh's feature_extraction.settings.from_columns method on train_features_df to extract the set of features to be generated per column for the inference data:

    columns_of_interest = [    
                                  "RoomTemp",
                                  "CoilTemp",
                                  "FanRelay"
                                  ]

    train_features_df = spark.read.format("parquet").load(<PATH>)
    train_features_pdf = train_features_df.toPandas()
    train_features_pdf = train_features_pdf.drop(columns=["id"])
    features = train_features_pdf.columns.tolist()
    train_kind_to_fc_parameters = tsfresh.feature_extraction.settings.from_columns(features)
  
    # Inference data consists of last 24 hours worth of telemetry
    inference_features_df = generate_features(inference_df, columns_of_interest, "normalized", 
    train_kind_to_fc_parameters)

   inference_features_pdf = inference_features_df.toPandas()
   inference_features_pdf = inference_features_pdf.drop(columns=["id"])
   inference_features = inference_features_pdf.columns.tolist()
   inference_kind_to_fc_parameters = tsfresh.feature_extraction.settings.from_columns(inference_features)
   print(inference_kind_to_fc_parameters == train_kind_to_fc_parameters) # Prints False

I use the below function to generate features:

def generate_features(filtered_combined_df, columns_of_interest, prefix, fc_parameters=None):
  @pandas_udf("id string, features map<string, double>", PandasUDFType.GROUPED_MAP)
  def extract_tsfresh_features(pdf):
    if not fc_parameters:
      extracted_features = extract_features(pdf,
                                            column_id='id', column_sort='time',
                                            column_kind='kind', column_value='value',
                                            default_fc_parameters=EfficientFCParameters(),
                                            disable_progressbar=True)
    else:
      extracted_features = extract_features(pdf,
                                          column_id='id', column_sort='time',
                                          column_kind='kind', column_value='value',
                                          kind_to_fc_parameters=fc_parameters,
                                          disable_progressbar=True)

    result_pdf = pd.DataFrame({
        "id": extracted_features.index,
        "features": extracted_features.to_dict(orient="records")
    })
    return result_pdf

  stack_expr = ", ".join([f"'{col_name}', cast({col_name} as string)" for col_name in columns_of_interest])
  df_pivot = filtered_combined_df.selectExpr(
      "time", "UUID", 
      f"stack({len(columns_of_interest)}, {stack_expr}) as (kind, value)"
      )
  df_pivot = df_pivot.withColumn("value", col("value").cast("float")) \
                      .withColumnRenamed("UUID", "id").where(col("value").isNotNull())
  features_df = df_pivot.groupby("id").apply(extract_tsfresh_features)

  first_row_df = features_df.limit(1).selectExpr("explode(features) as (key, value)")
  keys = [row['key'] for row in first_row_df.collect()]
  select_exprs = [col("id")] + [expr(f"features['{key}']").alias(f"{prefix}_{key}") for key in keys]
  features_pivoted_df = features_df.select(*select_exprs)
  print("Features generated successfully.")
  return features_pivoted_df

I notice that the features in inference data are slightly different than those in training data.

When I compare inference_kind_to_fc_parameters with train_kind_to_fc_parameters, I notice that inference_kind_to_fc_parameters doesn't have an entry for FanRelay column. How do I fix the mismatch in features being generated during training and inference stages?

Anything else we need to know?:
Note: Training process consumes more than one year worth of telemetry whereas inference data looks at the last 24 hours worth of telemetry. I also looked at the FanRelay column in inference data and it has all float values.

Environment:

Python version: 3.10.12
Operating System: macOS Sequoia
tsfresh version: 0.20.2
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

nils-braun · 2025-02-16T16:20:36Z

Hi @arjunsatheesan - sorry for the late answer.
The kind_to_fc_parameters dictionary in tsfresh will be used to set the settings for the given columns. So we assume the given column is actually in the data. As a test, you can maybe print/return the data frame before the feature extraction to check, if the column is actually in the data. Your example code is quite complex and I am not a spark expert, so I can not really tell you what is happening, but maybe you accidentally loose the column? I also see that you do some additional operation on the resulting column names (if I am not mistaken), so it might be worth checking if the unmodified column names do match.

arjunsatheesan added the bug label Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in features generated during training and inference stages #1099

Difference in features generated during training and inference stages #1099

arjunsatheesan commented Jan 5, 2025

nils-braun commented Feb 16, 2025

Difference in features generated during training and inference stages #1099

Difference in features generated during training and inference stages #1099

Comments

arjunsatheesan commented Jan 5, 2025

nils-braun commented Feb 16, 2025