You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem:
As part of the training process, I save the features generated as a Pyspark dataframe (train_features_df). During inference time, I use tsfresh's feature_extraction.settings.from_columns method on train_features_df to extract the set of features to be generated per column for the inference data:
columns_of_interest = [
"RoomTemp",
"CoilTemp",
"FanRelay"
]
train_features_df = spark.read.format("parquet").load(<PATH>)
train_features_pdf = train_features_df.toPandas()
train_features_pdf = train_features_pdf.drop(columns=["id"])
features = train_features_pdf.columns.tolist()
train_kind_to_fc_parameters = tsfresh.feature_extraction.settings.from_columns(features)
# Inference data consists of last 24 hours worth of telemetry
inference_features_df = generate_features(inference_df, columns_of_interest, "normalized",
train_kind_to_fc_parameters)
inference_features_pdf = inference_features_df.toPandas()
inference_features_pdf = inference_features_pdf.drop(columns=["id"])
inference_features = inference_features_pdf.columns.tolist()
inference_kind_to_fc_parameters = tsfresh.feature_extraction.settings.from_columns(inference_features)
print(inference_kind_to_fc_parameters == train_kind_to_fc_parameters) # Prints False
I use the below function to generate features:
def generate_features(filtered_combined_df, columns_of_interest, prefix, fc_parameters=None):
@pandas_udf("id string, features map<string, double>", PandasUDFType.GROUPED_MAP)
def extract_tsfresh_features(pdf):
if not fc_parameters:
extracted_features = extract_features(pdf,
column_id='id', column_sort='time',
column_kind='kind', column_value='value',
default_fc_parameters=EfficientFCParameters(),
disable_progressbar=True)
else:
extracted_features = extract_features(pdf,
column_id='id', column_sort='time',
column_kind='kind', column_value='value',
kind_to_fc_parameters=fc_parameters,
disable_progressbar=True)
result_pdf = pd.DataFrame({
"id": extracted_features.index,
"features": extracted_features.to_dict(orient="records")
})
return result_pdf
stack_expr = ", ".join([f"'{col_name}', cast({col_name} as string)" for col_name in columns_of_interest])
df_pivot = filtered_combined_df.selectExpr(
"time", "UUID",
f"stack({len(columns_of_interest)}, {stack_expr}) as (kind, value)"
)
df_pivot = df_pivot.withColumn("value", col("value").cast("float")) \
.withColumnRenamed("UUID", "id").where(col("value").isNotNull())
features_df = df_pivot.groupby("id").apply(extract_tsfresh_features)
first_row_df = features_df.limit(1).selectExpr("explode(features) as (key, value)")
keys = [row['key'] for row in first_row_df.collect()]
select_exprs = [col("id")] + [expr(f"features['{key}']").alias(f"{prefix}_{key}") for key in keys]
features_pivoted_df = features_df.select(*select_exprs)
print("Features generated successfully.")
return features_pivoted_df
I notice that the features in inference data are slightly different than those in training data.
When I compare inference_kind_to_fc_parameters with train_kind_to_fc_parameters, I notice that inference_kind_to_fc_parameters doesn't have an entry for FanRelay column. How do I fix the mismatch in features being generated during training and inference stages?
Anything else we need to know?:
Note: Training process consumes more than one year worth of telemetry whereas inference data looks at the last 24 hours worth of telemetry. I also looked at the FanRelay column in inference data and it has all float values.
Environment:
Python version: 3.10.12
Operating System: macOS Sequoia
tsfresh version: 0.20.2
Install method (conda, pip, source): pip
The text was updated successfully, but these errors were encountered:
Hi @arjunsatheesan - sorry for the late answer.
The kind_to_fc_parameters dictionary in tsfresh will be used to set the settings for the given columns. So we assume the given column is actually in the data. As a test, you can maybe print/return the data frame before the feature extraction to check, if the column is actually in the data. Your example code is quite complex and I am not a spark expert, so I can not really tell you what is happening, but maybe you accidentally loose the column? I also see that you do some additional operation on the resulting column names (if I am not mistaken), so it might be worth checking if the unmodified column names do match.
The problem:
As part of the training process, I save the features generated as a Pyspark dataframe (train_features_df). During inference time, I use tsfresh's feature_extraction.settings.from_columns method on train_features_df to extract the set of features to be generated per column for the inference data:
I use the below function to generate features:
I notice that the features in inference data are slightly different than those in training data.
When I compare inference_kind_to_fc_parameters with train_kind_to_fc_parameters, I notice that inference_kind_to_fc_parameters doesn't have an entry for FanRelay column. How do I fix the mismatch in features being generated during training and inference stages?
Anything else we need to know?:
Note: Training process consumes more than one year worth of telemetry whereas inference data looks at the last 24 hours worth of telemetry. I also looked at the FanRelay column in inference data and it has all float values.
Environment:
The text was updated successfully, but these errors were encountered: