Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are the EMR-related constants working? #4968

Open
fcas opened this issue Jan 25, 2025 · 0 comments
Open

Are the EMR-related constants working? #4968

fcas opened this issue Jan 25, 2025 · 0 comments

Comments

@fcas
Copy link

fcas commented Jan 25, 2025

Expected Behavior

Expected behavior when the following environment variables are set:

FEAST_SPARK_STAGING_LOCATION=s3://...
FEAST_SPARK_LAUNCHER="emr"
FEAST_EMR_CLUSTER_ID=""
FEAST_EMR_LOG_LOCATION=""
  1. Successfully connect to the specified EMR cluster.
  2. Initialize a SparkSession within the EMR environment.
  3. Execute the defined Spark job on the cluster.
  4. Upon completion of the Spark job, terminate the SparkSession and release cluster resources.
  5. Store the results of the Spark job in the online store.

Current Behavior

No Spark job is submitted to the EMR cluster. The code exits silently without any indication of success or failure of the materialization.

Steps to reproduce

FEAST_SPARK_STAGING_LOCATION=s3://...
FEAST_SPARK_LAUNCHER="emr"
FEAST_EMR_CLUSTER_ID=""
FEAST_EMR_LOG_LOCATION=""

RepoConfig:

{
    "project": "",
    "registry": "s3://",
    "provider": "aws",
    "entity_key_serialization_version": 2,
    "online_store":
    {
        "type": "dynamodb",
        "region": "us-west-2"
    },
    "offline_store":
    {
        "type": "spark",
        "region": "us-west-2",
        "staging_location": "s3://",
        "spark_conf":
        {
            "spark.master":"",
            "spark.ui.enabled": "true",
            "spark.eventLog.enabled": "false",
            "spark.sql.catalogImplementation": "hive",
            "spark.sql.parser.quotedRegexColumnNames": "true",
            "spark.sql.session.timeZone": "UTC",
            "spark.jars.packages": "org.apache.hadoop:hadoop-aws:3.3.1"
        }
    },
    "batch_engine":
    {
        "type": "spark.engine"
    }
}

Specifications

  • Version:
    feast = { extras = ["aws", "gcp", "spark"], version = "==0.42.0" }
  • Platform: macOS

Possible Solution

The only code reference found for the constant FEAST_SPARK_LAUNCHER dates back four years and is located here:

feast_spark_launcher = "emr"

Are the EMR-related constants working in the current FEAST version (0.42.0)?

# EMR cluster to run Feast Spark Jobs in
EMR_CLUSTER_ID: Optional[str] = None

# Region of EMR cluster
EMR_REGION: Optional[str] = None

# Template path of EMR cluster
EMR_CLUSTER_TEMPLATE_PATH: Optional[str] = None

# Log path of EMR cluster
EMR_LOG_LOCATION: Optional[str] = None

References:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant