Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipesEMRClient wrong log location #27050

Closed
Rahkovsky opened this issue Jan 12, 2025 · 6 comments · Fixed by #27100
Closed

PipesEMRClient wrong log location #27050

Rahkovsky opened this issue Jan 12, 2025 · 6 comments · Fixed by #27100
Assignees
Labels
area: dagster-pipes Related to Dagster Pipes integration: aws Related to dagster-aws type: bug Something isn't working

Comments

@Rahkovsky
Copy link

Rahkovsky commented Jan 12, 2025

What's the issue?

PipesEMRClient expects logs to be inside "/containers" folder in S3. This folder is created if the we run YARN on EMR. If we run EMR with steps then this folder does not exist and there are only steps/ and nodes/ folders for logs. Because I don't have containers folder in my EMR logs, the command:

  result = pipes_emr_client.run(
      context=context,
      run_job_flow_params=job_flow_params,
      extras={"verbose_logging": True},
  ).get_materialize_result()

produces an error:

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "emr_job_flow":
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_core/execution/plan/execute_plan.py", line 245, in dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_core/execution/plan/execute_step.py", line 500, in core_dagster_event_sequence_for_step
    for user_event in _step_output_error_checked_user_event_sequence(
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_core/execution/plan/execute_step.py", line 183, in _step_output_error_checked_user_event_sequence
    for user_event in user_event_sequence:
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_core/execution/plan/execute_step.py", line 87, in _process_asset_results_to_events
    for user_event in user_event_sequence:
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_core/execution/plan/compute.py", line 193, in execute_core_compute
    for step_output in _yield_compute_results(step_context, inputs, compute_fn, compute_context):
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_core/execution/plan/compute.py", line 162, in _yield_compute_results
    for event in iterate_with_context(
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_utils/__init__.py", line 488, in iterate_with_context
    with context_fn():
  File "/home/ubuntu/miniforge3/envs/venv/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_core/execution/plan/utils.py", line 84, in op_execution_error_boundary
    raise error_cls(
The above exception was caused by the following exception:
KeyError: 'Contents'
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster/_utils/__init__.py", line 490, in iterate_with_context
    next_output = next(iterator)
                  ^^^^^^^^^^^^^^
  File "/home/ubuntu/github/boosting-behavior-models/pipelines/assets/factory.py", line 52, in asset_function
    yield from factory_function(context, **inputs)
  File "/home/ubuntu/github/boosting-behavior-models/pipelines/assets/logic.py", line 555, in build_emr_job_flow
    result = pipes_emr_client.run(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster_aws/pipes/clients/emr.py", line 149, in run
    self._read_application_logs(context, session, wait_response)
  File "/home/ubuntu/github/boosting-behavior-models/.env/lib/python3.11/site-packages/dagster_aws/pipes/clients/emr.py", line 328, in _read_application_logs
    for obj in self.message_reader.client.list_objects_v2(

version: dagster_aws=0.25.6

It would be great if containers would not be hard-coded, but instead passed as a parameter. If I were to change line in [emr.py](https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-aws/dagster_aws/pipes/clients/emr.py#L313) 313
from

containers_prefix = os.path.join(prefix, f"{cluster_id}/containers/")

to

containers_prefix = os.path.join(prefix, f"{cluster_id}/steps/")

the code runs successfully

I suggest adding a parameter log_folder to PipesEMRClient._read_application_logs and PipesEMRClient.run with the default value containers

What did you expect to happen?

Dagster expects EMR logs to in the folder containers

How to reproduce?

Run non-spark EMR job.

Dagster version

1.9.6

Deployment type

Other

Deployment details

Code is run on EC2 Linux instance:

Python 3.11.6
dagit                              1.9.6
dagster                            1.9.6
dagster-aws                        0.25.6
dagster-duckdb                     0.25.6
dagster-graphql                    1.9.6
dagster-pipes                      1.9.6
dagster-postgres                   0.25.6
dagster-webserver                  1.9.6

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@Rahkovsky Rahkovsky added the type: bug Something isn't working label Jan 12, 2025
@garethbrickman garethbrickman added integration: aws Related to dagster-aws area: dagster-pipes Related to Dagster Pipes labels Jan 13, 2025
@danielgafni
Copy link
Contributor

Hey @Rahkovsky!

I don't have a setup to test this right now.
I remember the /steps directory existed with Yarn too, but it didn't contain application logs.
Instead, it contained the logs from the steps spawning Spark applications (e.g. spark-submit logs).

Just to double check, are you sure that application logs appear in /steps in your setup? Happy to apply the suggested fix in this case.

@Rahkovsky
Copy link
Author

Rahkovsky commented Jan 21, 2025

@danielgafni ,

I want to update the issue. The problem was solved by flag: include_stdio_in_messages=True in

pipes_emr_resource = PipesEMRClient(
    message_reader=PipesS3MessageReader(
        client=boto3.client("s3"),
        bucket="ppd-data-science",  
        include_stdio_in_messages=True
    )
)

Though it may be still a good idea to add this parameter if someone want to download full logs that were not generated by yarn.

@danielgafni
Copy link
Contributor

Happy the flag solved your problem! It was quite a journey to get this working :)

Getting back to my previous question: so you are sure /steps contain your Python app logs and not some system logs from something else? In this case we'll just merge my PR.

@Rahkovsky
Copy link
Author

Getting back to my previous question: so you are sure /steps contain your Python app logs and not some system logs from something else?

Yes, I am sure, I see these logs files:

aws s3 ls ".../logs/j-31IAUBURMDVNQ/"
                           PRE node/
                           PRE steps/


aws s3 ls ".../logs/j-31IAUBURMDVNQ/steps/s-05981531W5ISFGG65JTF/"
2025-01-19 08:06:14       1508 controller.txt.gz
2025-01-19 08:06:14       1712 stderr.txt.gz
2025-01-19 08:06:14       9059 stdout.txt.gz
2025-01-19 08:06:14        169 syslog.txt.gz

@danielgafni
Copy link
Contributor

Alright!

danielgafni added a commit that referenced this issue Jan 22, 2025
@Rahkovsky
Copy link
Author

Thank you @danielgafni !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: dagster-pipes Related to Dagster Pipes integration: aws Related to dagster-aws type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants