Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutError: Worker process failed to connect #4

Open
lemeow125 opened this issue Jan 3, 2025 · 9 comments · May be fixed by #5
Open

TimeoutError: Worker process failed to connect #4

lemeow125 opened this issue Jan 3, 2025 · 9 comments · May be fixed by #5

Comments

@lemeow125
Copy link

I'm getting the following error.

2025-01-03 14:06:33 2025-01-03 06:06:33 [info     ] Started TCP server             port=41063 session_id=38cbed58
2025-01-03 14:06:33 2025-01-03 06:06:33 [info     ] Starting worker process        game_id=csgo game_type=Diamond python=/root/.cache-data/dweam-venv/bin/python script=/app/dweam/game_process.py session_id=38cbed58
2025-01-03 14:06:33 ERROR:asyncio:Task was destroyed but it is pending!
2025-01-03 14:06:33 task: <Task pending name='Task-5473' coro=<Event.wait() done, defined at /usr/lib/python3.10/asyncio/locks.py:201> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2025-01-03 14:06:33 2025-01-03 06:06:33 [info     ] Started worker process         pid=451 session_id=38cbed58
2025-01-03 14:06:33 2025-01-03 06:06:33 [debug    ] Starting server                session_id=38cbed58
2025-01-03 14:06:38 2025-01-03 06:06:38 [debug    ] Reading stdout                 session_id=38cbed58
2025-01-03 14:06:40 2025-01-03 06:06:40 [warning  ] Timeout reading stdout         session_id=38cbed58
2025-01-03 14:06:40 2025-01-03 06:06:40 [error    ] Worker failed to connect       returncode=None session_id=38cbed58 stderr=None stdout=None
2025-01-03 14:06:40 2025-01-03 06:06:40 [error    ] Error during connection        session_id=38cbed58
2025-01-03 14:06:40 Traceback (most recent call last):
2025-01-03 14:06:40   File "/app/dweam/worker.py", line 229, in start
2025-01-03 14:06:40     raise asyncio.TimeoutError("Worker process failed to connect")
2025-01-03 14:06:40 asyncio.exceptions.TimeoutError: Worker process failed to connect
2025-01-03 14:06:40 
2025-01-03 14:06:40 During handling of the above exception, another exception occurred:
2025-01-03 14:06:40 
2025-01-03 14:06:40 Traceback (most recent call last):
2025-01-03 14:06:40   File "/app/dweam/worker.py", line 240, in start
2025-01-03 14:06:40     raise TimeoutError("Worker process failed to connect")
2025-01-03 14:06:40 TimeoutError: Worker process failed to connect
2025-01-03 14:06:40 2025-01-03 06:06:40 [error    ] Error starting game worker     session_id=38cbed58
2025-01-03 14:06:40 Traceback (most recent call last):
2025-01-03 14:06:40   File "/app/dweam/server.py", line 226, in event_generator
2025-01-03 14:06:40     answer = await run_task
2025-01-03 14:06:40   File "/app/dweam/worker.py", line 275, in run
2025-01-03 14:06:40     await self.start()
2025-01-03 14:06:40   File "/app/dweam/worker.py", line 270, in start
2025-01-03 14:06:40     raise RuntimeError(f"Failed to start worker after {max_retries} attempts")
2025-01-03 14:06:40 RuntimeError: Failed to start worker after 3 attempts
2025-01-03 14:06:40 DEBUG:sse_starlette.sse:chunk: b'event: error\r\ndata: Failed to start worker after 3 attempts\r\n\r\n'
2025-01-03 14:06:40 DEBUG:sse_starlette.sse:Got event: http.disconnect. Stop streaming.

After 3 retries, the frontend app shows the following error.
image

@irgolic
Copy link
Contributor

irgolic commented Jan 3, 2025

Hmm, this one isn't immediately obvious.

Could you provide the full logs of the run? You can put them in a toggleable summary tag, as shown here https://gist.github.com/scmx/eca72d44afee0113ceb0349dd54a84a2

@lemeow125
Copy link
Author

lemeow125 commented Jan 3, 2025

I can't seem to get that formatted to a code block. I've uploaded log files instead.
dweam_backend.log
dweam_turnserver.log
dweam_frontend.log

@irgolic
Copy link
Contributor

irgolic commented Jan 3, 2025

Thanks for the logs!

Could you try #5 to see if it fixes your issue? I can't replicate it locally.

@lemeow125
Copy link
Author

Still running into an error unfortunately.

2025-01-03 16:01:12 INFO:     172.19.0.2:39864 - "GET /turn-credentials HTTP/1.1" 200 OK
2025-01-03 16:01:12 INFO:     172.19.0.2:39868 - "POST /offer/Lucid/minecraft HTTP/1.1" 200 OK
2025-01-03 16:01:12 2025-01-03 08:01:12 [info     ] Started TCP server             port=45807 session_id=9f8fd580
2025-01-03 16:01:12 2025-01-03 08:01:12 [info     ] Starting worker process        game_id=minecraft game_type=Lucid python=/root/.cache-data/dweam-venv/bin/python script=/app/dweam/game_process.py session_id=9f8fd580
2025-01-03 16:01:12 2025-01-03 08:01:12 [info     ] Started worker process         pid=415 session_id=9f8fd580
2025-01-03 16:01:12 2025-01-03 08:01:12 [info     ] Attempting connection          attempt=1 session_id=9f8fd580 timeout=5.0
2025-01-03 16:01:17 2025-01-03 08:01:17 [warning  ] Connection attempt timed out   is_running=True pid=415 session_id=9f8fd580
2025-01-03 16:01:17 2025-01-03 08:01:17 [info     ] Attempting connection          attempt=2 session_id=9f8fd580 timeout=10.0
2025-01-03 16:01:27 DEBUG:sse_starlette.sse:ping: b': ping - 2025-01-03 08:01:27.531381+00:00\r\n\r\n'
2025-01-03 16:01:27 2025-01-03 08:01:27 [warning  ] Connection attempt timed out   is_running=True pid=415 session_id=9f8fd580
2025-01-03 16:01:27 2025-01-03 08:01:27 [error    ] Error during connection        session_id=9f8fd580
2025-01-03 16:01:27 Traceback (most recent call last):
2025-01-03 16:01:27   File "/app/dweam/worker.py", line 259, in start
2025-01-03 16:01:27     await self.attempt_connection(server, client_connected)
2025-01-03 16:01:27   File "/app/dweam/worker.py", line 168, in attempt_connection
2025-01-03 16:01:27     await server_task
2025-01-03 16:01:27   File "/usr/lib/python3.10/asyncio/base_events.py", line 365, in serve_forever
2025-01-03 16:01:27     raise RuntimeError(f'server {self!r} is closed')
2025-01-03 16:01:27 RuntimeError: server <Server sockets=()> is closed
2025-01-03 16:01:27 2025-01-03 08:01:27 [info     ] Retry attempt 2/3              session_id=9f8fd580
2025-01-03 16:01:28 2025-01-03 08:01:28 [info     ] Started TCP server             port=42467 session_id=9f8fd580
2025-01-03 16:01:28 2025-01-03 08:01:28 [info     ] Starting worker process        game_id=minecraft game_type=Lucid python=/root/.cache-data/dweam-venv/bin/python script=/app/dweam/game_process.py session_id=9f8fd580
2025-01-03 16:01:28 2025-01-03 08:01:28 [info     ] Started worker process         pid=429 session_id=9f8fd580
2025-01-03 16:01:28 2025-01-03 08:01:28 [info     ] Attempting connection          attempt=1 session_id=9f8fd580 timeout=5.0
2025-01-03 16:01:33 2025-01-03 08:01:33 [warning  ] Connection attempt timed out   is_running=True pid=429 session_id=9f8fd580
2025-01-03 16:01:33 2025-01-03 08:01:33 [info     ] Attempting connection          attempt=2 session_id=9f8fd580 timeout=10.0
2025-01-03 16:01:42 DEBUG:sse_starlette.sse:ping: b': ping - 2025-01-03 08:01:42.533132+00:00\r\n\r\n'
2025-01-03 16:01:43 2025-01-03 08:01:43 [warning  ] Connection attempt timed out   is_running=True pid=429 session_id=9f8fd580
2025-01-03 16:01:43 2025-01-03 08:01:43 [error    ] Error during connection        session_id=9f8fd580
2025-01-03 16:01:43 Traceback (most recent call last):
2025-01-03 16:01:43   File "/app/dweam/worker.py", line 259, in start
2025-01-03 16:01:43     await self.attempt_connection(server, client_connected)
2025-01-03 16:01:43   File "/app/dweam/worker.py", line 168, in attempt_connection
2025-01-03 16:01:43     await server_task
2025-01-03 16:01:43   File "/usr/lib/python3.10/asyncio/base_events.py", line 365, in serve_forever
2025-01-03 16:01:43     raise RuntimeError(f'server {self!r} is closed')
2025-01-03 16:01:43 RuntimeError: server <Server sockets=()> is closed
2025-01-03 16:01:43 2025-01-03 08:01:43 [info     ] Retry attempt 3/3              session_id=9f8fd580
2025-01-03 16:01:44 2025-01-03 08:01:44 [info     ] Started TCP server             port=39899 session_id=9f8fd580
2025-01-03 16:01:44 2025-01-03 08:01:44 [info     ] Starting worker process        game_id=minecraft game_type=Lucid python=/root/.cache-data/dweam-venv/bin/python script=/app/dweam/game_process.py session_id=9f8fd580
2025-01-03 16:01:44 ERROR:asyncio:Task was destroyed but it is pending!
2025-01-03 16:01:44 task: <Task pending name='Task-150' coro=<Event.wait() done, defined at /usr/lib/python3.10/asyncio/locks.py:201> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2025-01-03 16:01:44 ERROR:asyncio:Task was destroyed but it is pending!
2025-01-03 16:01:44 task: <Task pending name='Task-153' coro=<Event.wait() done, defined at /usr/lib/python3.10/asyncio/locks.py:201> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2025-01-03 16:01:44 ERROR:asyncio:Task was destroyed but it is pending!
2025-01-03 16:01:44 task: <Task pending name='Task-159' coro=<Event.wait() done, defined at /usr/lib/python3.10/asyncio/locks.py:201> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2025-01-03 16:01:44 ERROR:asyncio:Task was destroyed but it is pending!
2025-01-03 16:01:44 task: <Task pending name='Task-162' coro=<Event.wait() done, defined at /usr/lib/python3.10/asyncio/locks.py:201> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2025-01-03 16:01:44 2025-01-03 08:01:44 [info     ] Started worker process         pid=443 session_id=9f8fd580
2025-01-03 16:01:44 2025-01-03 08:01:44 [info     ] Attempting connection          attempt=1 session_id=9f8fd580 timeout=5.0

@irgolic
Copy link
Contributor

irgolic commented Jan 3, 2025

Thanks for testing the fix, and for your patience. It seems like the worker process DOES run, but doesn't connect to the main process. I thought the 5s timeout mightn't be long enough on your system, but seems like that's not the issue.

In #5 I've now:

  • simplified the timeout logic; it'll try thrice, waiting 5s, 10s, 20s (this may fix it for you)
  • made the worker process write its stdout/stderr to a file in the cache directory. It'll be at .cache-data/worker_logs/game_process_{PID}.log

Could you give it another shot? Hope this fix works, but if not, please share the worker logs in .cache-data/worker_logs

@lemeow125
Copy link
Author

Hi, thanks for the help again.

I can't seem to find the directory you've provided for logs.
image

Here's a snippet from my Docker Desktop logs instead.

2025-01-04 00:35:07 2025-01-03 16:35:07 [error    ] Error during connection        session_id=893669c5
2025-01-04 00:35:07 Traceback (most recent call last):
2025-01-04 00:35:07   File "/app/dweam/worker.py", line 227, in start
2025-01-04 00:35:07     raise asyncio.TimeoutError("Worker process failed to connect")
2025-01-04 00:35:07 asyncio.exceptions.TimeoutError: Worker process failed to connect
2025-01-04 00:35:07 
2025-01-04 00:35:07 During handling of the above exception, another exception occurred:
2025-01-04 00:35:07 
2025-01-04 00:35:07 Traceback (most recent call last):
2025-01-04 00:35:07   File "/app/dweam/worker.py", line 238, in start
2025-01-04 00:35:07     raise TimeoutError("Worker process failed to connect")
2025-01-04 00:35:07 TimeoutError: Worker process failed to connect
2025-01-04 00:35:07 2025-01-03 16:35:07 [info     ] Started TCP server             port=44541 session_id=893669c5
2025-01-04 00:35:07 2025-01-03 16:35:07 [info     ] Starting worker process        game_id=csgo game_type=Diamond python=/root/.cache-data/dweam-venv/bin/python script=/app/dweam/game_process.py session_id=893669c5
2025-01-04 00:35:07 2025-01-03 16:35:07 [info     ] Started worker process         pid=474 session_id=893669c5
2025-01-04 00:35:07 2025-01-03 16:35:07 [debug    ] Starting server                session_id=893669c5
2025-01-04 00:35:17 2025-01-03 16:35:17 [debug    ] Reading stdout                 session_id=893669c5
2025-01-04 00:35:19 2025-01-03 16:35:19 [warning  ] Timeout reading stdout         session_id=893669c5
2025-01-04 00:35:19 2025-01-03 16:35:19 [error    ] Worker failed to connect       returncode=None session_id=893669c5 stderr=None stdout=None
2025-01-04 00:35:19 2025-01-03 16:35:19 [error    ] Error during connection        session_id=893669c5
2025-01-04 00:35:19 Traceback (most recent call last):
2025-01-04 00:35:19   File "/app/dweam/worker.py", line 227, in start
2025-01-04 00:35:19     raise asyncio.TimeoutError("Worker process failed to connect")
2025-01-04 00:35:19 asyncio.exceptions.TimeoutError: Worker process failed to connect
2025-01-04 00:35:19 
2025-01-04 00:35:19 During handling of the above exception, another exception occurred:
2025-01-04 00:35:19 
2025-01-04 00:35:19 Traceback (most recent call last):
2025-01-04 00:35:19   File "/app/dweam/worker.py", line 238, in start
2025-01-04 00:35:19     raise TimeoutError("Worker process failed to connect")
2025-01-04 00:35:19 TimeoutError: Worker process failed to connect
2025-01-04 00:35:19 2025-01-03 16:35:19 [info     ] Started TCP server             port=38011 session_id=893669c5
2025-01-04 00:35:19 2025-01-03 16:35:19 [info     ] Starting worker process        game_id=csgo game_type=Diamond python=/root/.cache-data/dweam-venv/bin/python script=/app/dweam/game_process.py session_id=893669c5
2025-01-04 00:35:19 2025-01-03 16:35:19 [info     ] Started worker process         pid=488 session_id=893669c5
2025-01-04 00:35:19 2025-01-03 16:35:19 [debug    ] Starting server                session_id=893669c5
2025-01-04 00:35:39 2025-01-03 16:35:39 [debug    ] Reading stdout                 session_id=893669c5
2025-01-04 00:35:41 2025-01-03 16:35:41 [warning  ] Timeout reading stdout         session_id=893669c5
2025-01-04 00:35:41 2025-01-03 16:35:41 [error    ] Worker failed to connect       returncode=None session_id=893669c5 stderr=None stdout=None
2025-01-04 00:35:41 2025-01-03 16:35:41 [error    ] Error during connection        session_id=893669c5
2025-01-04 00:35:41 Traceback (most recent call last):
2025-01-04 00:35:41   File "/app/dweam/worker.py", line 227, in start
2025-01-04 00:35:41     raise asyncio.TimeoutError("Worker process failed to connect")
2025-01-04 00:35:41 asyncio.exceptions.TimeoutError: Worker process failed to connect
2025-01-04 00:35:41 
2025-01-04 00:35:41 During handling of the above exception, another exception occurred:
2025-01-04 00:35:41 
2025-01-04 00:35:41 Traceback (most recent call last):
2025-01-04 00:35:41   File "/app/dweam/worker.py", line 238, in start
2025-01-04 00:35:41     raise TimeoutError("Worker process failed to connect")
2025-01-04 00:35:41 TimeoutError: Worker process failed to connect
2025-01-04 00:35:41 2025-01-03 16:35:41 [error    ] Attempt 3 failed               session_id=893669c5
2025-01-04 00:35:41 Traceback (most recent call last):
2025-01-04 00:35:41   File "/app/dweam/worker.py", line 227, in start
2025-01-04 00:35:41     raise asyncio.TimeoutError("Worker process failed to connect")
2025-01-04 00:35:41 asyncio.exceptions.TimeoutError: Worker process failed to connect
2025-01-04 00:35:41 
2025-01-04 00:35:41 During handling of the above exception, another exception occurred:
2025-01-04 00:35:41 
2025-01-04 00:35:41 Traceback (most recent call last):
2025-01-04 00:35:41   File "/app/dweam/worker.py", line 238, in start
2025-01-04 00:35:41     raise TimeoutError("Worker process failed to connect")
2025-01-04 00:35:41 TimeoutError: Worker process failed to connect
2025-01-04 00:35:41 2025-01-03 16:35:41 [error    ] Error starting game worker     session_id=893669c5
2025-01-04 00:35:41 Traceback (most recent call last):
2025-01-04 00:35:41   File "/app/dweam/worker.py", line 227, in start
2025-01-04 00:35:41     raise asyncio.TimeoutError("Worker process failed to connect")
2025-01-04 00:35:41 asyncio.exceptions.TimeoutError: Worker process failed to connect
2025-01-04 00:35:41 
2025-01-04 00:35:41 During handling of the above exception, another exception occurred:
2025-01-04 00:35:41 
2025-01-04 00:35:41 Traceback (most recent call last):
2025-01-04 00:35:41   File "/app/dweam/server.py", line 226, in event_generator
2025-01-04 00:35:41     answer = await run_task
2025-01-04 00:35:41   File "/app/dweam/worker.py", line 275, in run
2025-01-04 00:35:41     await self.start()
2025-01-04 00:35:41   File "/app/dweam/worker.py", line 238, in start
2025-01-04 00:35:41     raise TimeoutError("Worker process failed to connect")
2025-01-04 00:35:41 TimeoutError: Worker process failed to connect
2025-01-03 23:05:54 DEBUG:sse_starlette.sse:ping: b': ping - 2025-01-03 15:05:54.786962+00:00\r\n\r\n'
2025-01-03 23:06:09 DEBUG:sse_starlette.sse:ping: b': ping - 2025-01-03 15:06:09.788955+00:00\r\n\r\n'
2025-01-03 23:06:21 DEBUG:sse_starlette.sse:chunk: b'event: error\r\ndata: Worker process failed to connect\r\n\r\n'
2025-01-03 23:06:21 DEBUG:sse_starlette.sse:Got event: http.disconnect. Stop streaming.
2025-01-04 00:35:00 ERROR:asyncio:Task was destroyed but it is pending!
2025-01-04 00:35:00 task: <Task pending name='Task-9514' coro=<Event.wait() done, defined at /usr/lib/python3.10/asyncio/locks.py:201> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2025-01-04 00:35:00 ERROR:asyncio:Task was destroyed but it is pending!
2025-01-04 00:35:00 task: <Task pending name='Task-9521' coro=<Event.wait() done, defined at /usr/lib/python3.10/asyncio/locks.py:201> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2025-01-04 00:35:00 ERROR:asyncio:Task was destroyed but it is pending!
2025-01-04 00:35:00 task: <Task pending name='Task-9528' coro=<Event.wait() done, defined at /usr/lib/python3.10/asyncio/locks.py:201> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2025-01-04 00:35:15 DEBUG:sse_starlette.sse:ping: b': ping - 2025-01-03 16:35:15.165167+00:00\r\n\r\n'
2025-01-04 00:35:30 DEBUG:sse_starlette.sse:ping: b': ping - 2025-01-03 16:35:30.165087+00:00\r\n\r\n'
2025-01-04 00:35:41 DEBUG:sse_starlette.sse:chunk: b'event: error\r\ndata: Worker process failed to connect\r\n\r\n'
2025-01-04 00:35:41 DEBUG:sse_starlette.sse:Got event: http.disconnect. Stop streaming.

@irgolic
Copy link
Contributor

irgolic commented Jan 3, 2025

Hi, if you're looking inside the container check /root/.cache-data; it's mounted from .cache-data in the directory you're running docker compose from.

@lemeow125
Copy link
Author

Found it, thanks.

Here's one of the logs generated.
game_process_418.log

@irgolic
Copy link
Contributor

irgolic commented Jan 3, 2025

Hey, thanks for the worker logs, and for your patience.

I've got two potential fixes :P

First one's on #5 – the guess here is that your PC is faster than mine, and therefore the subprocess wins the race condition against the reverse connection TCP socket server in the main process. It now ensure that the server starts before starting the worker process.

If that doesn't work, second one is on a branch called worker-not-connecting-2 – it allows the worker to retry connecting a few times.

Could you give it another try? Thanks for playtesting.

EDIT: sorry, neglected to push the 2nd branch until right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants