Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move input files into temporary directory #483

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

niwa2
Copy link
Contributor

@niwa2 niwa2 commented Jan 26, 2025

Related issues / PRs. Summarize issues.

Gradio is copies every input file into a cache directory and then provides the path to those files.
The problem is, that Gradio regularly cleans out the cache directory (currently every file older than 1 hour is deleted)
This is a problem if you have multiple files in the queue or if BGM removal is disabled and diarization is enabled.
If the transcription in one of the above cases takes longer than one hour, the input files are gone and everything goes south.

Summarize Changes

  1. move all files from the Gradio cache into a local temprary folder, that is deleted when everything is done or the app closes
  2. update the input file browser upon completion to be empty again

@niwa2 niwa2 force-pushed the fix_missing_input_files branch from bd0e097 to 4d47cd4 Compare January 26, 2025 14:35
… deleting the cached files before everything is finished
@niwa2 niwa2 force-pushed the fix_missing_input_files branch from 4d47cd4 to cc4d389 Compare January 26, 2025 14:54
@jhj0517 jhj0517 added the enhancement New feature or request label Jan 27, 2025
@jhj0517
Copy link
Owner

jhj0517 commented Jan 27, 2025

Hi. This is handled by gradio's delete_cache parameter: https://www.gradio.app/guides/resource-cleanup#automatic-cache-cleanup-via-delete-cache`.

Right now the default value the application uses is (60, 3600), which means the maxium age of the cached file is 3600 sec, with a cleaning frequency of 60 sec.

I think it's better to add --delete_cache as a cli option to let users customize the values.

@niwa2
Copy link
Contributor Author

niwa2 commented Jan 27, 2025

well, i am not sure what the best approach might be.
Increasing the timeout, can lead to the accumulation of a lot of unneeded data. That is the reason why this parameter was introduced in the first place.
A too small timeout will cause issues as described above, especially when one does not have a GPU to speed things up.
As far as i have seen, there is no possibility to trigger the cache cleaning on demand (e.g. when processing is finished).

I was also thinking about copying the data into the temporary folder instead of moving.
The gradio cache does not seem to be needed anymore, moving is a lot faster than copying and does not need the additional space.
With this solution one does not need to think about the delete_cache parameters. So it is "fool prove" and works out of the box. The user does not need to know that this parameter exists and also does not need to find out which value is needed in the users specific use case. (e.g. i was running a transcription an a dated NAS, on the CPU only and got the message, that the diarization could not load the audio file after roughly 12 hours. Then had to investigate what has happend, why this happened and how to prevent it from happening again, before i could start from the beginning. It would be nice, if the average user does not need to go through that.)
The only issue i have is, how to reliably detect, if the files are really from the gradio cache folder, because i do not know if there is any possibility, that we might get handed a file from a folder other than the gradio cache folder (like it is done in the pytests).
Copying would be the safer solution in this case.

Regading the issue with BGM disabled and then the diarization might fail after the transcription, one could also think about loading the audio at the begining (where the deepcopy takes place, but actually load the audio into RAM). This could solve the failing diarization at the cost of RAM usage, but it would still eventually fail if mutliple files are queued.

@jhj0517
Copy link
Owner

jhj0517 commented Jan 28, 2025

Thank you for explaining your driving.

I agree that some people might run into this problem more often than I thought, because the default caching age in the app might be too short now—currently it's 1 hour.

Originally the delete_cache thing did not exist, and the cached files remain stored on disk forever, and it gives inconvenience to people just like #10.
But since the gradio suggested solution with delete_cache now, I'd stick to gradio's solution as it's common approach to deal with when caching the file.

So the desirable way I think now is to increase the default caching age —probably 1 day or more— in the app and document it on the wiki page.

@niwa2
Copy link
Contributor Author

niwa2 commented Jan 28, 2025

I was thinking about the queing, when two users or two different tabs click the "GENERATE SUBTITLE FILE"-Button.
I am not sure how gradio behaves, but if the click action is not executed immediately but is queued until the previous one is finished, then this solution of copieing the files would probably not work anyway.
So it indeed seems like increasing the timeout might be the best option.
It would be nice though to have the possibility to clean up the temp directory when the job is done.
Maybe i will investigate in that direction.

What do you think?:
Does it make sense to simply delete the input files from a folder containing "gradio" in the path after the job is finished?
Or could that do more harm than it helps?

@jhj0517
Copy link
Owner

jhj0517 commented Feb 2, 2025

Sorry for late reply. I didn't have a time.

but is queued until the previous one is finished, then this solution of copieing the files would probably not work anyway.

This is right point, It seriazably work if it's in the same queue, so this won't work anyway.
To enable multiple queues, #452 should be merged. But there are some UI issues about the queue in the gradio now.

In my opinion I see no problem if we set the caching age to a high value, but if we have to deal with this then it would be the way to add something in the finally scope in the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants