-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework garbage collection to keep repo directory #190
Comments
I don't understand this one. Is it because KV store doesn't have a rename (changing the key) and only add/remove? The current GC is the way it is because it originally allowed multiple clients on different desktop computers to do whatever they need to do, and sync later over dropbox/syncthing. That was the original use case I wrote rdedup for (since that's how I tend to use my desktop/laptops: setup syncthing everywhere, and use it as a free sync & backup). I don't think the proposed solution necessarily works in that case, and it also has operational issues (user has to setup some additional With time people pop up with all sorts of different usecases, including industrial level backups, and even trying to make rdedup disk-like storage and so on . So, we're reaching the point where Possibilities:
|
Yes i wanted to make the use-case thing explicit by #189. Sorry if that feels overwhelming. But i am quite confident that - with reasonable tradeoffs - rdedup can stay as lean as it is, which is the reason i chose to put some energy into it (sorry no rust coding for now...). |
As of this proposal, i studied your comments in #172 and want to find a way to solve that. If you bear with me, maybe we can come up with something.
Yes, that's my use case too, but with rclone.
Can you elaborate? It is not so far from what you said might work in #172.
(Hmm, you did not finish the sentence, and i'm unable to guess) |
Or, let me elaborate the process, maybe this helps. (As i understand it,) GC as described in #132 maintains 2 directories with magic names, and in #172 you said if it helps we might rename them to cur and old. The above breaks doen to: There is only cur. When we start, we create a local temporary directory, and in it for each cur chunk an empty file in tmp-old. For each used chunk, we delete its placeholder in tmp-old (we could copy to tmp-cur, but we won't need it anyway). In the end, for every left over placeholder in tmp-old, we delete the corresponding chunk. And in the end, we did approximately the same as above, but the problems in #172 and additionally those i mentioned in the IS won't manifest. Or in other words, nothing bad can happen as the effects are the same as in current GC, but som bad things wont happen. For the syncthing case, it's less file renaming, and for the rsync (@pfernie) and rclone (@geek-merlin ;-) case, it's relief of file re-copying. (But i'm only proposing this, due to my lack of rust-fu, it will only happen if @pfernie from the other ticket finds this convincing and gives it a shot.) |
I would also like to keep the implementation here simple, and not fall in the trap of trying to support "all use cases". But I do think there is a scheme that remains simple and shaves some of the warts. @geek-merlin has already linked and summarized some history, but from my understanding the current generational scheme is an elegant solution to the needs of incremental GC, as state is tracked in the filesystem. But outside this implementation detail, generations are not helpful for any use cases in and of themselves. To run with the idea @geek-merlin sketched out here, here is an approach that hopefully retains the elegance of the current scheme, but can avoid the generational implementation. On invoking
I believe this approach supports the contemplated use cases. It is not an explicit intent of this approach, but it seems this does also facilitate GC being carried out on multiple hosts for a store. That is, for a store synchronized between multiple hosts, initiating a Some additional comments:
|
I don't think you understand the original use-case well. I have 2 or more laptops. There's dropbox/syncthing syncing between them the rdedup repo. But they are not connected with each other at all times. The network can be down, syncthing can be down, laptop can be turned off, etc. Each can be doing anything they want - GC, backup, etc, doesn't matter. And when connectivity is restored, they will do dumb file sync and everything will still work, no data can lost, and no file conflicts ever happen. That's what the current scheme was designed to support. Before actually deleting any files, there's a timeout (in days), to allow time for sync. All that is required is that all hosts sync everything at least that frequently. If any host introduced files to the old generation that was GCed on another host, from all hosts perspective, it looks like the incremental GC was not actually completed. On the next GC run new files in old generation will be brought into the new generation re-completing the garbage collection, and timeout will be restarted. Eventually all hosts will be aware of the new generation and write to the latest generation directly, and old generation(s) will be considered stale and all files in it will be actually removed. I'm quite fond of this scheme, because it's a fully distributed p2p garbage collection system, where everything is idempotent and immutable. No "locked during this enumeration" can work in such a use-case, because there is no locking possible. And there can't be any mutable files, because dropbox/syncthing can not synchronize between concurent changes to the same file. Now, after this was all working, there came the use-case of "remote storage". So instead of having N copies of all files, synchronized with dropbox, there's now just one copy remotely accessed by all hosts. It's a completely different scheme. And there locking (of the storage) was added and development went forward. But personally I'm still syncing with syncthing and using rdedup to scratch my original use-case niche. Now, having said that I kind of read through the design and it's like 90% the same as what's currently being done. There's just no |
The only weakness of the current scheme that I'm aware of is that no data can be ever mutated in place, so things have to change names This is not a problem for https://lincolnloop.com/blog/detecting-file-moves-renames-rsync/ I really think that
|
First of all, thanks for taking the time and effort to review and comment on these suggestions. In writing this up, a new issue occurred to me, which I opened separately as #191. In regards to this thread: you're right, I had an incomplete view of the motivating use case. I had drawn a wrong conclusion that in practice there would only ever be two generations ( The I have some thoughts on adapting the current scheme; they would probably involve altering the store structure a bit, so maybe still not desirable relative to an option like |
Yes. That's exactly what happens and how it works. I think we're in sync now. :) |
Sorry for being off so long. This issue is still important to me, but "the house was on fire"... Thanks a lot @dpc to take all the time and explain your thoughts and use case. Whe i find time i will add that to the "use cases" issue. To be clear: I did not start that issue to mandatorily cover all use cases (though it might turn out). In fact i wanted to get an overview of what people may want (and i want) and refer to that so we can clarify hidden assumptions. |
That said, i still have the hope that we can find a GC strategy that works for all possible scenarios. Just to get everyone onto the same boat: @dpc, we talked about duplicacy's Lock Free Deduplication in another issue and you said it's similar to the current strategy. @pfernie : Do you know and grok that? My hope is that we find inspiration in it that advances this. |
I doubt there's going to be one strategy for all use-cases, but my bet would be that 2, max 3 would be all that is ever needed, and they would be fairly invisible for the rest of the system. |
Problem
In #172, it was discovered that the changing repo dir name causes several problems.
Refs
GC was updated in #32 #75 #132
Currrent situation,
Let's first state the current situation and how it evolved, as reality check.
GC checks for every chunk, if it has a direct or indicŕect reference from a name, and only keeps the referenced.
There are sophisticated ways to do that without locking, so backup and GC can safely act at the same time, but for the sake of simplicity, we are using a suitable form of locking (creating a lock file in the repo). (But note that #188 might benefit from no locks.)
Some time ago, the GC reference checks had been done in a memory based Key-Value store (in fact we only need the keys), but this soon gets huge and does not scale.
So in #132, this was changed to a method where chunk files were moved when a reference was seen, so the repositlry itself is used as a key-value-store.
While this has scalability wins, it also brings us some drawbacks:
Proposal
Let's combine the best of both approaches and use a scalable KV store like this:
GC service:
The trashlist service can be any scalable (non-memory-only) KV store (we only use keys, and leave value empty).
Let's e.g. start wigh the caves crate and its local file backend, and evaluate its rocksdb backend and maybe other options in a followup.
The text was updated successfully, but these errors were encountered: