Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: chained pipelines #58

Closed
towhans opened this issue Mar 20, 2019 · 7 comments
Closed

Proposal: chained pipelines #58

towhans opened this issue Mar 20, 2019 · 7 comments

Comments

@towhans
Copy link

towhans commented Mar 20, 2019

Batcher would be able to forward messages to another pipeline. acks would be performed by the very last batcher.

The case is to have smaller pipelines that can be put together without the need to have an intermediate step of some external queue/topic.

@josevalim
Copy link
Member

Hi @towhans, can you please expand on the use case? Generally speaking you don't want to pass the data through multiple processes, as that incurs copying. So our concern with "connecting pipelines" is that users will end-up using pipelines for code organization purposes instead of modelling runtime concerns.

So can you describe why would you need to pass the data around? Thanks!

@towhans
Copy link
Author

towhans commented Mar 21, 2019

transformer1 -> processor1 -> batcher1
transformer2 -> processor2 -> batcher2

The case is that transformer2 is to be applied after processor1. processor1 is statefull. transformer2 is stateless. If we make:

transformer1 -> processor1 |> transformer2 |> processor2 -> batcher

then we can't specify different parallelization for transformer2.

So the case is about interleaving stateful and stateless transformations.

@josevalim
Copy link
Member

So the case is about interleaving stateful and stateless transformations.

Which kind of transformations though? What is stateful and what isn't?

In theory, the only benefit for creating new pipelines / new stages is if different part of those stages depend on different IO resources and we plan to do it as part of #39. Stateful or stateless should not matter. :)

@towhans
Copy link
Author

towhans commented Mar 28, 2019

Sorry for taking so long to respond. I had to think it through again. I get your point to avoid the anti-pattern of using gen_stages for code organization. In our case transformators are stateless and processors are statefull. But that doesn't really matter. The importatnt realization for me is that the "chain of pipelines" is a higher level thing that can be assembled into one single broadway pipeline. So I retract the proposal and thank you for your replies. They were very helpful.

@josevalim
Copy link
Member

josevalim commented Mar 28, 2019

Thanks for following up! The unnecessary creation of processes/stages is exactly what we want to avoid, so when we adding multiple processors, we have to be really careful in documenting those concerns!

@kwando
Copy link

kwando commented Jul 5, 2019

I have a usecase for this, I think.

Stream of user ids -> batchLookup profiles for users -> partition / filter profiles -> do somehting with batches of profiles.

I can sort of make this work by moving the profile lookup into the producer, but then I need to build out that convenient batching logic myself instead.

@msaraiva
Copy link
Collaborator

msaraiva commented Jul 5, 2019

Hi @kwando!

Thanks for the feedback.

I believe you'll be able to achieve that after we implement #39.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants