You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
currently lookback accepts an integer, representing the [today-n : today] range of the incremental run
in most companies the distribution of delayed data is very skewed towards the newer end of the lookback range [citation needed].
i.e. 90+% of delayed data arrives after 1 day, and then comes the long tail.
to improve efficiency, implement a lookback that accepts [0, 1, n], where n is the greatest possible delay. when running regularly, this would not immediately update the data in [1 < x < n], saving significant compute by skipping. instead, the data would be fully updated after n days, in a rolling fashion.
Describe alternatives you've considered
we implemented our own version of this a while ago, with a date range macro that accept both an integer (range without gaps) or an array of integers (range with gaps, or just specific days).
simplified:
{# reprocessing specific days -#}
{% if lookback is sequence -%}
({% for day in lookback %}
{{ event_date }} between
current_date - {{ day }}
and current_date - {{ day }}
{{ 'or' if not loop.last -}}
{% endfor -%}
)
{# reprocessing last x days -#}
{% else -%}
{{ event_date }} between
current_date - {{ lookback }}
and current_date
{% endif -%}
Who will this benefit?
any clients with...
large datasets, i.e. computation is a significant cost factor
who "want all data"
delayed data has a typical recency skew
Are you interested in contributing this feature?
yes, if it's as easy as our macro ;)
Anything else?
for perspective, this is currently a blocker for us for implementing microbatches. the advantage of calculating daily batches is completely offset by not being able to skip "plot-irrelevant" days.
The text was updated successfully, but these errors were encountered:
instead, the data would be fully updated after n days, in a rolling fashion.
Hello! Thanks for your feedback on microbatch and for sharing this use case. Just to make sure I understand the feature request, this would essentially allow you to say "wait to process data for X date until Y days have passed"?
Any specific reason you would want this included in the lookback config vs. a new config?
I'm imagining you could configure your microbatch model with something like a delay config. If I set delay: 2, then I would not process any new data unless 2 days have passed between event_time and now.
hi! how this is implemented is secondary, i will be happy either way, since this is currently a blocker for us for moving to microbatch. but the proposed delay would not work (as i understand it), let me clarify the nuance:
we (and maybe others with similar delay skew) want to process some delayed data, but not all. given an example with max delay: 6 days and avg delay ratio: 10%
i want to process today (or not, if plots should only have complete days, but ignore for now)
i want to process today-1: since this is the first full day, with 90% complete data
i want to process today-2: since this day is 98% complete by now
i don't want to process today-3: since this day does not add much delayed data
i don't want to process today-4: since this day does not add much delayed data
i don't want to process today-5: since this day does not add much delayed data
i want to process today-6: since this is the first day with 100% complete data (the 'rolling correction')
hence, i want to skip 3 days of processing, instead of lookback=6.
hence, as stated, we implemented the same as lookback=[0,1,2,6].
Is this your first time submitting a feature request?
Describe the feature
currently
lookback
accepts an integer, representing the [today-n : today] range of the incremental runin most companies the distribution of delayed data is very skewed towards the newer end of the lookback range [citation needed].
i.e. 90+% of delayed data arrives after 1 day, and then comes the long tail.
to improve efficiency, implement a
lookback
that accepts[0, 1, n]
, where n is the greatest possible delay. when running regularly, this would not immediately update the data in[1 < x < n]
, saving significant compute by skipping. instead, the data would be fully updated after n days, in a rolling fashion.Describe alternatives you've considered
we implemented our own version of this a while ago, with a date range macro that accept both an integer (range without gaps) or an array of integers (range with gaps, or just specific days).
simplified:
Who will this benefit?
any clients with...
Are you interested in contributing this feature?
yes, if it's as easy as our macro ;)
Anything else?
for perspective, this is currently a blocker for us for implementing microbatches. the advantage of calculating daily batches is completely offset by not being able to skip "plot-irrelevant" days.
The text was updated successfully, but these errors were encountered: