[Feature] allow gaps in the lookback range for microbatch #11242

data-blade · 2025-01-27T14:13:09Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

currently lookback accepts an integer, representing the [today-n : today] range of the incremental run

in most companies the distribution of delayed data is very skewed towards the newer end of the lookback range [citation needed].

i.e. 90+% of delayed data arrives after 1 day, and then comes the long tail.

to improve efficiency, implement a lookback that accepts [0, 1, n], where n is the greatest possible delay. when running regularly, this would not immediately update the data in [1 < x < n], saving significant compute by skipping. instead, the data would be fully updated after n days, in a rolling fashion.

Describe alternatives you've considered

we implemented our own version of this a while ago, with a date range macro that accept both an integer (range without gaps) or an array of integers (range with gaps, or just specific days).

simplified:

{# reprocessing specific days -#}
{% if lookback is sequence -%}
	({% for day in lookback %}
		{{ event_date }} between
			current_date - {{ day }}
			and current_date - {{ day }}
		{{ 'or' if not loop.last -}}
	{% endfor -%}
	)
{# reprocessing last x days -#}
{% else -%}
	{{ event_date }} between
		current_date - {{ lookback }}
		and current_date
{% endif -%}

Who will this benefit?

any clients with...

large datasets, i.e. computation is a significant cost factor
who "want all data"
delayed data has a typical recency skew

Are you interested in contributing this feature?

yes, if it's as easy as our macro ;)

Anything else?

for perspective, this is currently a blocker for us for implementing microbatches. the advantage of calculating daily batches is completely offset by not being able to skip "plot-irrelevant" days.

The text was updated successfully, but these errors were encountered:

graciegoheen · 2025-02-10T18:39:26Z

instead, the data would be fully updated after n days, in a rolling fashion.

Hello! Thanks for your feedback on microbatch and for sharing this use case. Just to make sure I understand the feature request, this would essentially allow you to say "wait to process data for X date until Y days have passed"?

Any specific reason you would want this included in the lookback config vs. a new config?

I'm imagining you could configure your microbatch model with something like a delay config. If I set delay: 2, then I would not process any new data unless 2 days have passed between event_time and now.

data-blade · 2025-02-11T10:31:03Z

hi! how this is implemented is secondary, i will be happy either way, since this is currently a blocker for us for moving to microbatch. but the proposed delay would not work (as i understand it), let me clarify the nuance:

we (and maybe others with similar delay skew) want to process some delayed data, but not all. given an example with max delay: 6 days and avg delay ratio: 10%

i want to process today (or not, if plots should only have complete days, but ignore for now)
i want to process today-1: since this is the first full day, with 90% complete data
i want to process today-2: since this day is 98% complete by now
i don't want to process today-3: since this day does not add much delayed data
i don't want to process today-4: since this day does not add much delayed data
i don't want to process today-5: since this day does not add much delayed data
i want to process today-6: since this is the first day with 100% complete data (the 'rolling correction')

hence, i want to skip 3 days of processing, instead of lookback=6.
hence, as stated, we implemented the same as lookback=[0,1,2,6].

data-blade added enhancement New feature or request triage labels Jan 27, 2025

graciegoheen added awaiting_response and removed triage labels Feb 10, 2025

QMalcolm mentioned this issue Feb 10, 2025

[EPIC] Microbatch Follow-ups and Bug Fixes #11292

Open

github-actions bot added triage and removed awaiting_response labels Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] allow gaps in the lookback range for microbatch #11242

[Feature] allow gaps in the lookback range for microbatch #11242

data-blade commented Jan 27, 2025

graciegoheen commented Feb 10, 2025

data-blade commented Feb 11, 2025 •

edited

Loading

[Feature] allow gaps in the lookback range for microbatch #11242

[Feature] allow gaps in the lookback range for microbatch #11242

Comments

data-blade commented Jan 27, 2025

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

graciegoheen commented Feb 10, 2025

data-blade commented Feb 11, 2025 • edited Loading

data-blade commented Feb 11, 2025 •

edited

Loading