Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] allow gaps in the lookback range for microbatch #11242

Open
3 tasks done
Tracked by #11292
data-blade opened this issue Jan 27, 2025 · 2 comments
Open
3 tasks done
Tracked by #11292

[Feature] allow gaps in the lookback range for microbatch #11242

data-blade opened this issue Jan 27, 2025 · 2 comments
Labels
enhancement New feature or request triage

Comments

@data-blade
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

currently lookback accepts an integer, representing the [today-n : today] range of the incremental run

in most companies the distribution of delayed data is very skewed towards the newer end of the lookback range [citation needed].

i.e. 90+% of delayed data arrives after 1 day, and then comes the long tail.

to improve efficiency, implement a lookback that accepts [0, 1, n], where n is the greatest possible delay. when running regularly, this would not immediately update the data in [1 < x < n], saving significant compute by skipping. instead, the data would be fully updated after n days, in a rolling fashion.

Describe alternatives you've considered

we implemented our own version of this a while ago, with a date range macro that accept both an integer (range without gaps) or an array of integers (range with gaps, or just specific days).

simplified:

{# reprocessing specific days -#}
{% if lookback is sequence -%}
	({% for day in lookback %}
		{{ event_date }} between
			current_date - {{ day }}
			and current_date - {{ day }}
		{{ 'or' if not loop.last -}}
	{% endfor -%}
	)
{# reprocessing last x days -#}
{% else -%}
	{{ event_date }} between
		current_date - {{ lookback }}
		and current_date
{% endif -%}

Who will this benefit?

any clients with...

  • large datasets, i.e. computation is a significant cost factor
  • who "want all data"
  • delayed data has a typical recency skew

Are you interested in contributing this feature?

yes, if it's as easy as our macro ;)

Anything else?

for perspective, this is currently a blocker for us for implementing microbatches. the advantage of calculating daily batches is completely offset by not being able to skip "plot-irrelevant" days.

@data-blade data-blade added enhancement New feature or request triage labels Jan 27, 2025
@graciegoheen
Copy link
Contributor

instead, the data would be fully updated after n days, in a rolling fashion.

Hello! Thanks for your feedback on microbatch and for sharing this use case. Just to make sure I understand the feature request, this would essentially allow you to say "wait to process data for X date until Y days have passed"?

Any specific reason you would want this included in the lookback config vs. a new config?

I'm imagining you could configure your microbatch model with something like a delay config. If I set delay: 2, then I would not process any new data unless 2 days have passed between event_time and now.

@data-blade
Copy link
Author

data-blade commented Feb 11, 2025

hi! how this is implemented is secondary, i will be happy either way, since this is currently a blocker for us for moving to microbatch. but the proposed delay would not work (as i understand it), let me clarify the nuance:

we (and maybe others with similar delay skew) want to process some delayed data, but not all. given an example with max delay: 6 days and avg delay ratio: 10%

  • i want to process today (or not, if plots should only have complete days, but ignore for now)
  • i want to process today-1: since this is the first full day, with 90% complete data
  • i want to process today-2: since this day is 98% complete by now
  • i don't want to process today-3: since this day does not add much delayed data
  • i don't want to process today-4: since this day does not add much delayed data
  • i don't want to process today-5: since this day does not add much delayed data
  • i want to process today-6: since this is the first day with 100% complete data (the 'rolling correction')

hence, i want to skip 3 days of processing, instead of lookback=6.
hence, as stated, we implemented the same as lookback=[0,1,2,6].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

2 participants