Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.17](backport #41817) [aws] [s3] Introduce ignore_older & start_timestamp for S3 input allowing better registry cleanups #42717

Merged
merged 2 commits into from
Feb 14, 2025

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Feb 14, 2025

Proposed commit message

Introduce ignore_older and start_timestamp properties to AWS S3 input. This is a follow-up for #41694.

The configurations introduced here act as input object filters. If the object fails to match derived filters, the entries will be cleaned up from the registry, reducing filebeat memory consumption.

Introduced configurations are,

  • ignore_older : Accepts a time duration in which entries are accepted for processing
  • start_timestamp: A timestamp from which objects are accepted for processing

For both inputs, the object's last modified timestamp is taken into comparison. See Use cases section for further explanation

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

None as defaults are disabled. However, when configurations introduced here are used, the following can have an impact on the user,

  • Whenstart_timestamp is defined, then objects with the last modified timestamps prior to the timestamp are ignored from processing (documented 1)
  • When ignore_older is defined, then objects that do not fall within the look-back period when processing starts (polling run) are ignored (documented 1)
  • When both start_timestamp & ignore_older are defined, the initial run will process all entries up to start_timestamp. The subsequent runs will not include entries that do not fall within ignore_older even if processing failed for an object. (documented 1)

How to test this PR locally

  • Build filebeat from the changest included in the PR
  • Source S3 bucket with objects (you may use this tool 2 to create entries)
  • Try configuring filebeat with alternative values for ignore_older & start_timestamp to see how data ingestion change with their values. See Use cases section for further explanation

Related issues

Use cases

Consider below diagrams where there're 3 objects Object A, Object B and Object C with their last modified timestamps of t1, t2 and t3.

And consider how filebeat processes and tracks registry entries based on the following scenarios

Default behavior

If none of the configurations are used, then filebeat will process and the internal registry will track all objects continuously unless they are removed from the bucket.

image

Use start_timestamp

If start_timestamp is used, objects newer than the timestamp are accepted for processing. The registry will grow unless objects are removed from the bucket by other means (ex:- lifecycle policy).

image

Use ignore_older

If ignore_older is defined, input will process objects within the provided duration, calculated from the current time. The registry will track objects within the current timeframe and others will get cleaned up eventually by subsequent runs.

image

Use both ignore_older & start_timestamp

If both properties are defined,

  • The initial run will include entries within the start_timestamp (ignoring ignore_older duration).
  • Subsequent runs will only consider entries within the ignore_older duration.

image


This is an automatic backport of pull request #41817 done by [Mergify](https://mergify.com).

Footnotes

  1. https://github.com/elastic/beats/pull/41817/files#diff-422765b7341c5bbf6de7af38927e34e00a5073b188585a7af3c4fee1175b64a6 2 3

  2. https://github.com/Kavindu-Dodan/data-gen

…wing better registry cleanups (#41817)

* add changelog entry

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* sort config entries

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* introduce ignore old and start timestamp configurations and document them

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* add filtering logic

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* filter tests

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* add component test for filtering and fix lint issues

Signed-off-by: Kavindu Dodanduwa <[email protected]>

# Conflicts:
#	x-pack/filebeat/input/awss3/s3_test.go

* add changelog entry

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* improve documentation

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* review changes - improve naming, change signature and improve documentation

Signed-off-by: Kavindu Dodanduwa <[email protected]>

---------

Signed-off-by: Kavindu Dodanduwa <[email protected]>
(cherry picked from commit 4ba7d1c)

# Conflicts:
#	x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
#	x-pack/filebeat/input/awss3/s3_test.go
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Feb 14, 2025
@mergify mergify bot requested review from a team as code owners February 14, 2025 14:59
@mergify mergify bot requested review from rdner and VihasMakwana and removed request for a team February 14, 2025 14:59
Copy link
Contributor Author

mergify bot commented Feb 14, 2025

Cherry-pick of 4ba7d1c has failed:

On branch mergify/bp/8.17/pr-41817
Your branch is up to date with 'origin/8.17'.

You are currently cherry-picking commit 4ba7d1c9a.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   CHANGELOG.next.asciidoc
	modified:   x-pack/filebeat/_meta/config/filebeat.inputs.reference.xpack.yml.tmpl
	modified:   x-pack/filebeat/filebeat.reference.yml
	modified:   x-pack/filebeat/input/awss3/config.go
	modified:   x-pack/filebeat/input/awss3/config_test.go
	modified:   x-pack/filebeat/input/awss3/input_benchmark_test.go
	new file:   x-pack/filebeat/input/awss3/s3_filters.go
	new file:   x-pack/filebeat/input/awss3/s3_filters_test.go
	modified:   x-pack/filebeat/input/awss3/s3_input.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
	both modified:   x-pack/filebeat/input/awss3/s3_test.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 14, 2025
@pierrehilbert pierrehilbert added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Cloud-Monitoring Label for the Cloud Monitoring team labels Feb 14, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 14, 2025
Signed-off-by: Kavindu Dodanduwa <[email protected]>
@Kavindu-Dodan Kavindu-Dodan force-pushed the mergify/bp/8.17/pr-41817 branch from c2a9343 to f435c7c Compare February 14, 2025 16:10
@Kavindu-Dodan Kavindu-Dodan added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Feb 14, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@Kavindu-Dodan Kavindu-Dodan merged commit 6b8489b into 8.17 Feb 14, 2025
22 checks passed
@Kavindu-Dodan Kavindu-Dodan deleted the mergify/bp/8.17/pr-41817 branch February 14, 2025 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport conflicts There is a conflict in the backported pull request Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants