Backport PR #16968 to 8.x: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17019

github-actions · 2025-02-05T10:15:49Z

Backport PR #16968 to 8.x branch, original message:

Release notes

[rn:skip]

What does this PR do?

This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:

accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
respect the encoding of the input string. Use concat method instead of addAll, which avoid to convert RubyString to String and back to RubyString. When return the head StringBuilder it enforce the encoding with the input charset.

Why is it important/What is the impact to the user?

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
I have added tests that prove my fix is effective or that my feature works

Author's Checklist

[ ]

How to test this PR locally

The test plan has two sides:

one to check that the behaviour of size limiting acts as expected. In such case follow the instructions in BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483.
the other to verify the encoding is respected.

How to test the encoding is respected

Startup a REPL with Logstash and exercise the tokenizer:

$> bin/logstash -i irb
> buftok = FileWatch::BufferedTokenizer.new
> buftok.extract("\xA3".force_encoding("ISO8859-1")); buftok.flush.bytes

or use the following script

require 'socket'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

text = "\xA3" # the £ symbol in ISO-8859-1 aka Latin-1
text.force_encoding("ISO-8859-1")
socket.puts(text)

socket.close

with the Logstash run as

bin/logstash -e "input { tcp { port => 1234 codec => line { charset => 'ISO8859-1' } } } output { stdout { codec => rubydebug } }"

In the output the £ as to be present and not Â£

Related issues

…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)

elastic-sonarqube · 2025-02-05T10:30:25Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

elasticmachine · 2025-02-05T10:40:07Z

💚 Build Succeeded

Buildkite Build
Commit: c5527e6

github-actions bot added backport v8.19.0 labels Feb 5, 2025

andsel merged commit 40a7fdf into 8.x Feb 5, 2025
5 checks passed

andsel deleted the backport_16968_8.x branch February 5, 2025 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport PR #16968 to 8.x: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17019

Backport PR #16968 to 8.x: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17019

github-actions bot commented Feb 5, 2025

elastic-sonarqube bot commented Feb 5, 2025

elasticmachine commented Feb 5, 2025

Backport PR #16968 to 8.x: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17019

Backport PR #16968 to 8.x: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17019

Conversation

github-actions bot commented Feb 5, 2025

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

Author's Checklist

How to test this PR locally

How to test the encoding is respected

Related issues

elastic-sonarqube bot commented Feb 5, 2025

Quality Gate passed

elasticmachine commented Feb 5, 2025

💚 Build Succeeded