Redaction occurring in non-user-input fields #792

ferg-dev · 2024-12-12T00:59:58Z

Describe the bug
Redaction is checked on all keys in request/response objects outside the specific exclusion of FirstSeen and LastSeen, this can lead to invalid data and failed import to OpenSearch. Adding more keys to the exclusion list is possible but there are potentially over 100 keys that would need to be added to the list to cover all scenarios.

To Reproduce
Comprehend needs to identify PII. Seems most impactful when it identifies a single digit/character or a common element (like QID). This appears to only show up in higher volume usage.

Expected behavior
User input should be redacted when enabled, but bot information, session IDs, timestamps, and settings should be skipped.

Please complete the following information about the solution:

Version: v6.1.1
Region: us-west-2
Was the solution modified from the version published on this repository? Yes
If the answer to the previous question was yes, are the changes available on GitHub? No
Have you checked your service quotas for the services this solution uses? I checked to ensure Comprehend average usage was lower than service threshold
Were there any errors in the CloudWatch Logs? No errors found in Fulfillment Lambda logs with debug enabled.

Screenshots
Comprehend identified '2' as a problematic digit:

All 2's were redacted out of other string fields (with the exception of FirstSeen and LastSeen):

The invalid date caused error importing into OpenSearch:

Additional context
Solution appears to use Comprehend to analyze the input transcript/question for the specific locations with detected PII, then stores the strings at those locations as regex capture groups in an environment variable. Regex then uses this env variable to redact data in all fields. So if Comprehend recognizes "QID" as PII that QID is then removed from all string values within request and response.

fhoueto-amz · 2024-12-12T20:17:19Z

Thanks. We will review and revert to you

abhirpat · 2024-12-20T19:27:06Z

Hi @ferg-dev, thanks for reporting this. I tried to reproduce this on the latest v6.1.5 but was unsuccessful. In my test, I enabled redacting using Comprehend and provided SSN (all 2s), credit card (all 2s), and number 2. The redaction appeared to work successfully based on what I observed in OpenSearch Dashboards > Discover.

To help further investigate, please let us know:

You mentioned you're using v6.1.1 in the description. Are you able to see this issue in the latest version? We released improvements in v6.1.4, and it would be helpful if you could try this version since I am unable to reproduce the issue.
Which logs did you see the parsing error you posted?
You mentioned it failed. Could you please confirm whether you are performing any manual imports outside of the automatic workflows in QnABot?
Are you customizing other settings such as redaction confidence score and setting it lower than the default 0.99? I tried both 0.99 and 0.8 but was unable to reproduce the issue.

abhirpat · 2025-01-03T19:07:27Z

Hi @ferg-dev Have you had a chance to see my previous message? Kindly let me know if you have any further information on the queries I posted.

ferg-dev · 2025-01-03T22:01:09Z

@abhirpat

I was able to reproduce the issue:

Deployed the 6.1.5 CloudFormation template
In QnABot Admin panel, went to settings -> security and privacy -> toggled ENABLE_REDACTING_WITH_COMPREHEND to true
In the test client, entered the text: "My check fromDec 2,2024 has not been deposited into my account. 2. 2"
Checked the OpenSearch logs:

This is just one example that I've mocked based on sampled utterances that appears to reliably reproduce the issue. I believe the reason you were not able to reproduce it is comprehend didn't detect the single digit 2 in your user input while in this particular example it does. This leads to the regex matching the "2" before the "2024" and redacting it prior to matching 2024.

To answer your questions:

See above, reproduced on 6.1.5
When Firehose fails to write to the OpenSearch endpoint it creates an S3 object that contains this error.
This was the Firehose flow included in QnA bot, it doesn't happen all the time, only when a digit/letter matches something in a date field that makes it invalid
Confidence score is set to the default of 0.99

abhirpat · 2025-01-21T16:07:09Z

Thank you for these details, @ferg-dev. I was able to see few non-user input keys like timing being redacted. Please note the inputTranscript is supposed be redacted as that is user input.

Issue Summary:

The issue occurs when Amazon Comprehend incorrectly identifies and redacts certain numerical combinations in strings, followed by other identifiers. This appears to be due to incorrect detection behavior in the Comprehend.

Action Items:

We will report this issue to the Amazon Comprehend team internally.
We will review the design and provide some mitigation.

For immediate assistance, you can also open a Support Case through your AWS Console to reach the Comprehend team directly. We'll update this thread once we have more information or the mitigation is implemented.

abhirpat · 2025-01-23T21:58:10Z

Hi @ferg-dev , thank you for your patience. In latest QnABot v7.0.0 released, we have excluded additional non-user keys to help mitigate this. For more information on this release, please refer to CHANGELOG.md.

Please let us know if any questions.

ferg-dev · 2025-01-29T18:55:33Z

Hi @abhirpat, thank you for the update.

I reviewed the changes and completed some tests on the 7.0.0 version and I believe it will help with some cases. However, I believe it still leaves a lot of unexcluded fields open for redaction which leads to data quality issues. I could potentially add the 100+ fields to the exclusions list to prevent, but that seems an inefficient method to prevent these occurrences and leaves the door open for this to occur on future enhancements.

Here's one example from the sandbox where I deployed v7.0.0:

I understand that these are often incorrect identifications from Comprehend, but realistically that will not be 100% accurate. Are there any plans to separate out the user input portion from redaction rather than redacting the entire record?

ferg-dev added the bug label Dec 12, 2024

fhoueto-amz assigned fhoueto-amz and abhirpat and unassigned fhoueto-amz Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redaction occurring in non-user-input fields #792

Redaction occurring in non-user-input fields #792

ferg-dev commented Dec 12, 2024

fhoueto-amz commented Dec 12, 2024

abhirpat commented Dec 20, 2024 •

edited

Loading

abhirpat commented Jan 3, 2025

ferg-dev commented Jan 3, 2025 •

edited

Loading

abhirpat commented Jan 21, 2025 •

edited

Loading

abhirpat commented Jan 23, 2025

ferg-dev commented Jan 29, 2025

Redaction occurring in non-user-input fields #792

Redaction occurring in non-user-input fields #792

Comments

ferg-dev commented Dec 12, 2024

fhoueto-amz commented Dec 12, 2024

abhirpat commented Dec 20, 2024 • edited Loading

abhirpat commented Jan 3, 2025

ferg-dev commented Jan 3, 2025 • edited Loading

abhirpat commented Jan 21, 2025 • edited Loading

abhirpat commented Jan 23, 2025

ferg-dev commented Jan 29, 2025

abhirpat commented Dec 20, 2024 •

edited

Loading

ferg-dev commented Jan 3, 2025 •

edited

Loading

abhirpat commented Jan 21, 2025 •

edited

Loading