Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redaction occurring in non-user-input fields #792

Open
6 tasks done
ferg-dev opened this issue Dec 12, 2024 · 7 comments
Open
6 tasks done

Redaction occurring in non-user-input fields #792

ferg-dev opened this issue Dec 12, 2024 · 7 comments
Assignees
Labels

Comments

@ferg-dev
Copy link

Describe the bug
Redaction is checked on all keys in request/response objects outside the specific exclusion of FirstSeen and LastSeen, this can lead to invalid data and failed import to OpenSearch. Adding more keys to the exclusion list is possible but there are potentially over 100 keys that would need to be added to the list to cover all scenarios.

To Reproduce
Comprehend needs to identify PII. Seems most impactful when it identifies a single digit/character or a common element (like QID). This appears to only show up in higher volume usage.

Expected behavior
User input should be redacted when enabled, but bot information, session IDs, timestamps, and settings should be skipped.

Please complete the following information about the solution:

  • Version: v6.1.1
  • Region: us-west-2
  • Was the solution modified from the version published on this repository? Yes
  • If the answer to the previous question was yes, are the changes available on GitHub? No
  • Have you checked your service quotas for the services this solution uses? I checked to ensure Comprehend average usage was lower than service threshold
  • Were there any errors in the CloudWatch Logs? No errors found in Fulfillment Lambda logs with debug enabled.

Screenshots
Comprehend identified '2' as a problematic digit:
1_question
All 2's were redacted out of other string fields (with the exception of FirstSeen and LastSeen):
2_other_values_redacted
The invalid date caused error importing into OpenSearch:
Screenshot 2024-12-11 174337

Additional context
Solution appears to use Comprehend to analyze the input transcript/question for the specific locations with detected PII, then stores the strings at those locations as regex capture groups in an environment variable. Regex then uses this env variable to redact data in all fields. So if Comprehend recognizes "QID" as PII that QID is then removed from all string values within request and response.

@ferg-dev ferg-dev added the bug label Dec 12, 2024
@fhoueto-amz
Copy link
Contributor

Thanks. We will review and revert to you

@abhirpat
Copy link
Member

abhirpat commented Dec 20, 2024

Hi @ferg-dev, thanks for reporting this. I tried to reproduce this on the latest v6.1.5 but was unsuccessful. In my test, I enabled redacting using Comprehend and provided SSN (all 2s), credit card (all 2s), and number 2. The redaction appeared to work successfully based on what I observed in OpenSearch Dashboards > Discover.

Screenshot 2024-12-19 at 4 47 25 PM Screenshot 2024-12-19 at 4 48 16 PM

To help further investigate, please let us know:

  1. You mentioned you're using v6.1.1 in the description. Are you able to see this issue in the latest version? We released improvements in v6.1.4, and it would be helpful if you could try this version since I am unable to reproduce the issue.

  2. Which logs did you see the parsing error you posted?

  3. You mentioned it failed. Could you please confirm whether you are performing any manual imports outside of the automatic workflows in QnABot?

  4. Are you customizing other settings such as redaction confidence score and setting it lower than the default 0.99? I tried both 0.99 and 0.8 but was unable to reproduce the issue.

@abhirpat
Copy link
Member

abhirpat commented Jan 3, 2025

Hi @ferg-dev Have you had a chance to see my previous message? Kindly let me know if you have any further information on the queries I posted.

@ferg-dev
Copy link
Author

ferg-dev commented Jan 3, 2025

@abhirpat

I was able to reproduce the issue:

  1. Deployed the 6.1.5 CloudFormation template
  2. In QnABot Admin panel, went to settings -> security and privacy -> toggled ENABLE_REDACTING_WITH_COMPREHEND to true
  3. In the test client, entered the text: "My check fromDec 2,2024 has not been deposited into my account. 2. 2"
  4. Checked the OpenSearch logs:
    image

This is just one example that I've mocked based on sampled utterances that appears to reliably reproduce the issue. I believe the reason you were not able to reproduce it is comprehend didn't detect the single digit 2 in your user input while in this particular example it does. This leads to the regex matching the "2" before the "2024" and redacting it prior to matching 2024.

To answer your questions:

  1. See above, reproduced on 6.1.5
  2. When Firehose fails to write to the OpenSearch endpoint it creates an S3 object that contains this error.
  3. This was the Firehose flow included in QnA bot, it doesn't happen all the time, only when a digit/letter matches something in a date field that makes it invalid
  4. Confidence score is set to the default of 0.99

@abhirpat
Copy link
Member

abhirpat commented Jan 21, 2025

Thank you for these details, @ferg-dev. I was able to see few non-user input keys like timing being redacted. Please note the inputTranscript is supposed be redacted as that is user input.

Issue Summary:

  • The issue occurs when Amazon Comprehend incorrectly identifies and redacts certain numerical combinations in strings, followed by other identifiers. This appears to be due to incorrect detection behavior in the Comprehend.

Action Items:

  • We will report this issue to the Amazon Comprehend team internally.
  • We will review the design and provide some mitigation.

For immediate assistance, you can also open a Support Case through your AWS Console to reach the Comprehend team directly. We'll update this thread once we have more information or the mitigation is implemented.

@abhirpat
Copy link
Member

Hi @ferg-dev , thank you for your patience. In latest QnABot v7.0.0 released, we have excluded additional non-user keys to help mitigate this. For more information on this release, please refer to CHANGELOG.md.

Please let us know if any questions.

@ferg-dev
Copy link
Author

Hi @abhirpat, thank you for the update.

I reviewed the changes and completed some tests on the 7.0.0 version and I believe it will help with some cases. However, I believe it still leaves a lot of unexcluded fields open for redaction which leads to data quality issues. I could potentially add the 100+ fields to the exclusions list to prevent, but that seems an inefficient method to prevent these occurrences and leaves the door open for this to occur on future enhancements.

Here's one example from the sandbox where I deployed v7.0.0:
Image

I understand that these are often incorrect identifications from Comprehend, but realistically that will not be 100% accurate. Are there any plans to separate out the user input portion from redaction rather than redacting the entire record?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants