-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redaction occurring in non-user-input fields #792
Comments
Thanks. We will review and revert to you |
Hi @ferg-dev, thanks for reporting this. I tried to reproduce this on the latest v6.1.5 but was unsuccessful. In my test, I enabled redacting using Comprehend and provided SSN (all 2s), credit card (all 2s), and number 2. The redaction appeared to work successfully based on what I observed in OpenSearch Dashboards > Discover. To help further investigate, please let us know:
|
Hi @ferg-dev Have you had a chance to see my previous message? Kindly let me know if you have any further information on the queries I posted. |
Thank you for these details, @ferg-dev. I was able to see few non-user input keys like Issue Summary:
Action Items:
For immediate assistance, you can also open a Support Case through your AWS Console to reach the Comprehend team directly. We'll update this thread once we have more information or the mitigation is implemented. |
Hi @ferg-dev , thank you for your patience. In latest QnABot v7.0.0 released, we have excluded additional non-user keys to help mitigate this. For more information on this release, please refer to CHANGELOG.md. Please let us know if any questions. |
Hi @abhirpat, thank you for the update. I reviewed the changes and completed some tests on the 7.0.0 version and I believe it will help with some cases. However, I believe it still leaves a lot of unexcluded fields open for redaction which leads to data quality issues. I could potentially add the 100+ fields to the exclusions list to prevent, but that seems an inefficient method to prevent these occurrences and leaves the door open for this to occur on future enhancements. Here's one example from the sandbox where I deployed v7.0.0: I understand that these are often incorrect identifications from Comprehend, but realistically that will not be 100% accurate. Are there any plans to separate out the user input portion from redaction rather than redacting the entire record? |
Describe the bug
Redaction is checked on all keys in request/response objects outside the specific exclusion of FirstSeen and LastSeen, this can lead to invalid data and failed import to OpenSearch. Adding more keys to the exclusion list is possible but there are potentially over 100 keys that would need to be added to the list to cover all scenarios.
To Reproduce
Comprehend needs to identify PII. Seems most impactful when it identifies a single digit/character or a common element (like QID). This appears to only show up in higher volume usage.
Expected behavior
User input should be redacted when enabled, but bot information, session IDs, timestamps, and settings should be skipped.
Please complete the following information about the solution:
Screenshots
Comprehend identified '2' as a problematic digit:
All 2's were redacted out of other string fields (with the exception of FirstSeen and LastSeen):
The invalid date caused error importing into OpenSearch:
Additional context
Solution appears to use Comprehend to analyze the input transcript/question for the specific locations with detected PII, then stores the strings at those locations as regex capture groups in an environment variable. Regex then uses this env variable to redact data in all fields. So if Comprehend recognizes "QID" as PII that QID is then removed from all string values within request and response.
The text was updated successfully, but these errors were encountered: