Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensitive data leakage #1374

Open
pesmeriz opened this issue Sep 25, 2024 · 2 comments
Open

Sensitive data leakage #1374

pesmeriz opened this issue Sep 25, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@pesmeriz
Copy link
Contributor

System Info

OS version: MacOS Sequoia 15.0

My pyproject.toml

[project]
name = "pandasai-benchmark"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "numpy==1.26.4",
    "pandasai>=2.2.14",
    "python-decouple>=3.8",
    "pyyaml>=6.0.2",
]

🐛 Describe the bug

Using "enforce_privacy": True does not anonimize the data. Even if you use customer_head on your SmartDataframe, the Agent will always share the data within the original dataframe. My example:

from pandasai import SmartDataframe, Agent
from pandasai.llm.local_llm import LocalLLM

import pandas as pd
from pandasai.llm import BambooLLM
from decouple import config as cfg
import os

local = False 
bypass = True

if local:
    llm = LocalLLM(api_base="http://localhost:11434/v1", model="qwen2.5-coder:latest")
else:
    if bypass:
        llm = BambooLLM(api_key=cfg("BAMBOO_API_KEY"))
    else:
        os.environ["PANDASAI_API_KEY"] = cfg("BAMBOO_API_KEY")
        llm = BambooLLM()


employee_head = pd.DataFrame(
    [
        [1, "Pedro", 600, 1],
        [2, "Tone", 1200, 2],
        [3, "Turo", 900, 3],
        [4, "ks", 750, 4],
        [5, "none", 950, 2],
    ],
    columns=["id", "name", "salary", "department_id"],
)

employee = SmartDataframe(
    pd.DataFrame(
        [
            [1, "John Dow", 60000, 1],
            [2, "Jane Smith", 120000, 2],
            [3, "Taro Yamada", 90000, 3],
            [4, "Maria Silva", 75000, 4],
            [5, "Michal Johnson", 95000, 2],
        ],
        columns=["id", "name", "salary", "department_id"],
    ),
    custom_head=employee_head,
    config={"custom_head": employee_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

department_head = pd.DataFrame(
    [
        [1, "HR", 1000, 1],
        [2, "E", 5000, 2],
        [3, "M", 2000, 3],
        [4, "S", 3000, 4],
    ],
    columns=["id", "name", "budget", "country_id"],
)

department = SmartDataframe(
    pd.DataFrame(
        [
            [1, "Human Resources", 100000, 1],
            [2, "Engineering", 500000, 2],
            [3, "Marketing", 200000, 3],
            [4, "Sales", 300000, 4],
        ],
        columns=["id", "name", "budget", "country_id"],
    ),
    custom_head=department_head,
    config={"custom_head": department_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

country_head = pd.DataFrame(
    [
        [1, "United States"],
        [2, "Germany"],
        [3, "Japan"],
        [4, "Brazil"],
    ],
    columns=["id", "name"],
)

country = SmartDataframe(
    pd.DataFrame(
        [
            [1, "United States"],
            [2, "Germany"],
            [3, "Japan"],
            [4, "Brazil"],
        ],
        columns=["id", "name"],
    ),
    custom_head=country_head,
    config={"custom_head": country_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

agent = Agent(
    [country, employee, department],
    config={
        "llm": llm,
        "verbose": True,
        "enforce_privacy": True,
        "enable_cache": False,
    },
)
response = agent.chat(
    "show me the employees with a salary above 80k and their respective salary"
    # "pivot table of the average salary of employees cross department with country"
    # "Who earns the least out of the Germany?"
)
print(response)

You can check this on /pandasai/llm/bamboo_llm.py line 18.

from typing import Optional

from ..helpers.request import Session
from ..prompts.base import BasePrompt
from .base import LLM


class BambooLLM(LLM):
    _session: Session

    def __init__(
        self, endpoint_url: Optional[str] = None, api_key: Optional[str] = None
    ):
        self._session = Session(endpoint_url=endpoint_url, api_key=api_key)

    def call(self, instruction: BasePrompt, _context=None) -> str:
        data = instruction.to_json()
        response = self._session.post("/llm/chat", json=data)
        return response["data"]

    @property
    def type(self) -> str:
        return "bamboo_llm"
Copy link

dosubot bot commented Sep 25, 2024

Hey @pesmeriz! 👋 I'm here to help you out with any issues or questions you have. If you need assistance, feel free to ask!

There is a known issue with the enforce_privacy setting not working as expected. The problem is that the current implementation does not actually perform any data anonymization based on the enforce_privacy setting. The enforce_privacy property is defined and can be set, but there is no logic that uses this setting to modify or anonymize the data [1].

A similar issue was discussed in an open issue, which is now closed. The suggested temporary fix involves modifying the convert_df_to_csv function in pandasai/helpers/dataframe_serializer.py to avoid adding the dataframe details if enforce_privacy is set to True [2].

Here is the suggested temporary fix:

Replace the following code in convert_df_to_csv:

# Add dataframe details
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df.to_csv()}"

With this:

# TEMP FIX: Do not add dataframe details
df_without_sample_data = pd.DataFrame(columns=df.pandas_df.columns)
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df_without_sample_data.to_csv()}"

This change will prevent the original dataframe details from being included when enforce_privacy is set to True.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the bug Something isn't working label Sep 25, 2024
@pesmeriz
Copy link
Contributor Author

pesmeriz commented Sep 25, 2024

I tried the proposed solution and the problems remains.

Also, this should be made clearar on the docs, though. Otherwise people start using, find a bug (or worse, they don't), and then are left without a solution and time wasted.

@pesmeriz pesmeriz reopened this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant