Sensitive data leakage #1374

pesmeriz · 2024-09-25T11:32:18Z

System Info

OS version: MacOS Sequoia 15.0

My pyproject.toml

[project]
name = "pandasai-benchmark"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "numpy==1.26.4",
    "pandasai>=2.2.14",
    "python-decouple>=3.8",
    "pyyaml>=6.0.2",
]

🐛 Describe the bug

Using "enforce_privacy": True does not anonimize the data. Even if you use customer_head on your SmartDataframe, the Agent will always share the data within the original dataframe. My example:

from pandasai import SmartDataframe, Agent
from pandasai.llm.local_llm import LocalLLM

import pandas as pd
from pandasai.llm import BambooLLM
from decouple import config as cfg
import os

local = False 
bypass = True

if local:
    llm = LocalLLM(api_base="http://localhost:11434/v1", model="qwen2.5-coder:latest")
else:
    if bypass:
        llm = BambooLLM(api_key=cfg("BAMBOO_API_KEY"))
    else:
        os.environ["PANDASAI_API_KEY"] = cfg("BAMBOO_API_KEY")
        llm = BambooLLM()


employee_head = pd.DataFrame(
    [
        [1, "Pedro", 600, 1],
        [2, "Tone", 1200, 2],
        [3, "Turo", 900, 3],
        [4, "ks", 750, 4],
        [5, "none", 950, 2],
    ],
    columns=["id", "name", "salary", "department_id"],
)

employee = SmartDataframe(
    pd.DataFrame(
        [
            [1, "John Dow", 60000, 1],
            [2, "Jane Smith", 120000, 2],
            [3, "Taro Yamada", 90000, 3],
            [4, "Maria Silva", 75000, 4],
            [5, "Michal Johnson", 95000, 2],
        ],
        columns=["id", "name", "salary", "department_id"],
    ),
    custom_head=employee_head,
    config={"custom_head": employee_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

department_head = pd.DataFrame(
    [
        [1, "HR", 1000, 1],
        [2, "E", 5000, 2],
        [3, "M", 2000, 3],
        [4, "S", 3000, 4],
    ],
    columns=["id", "name", "budget", "country_id"],
)

department = SmartDataframe(
    pd.DataFrame(
        [
            [1, "Human Resources", 100000, 1],
            [2, "Engineering", 500000, 2],
            [3, "Marketing", 200000, 3],
            [4, "Sales", 300000, 4],
        ],
        columns=["id", "name", "budget", "country_id"],
    ),
    custom_head=department_head,
    config={"custom_head": department_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

country_head = pd.DataFrame(
    [
        [1, "United States"],
        [2, "Germany"],
        [3, "Japan"],
        [4, "Brazil"],
    ],
    columns=["id", "name"],
)

country = SmartDataframe(
    pd.DataFrame(
        [
            [1, "United States"],
            [2, "Germany"],
            [3, "Japan"],
            [4, "Brazil"],
        ],
        columns=["id", "name"],
    ),
    custom_head=country_head,
    config={"custom_head": country_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

agent = Agent(
    [country, employee, department],
    config={
        "llm": llm,
        "verbose": True,
        "enforce_privacy": True,
        "enable_cache": False,
    },
)
response = agent.chat(
    "show me the employees with a salary above 80k and their respective salary"
    # "pivot table of the average salary of employees cross department with country"
    # "Who earns the least out of the Germany?"
)
print(response)

You can check this on /pandasai/llm/bamboo_llm.py line 18.

from typing import Optional

from ..helpers.request import Session
from ..prompts.base import BasePrompt
from .base import LLM


class BambooLLM(LLM):
    _session: Session

    def __init__(
        self, endpoint_url: Optional[str] = None, api_key: Optional[str] = None
    ):
        self._session = Session(endpoint_url=endpoint_url, api_key=api_key)

    def call(self, instruction: BasePrompt, _context=None) -> str:
        data = instruction.to_json()
        response = self._session.post("/llm/chat", json=data)
        return response["data"]

    @property
    def type(self) -> str:
        return "bamboo_llm"

The text was updated successfully, but these errors were encountered:

dosubot · 2024-09-25T11:33:16Z

Hey @pesmeriz! 👋 I'm here to help you out with any issues or questions you have. If you need assistance, feel free to ask!

There is a known issue with the enforce_privacy setting not working as expected. The problem is that the current implementation does not actually perform any data anonymization based on the enforce_privacy setting. The enforce_privacy property is defined and can be set, but there is no logic that uses this setting to modify or anonymize the data ^[1].

A similar issue was discussed in an open issue, which is now closed. The suggested temporary fix involves modifying the convert_df_to_csv function in pandasai/helpers/dataframe_serializer.py to avoid adding the dataframe details if enforce_privacy is set to True ^[2].

Here is the suggested temporary fix:

Replace the following code in convert_df_to_csv:

# Add dataframe details
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df.to_csv()}"

With this:

# TEMP FIX: Do not add dataframe details
df_without_sample_data = pd.DataFrame(columns=df.pandas_df.columns)
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df_without_sample_data.to_csv()}"

This change will prevent the original dataframe details from being included when enforce_privacy is set to True.

To continue talking to Dosu, mention @dosu.

pesmeriz · 2024-09-25T11:40:14Z

I tried the proposed solution and the problems remains.

Also, this should be made clearar on the docs, though. Otherwise people start using, find a bug (or worse, they don't), and then are left without a solution and time wasted.

dosubot bot added the bug Something isn't working label Sep 25, 2024

pesmeriz closed this as completed Sep 25, 2024

pesmeriz reopened this Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensitive data leakage #1374

Sensitive data leakage #1374

pesmeriz commented Sep 25, 2024

dosubot bot commented Sep 25, 2024

pesmeriz commented Sep 25, 2024 •

edited

Loading

Sensitive data leakage #1374

Sensitive data leakage #1374

Comments

pesmeriz commented Sep 25, 2024

System Info

🐛 Describe the bug

dosubot bot commented Sep 25, 2024

pesmeriz commented Sep 25, 2024 • edited Loading

pesmeriz commented Sep 25, 2024 •

edited

Loading