Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] LocalDocs support for CSV, JSON, XML #2059

Open
tbennett6421 opened this issue Feb 29, 2024 · 7 comments
Open

[Feature] LocalDocs support for CSV, JSON, XML #2059

tbennett6421 opened this issue Feb 29, 2024 · 7 comments
Labels
chat gpt4all-chat issues enhancement New feature or request local-docs

Comments

@tbennett6421
Copy link

tbennett6421 commented Feb 29, 2024

Feature Request

Please add to the roadmap for gpt4all-localdoc, the ability to parse csv, json, xml files. LLM models are prone to making garbage up, so I intended to use localdocs to provide databases of concrete items. Generally most of these formats will be in csv, json, or xml.

LocalDocs currently supports plain text files (.txt, .md, and .rst) and PDF files (.pdf).

Example use cases:

  • dumping logs into a folder, and asking questions about the data.
  • dumping databases into a folder, requesting experimental data such as (mw, mp/fp, solubility)
  • dumping financial spreadsheets, and asking questions about transcripts.
  • and more
@tbennett6421 tbennett6421 added the enhancement New feature or request label Feb 29, 2024
@cebtenzzre cebtenzzre added chat gpt4all-chat issues local-docs labels Feb 29, 2024
@cebtenzzre cebtenzzre changed the title [Feature] LocalDocs machine readable formats [Feature] LocalDocs support for CSV, JSON, XML Feb 29, 2024
@manyoso
Copy link
Collaborator

manyoso commented Mar 10, 2024

Localdocs currently does not have any support for custom file parsing though this would be a nice addition.

@mishaxz
Copy link

mishaxz commented Mar 10, 2024

I concur, right now you have to rename you .csv files to .txt

btw. does anyone know what the fastest models are for this kind of thing, I'm using Nous Hermes 2 Mistral DPO right now on the txt csv file but it is kind of slow.

@tbennett6421
Copy link
Author

What would it take to implement some kinda parser in the localdocs? I mean I'd be willing to look at doing a pr for it? Either in python or in c?

@tbennett6421
Copy link
Author

tbennett6421 commented Mar 12, 2024

#1344 could help address one of those points above:

dumping databases into a folder, requesting experimental data such as (mw, mp/fp, solubility)

specifically when gpt hallucinates or makes up empirically measured data.

@cebtenzzre
Copy link
Member

Localdocs currently does not have any support for custom file parsing though this would be a nice addition.

Since these are plain text formats, a minimum effort implementation would be to just add these formats back to the whitelist. At the time I removed them, I wanted to start with a clean slate because there were a lot of formats in that list that even if they worked, didn't seem like anyone would be using them.

Although I don't think it makes sense to use the LocalDocs feature as-is to process structured input, since it breaks it into chunks and destroys the global structure... it clearly worked well enough for a few people in the past. A slightly more useful implementation would e.g. keep the header for snippets of CSV, and keep the outer structure for XML and JSON.

@spiralofhope
Copy link

I see both csv and xml listed in database.cpp

So if I understand this feature request, it is asking for proper support for the data as a data format and not just as a dump of plain text words. Is that right?

@tbennett6421
Copy link
Author

CSV hasn't been the worst. For instance, I've dropped big csvfiles without headers and it does a well enough job getting what I mean out of it. My big issue is consuming json, xml structured input.

Generally I write something to consume data from some endpoint/url as json(). Then I painstakingly try to convert the results to markdown or plaint text then add it to my localdoc collections. Generally I have to write unique "data-massaging" scripts to support data collection efforts. If I dont do that before hand, I get giant database files which impacts the perfomance of GPT4ALL.

That's my specific use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chat gpt4all-chat issues enhancement New feature or request local-docs
Projects
None yet
Development

No branches or pull requests

5 participants