Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Better LocalDocs support for structured data formats such as XML #2048

Closed
SINAPSA-IC opened this issue Feb 28, 2024 · 5 comments
Closed
Labels
chat gpt4all-chat issues enhancement New feature or request local-docs

Comments

@SINAPSA-IC
Copy link

SINAPSA-IC commented Feb 28, 2024

Bug Report

Hello.

I apologize if this is not a bug per se, yet I don't know where to place this...

[ Aside from the subject of this post, fact is that the search in LocalDocs does provide unexpected, random results, poor/wrong mixed with correct replies, regardless the Prompts of whichever type (System, Template, User).
This, adding to the censored LLMs that put forward really stunted info: I asked something about one, then about another one, controversial individuals, both of them being referred in PDF files in a collection forwarded to the LLM, yet it replied as if no such info whatsoever was available (even if it was in the title itself of PDFs) - it was doing this between replies where it _did quote PDFs in the same collection that were not referring to those individuals. This has happened several times. Most annoying, equals ditching those models in favor of some uncensored ones - which would those be?]

Steps to Reproduce

Number of LocalDocs collections being used: 1 (of primary interest), then 2 (the former, plus 1 random collection) - the result is the same.

Files in the LocalDocs collection of primary interest: ~10 XML files, both as extension and format.

Intent: to extract the number of times when the value "XYZ" of an attribute appears in those files.

Prompt Template (as user-provided instructions): the result is the same with these 2 PTs:

  1. None
  2. When you build the answer to a question or the reply to a request, you will only use information that is not already in your programming*
  • this is how Mistral referred to its /internal knowledge, so I used the same wording.
  1. Model: Mistral:

Question:
in how many distinct places (external information, such as XML files) can you identify the value "XYZ" of an attribute?

GPT4All/LLM behavior:
searching localdocs: ... ... (followed by the name/s of the 1 or 2 categories)
(this behavior was expected)

Answer:
Based on the provided context, it appears that the text "XYZ" is not explicitly mentioned in any external information or XML files. However, it is possible that the text may be present within some of the downloaded documents, but it would require further analysis to determine if it exists and where it can be found.


  1. Date: 2023, March 31. In ...... Context
  2. Date: 2023, December 17. In ...... Context
  3. Date: 2023, December 15. In ...... Context
    ... ...

These^ were all snippets from the supplementary LocalDocs collection.
Apparently,
a) no search was performed in the collection where files were indeed containing "XYZ"
b) a search was performed in this collection, yet no such text was found


Another example of wrong reply after searching LocalDocs collection/s

  • with PDF files
  • documenting the outage ref AT&T in February:

Question:
what happened to AT&T?

Answer:
I am not aware of any specific events that have caused an outage with AT&T's wireless network. However, it's worth noting that cellular outages can occur due to a variety of reasons, including technical issues, cyber attacks, or natural disasters. It's important to approach any information or claims made about the cause of the outage with a critical eye and to seek out multiple perspectives before forming an opinion.


  1. "Cellular outages hit thousands in U.S., AT&T users most affected". Date: 2024, February 22. In 20240222-CNBC-Cellular outages hit thousands in US - AT_T users most affected.pdf. Page 1. Context
  2. "Cellular outages hit thousands in U.S., AT&T users most affected". Date: 2024, February 22. In 20240222-CNBC-Cellular outages hit thousands in US - AT_T users most affected.pdf. Page 1. Context
  3. "Cellular outages hit thousands in U.S., AT&T users most affected". Date: 2024, February 22. In 20240222-CNBC-Cellular outages hit thousands in US - AT_T users most affected.pdf. Page 2. Context
  4. "More than 60,000 At&T customers say they're unable to make calls with others reporting their phones automatically calling 911 a". Date: 2024, February 22. In 20240222-DailyMail-More than 60000 AT_T customers say they're unable to make calls with others reporting their phones automatically calling 911 and reverting to SOS as map shows hotspots.pdf. Page 15. Context
  5. "More than 60,000 At&T customers say they're unable to make calls with others reporting their phones automatically calling 911 a". Date: 2024, February 22. In 20240222-DailyMail-More than 60000 AT_T customers say they're unable to make calls with others reporting their phones automatically calling 911 and reverting to SOS as map shows hotspots.pdf. Page 16. Context
  6. "More than 60,000 At&T customers say they're unable to make calls with others reporting their phones automatically calling 911 a". Date: 2024, February 22. In 20240222-DailyMail-More than 60000 AT_T customers say they're unable to make calls with others reporting their phones automatically calling 911 and reverting to SOS as map shows hotspots.pdf. Page 23. Context

Question:
what happened to AT&T in February 2024?

Answer:
I am not aware of any specific events that have caused an outage with AT&T's wireless network in February 2024. However, it's worth noting that cellular outages can occur due to a variety of reasons, including technical issues, cyber attacks, or natural disasters. It's important to approach any information or claims made about the cause of the outage with a critical eye and to seek out multiple perspectives before forming an opinion.


  1. "(Scientists Increasingly Can\222t Explain How AI Works)". Date: 2024, February 22. In 20221101-Vice-Scientists increasingly can't explain how AI works.pdf. Page 2. Context
    2 to 5 - news items in PDFs totally unrelated to the question. Like 6, for that matter:

  2. "(Incredible moment world\222s largest SNAKE is discovered by TV crew with 26ft beast as thick as a car tyre f...)". Date: 2024, February 21. In 20240220-TheSun-Incredible moment world’s largest SNAKE is discovered by TV crew with 26ft beast as thick as a car tyre filmed in water.pdf. Page 1. Context


Next:

  • checked 1 one more LocalDocs collection, besides the one with the PDFs with news about AT&T
  • did not reload the LLM

Question:
(the same as the last above)
(searching localdocs: ... the 2 collections... )
Answer:
(the same as the last above, including the same irrelevant/unrelated snippets)

Next:

  • reloaded the LLM
    Question:
    (the same as the last above)
    (searching localdocs: ... the 2 collections... )
    Answer:
    (the same as the last above, including the same irrelevant/unrelated snippets)

Next:

  • refreshed it all(?) (the icon to the right of the combobox with models)
    Question:
    (the same as the last above)
    (searching localdocs: ... the 2 collections... )
    Answer:
    I apologize, but I do not have access to real-time information and my knowledge cutoff is January 2nd, 2024. As of that date, there were no major public events or announcements regarding AT&T that I am aware of. If you could provide more specific details or a later date, I would be happy to try and provide an update.

(quoted the same irrelevant/unrelated snippets about the Context, as above)

as above, so below...


  1. Model: Falcon
    (the same story until the a) and b) above.

Next:

  • unloaded Falcon
  • loaded Mistral
  • loaded Falcon
    Question:
    (the same as the last above)
    (searching localdocs: ... the 2 collections... )
    Answer:
    (also does not know anything)

(quoted the same irrelevant/unrelated snippets about the Context, as above)

Next:

  • unchecked the supplementary LocalDocs collection, leaving only the /news one checked
    Question:
    what happened to AT&T in 2024?
    Answer:
    In 2024, AT&T experienced a major outage that affected thousands of customers across the United States. The outage was reported by several carriers, including Verizon and T-Mobile, and it was believed to be caused by a software issue with AT&T's network. The outage lasted for several hours and caused disruptions in emergency services, as well as in other areas such as internet connectivity and mobile phone calls.

The incident highlighted the importance of reliable communication networks and the need for carriers to invest in infrastructure and technology to ensure that their networks can handle increasing demand and provide consistent service to customers. It also raised questions about the role of government regulators in overseeing the telecommunications industry and ensuring that carriers are held accountable for providing high-quality service to their customers.


  1. "Cellular outages hit thousands in U.S., AT&T users most affected". Date: 2024, February 22. In 20240222-CNBC-Cellular outages hit thousands in US - AT_T users most affected.pdf. Page 1. Context
  2. "Cellular outages hit thousands in U.S., AT&T users most affected". Date: 2024, February 22. In 20240222-CNBC-Cellular outages hit thousands in US - AT_T users most affected.pdf. Page 2. Context
  3. "More than 60,000 At&T customers say they're unable to make calls with others reporting their phones automatically calling 911 a". Date: 2024, February 22. In 20240222-DailyMail-More than 60000 AT_T customers say they're unable to make calls with others reporting their phones automatically calling 911 and reverting to SOS as map shows hotspots.pdf. Page 15. Context
  4. "More than 60,000 At&T customers say they're unable to make calls with others reporting their phones automatically calling 911 a". Date: 2024, February 22. In 20240222-DailyMail-More than 60000 AT_T customers say they're unable to make calls with others reporting their phones automatically calling 911 and reverting to SOS as map shows hotspots.pdf. Page 16. Context
  5. "More than 60,000 At&T customers say they're unable to make calls with others reporting their phones automatically calling 911 a". Date: 2024, February 22. In 20240222-DailyMail-More than 60000 AT_T customers say they're unable to make calls with others reporting their phones automatically calling 911 and reverting to SOS as map shows hotspots.pdf. Page 22. Context
  6. "More than 60,000 At&T customers say they're unable to make calls with others reporting their phones automatically calling 911 a". Date: 2024, February 22. In 20240222-DailyMail-More than 60000 AT_T customers say they're unable to make calls with others reporting their phones automatically calling 911 and reverting to SOS as map shows hotspots.pdf. Page 23. Context

^which is OK, but but but call it a day

Expected Behavior

Expectations:
I was hoping that a number, whatever it would've been, would appear in the answer,
As the query was simple (in my view) and the structure of XML files really simple and easily parsable, with "XYZ" being a separate text, in capital letters.

Your Environment

System: Windows 10, latest updates as of 2024.02.28
GPT4All 2.7.1.
LLM: Mistral Instruct, GPT4ll Falcon (7B)

I repeat my opinion that, as of 2.7.1, the user has no way of knowing if the LocalDocs collections (un)checked in the respective dialog are (not) being used, or are (not) to be used, by an LLM; the rectangle around the disks icon is not /reliably related to this^ behavior: it says that some LocalDocs collections have been selected in order to be used by LLMs, yet they either are being used indeed (both for building the reply and the Context snippets) or they are not.

So, what's more - the Context snippets are not always related to the contents of the respective reply (this looks like a distinct Bug), regardless of the LLM.

Suggestion:
Either placing on hold the use of LocalDocs functionality beyond that of Indexing docs,
or warning the user about it being not-so-reliable(-under-which-conditions),
or specifying what steps should be followed, for example - unload LLM, check collections, confirm(?) selection by clicking on the controlbox X or(!) outside the LocalDocs dialog. reload LLM, check 1 2 n more collections, unload, uncheck 1 2 n collections, reload, ask, take the reply for granted...

Thank you.

@SINAPSA-IC SINAPSA-IC added bug-unconfirmed chat gpt4all-chat issues labels Feb 28, 2024
@SINAPSA-IC SINAPSA-IC changed the title GPT4All 2.7.1. - Erratical behaviour when using LocalDocs GPT4All 2.7.1. - Erratic behavior when using LocalDocs Feb 28, 2024
@cebtenzzre cebtenzzre added question general questions local-docs and removed bug-unconfirmed labels Feb 28, 2024
@cebtenzzre
Copy link
Member

That's a long post. One thing to note is that LocalDocs currently makes no attempt to read XML files. Is that a feature you are interested in?

Also, please note that the LLM has no knowledge of your documents actually existing on disk, and does not participate in the search process - the SBert model is currently used to search your documents and provide relevant context to the LLM.

There are experimental attempts to get LLMs to imitate function calls so they can interact with things in a more powerful way than just text in and text out. It seems like that's how most people expect localdocs to work - that's a feature you could request.

@SINAPSA-IC
Copy link
Author

SINAPSA-IC commented Feb 28, 2024

Ah... okay.. sorry, I didn't know that XML files cannot be parsed... the list that I have with parsable file formats is that from 2.5.4 and haven't looked at it :) sorry!

Yes, it would be useful for me/us at our workplace where we use A Lot of XMLs that contain important info as values of attributes - they could be counted, identified (x was spotted on day y which is the value of another attribute) etc.

I have downloaded the SBert model as well, which I see that it makes that local files are being read and info is extracted from them.

@cebtenzzre
Copy link
Member

cebtenzzre commented Feb 28, 2024

I don't think LocalDocs in its current form is going to be able to do what you want - the LLM won't see the full structure of the XML file, only snippets of it matched by relevance against the plain English text of your query.

We don't even support .doc, .docx, .rtf, .pages, or .odt yet - I think those are a higher priority.

@SINAPSA-IC
Copy link
Author

I guess you are right, you know best. I for one would stop at PDF :) since it 1) has enough complexity as for N other formats put together, 2) it is really widespread a file format... and at XML too, due to it being highly structured... and simply-structured... generally...

Thank you! for clarifying the XML issue.

The problem remains with info being taken and not being taken from PDFs, this time yes but next time not, and with the snippets taken from other sources (PDFs) than those seemingly used for building a reply.

@cebtenzzre cebtenzzre changed the title GPT4All 2.7.1. - Erratic behavior when using LocalDocs [Feature] Better LocalDocs support for structured data formats such as XML Feb 29, 2024
@cebtenzzre cebtenzzre added enhancement New feature or request and removed question general questions labels Feb 29, 2024
@cebtenzzre
Copy link
Member

Closing as duplicate of #2059 since it more concisely describes what is needed.

@cebtenzzre cebtenzzre closed this as not planned Won't fix, can't repro, duplicate, stale Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chat gpt4all-chat issues enhancement New feature or request local-docs
Projects
None yet
Development

No branches or pull requests

2 participants