-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Better LocalDocs support for structured data formats such as XML #2048
Comments
That's a long post. One thing to note is that LocalDocs currently makes no attempt to read XML files. Is that a feature you are interested in? Also, please note that the LLM has no knowledge of your documents actually existing on disk, and does not participate in the search process - the SBert model is currently used to search your documents and provide relevant context to the LLM. There are experimental attempts to get LLMs to imitate function calls so they can interact with things in a more powerful way than just text in and text out. It seems like that's how most people expect localdocs to work - that's a feature you could request. |
Ah... okay.. sorry, I didn't know that XML files cannot be parsed... the list that I have with parsable file formats is that from 2.5.4 and haven't looked at it :) sorry! Yes, it would be useful for me/us at our workplace where we use A Lot of XMLs that contain important info as values of attributes - they could be counted, identified (x was spotted on day y which is the value of another attribute) etc. I have downloaded the SBert model as well, which I see that it makes that local files are being read and info is extracted from them. |
I don't think LocalDocs in its current form is going to be able to do what you want - the LLM won't see the full structure of the XML file, only snippets of it matched by relevance against the plain English text of your query. We don't even support .doc, .docx, .rtf, .pages, or .odt yet - I think those are a higher priority. |
I guess you are right, you know best. I for one would stop at PDF :) since it 1) has enough complexity as for N other formats put together, 2) it is really widespread a file format... and at XML too, due to it being highly structured... and simply-structured... generally... Thank you! for clarifying the XML issue. The problem remains with info being taken and not being taken from PDFs, this time yes but next time not, and with the snippets taken from other sources (PDFs) than those seemingly used for building a reply. |
Closing as duplicate of #2059 since it more concisely describes what is needed. |
Bug Report
Hello.
I apologize if this is not a bug per se, yet I don't know where to place this...
[ Aside from the subject of this post, fact is that the search in LocalDocs does provide unexpected, random results, poor/wrong mixed with correct replies, regardless the Prompts of whichever type (System, Template, User).
This, adding to the censored LLMs that put forward really stunted info: I asked something about one, then about another one, controversial individuals, both of them being referred in PDF files in a collection forwarded to the LLM, yet it replied as if no such info whatsoever was available (even if it was in the title itself of PDFs) - it was doing this between replies where it _did quote PDFs in the same collection that were not referring to those individuals. This has happened several times. Most annoying, equals ditching those models in favor of some uncensored ones - which would those be?]
Steps to Reproduce
Number of LocalDocs collections being used: 1 (of primary interest), then 2 (the former, plus 1 random collection) - the result is the same.
Files in the LocalDocs collection of primary interest: ~10 XML files, both as extension and format.
Intent: to extract the number of times when the value "XYZ" of an attribute appears in those files.
Prompt Template (as user-provided instructions): the result is the same with these 2 PTs:
Question:
in how many distinct places (external information, such as XML files) can you identify the value "XYZ" of an attribute?
GPT4All/LLM behavior:
searching localdocs: ... ... (followed by the name/s of the 1 or 2 categories)
(this behavior was expected)
Answer:
Based on the provided context, it appears that the text "XYZ" is not explicitly mentioned in any external information or XML files. However, it is possible that the text may be present within some of the downloaded documents, but it would require further analysis to determine if it exists and where it can be found.
... ...
These^ were all snippets from the supplementary LocalDocs collection.
Apparently,
a) no search was performed in the collection where files were indeed containing "XYZ"
b) a search was performed in this collection, yet no such text was found
Another example of wrong reply after searching LocalDocs collection/s
Question:
what happened to AT&T?
Answer:
I am not aware of any specific events that have caused an outage with AT&T's wireless network. However, it's worth noting that cellular outages can occur due to a variety of reasons, including technical issues, cyber attacks, or natural disasters. It's important to approach any information or claims made about the cause of the outage with a critical eye and to seek out multiple perspectives before forming an opinion.
Question:
what happened to AT&T in February 2024?
Answer:
I am not aware of any specific events that have caused an outage with AT&T's wireless network in February 2024. However, it's worth noting that cellular outages can occur due to a variety of reasons, including technical issues, cyber attacks, or natural disasters. It's important to approach any information or claims made about the cause of the outage with a critical eye and to seek out multiple perspectives before forming an opinion.
"(Scientists Increasingly Can\222t Explain How AI Works)". Date: 2024, February 22. In 20221101-Vice-Scientists increasingly can't explain how AI works.pdf. Page 2. Context
2 to 5 - news items in PDFs totally unrelated to the question. Like 6, for that matter:
"(Incredible moment world\222s largest SNAKE is discovered by TV crew with 26ft beast as thick as a car tyre f...)". Date: 2024, February 21. In 20240220-TheSun-Incredible moment world’s largest SNAKE is discovered by TV crew with 26ft beast as thick as a car tyre filmed in water.pdf. Page 1. Context
Next:
Question:
(the same as the last above)
(searching localdocs: ... the 2 collections... )
Answer:
(the same as the last above, including the same irrelevant/unrelated snippets)
Next:
Question:
(the same as the last above)
(searching localdocs: ... the 2 collections... )
Answer:
(the same as the last above, including the same irrelevant/unrelated snippets)
Next:
Question:
(the same as the last above)
(searching localdocs: ... the 2 collections... )
Answer:
I apologize, but I do not have access to real-time information and my knowledge cutoff is January 2nd, 2024. As of that date, there were no major public events or announcements regarding AT&T that I am aware of. If you could provide more specific details or a later date, I would be happy to try and provide an update.
(quoted the same irrelevant/unrelated snippets about the Context, as above)
as above, so below...
(the same story until the a) and b) above.
Next:
Question:
(the same as the last above)
(searching localdocs: ... the 2 collections... )
Answer:
(also does not know anything)
(quoted the same irrelevant/unrelated snippets about the Context, as above)
Next:
Question:
what happened to AT&T in 2024?
Answer:
In 2024, AT&T experienced a major outage that affected thousands of customers across the United States. The outage was reported by several carriers, including Verizon and T-Mobile, and it was believed to be caused by a software issue with AT&T's network. The outage lasted for several hours and caused disruptions in emergency services, as well as in other areas such as internet connectivity and mobile phone calls.
The incident highlighted the importance of reliable communication networks and the need for carriers to invest in infrastructure and technology to ensure that their networks can handle increasing demand and provide consistent service to customers. It also raised questions about the role of government regulators in overseeing the telecommunications industry and ensuring that carriers are held accountable for providing high-quality service to their customers.
^which is OK, but but but call it a day
Expected Behavior
Expectations:
I was hoping that a number, whatever it would've been, would appear in the answer,
As the query was simple (in my view) and the structure of XML files really simple and easily parsable, with "XYZ" being a separate text, in capital letters.
Your Environment
System: Windows 10, latest updates as of 2024.02.28
GPT4All 2.7.1.
LLM: Mistral Instruct, GPT4ll Falcon (7B)
I repeat my opinion that, as of 2.7.1, the user has no way of knowing if the LocalDocs collections (un)checked in the respective dialog are (not) being used, or are (not) to be used, by an LLM; the rectangle around the disks icon is not /reliably related to this^ behavior: it says that some LocalDocs collections have been selected in order to be used by LLMs, yet they either are being used indeed (both for building the reply and the Context snippets) or they are not.
So, what's more - the Context snippets are not always related to the contents of the respective reply (this looks like a distinct Bug), regardless of the LLM.
Suggestion:
Either placing on hold the use of LocalDocs functionality beyond that of Indexing docs,
or warning the user about it being not-so-reliable(-under-which-conditions),
or specifying what steps should be followed, for example - unload LLM, check collections, confirm(?) selection by clicking on the controlbox X or(!) outside the LocalDocs dialog. reload LLM, check 1 2 n more collections, unload, uncheck 1 2 n collections, reload, ask, take the reply for granted...
Thank you.
The text was updated successfully, but these errors were encountered: