Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search with Page Content Crawl Feedback | 搜索 - 页面深度抓取功能反馈 #6632

Open
arvinxx opened this issue Mar 2, 2025 · 38 comments

Comments

@arvinxx
Copy link
Contributor

arvinxx commented Mar 2, 2025

Image

Regarding #6482, v1.67.1 officially releases the web scraping feature for online search. We welcome everyone to provide feedback on any bad cases encountered during use, so we can improve the web scraping experience together.


#6482 ,v1.67.1 正式发布联网搜索的页面抓取功能,欢迎大家反馈一些使用过程中的 bad case,一起完善页面抓取的使用体验。

@dream-north
Copy link

赞!gpt-4o可以正常使用。但是qwen系列的模型好像不能正常触发调用Crawl ,这个有办法吗?

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


praise! gpt-4o can be used normally. However, the qwen series model does not seem to trigger the call to Crawl normally. Is there any way to do this?

@arvinxx
Copy link
Contributor Author

arvinxx commented Mar 2, 2025

@dream-north 如果是不支持 Function calling 的模型,应该就不行

@lobehub lobehub deleted a comment from lobehubbot Mar 2, 2025
@dream-north
Copy link

@dream-north 如果是不支持 Function calling 的模型,应该就不行

@arvinxx 我用的qwen-plus,是支持function calling的,使用下来

  • 开启智能联网,关闭使用模型内置搜索引擎,开启web-browsing,没有触发searXNG的搜索页面,没有触发crawl。
  • 关闭智能联网,开启web-browsing,成功触发searXNG的搜索页面,没有触发crawl。
  • 直接像您的示例一样传入url,没有触发crawl。

gpt-4o在开启智能联网后能稳定调用searXNG和crawl。

@lobehub lobehub deleted a comment from lobehubbot Mar 2, 2025
@arvinxx
Copy link
Contributor Author

arvinxx commented Mar 2, 2025

@dream-north qwen-plus 是不是有内置联网的功能?

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@dream-north qwen-plus Does it have built-in networking function?

@dream-north
Copy link

@dream-north qwen-plus 是不是有内置联网的功能?

@arvinxx 是的没错,感觉两个联网有点混乱在一起了,qwen-plus开启智能联网即使关闭了使用内置搜索引擎,也没有调用searXNG。

改用没有内置搜索的qwen2.5-72b,稳定调用searXNG,但是没有像gpt-4o一样对查询的结果调用crawl。

又试了下,qwen-plus开启智能联网可以按照您的样例调用crawl总结页面,赞。


另外关于另一个事情,百炼的qwen api好像支持function calling流式输出了,我把src/libs/agent-runtime/qwen/index.ts中的stream改成true可以正常流式输出。相关issue:#4567

Image

@Cheukfung
Copy link

我的测试没问题的
版本:docker database 1.67.1
模型:qwen-plus (支持fc 有内置联网)
设置:开启智能联网,关闭模型内联网,发送链接能成功触发 crawl
截图:

Image

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


My test is fine
Version: docker database 1.67.1
Model: qwen-plus (supports FC with built-in networking)
Settings: Turn on intelligent networking, turn off model intranet, and sending links can successfully trigger crawl
screenshot:

Image

@dream-north
Copy link

@Cheukfung 嗯嗯感谢,我也可以发送链接能成功触发 crawl了。

搜索功能我像你一样问“搜索今天的nba比赛”,可以触发SearXNG;
但是问“特朗普和泽连斯基为什么吵架”,看起来像是调用了内置搜索,并且给了一个example.com的错误参考链接

Image

相同的问题使用gpt-4o-mini,会在搜索之后调用crawl,这个没有在qwen系列模型中触发过。

Image

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@Cheukfung Yeah Thanks, I can also send the link to successfully trigger crawl.

Search function I ask "Search today's nba game" like you, which can trigger SearXNG;
But asking "Why Trump and Zelensky quarreled" looks like a call to the built-in search and gives an error reference link for example.com

Image

The same problem uses gpt-4o-mini, which will call crawl after searching, which has not been triggered in the qwen series model.

Image

@Sun-drenched
Copy link

@dream-north qwen-plus 是不是有内置联网的功能?

似乎内置联网功能的模型都大概率不调用LobeChat侧的搜索工具(即使没有勾选“使用模型内置搜索引擎”),我在Gemini2.0系列模型上用时事类问题复现了上述qwen模型一样的问题

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@dream-north Does qwen-plus have built-in networking function?

It seems that models with built-in networking function are likely to not call the search tool on the LobeChat side (even if "Use the built-in search engine for model" is not checked), I used current affairs problems to reproduce the same problem as the above qwen model on the Gemini2.0 series model.

@Sun-drenched
Copy link

naive抓取方式局限还是蛮大的,很容易抓取一些无效内容或者模型无法理解的图片(链接)。特别是还有截断机制,反倒把无效内容保留,有效内容去掉了。本次测试中,抓取了3个网页,21000字符里可能有效字符就5000.

Image

Image

Image

Image

是否可以加一个mini的前置模型预筛选naive抓取的内容,避免机械地截断,影响效果而且浪费高级模型token。如果接入低价或者可白嫖的mini模型 or 本地部署模型,效果和性价比应该会好很多。得益于此,也可以增加抓取页面数。

或者把Jina作为首选(20rpm对于大多数没有公开提供服务的社区版用户来说应该是足够的)。

话说现在抓取方式选择机制是怎么样的呢,感觉没有有效的质量侦测和切换机制。图一所示的网页用naive抓取,38000/7000字符里没有一个有效的。后面两个网页质量好一些(至少没有触发登录),反倒用了Jina。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The naive crawling method is quite limited, and it is easy to crawl some images that are invalid or cannot be understood by the model (link). In particular, there is also a cutoff mechanism, which retains invalid content and removes the effective content. In this test, 3 web pages were captured, and the possible valid characters in 21,000 characters were 5,000.

Image

Image

Image

Image

Can you add a mini pre-model to pre-filter the content of naive crawling to avoid mechanical truncation, affecting the effect and wasting advanced model tokens. If you access a low-priced mini model or local deployment model, the effect and cost-effectiveness should be much better. Thanks to this, you can also increase the number of pages crawled.

Or take Jina as your preferred choice (20rpm should be enough for most community version users who do not provide public services).

Speaking of which, what is the current crawling method selection mechanism? I feel that there is no effective quality detection and switching mechanism. The web page shown in Figure 1 is crawled with naive, and none of the 38000/7000 characters are valid. The last two web pages are of better quality (at least they did not trigger login), but instead they used Jina.

@dream-north
Copy link

qwen-plus确定搜索后调用SearXNG,但是没有调用crawl,回答错误。
Image

gpt-4o每次搜索后都会调用crawl,回答正确。
Image

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


qwen-plus calls SearXNG after searching, but crawl is not called, and the answer is wrong.
Image

gpt-4o will call crawl after each search, and the answer is correct.
Image

@AAEE86
Copy link

AAEE86 commented Mar 3, 2025

Image

使用DeepSeek R1 时不显示searchWithSearXNG搜索页面

@ChenLuoi
Copy link

ChenLuoi commented Mar 3, 2025

使用function call去抉择是否需要crawl抓取页面的,很依赖AI本身的能力和倾向。能力足够强的AI可以通过多步迭代,达成类似DeepResearch的效果;但是能力一般的AI往往都不会再通过crawl去继续抓取页面内容。
建议提供一个可选的强制机制,对searXNG搜索到的结果,相关性最高的N个信源站点的抓取页面内容。可以显著提升特定搜索场景下的体验。
后面做通用search的时候,这种思路也可以沿用。用户可以按需配置对于搜索结果进行深度抓取的门限。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Using function call to decide whether crawl is needed to crawl pages, it depends on the capabilities and tendencies of the AI ​​itself. AI with strong enough capabilities can achieve effects similar to DeepResearch through multiple steps; however, AI with average capabilities will often no longer continue to crawl the page content through crawl.
It is recommended to provide an optional mandatory mechanism to crawl page content of the N source sites with the highest correlation results found in seaXNG. It can significantly improve the experience in a specific search scenario.
When doing general search later, this idea can also be used. Users can configure thresholds for deep crawling search results on demand.

@AnotiaWang
Copy link

使用DeepSeek R1 时不显示searchWithSearXNG搜索页面

这是出现幻觉了,直接编造了搜索结果和引用数据

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


SearchWithSearXNG search page does not appear when using DeepSeek R1

This is an hallucination, and the search results and reference data were directly fabricated

@AAEE86
Copy link

AAEE86 commented Mar 3, 2025

使用DeepSeek R1 时不显示searchWithSearXNG搜索页面

这是出现幻觉了,直接编造了搜索结果和引用数据

还真是.....

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


SearchWithSearXNG search page is not displayed when using DeepSeek R1

This is an hallucination, and it directly fabricated search results and cited data

It's really...

@AnotiaWang

This comment has been minimized.

@lobehubbot

This comment has been minimized.

@arvinxx
Copy link
Contributor Author

arvinxx commented Mar 3, 2025

使用function call去抉择是否需要crawl抓取页面的,很依赖AI本身的能力和倾向。能力足够强的AI可以通过多步迭代,达成类似DeepResearch的效果;但是能力一般的AI往往都不会再通过crawl去继续抓取页面内容。
建议提供一个可选的强制机制,对searXNG搜索到的结果,相关性最高的N个信源站点的抓取页面内容。可以显著提升特定搜索场景下的体验。

想到一块去了,这个就是 #6277 Step 5 和 Step 6 准备做的。会有一个「强制联网」模式,这个模式下可以集成小模型做前置意图理解。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Using function call to decide whether crawl is needed to crawl pages, it depends on the capabilities and tendencies of the AI ​​itself. AI with strong enough capabilities can achieve effects similar to DeepResearch through multiple steps; however, AI with average capabilities will often no longer continue to crawl the page content through crawl.
It is recommended to provide an optional mandatory mechanism to crawl page content of the N source sites with the highest correlation results found in seaXNG. It can significantly improve the experience in a specific search scenario.

I thought of going together, and this is what #6277 Step 5 and Step 6 are preparing to do. There will be a "forced networking" mode, in which small models can be integrated for pre-intention understanding.

@chung1912
Copy link
Contributor

需要增加一个开关深度搜索的按钮

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Need to add a switch depth search button

@Slinet6056

This comment has been minimized.

@lobehubbot

This comment has been minimized.

@arvinxx

This comment has been minimized.

@lobehubbot

This comment has been minimized.

@Slinet6056

This comment has been minimized.

@lobehubbot

This comment has been minimized.

@arvinxx

This comment has been minimized.

@lobehubbot

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants