Search with Page Content Crawl Feedback | 搜索 - 页面深度抓取功能反馈 #6632

arvinxx · 2025-03-02T14:54:34Z

Regarding #6482, v1.67.1 officially releases the web scraping feature for online search. We welcome everyone to provide feedback on any bad cases encountered during use, so we can improve the web scraping experience together.

接 #6482 ，v1.67.1 正式发布联网搜索的页面抓取功能，欢迎大家反馈一些使用过程中的 bad case，一起完善页面抓取的使用体验。

dream-north · 2025-03-02T15:02:50Z

赞！gpt-4o可以正常使用。但是qwen系列的模型好像不能正常触发调用Crawl ，这个有办法吗？

lobehubbot · 2025-03-02T15:03:02Z

praise! gpt-4o can be used normally. However, the qwen series model does not seem to trigger the call to Crawl normally. Is there any way to do this?

arvinxx · 2025-03-02T15:26:29Z

@dream-north 如果是不支持 Function calling 的模型，应该就不行

dream-north · 2025-03-02T15:49:00Z

@arvinxx 我用的qwen-plus，是支持function calling的，使用下来

开启智能联网，关闭使用模型内置搜索引擎，开启web-browsing，没有触发searXNG的搜索页面，没有触发crawl。
关闭智能联网，开启web-browsing，成功触发searXNG的搜索页面，没有触发crawl。
直接像您的示例一样传入url，没有触发crawl。

gpt-4o在开启智能联网后能稳定调用searXNG和crawl。

arvinxx · 2025-03-02T15:57:28Z

@dream-north qwen-plus 是不是有内置联网的功能？

lobehubbot · 2025-03-02T15:57:40Z

@dream-north qwen-plus Does it have built-in networking function?

dream-north · 2025-03-02T16:15:40Z

@arvinxx 是的没错，感觉两个联网有点混乱在一起了，qwen-plus开启智能联网即使关闭了使用内置搜索引擎，也没有调用searXNG。

改用没有内置搜索的qwen2.5-72b，稳定调用searXNG，但是没有像gpt-4o一样对查询的结果调用crawl。

又试了下，qwen-plus开启智能联网可以按照您的样例调用crawl总结页面，赞。

另外关于另一个事情，百炼的qwen api好像支持function calling流式输出了，我把src/libs/agent-runtime/qwen/index.ts中的stream改成true可以正常流式输出。相关issue：#4567

Cheukfung · 2025-03-02T16:19:09Z

我的测试没问题的
版本：docker database 1.67.1
模型：qwen-plus （支持fc 有内置联网）
设置：开启智能联网，关闭模型内联网，发送链接能成功触发 crawl
截图：

lobehubbot · 2025-03-02T16:19:21Z

My test is fine
Version: docker database 1.67.1
Model: qwen-plus (supports FC with built-in networking)
Settings: Turn on intelligent networking, turn off model intranet, and sending links can successfully trigger crawl
screenshot:

dream-north · 2025-03-02T16:37:54Z

@Cheukfung 嗯嗯感谢，我也可以发送链接能成功触发 crawl了。

搜索功能我像你一样问“搜索今天的nba比赛”，可以触发SearXNG；
但是问“特朗普和泽连斯基为什么吵架”，看起来像是调用了内置搜索，并且给了一个example.com的错误参考链接

相同的问题使用gpt-4o-mini，会在搜索之后调用crawl，这个没有在qwen系列模型中触发过。

lobehubbot · 2025-03-02T16:38:07Z

@Cheukfung Yeah Thanks, I can also send the link to successfully trigger crawl.

Search function I ask "Search today's nba game" like you, which can trigger SearXNG;
But asking "Why Trump and Zelensky quarreled" looks like a call to the built-in search and gives an error reference link for example.com

The same problem uses gpt-4o-mini, which will call crawl after searching, which has not been triggered in the qwen series model.

Sun-drenched · 2025-03-02T16:41:49Z

似乎内置联网功能的模型都大概率不调用LobeChat侧的搜索工具（即使没有勾选“使用模型内置搜索引擎”），我在Gemini2.0系列模型上用时事类问题复现了上述qwen模型一样的问题

lobehubbot · 2025-03-02T16:42:02Z

It seems that models with built-in networking function are likely to not call the search tool on the LobeChat side (even if "Use the built-in search engine for model" is not checked), I used current affairs problems to reproduce the same problem as the above qwen model on the Gemini2.0 series model.

Sun-drenched · 2025-03-02T17:00:34Z

naive抓取方式局限还是蛮大的，很容易抓取一些无效内容或者模型无法理解的图片（链接）。特别是还有截断机制，反倒把无效内容保留，有效内容去掉了。本次测试中，抓取了3个网页，21000字符里可能有效字符就5000.

是否可以加一个mini的前置模型预筛选naive抓取的内容，避免机械地截断，影响效果而且浪费高级模型token。如果接入低价或者可白嫖的mini模型 or 本地部署模型，效果和性价比应该会好很多。得益于此，也可以增加抓取页面数。

或者把Jina作为首选（20rpm对于大多数没有公开提供服务的社区版用户来说应该是足够的）。

话说现在抓取方式选择机制是怎么样的呢，感觉没有有效的质量侦测和切换机制。图一所示的网页用naive抓取，38000/7000字符里没有一个有效的。后面两个网页质量好一些（至少没有触发登录），反倒用了Jina。

lobehubbot · 2025-03-02T17:00:49Z

The naive crawling method is quite limited, and it is easy to crawl some images that are invalid or cannot be understood by the model (link). In particular, there is also a cutoff mechanism, which retains invalid content and removes the effective content. In this test, 3 web pages were captured, and the possible valid characters in 21,000 characters were 5,000.

Can you add a mini pre-model to pre-filter the content of naive crawling to avoid mechanical truncation, affecting the effect and wasting advanced model tokens. If you access a low-priced mini model or local deployment model, the effect and cost-effectiveness should be much better. Thanks to this, you can also increase the number of pages crawled.

Or take Jina as your preferred choice (20rpm should be enough for most community version users who do not provide public services).

Speaking of which, what is the current crawling method selection mechanism? I feel that there is no effective quality detection and switching mechanism. The web page shown in Figure 1 is crawled with naive, and none of the 38000/7000 characters are valid. The last two web pages are of better quality (at least they did not trigger login), but instead they used Jina.

dream-north · 2025-03-03T00:11:55Z

qwen-plus确定搜索后调用SearXNG，但是没有调用crawl，回答错误。

gpt-4o每次搜索后都会调用crawl，回答正确。

lobehubbot · 2025-03-03T00:12:09Z

qwen-plus calls SearXNG after searching, but crawl is not called, and the answer is wrong.

gpt-4o will call crawl after each search, and the answer is correct.

AAEE86 · 2025-03-03T01:02:23Z

使用DeepSeek R1 时不显示searchWithSearXNG搜索页面

ChenLuoi · 2025-03-03T02:29:45Z

使用function call去抉择是否需要crawl抓取页面的，很依赖AI本身的能力和倾向。能力足够强的AI可以通过多步迭代，达成类似DeepResearch的效果；但是能力一般的AI往往都不会再通过crawl去继续抓取页面内容。
建议提供一个可选的强制机制，对searXNG搜索到的结果，相关性最高的N个信源站点的抓取页面内容。可以显著提升特定搜索场景下的体验。
后面做通用search的时候，这种思路也可以沿用。用户可以按需配置对于搜索结果进行深度抓取的门限。

lobehubbot · 2025-03-03T02:29:56Z

Using function call to decide whether crawl is needed to crawl pages, it depends on the capabilities and tendencies of the AI itself. AI with strong enough capabilities can achieve effects similar to DeepResearch through multiple steps; however, AI with average capabilities will often no longer continue to crawl the page content through crawl.
It is recommended to provide an optional mandatory mechanism to crawl page content of the N source sites with the highest correlation results found in seaXNG. It can significantly improve the experience in a specific search scenario.
When doing general search later, this idea can also be used. Users can configure thresholds for deep crawling search results on demand.

AnotiaWang · 2025-03-03T02:40:30Z

这是出现幻觉了，直接编造了搜索结果和引用数据

lobehubbot · 2025-03-03T02:40:40Z

This is an hallucination, and the search results and reference data were directly fabricated

AAEE86 · 2025-03-03T02:47:46Z

还真是.....

lobehubbot · 2025-03-03T02:47:58Z

It's really...

arvinxx · 2025-03-03T03:39:02Z

想到一块去了，这个就是 #6277 Step 5 和 Step 6 准备做的。会有一个「强制联网」模式，这个模式下可以集成小模型做前置意图理解。

lobehubbot · 2025-03-03T03:39:13Z

I thought of going together, and this is what #6277 Step 5 and Step 6 are preparing to do. There will be a "forced networking" mode, in which small models can be integrated for pre-intention understanding.

chung1912 · 2025-03-03T05:07:52Z

需要增加一个开关深度搜索的按钮

lobehubbot · 2025-03-03T05:08:05Z

Need to add a switch depth search button

arvinxx mentioned this issue Mar 2, 2025

Application level Search Feedback | 应用级搜索反馈 #6482

Closed

arvinxx pinned this issue Mar 2, 2025

lobehub deleted a comment from lobehubbot Mar 2, 2025

This comment has been minimized.

Sign in to view

arvinxx mentioned this issue Mar 3, 2025

🐛 fix: Fix page crash with crawler error #6662

Merged

8 tasks

Search with Page Content Crawl Feedback | 搜索 - 页面深度抓取功能反馈 #6632

Search with Page Content Crawl Feedback | 搜索 - 页面深度抓取功能反馈 #6632

Comments

arvinxx commented Mar 2, 2025

dream-north commented Mar 2, 2025

lobehubbot commented Mar 2, 2025

arvinxx commented Mar 2, 2025

dream-north commented Mar 2, 2025

arvinxx commented Mar 2, 2025

lobehubbot commented Mar 2, 2025

dream-north commented Mar 2, 2025

Cheukfung commented Mar 2, 2025

lobehubbot commented Mar 2, 2025

dream-north commented Mar 2, 2025

lobehubbot commented Mar 2, 2025

Sun-drenched commented Mar 2, 2025

lobehubbot commented Mar 2, 2025

Sun-drenched commented Mar 2, 2025

lobehubbot commented Mar 2, 2025

dream-north commented Mar 3, 2025

lobehubbot commented Mar 3, 2025

AAEE86 commented Mar 3, 2025

ChenLuoi commented Mar 3, 2025

lobehubbot commented Mar 3, 2025

AnotiaWang commented Mar 3, 2025

lobehubbot commented Mar 3, 2025

AAEE86 commented Mar 3, 2025

lobehubbot commented Mar 3, 2025

This comment has been minimized.

This comment has been minimized.

arvinxx commented Mar 3, 2025

lobehubbot commented Mar 3, 2025

chung1912 commented Mar 3, 2025

lobehubbot commented Mar 3, 2025

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.