-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search with Page Content Crawl Feedback | 搜索 - 页面深度抓取功能反馈 #6632
Comments
赞!gpt-4o可以正常使用。但是qwen系列的模型好像不能正常触发调用Crawl ,这个有办法吗? |
praise! gpt-4o can be used normally. However, the qwen series model does not seem to trigger the call to Crawl normally. Is there any way to do this? |
@dream-north 如果是不支持 Function calling 的模型,应该就不行 |
@arvinxx 我用的qwen-plus,是支持function calling的,使用下来
gpt-4o在开启智能联网后能稳定调用searXNG和crawl。 |
@dream-north qwen-plus 是不是有内置联网的功能? |
@dream-north qwen-plus Does it have built-in networking function? |
@arvinxx 是的没错,感觉两个联网有点混乱在一起了,qwen-plus开启智能联网即使关闭了使用内置搜索引擎,也没有调用searXNG。 改用没有内置搜索的qwen2.5-72b,稳定调用searXNG,但是没有像gpt-4o一样对查询的结果调用crawl。 又试了下,qwen-plus开启智能联网可以按照您的样例调用crawl总结页面,赞。 另外关于另一个事情,百炼的qwen api好像支持function calling流式输出了,我把src/libs/agent-runtime/qwen/index.ts中的stream改成true可以正常流式输出。相关issue:#4567 |
@Cheukfung 嗯嗯感谢,我也可以发送链接能成功触发 crawl了。 搜索功能我像你一样问“搜索今天的nba比赛”,可以触发SearXNG; 相同的问题使用gpt-4o-mini,会在搜索之后调用crawl,这个没有在qwen系列模型中触发过。 |
@Cheukfung Yeah Thanks, I can also send the link to successfully trigger crawl. Search function I ask "Search today's nba game" like you, which can trigger SearXNG; The same problem uses gpt-4o-mini, which will call crawl after searching, which has not been triggered in the qwen series model. |
似乎内置联网功能的模型都大概率不调用LobeChat侧的搜索工具(即使没有勾选“使用模型内置搜索引擎”),我在Gemini2.0系列模型上用时事类问题复现了上述qwen模型一样的问题 |
It seems that models with built-in networking function are likely to not call the search tool on the LobeChat side (even if "Use the built-in search engine for model" is not checked), I used current affairs problems to reproduce the same problem as the above qwen model on the Gemini2.0 series model. |
naive抓取方式局限还是蛮大的,很容易抓取一些无效内容或者模型无法理解的图片(链接)。特别是还有截断机制,反倒把无效内容保留,有效内容去掉了。本次测试中,抓取了3个网页,21000字符里可能有效字符就5000. 是否可以加一个mini的前置模型预筛选naive抓取的内容,避免机械地截断,影响效果而且浪费高级模型token。如果接入低价或者可白嫖的mini模型 or 本地部署模型,效果和性价比应该会好很多。得益于此,也可以增加抓取页面数。 或者把Jina作为首选(20rpm对于大多数没有公开提供服务的社区版用户来说应该是足够的)。 话说现在抓取方式选择机制是怎么样的呢,感觉没有有效的质量侦测和切换机制。图一所示的网页用naive抓取,38000/7000字符里没有一个有效的。后面两个网页质量好一些(至少没有触发登录),反倒用了Jina。 |
The naive crawling method is quite limited, and it is easy to crawl some images that are invalid or cannot be understood by the model (link). In particular, there is also a cutoff mechanism, which retains invalid content and removes the effective content. In this test, 3 web pages were captured, and the possible valid characters in 21,000 characters were 5,000. Can you add a mini pre-model to pre-filter the content of naive crawling to avoid mechanical truncation, affecting the effect and wasting advanced model tokens. If you access a low-priced mini model or local deployment model, the effect and cost-effectiveness should be much better. Thanks to this, you can also increase the number of pages crawled. Or take Jina as your preferred choice (20rpm should be enough for most community version users who do not provide public services). Speaking of which, what is the current crawling method selection mechanism? I feel that there is no effective quality detection and switching mechanism. The web page shown in Figure 1 is crawled with naive, and none of the 38000/7000 characters are valid. The last two web pages are of better quality (at least they did not trigger login), but instead they used Jina. |
使用function call去抉择是否需要crawl抓取页面的,很依赖AI本身的能力和倾向。能力足够强的AI可以通过多步迭代,达成类似DeepResearch的效果;但是能力一般的AI往往都不会再通过crawl去继续抓取页面内容。 |
Using function call to decide whether crawl is needed to crawl pages, it depends on the capabilities and tendencies of the AI itself. AI with strong enough capabilities can achieve effects similar to DeepResearch through multiple steps; however, AI with average capabilities will often no longer continue to crawl the page content through crawl. |
这是出现幻觉了,直接编造了搜索结果和引用数据 |
This is an hallucination, and the search results and reference data were directly fabricated |
还真是..... |
It's really... |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
想到一块去了,这个就是 #6277 Step 5 和 Step 6 准备做的。会有一个「强制联网」模式,这个模式下可以集成小模型做前置意图理解。 |
I thought of going together, and this is what #6277 Step 5 and Step 6 are preparing to do. There will be a "forced networking" mode, in which small models can be integrated for pre-intention understanding. |
需要增加一个开关深度搜索的按钮 |
Need to add a switch depth search button |
Regarding #6482, v1.67.1 officially releases the web scraping feature for online search. We welcome everyone to provide feedback on any bad cases encountered during use, so we can improve the web scraping experience together.
接 #6482 ,v1.67.1 正式发布联网搜索的页面抓取功能,欢迎大家反馈一些使用过程中的 bad case,一起完善页面抓取的使用体验。
The text was updated successfully, but these errors were encountered: