-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAG Web Browser implementation #1
Conversation
I need to change this actor to incorporate standby mode, as I initially misunderstood the standby functionality. Originally, I thought that standby mode would automatically wrap the existing actor into an "API server", exposing the running container. Therefore, I expected that sending a POST request to container_url with the payload would work. |
Good work @jirispilka, here is feedback from my side:
How about RAG Web browser (rag-web-browser) ?
Absolutely, I was thinking about INPUT like this:
The Standby API should just replicate input 1:1, and expose GET API endpoint at e.g.
Sounds great
Yeah, makes sense.
Yeah absolutely, also if we use just one browser (Firefox?), it will keep the Docker image small and thus fast to start.
Sounds good, the results could look like this (either stored to output dataset, or returned directly from Standby API): {
[
{
text: “...chunk of text from target page ...”,
markdown: "...", // if enabled
metadata: {
url: “http://www.example.com/dir/page.htm“,
// createdAt: ...ISODate, // date of creation, if available
// author: "Bob Dylan", // Author of page, if available
},
},
...
],
}
Yeah, definitely, we just need to figure how are we going to return file content in Standby API, in Base64 or kv-store?
|
@jancurn Thanks for the feedback!
Ales has suggested:
I don't want to confuse users whether this refers to number of Google search pages to crawl or number of Google search results to return. I think
Thanks, will do
yes, firefox
ok, I'm having flat structure now, i.e.
Good point with the metadata. If we want to store additional info, it would be better to create the metadata fiedl right away. |
|
For inspiration of results, see https://tavily.com/#api: Note that they just use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
* Working solution at platform * Rename actor * Working normal mode -> had to set keepAlive=false * feat: Handle inputs correctly, define output format (#3) * Output format definition * Correctly handle proxy configuration * Remove Actor.isAtHome() and simplify standby and normal mode. Add comment why we have crawlers map. * Fea: Add in memory request queue (#4) * Move apify/google-search and website-content-crawler code to separate directories to improve readability.
* Update README.md and add measured time.
@metalwarrior665 @jancurn I would like to ask you again for review. Performance-wise, there is still plenty of room for an improvement. I did ad-hoc testing, with the following summary: Here is a typical latency breakdown for the RAG Web Browser. The numbers below are based on the following search terms: "apify", "Donald Trump", "boston". Results were averaged for the three queries.
You can find a breakdown in the performance_measures.md
There are some low-hanging fruits like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. 2 GB is definitely suboptimal for ideal speed. I think you could even go higher than 4 GB for default, depending if cost or latency is a bigger priority.
How many pages you think it will scrape usually? If 5 is normal, you could theoretically hyper-optimize by spawning those in separate standby Actors
* Update README.md with information about memory/latency
I updated the README in this spirit
I'm not sure about how many pages people might use. My guess was somewhere around 1 to 5. I'm starting playwright with |
3212d30
to
90c7bd0
Compare
* Add description, title, and url to the userData and response * Add info about GPT actions * Fix Google Search output * Fix response with GoogleSearchResult, fix Output, update latency in README.md * Add paragraph about time out
* Disable debugMode * Update README.md
* Add ability to create a new crawlers using query parameters * feat: Update Dockerfile to node version 22, fix playwright key creation (#13)
* Updated README.md to include tips on improving latency * Set initialConcurrency to 5 * Set minConcurrency to 3
* Update README.md with information about default country for Google Search
Let me merge this PR now for the following reasons:
|
Description
Output: Configurable options include text (default), HTML (optional), and Markdown (optional).
see slack discussion for a reference