Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAG Web Browser implementation #1

Merged
merged 27 commits into from
Oct 2, 2024
Merged

RAG Web Browser implementation #1

merged 27 commits into from
Oct 2, 2024

Conversation

jirispilka
Copy link
Collaborator

@jirispilka jirispilka commented Aug 16, 2024

Description

  • Inputs: I aim to keep the configuration minimal yet usable (limit to EN, US for now).
  • Functionality: Only extract titles and URLs from Google search results (adapted from Google extractors).
  • Cheerio: Used for Google Search to improve speed.
  • Playwright: Used for scraping (might be slow, but JavaScript rendering is a requirement). I avoided complicating this with adaptive crawling or crawler selection. We can keep it for later.
    Output: Configurable options include text (default), HTML (optional), and Markdown (optional).

see slack discussion for a reference

@jirispilka
Copy link
Collaborator Author

I need to change this actor to incorporate standby mode, as I initially misunderstood the standby functionality. Originally, I thought that standby mode would automatically wrap the existing actor into an "API server", exposing the running container. Therefore, I expected that sending a POST request to container_url with the payload would work.

@jancurn
Copy link
Member

jancurn commented Aug 22, 2024

Good work @jirispilka, here is feedback from my side:

  • Name: I asked Ales W. for a good name (to avoid overlaps with the existing actors)

How about RAG Web browser (rag-web-browser) ?

  • Inputs: I aim to keep the configuration minimal yet usable (limit to EN, US for now).

Absolutely, I was thinking about INPUT like this:

  • I’d use query instead of queries, just to keep it simple and nice in code example. In Standby API, you can call this endpoint multiple times for multiple queries anyway, which better balances the performance.
  • maxPages - Maximum resulting number of web pages to load.
  • FUTURE: searchType (select: search|images|..) - Would be great to expand to Google Image search too, now that "Traditional or "naive" Retrieval-Augmented Generation (RAG) is dead" :)
  • formats (multi-option: text|markdown|html|image|pdf|docx|file) - Select text formats to be included in the results.
  • pageTimeoutSecs - How many seconds wait for loading target web page. If reached, it will be skipped in the results. This is to ensure we’re within ~40 seconds timeout for function calling in GPTs and elsewhere. Default: 30 seconds.

The Standby API should just replicate input 1:1, and expose GET API endpoint at e.g. /search?query=test&formats=html,markdown&pageTimeoutSecs=5&...

  • Functionality: Only extract titles and URLs from Google search results (adapted from Google extractors).

Sounds great

  • Cheerio: Used for Google Search to improve speed.

Yeah, makes sense.

  • Playwright: Used for scraping (might be slow, but JavaScript rendering is a requirement). I avoided complicating this with adaptive crawling or crawler selection. We can keep it for later.

Yeah absolutely, also if we use just one browser (Firefox?), it will keep the Docker image small and thus fast to start.

Output: Configurable options include text (default), HTML (optional), and Markdown (optional).

Sounds good, the results could look like this (either stored to output dataset, or returned directly from Standby API):

{
[ 
    {
      text: ...chunk of text from target page ...,
	markdown: "...", // if enabled 
      metadata: {
        url: “http://www.example.com/dir/page.htm“,
        // createdAt: ...ISODate, // date of creation, if available
        // author: "Bob Dylan", // Author of page, if available
      },
    },
    ...
  ],
}

TODOs:

  • add standby mode
  • extract PDF, Docx content?

Yeah, definitely, we just need to figure how are we going to return file content in Standby API, in Base64 or kv-store?

see slack discussion for a reference

@jirispilka
Copy link
Collaborator Author

@jancurn Thanks for the feedback!

  • Name: I asked Ales W. for a good name (to avoid overlaps with the existing actors)

How about RAG Web browser (rag-web-browser) ?

Ales has suggested: Google Search Data Extractor, which I like. I think it would be good to also include the "web browser"

  • I’d use query instead of queries, just to keep it simple and nice in code example. In Standby API, you can call this endpoint multiple times for multiple queries anyway, which better balances the performance.

queries I kept the same as for the google-search-actor, but I will change it to query (sounds better)

  • maxPages - Maximum resulting number of web pages to load.

I don't want to confuse users whether this refers to number of Google search pages to crawl or number of Google search results to return. I think maxResults is more clear in that sense. @jancurn what do you think?

  • formats (multi-option: text|markdown|html|image|pdf|docx|file) - Select text formats to be included in the results.
  • pageTimeoutSecs - How many seconds wait for loading target web page. If reached, it will be skipped in the results. This is to ensure we’re within ~40 seconds timeout for function calling in GPTs and elsewhere. Default: 30 seconds.

The Standby API should just replicate input 1:1, and expose GET API endpoint at e.g. /search?query=test&formats=html,markdown&pageTimeoutSecs=5&...

Thanks, will do

  • Playwright: Used for scraping (might be slow, but JavaScript rendering is a requirement). I avoided complicating this with adaptive crawling or crawler selection. We can keep it for later.

Yeah absolutely, also if we use just one browser (Firefox?), it will keep the Docker image small and thus fast to start.

yes, firefox

Output: Configurable options include text (default), HTML (optional), and Markdown (optional).

Sounds good, the results could look like this (either stored to output dataset, or returned directly from Standby API):

{
[ 
    {
      text: ...chunk of text from target page ...,
	markdown: "...", // if enabled 
      metadata: {
        url: “http://www.example.com/dir/page.htm“,
        // createdAt: ...ISODate, // date of creation, if available
        // author: "Bob Dylan", // Author of page, if available
      },
    },
    ...
  ],
}

ok, I'm having flat structure now, i.e.

{ 
  "title": "",
  "url" : "".
  "text": .......
}

Good point with the metadata. If we want to store additional info, it would be better to create the metadata fiedl right away.

@jancurn
Copy link
Member

jancurn commented Aug 23, 2024

  • "Google Search Data Extractor" is way too close to "Google Search Results Scraper" and all the other Google Search scrapers we already have in Store (see https://apify.com/store/categories?search=google+search), plus it feels like it just extracts data from Google Search, which is not true - the data is in web pages! And it doesn't communicate the main application of this Actor - for being web browser for RAG. Let's go with RAG Web browser (apify/rag-web-browser)
  • re maxResults - fair enough, the results can be pages, but also files or images in the future
  • "ok, I'm having flat structure now, i.e." - we need to return results from more than one page, so it needs to be an array. Good point about adding title too.

@jancurn
Copy link
Member

jancurn commented Aug 23, 2024

For inspiration of results, see https://tavily.com/#api:

Screenshot 2024-08-23 at 14 24 54

Note that they just use content probably for all other kinds of content. But what if we want to return both text and markdown together? I'd stick with our data model

Copy link
Member

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

src/main.ts Outdated Show resolved Hide resolved
src/extractors.ts Outdated Show resolved Hide resolved
src/crawlers.ts Show resolved Hide resolved
src/crawlers.ts Outdated Show resolved Hide resolved
* Working solution at platform
* Rename actor
* Working normal mode -> had to set keepAlive=false
* feat: Handle inputs correctly, define output format (#3)
* Output format definition
* Correctly handle proxy configuration
* Remove Actor.isAtHome() and simplify standby and normal mode. Add comment why we have crawlers map.
* Fea: Add in memory request queue (#4)
* Move apify/google-search and website-content-crawler code to separate directories to improve readability.
)

* Handle failedRequests
* Update eslint config
* Add debug mode and functionality to measure time. Refactor the code to keep the line-len to 120 chars.
* Update information about standby mode.
* Update README.md and add measured time.
@jirispilka jirispilka changed the title WIP: Initial working version RAG Web Browser implementation Sep 2, 2024
@jirispilka
Copy link
Collaborator Author

@metalwarrior665 @jancurn I would like to ask you again for review.

Performance-wise, there is still plenty of room for an improvement. I did ad-hoc testing, with the following summary:

Here is a typical latency breakdown for the RAG Web Browser.
Please note, the these results are only indicative and may vary based on the search term and the target websites.

The numbers below are based on the following search terms: "apify", "Donald Trump", "boston". Results were averaged for the three queries.

Memory (GB) Max Results Latency (s)
2 1 36
2 5 88
4 1 22
4 5 46

You can find a breakdown in the performance_measures.md

Average:

request-received: 0 s
before-cheerio-queue-add: 129
cheerio-request-handler-start: 2499
before-playwright-queue-add: 87
playwright-request-start: 15703
playwright-wait-dynamic-content: 7029
playwright-remove-cookie: 1073
playwright-parse-with-cheerio: 5564
playwright-process-html: 3829
playwright-before-response-send: 236 
Time taken for each request: [ 49762, 16004, 42676 ]

There are some low-hanging fruits like: playwright-wait-dynamic-content, playwright-remove-cookie.
However, I'm uncertain about the potential improvements for other things like playwright-request-start.

@jirispilka jirispilka marked this pull request as ready for review September 2, 2024 14:04
Copy link
Member

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. 2 GB is definitely suboptimal for ideal speed. I think you could even go higher than 4 GB for default, depending if cost or latency is a bigger priority.

How many pages you think it will scrape usually? If 5 is normal, you could theoretically hyper-optimize by spawning those in separate standby Actors

* Update README.md with information about memory/latency
@jirispilka
Copy link
Collaborator Author

Great job. 2 GB is definitely suboptimal for ideal speed. I think you could even go higher than 4 GB for default, depending if cost or latency is a bigger priority.

I updated the README in this spirit

Based on your requirements, if low latency is a priority, consider running the Actor with 4GB or more of memory.
However, if you're looking for a cost-effective solution, you can run the Actor with 2GB of memory.

How many pages you think it will scrape usually? If 5 is normal, you could theoretically hyper-optimize by spawning those in separate standby Actors

I'm not sure about how many pages people might use. My guess was somewhere around 1 to 5. I'm starting playwright with desiredConcurrency = 3.

jirispilka and others added 11 commits September 4, 2024 16:25
* Add description, title, and url to the userData and response
* Add info about GPT actions
* Fix Google Search output
* Fix response with GoogleSearchResult, fix Output, update latency in README.md
* Add paragraph about time out
* Disable debugMode
* Update README.md
* Add ability to create a new crawlers using query parameters
* feat: Update Dockerfile to node version 22, fix playwright key creation (#13)
* Updated README.md to include tips on improving latency
* Set initialConcurrency to 5
* Set minConcurrency to 3
* Update README.md with information about default country for Google Search
@jirispilka
Copy link
Collaborator Author

Let me merge this PR now for the following reasons:

  • The Actor has already been published, so it’s better to have the code in the master branch for anyone visiting the repository.
  • This PR has become quite lengthy and complex.

@jirispilka jirispilka merged commit 932a411 into master Oct 2, 2024
1 check passed
@jirispilka jirispilka deleted the feat/actor-impl branch October 2, 2024 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants