RAG Web Browser implementation #1

jirispilka · 2024-08-16T08:58:40Z

Description

Inputs: I aim to keep the configuration minimal yet usable (limit to EN, US for now).
Functionality: Only extract titles and URLs from Google search results (adapted from Google extractors).
Cheerio: Used for Google Search to improve speed.
Playwright: Used for scraping (might be slow, but JavaScript rendering is a requirement). I avoided complicating this with adaptive crawling or crawler selection. We can keep it for later.
Output: Configurable options include text (default), HTML (optional), and Markdown (optional).

jirispilka · 2024-08-19T07:54:46Z

I need to change this actor to incorporate standby mode, as I initially misunderstood the standby functionality. Originally, I thought that standby mode would automatically wrap the existing actor into an "API server", exposing the running container. Therefore, I expected that sending a POST request to container_url with the payload would work.

… normal mode) (#2)

jancurn · 2024-08-22T21:08:54Z

Good work @jirispilka, here is feedback from my side:

Name: I asked Ales W. for a good name (to avoid overlaps with the existing actors)

How about RAG Web browser (rag-web-browser) ?

Inputs: I aim to keep the configuration minimal yet usable (limit to EN, US for now).

Absolutely, I was thinking about INPUT like this:

I’d use query instead of queries, just to keep it simple and nice in code example. In Standby API, you can call this endpoint multiple times for multiple queries anyway, which better balances the performance.
maxPages - Maximum resulting number of web pages to load.
FUTURE: searchType (select: search|images|..) - Would be great to expand to Google Image search too, now that "Traditional or "naive" Retrieval-Augmented Generation (RAG) is dead" :)
formats (multi-option: text|markdown|html|image|pdf|docx|file) - Select text formats to be included in the results.
pageTimeoutSecs - How many seconds wait for loading target web page. If reached, it will be skipped in the results. This is to ensure we’re within ~40 seconds timeout for function calling in GPTs and elsewhere. Default: 30 seconds.

The Standby API should just replicate input 1:1, and expose GET API endpoint at e.g. /search?query=test&formats=html,markdown&pageTimeoutSecs=5&...

Functionality: Only extract titles and URLs from Google search results (adapted from Google extractors).

Sounds great

Cheerio: Used for Google Search to improve speed.

Yeah, makes sense.

Playwright: Used for scraping (might be slow, but JavaScript rendering is a requirement). I avoided complicating this with adaptive crawling or crawler selection. We can keep it for later.

Yeah absolutely, also if we use just one browser (Firefox?), it will keep the Docker image small and thus fast to start.

Output: Configurable options include text (default), HTML (optional), and Markdown (optional).

Sounds good, the results could look like this (either stored to output dataset, or returned directly from Standby API):

{
[ 
    {
      text: “...chunk of text from target page ...”,
	markdown: "...", // if enabled 
      metadata: {
        url: “http://www.example.com/dir/page.htm“,
        // createdAt: ...ISODate, // date of creation, if available
        // author: "Bob Dylan", // Author of page, if available
      },
    },
    ...
  ],
}

TODOs:

add standby mode

extract PDF, Docx content?

Yeah, definitely, we just need to figure how are we going to return file content in Standby API, in Base64 or kv-store?

see slack discussion for a reference

jirispilka · 2024-08-23T09:55:02Z

@jancurn Thanks for the feedback!

Name: I asked Ales W. for a good name (to avoid overlaps with the existing actors)

How about RAG Web browser (rag-web-browser) ?

Ales has suggested: Google Search Data Extractor, which I like. I think it would be good to also include the "web browser"

I’d use query instead of queries, just to keep it simple and nice in code example. In Standby API, you can call this endpoint multiple times for multiple queries anyway, which better balances the performance.

queries I kept the same as for the google-search-actor, but I will change it to query (sounds better)

maxPages - Maximum resulting number of web pages to load.

I don't want to confuse users whether this refers to number of Google search pages to crawl or number of Google search results to return. I think maxResults is more clear in that sense. @jancurn what do you think?

formats (multi-option: text|markdown|html|image|pdf|docx|file) - Select text formats to be included in the results.

pageTimeoutSecs - How many seconds wait for loading target web page. If reached, it will be skipped in the results. This is to ensure we’re within ~40 seconds timeout for function calling in GPTs and elsewhere. Default: 30 seconds.

The Standby API should just replicate input 1:1, and expose GET API endpoint at e.g. /search?query=test&formats=html,markdown&pageTimeoutSecs=5&...

Thanks, will do

Playwright: Used for scraping (might be slow, but JavaScript rendering is a requirement). I avoided complicating this with adaptive crawling or crawler selection. We can keep it for later.

Yeah absolutely, also if we use just one browser (Firefox?), it will keep the Docker image small and thus fast to start.

yes, firefox

Output: Configurable options include text (default), HTML (optional), and Markdown (optional).

Sounds good, the results could look like this (either stored to output dataset, or returned directly from Standby API):

{
[ 
    {
      text: “...chunk of text from target page ...”,
	markdown: "...", // if enabled 
      metadata: {
        url: “http://www.example.com/dir/page.htm“,
        // createdAt: ...ISODate, // date of creation, if available
        // author: "Bob Dylan", // Author of page, if available
      },
    },
    ...
  ],
}

ok, I'm having flat structure now, i.e.

{ 
  "title": "",
  "url" : "".
  "text": .......
}

Good point with the metadata. If we want to store additional info, it would be better to create the metadata fiedl right away.

jancurn · 2024-08-23T10:02:32Z

"Google Search Data Extractor" is way too close to "Google Search Results Scraper" and all the other Google Search scrapers we already have in Store (see https://apify.com/store/categories?search=google+search), plus it feels like it just extracts data from Google Search, which is not true - the data is in web pages! And it doesn't communicate the main application of this Actor - for being web browser for RAG. Let's go with RAG Web browser (apify/rag-web-browser)
re maxResults - fair enough, the results can be pages, but also files or images in the future
"ok, I'm having flat structure now, i.e." - we need to return results from more than one page, so it needs to be an array. Good point about adding title too.

jancurn · 2024-08-23T12:24:30Z

For inspiration of results, see https://tavily.com/#api:

Note that they just use content probably for all other kinds of content. But what if we want to return both text and markdown together? I'd stick with our data model

metalwarrior665

Looks great!

src/main.ts

src/extractors.ts

src/crawlers.ts

* Working solution at platform * Rename actor * Working normal mode -> had to set keepAlive=false * feat: Handle inputs correctly, define output format (#3) * Output format definition * Correctly handle proxy configuration * Remove Actor.isAtHome() and simplify standby and normal mode. Add comment why we have crawlers map. * Fea: Add in memory request queue (#4) * Move apify/google-search and website-content-crawler code to separate directories to improve readability.

) * Handle failedRequests * Update eslint config * Add debug mode and functionality to measure time. Refactor the code to keep the line-len to 120 chars. * Update information about standby mode.

* Update README.md and add measured time.

jirispilka · 2024-09-02T14:04:18Z

@metalwarrior665 @jancurn I would like to ask you again for review.

Performance-wise, there is still plenty of room for an improvement. I did ad-hoc testing, with the following summary:

Here is a typical latency breakdown for the RAG Web Browser.
Please note, the these results are only indicative and may vary based on the search term and the target websites.

The numbers below are based on the following search terms: "apify", "Donald Trump", "boston". Results were averaged for the three queries.

Memory (GB)	Max Results	Latency (s)
2	1	36
2	5	88
4	1	22
4	5	46

You can find a breakdown in the performance_measures.md

Average:

request-received: 0 s
before-cheerio-queue-add: 129
cheerio-request-handler-start: 2499
before-playwright-queue-add: 87
playwright-request-start: 15703
playwright-wait-dynamic-content: 7029
playwright-remove-cookie: 1073
playwright-parse-with-cheerio: 5564
playwright-process-html: 3829
playwright-before-response-send: 236 
Time taken for each request: [ 49762, 16004, 42676 ]

There are some low-hanging fruits like: playwright-wait-dynamic-content, playwright-remove-cookie.
However, I'm uncertain about the potential improvements for other things like playwright-request-start.

metalwarrior665

Great job. 2 GB is definitely suboptimal for ideal speed. I think you could even go higher than 4 GB for default, depending if cost or latency is a bigger priority.

How many pages you think it will scrape usually? If 5 is normal, you could theoretically hyper-optimize by spawning those in separate standby Actors

* Update README.md with information about memory/latency

jirispilka · 2024-09-04T11:47:13Z

Great job. 2 GB is definitely suboptimal for ideal speed. I think you could even go higher than 4 GB for default, depending if cost or latency is a bigger priority.

I updated the README in this spirit

Based on your requirements, if low latency is a priority, consider running the Actor with 4GB or more of memory.
However, if you're looking for a cost-effective solution, you can run the Actor with 2GB of memory.

How many pages you think it will scrape usually? If 5 is normal, you could theoretically hyper-optimize by spawning those in separate standby Actors

I'm not sure about how many pages people might use. My guess was somewhere around 1 to 5. I'm starting playwright with desiredConcurrency = 3.

…standby mode.

* Add description, title, and url to the userData and response * Add info about GPT actions * Fix Google Search output * Fix response with GoogleSearchResult, fix Output, update latency in README.md * Add paragraph about time out

* Disable debugMode * Update README.md

* Add ability to create a new crawlers using query parameters * feat: Update Dockerfile to node version 22, fix playwright key creation (#13)

* Updated README.md to include tips on improving latency * Set initialConcurrency to 5 * Set minConcurrency to 3

* Update README.md with information about default country for Google Search

jirispilka · 2024-10-02T07:26:09Z

Let me merge this PR now for the following reasons:

The Actor has already been published, so it’s better to have the code in the master branch for anyone visiting the repository.
This PR has become quite lengthy and complex.

Initial working version

183a544

jirispilka requested review from jancurn and metalwarrior665 August 16, 2024 09:09

jirispilka added 7 commits August 16, 2024 11:33

Organize imports

df0db55

Fixed issue with HTML parsing

c248d3c

Fixed imports

dba1d34

Fix playwright config

ca67d19

Add proxy and fix google search URL

aad1b4b

Add query to input schema

b706092

Move crawler options to processInput function

a6ae01e

Feat: Add standby mode (query parameters are the same as input in the…

2d698be

… normal mode) (#2)

metalwarrior665 reviewed Aug 27, 2024

View reviewed changes

src/main.ts Outdated Show resolved Hide resolved

src/extractors.ts Outdated Show resolved Hide resolved

src/crawlers.ts Show resolved Hide resolved

src/crawlers.ts Outdated Show resolved Hide resolved

jirispilka added 4 commits August 28, 2024 16:53

Feat: refactor response handling and add debugMode with timeMeasures (#7

01d296a

) * Handle failedRequests * Update eslint config * Add debug mode and functionality to measure time. Refactor the code to keep the line-len to 120 chars. * Update information about standby mode.

Update README.md (#8)

f769492

Docs: update readme (#9)

1ed5d14

* Update README.md and add measured time.

jirispilka changed the title ~~WIP: Initial working version~~ RAG Web Browser implementation Sep 2, 2024

jirispilka added 2 commits September 2, 2024 15:52

Update performance_measures.md

d14bf41

Update performance_measures.md

1cfe48c

jirispilka marked this pull request as ready for review September 2, 2024 14:04

jirispilka requested a review from metalwarrior665 September 2, 2024 14:04

metalwarrior665 approved these changes Sep 3, 2024

View reviewed changes

docs: update readme latency (#10)

90c7bd0

* Update README.md with information about memory/latency

jirispilka force-pushed the feat/actor-impl branch from 3212d30 to 90c7bd0 Compare September 4, 2024 14:18

jirispilka and others added 11 commits September 4, 2024 16:25

Add information about limitation w.r.t. different configuration in a …

8a787e6

…standby mode.

Add CHANGELOG.md

e375551

feat: Disable debugMode (#12)

232456a

* Disable debugMode * Update README.md

Provide better defaults search query in input_schema.json

dbe6c63

Update README.md

6059d0b

Feat: create crawlers on the fly (#14)

4bb8c22

* Add ability to create a new crawlers using query parameters * feat: Update Dockerfile to node version 22, fix playwright key creation (#13)

Update CHANGELOG.md

9bc2f2b

Fix response format when crawler fails (#15)

7eb9bb1

docs: update readme min concurrency (#16)

edc4bea

* Updated README.md to include tips on improving latency * Set initialConcurrency to 5 * Set minConcurrency to 3

Fix: update readme with information about Google country (#17)

7bc3ebc

* Update README.md with information about default country for Google Search

jirispilka merged commit 932a411 into master Oct 2, 2024
1 check passed

jirispilka deleted the feat/actor-impl branch October 2, 2024 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Web Browser implementation #1

RAG Web Browser implementation #1

jirispilka commented Aug 16, 2024 •

edited

Loading

jirispilka commented Aug 19, 2024

jancurn commented Aug 22, 2024

jirispilka commented Aug 23, 2024

jancurn commented Aug 23, 2024

jancurn commented Aug 23, 2024 •

edited

Loading

metalwarrior665 left a comment

jirispilka commented Sep 2, 2024

metalwarrior665 left a comment

jirispilka commented Sep 4, 2024

jirispilka commented Oct 2, 2024

RAG Web Browser implementation #1

RAG Web Browser implementation #1

Conversation

jirispilka commented Aug 16, 2024 • edited Loading

jirispilka commented Aug 19, 2024

jancurn commented Aug 22, 2024

jirispilka commented Aug 23, 2024

jancurn commented Aug 23, 2024

jancurn commented Aug 23, 2024 • edited Loading

metalwarrior665 left a comment

Choose a reason for hiding this comment

jirispilka commented Sep 2, 2024

metalwarrior665 left a comment

Choose a reason for hiding this comment

jirispilka commented Sep 4, 2024

jirispilka commented Oct 2, 2024

jirispilka commented Aug 16, 2024 •

edited

Loading

jancurn commented Aug 23, 2024 •

edited

Loading