diff --git a/README.md b/README.md index ab644c6..e886d4a 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,21 @@ -## RAG Web Browser +## 🌐 RAG Web Browser This Actor retrieves website content from the top Google Search Results Pages (SERPs). Given a search query, it fetches the top Google search result URLs and then follows each URL to extract the text content from the targeted websites. The RAG Web Browser is designed for Large Language Model (LLM) applications or LLM agents to provide up-to-date Google search knowledge. -**Main features**: +**πŸš€ Main features**: - Searches Google and extracts the top Organic results. - Follows the top URLs to scrape HTML and extract website text, excluding navigation, ads, banners, etc. - Capable of extracting content from JavaScript-enabled websites and bypassing anti-scraping protections. - Output formats include plain text, markdown, and HTML. This Actor is a combination of a two specialized actors: -- [Google Search Results Scraper](https://apify.com/apify/google-search-scraper) -- [Website Content Crawler](https://apify.com/apify/website-content-crawler) +- Are you looking to scrape Google Search Results? Check out the [Google Search Results Scraper](https://apify.com/apify/google-search-scraper) actor. +- Do you need extract content from a list of URLs? Explore the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor. -### Fast responses using the Standby mode +### 🏎️ Fast responses using the Standby mode This Actor can be run in both normal and [standby modes](https://docs.apify.com/platform/actors/running/standby). Normal mode is useful for testing and running in ad-hoc settings, but it comes with some overhead due to the Actor's initial startup time. @@ -26,7 +26,7 @@ This allows the Actor to stay active, enabling it to retrieve results with lower *Limitations*: Running the Actor in Standby mode does not support changing crawling and scraping configurations using query parameters. Supporting this would require creating crawlers on the fly, which would add an overhead of 1-2 seconds. -#### How to start the Actor in a Standby mode? +#### πŸ”₯ How to start the Actor in a Standby mode? You need the Actor's standby URL and `APIFY_API_TOKEN`. Then, you can send requests to the `/search` path along with your `query` and the number of results (`maxResults`) you want to retrieve. @@ -60,7 +60,7 @@ Here’s an example of the server response (truncated for brevity): The Standby mode has several configuration parameters, such as Max Requests per Run, Memory, and Idle Timeout. You can find the details in the [Standby Mode documentation](https://docs.apify.com/platform/actors/running/standby#how-do-i-customize-standby-configuration). -## API parameters +## πŸ“§ API parameters When running in the standby mode the RAG Web Browser accept the following query parameters: @@ -72,36 +72,34 @@ When running in the standby mode the RAG Web Browser accept the following query | `requestTimeoutSecs` | Timeout (in seconds) for making the search request and processing its response | -### What is the best way to run the RAG Web Browser? +### πŸƒ What is the best way to run the RAG Web Browser? The RAG Web Browser is designed to be run in Standby mode for optimal performance. The Standby mode allows the Actor to stay active, enabling it to retrieve results with lower latency. + +### πŸ•’ What is the expected latency? + The latency is proportional to the memory allocated to the Actor and number of results requested. Here is a typical latency breakdown for the RAG Web Browser. -Please note the these results are only indicative and may vary based on the search term and the target websites. +Please note the these results are only indicative and may vary based on the search term, the target websites, +and network latency. -The numbers below are based on the following search terms: "apify", "Donald Trump", "boston". Results were averaged for the three queries. +The numbers below are based on the following search terms: "apify", "Donald Trump", "boston". +Results were averaged for the three queries. | Memory (GB) | Max Results | Latency (s) | |-------------|-------------|-------------| | 2 | 1 | 36 | | 2 | 5 | 88 | | 4 | 1 | 22 | +| 4 | 3 | 31 | | 4 | 5 | 46 | +Based on your requirements, if low latency is a priority, consider running the Actor with 4GB or more of memory. +However, if you're looking for a cost-effective solution, you can run the Actor with 2GB of memory. -#### Looking to scrape Google Search Results? -- Check out the [Google Search Results Scraper](https://apify.com/apify/google-search-scraper) actor. - -#### Need to extract content from a list of URLs? -- Explore the the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor. - -Browsing Tool -- https://community.openai.com/t/new-assistants-browse-with-bing-ability/479383/27 - - -### Development +### πŸ‘·πŸΌ Development #### Run STANDBY mode using apify-cli for development ```bash diff --git a/data/performance_measures.md b/data/performance_measures.md index cfaaa82..d84163b 100644 --- a/data/performance_measures.md +++ b/data/performance_measures.md @@ -22,7 +22,7 @@ playwright-wait-dynamic-content: 7029 playwright-remove-cookie: 1073 playwright-parse-with-cheerio: 5564 playwright-process-html: 3829 -playwright-before-response-send: 236 +playwright-before-response-send: 236 Time taken for each request: [ 49762, 16004, 42676 ] Time taken on average 36147.333333333336 @@ -132,15 +132,88 @@ before-cheerio-queue-add: 123 cheerio-request-handler-start: 2637 before-playwright-queue-add: 12 playwright-request-start: 8517 -playwright-wait-dynamic-content: 6013 -playwright-remove-cookie: 497 -playwright-parse-with-cheerio: 2296 -playwright-process-html: 1664 -playwright-before-response-send: 110 +playwright-wait-dynamic-content: 6013 +playwright-remove-cookie: 497 +playwright-parse-with-cheerio: 2296 +playwright-process-html: 1664 +playwright-before-response-send: 110 Time taken for each request: [ 25433, 14899, 25276 ] Time taken on average 21869.333333333332 ``` +# Memory 4GB, Max Results 3, Proxy: auto + +```text +Average time for each time measure event: Map(10) { + 'request-received' => [ + 0, 0, 0, 0, 0, + 0, 0, 0, 0 + ], + 'before-cheerio-queue-add' => [ + 157, 157, 157, + 107, 107, 107, + 122, 122, 122 + ], + 'cheerio-request-handler-start' => [ + 1699, 1699, 1699, + 4312, 4312, 4312, + 2506, 2506, 2506 + ], + 'before-playwright-queue-add' => [ + 10, 10, 10, 13, 13, + 13, 5, 5, 5 + ], + 'playwright-request-start' => [ + 16249, 17254, 26159, + 6726, 9821, 11124, + 7349, 8212, 29345 + ], + 'playwright-wait-dynamic-content' => [ + 1110, 10080, 10076, + 6132, 1524, 18367, + 3077, 2508, 10001 + ], + 'playwright-remove-cookie' => [ + 1883, 914, 133, + 1176, 5072, 241, + 793, 4234, 120 + ], + 'playwright-parse-with-cheerio' => [ + 1203, 1490, 801, + 698, 2919, 507, + 798, 1378, 2756 + ], + 'playwright-process-html' => [ + 2597, 1304, 1398, + 1099, 6756, 1031, + 2110, 5416, 2028 + ], + 'playwright-before-response-send' => [ + 105, 112, 74, + 501, 3381, 26, + 101, 1570, 69 + ] +} +request-received: 0 s +before-cheerio-queue-add: 129 s +cheerio-request-handler-start: 2839 s +before-playwright-queue-add: 9 s +playwright-request-start: 14693 s +playwright-wait-dynamic-content: 6986 s +playwright-remove-cookie: 1618 s +playwright-parse-with-cheerio: 1394 s +playwright-process-html: 2638 s +playwright-before-response-send: 660 s +Time taken for each request: [ + 25013, 33020, + 40507, 20764, + 33905, 35728, + 16861, 25951, + 46952 +] +Time taken on average 30966.777777777777 +``` + # Memory 4GB, Max Results 5, Proxy: auto ```text @@ -205,15 +278,15 @@ Time taken on average 21869.333333333332 ] } request-received: 0 s -before-cheerio-queue-add: 145 -cheerio-request-handler-start: 3117 -before-playwright-queue-add: 41 -playwright-request-start: 31449 -playwright-wait-dynamic-content: 4987 -playwright-remove-cookie: 1742 -playwright-parse-with-cheerio: 2020 -playwright-process-html: 2451 -playwright-before-response-send: 558 +before-cheerio-queue-add: 145 +cheerio-request-handler-start: 3117 +before-playwright-queue-add: 41 +playwright-request-start: 31449 +playwright-wait-dynamic-content: 4987 +playwright-remove-cookie: 1742 +playwright-parse-with-cheerio: 2020 +playwright-process-html: 2451 +playwright-before-response-send: 558 Time taken for each request: [ 26517, 33101, 58388, 71906, 81101, 30794, diff --git a/src/performance-measures.ts b/src/performance-measures.ts index 97fe9b1..5e54875 100644 --- a/src/performance-measures.ts +++ b/src/performance-measures.ts @@ -7,6 +7,7 @@ import { Actor } from 'apify'; // const datasetId = 'aDnsnaBqGb8eTdpGv'; // 2GB, maxResults=1 // const datasetId = 'giAPLL8dhd2PDqPlf'; // 2GB, maxResults=5 // const datasetId = 'VKzel6raVqisgIYfe'; // 4GB, maxResults=1 +// const datasetId = 'KkTaLd70HbFgAO35y'; // 4GB, maxResults=3 const datasetId = 'fm9tO0GDBUagMT0df'; // 4GB, maxResults=5 // set environment variables APIFY_TOKEN