Skip to content

Commit

Permalink
docs: update readme latency (#10)
Browse files Browse the repository at this point in the history
* Update README.md with information about memory/latency
  • Loading branch information
jirispilka authored Sep 4, 2024
1 parent 1cfe48c commit 90c7bd0
Show file tree
Hide file tree
Showing 3 changed files with 108 additions and 36 deletions.
40 changes: 19 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
## RAG Web Browser
## 🌐 RAG Web Browser

This Actor retrieves website content from the top Google Search Results Pages (SERPs).
Given a search query, it fetches the top Google search result URLs and then follows each URL to extract the text content from the targeted websites.

The RAG Web Browser is designed for Large Language Model (LLM) applications or LLM agents to provide up-to-date Google search knowledge.

**Main features**:
**🚀 Main features**:
- Searches Google and extracts the top Organic results.
- Follows the top URLs to scrape HTML and extract website text, excluding navigation, ads, banners, etc.
- Capable of extracting content from JavaScript-enabled websites and bypassing anti-scraping protections.
- Output formats include plain text, markdown, and HTML.

This Actor is a combination of a two specialized actors:
- [Google Search Results Scraper](https://apify.com/apify/google-search-scraper)
- [Website Content Crawler](https://apify.com/apify/website-content-crawler)
- Are you looking to scrape Google Search Results? Check out the [Google Search Results Scraper](https://apify.com/apify/google-search-scraper) actor.
- Do you need extract content from a list of URLs? Explore the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor.

### Fast responses using the Standby mode
### 🏎️ Fast responses using the Standby mode

This Actor can be run in both normal and [standby modes](https://docs.apify.com/platform/actors/running/standby).
Normal mode is useful for testing and running in ad-hoc settings, but it comes with some overhead due to the Actor's initial startup time.
Expand All @@ -26,7 +26,7 @@ This allows the Actor to stay active, enabling it to retrieve results with lower
*Limitations*: Running the Actor in Standby mode does not support changing crawling and scraping configurations using query parameters.
Supporting this would require creating crawlers on the fly, which would add an overhead of 1-2 seconds.

#### How to start the Actor in a Standby mode?
#### 🔥 How to start the Actor in a Standby mode?

You need the Actor's standby URL and `APIFY_API_TOKEN`. Then, you can send requests to the `/search` path along with your `query` and the number of results (`maxResults`) you want to retrieve.

Expand Down Expand Up @@ -60,7 +60,7 @@ Here’s an example of the server response (truncated for brevity):
The Standby mode has several configuration parameters, such as Max Requests per Run, Memory, and Idle Timeout.
You can find the details in the [Standby Mode documentation](https://docs.apify.com/platform/actors/running/standby#how-do-i-customize-standby-configuration).

## API parameters
## 📧 API parameters

When running in the standby mode the RAG Web Browser accept the following query parameters:

Expand All @@ -72,36 +72,34 @@ When running in the standby mode the RAG Web Browser accept the following query
| `requestTimeoutSecs` | Timeout (in seconds) for making the search request and processing its response |


### What is the best way to run the RAG Web Browser?
### 🏃 What is the best way to run the RAG Web Browser?

The RAG Web Browser is designed to be run in Standby mode for optimal performance.
The Standby mode allows the Actor to stay active, enabling it to retrieve results with lower latency.

### 🕒 What is the expected latency?

The latency is proportional to the memory allocated to the Actor and number of results requested.

Here is a typical latency breakdown for the RAG Web Browser.
Please note the these results are only indicative and may vary based on the search term and the target websites.
Please note the these results are only indicative and may vary based on the search term, the target websites,
and network latency.

The numbers below are based on the following search terms: "apify", "Donald Trump", "boston". Results were averaged for the three queries.
The numbers below are based on the following search terms: "apify", "Donald Trump", "boston".
Results were averaged for the three queries.

| Memory (GB) | Max Results | Latency (s) |
|-------------|-------------|-------------|
| 2 | 1 | 36 |
| 2 | 5 | 88 |
| 4 | 1 | 22 |
| 4 | 3 | 31 |
| 4 | 5 | 46 |

Based on your requirements, if low latency is a priority, consider running the Actor with 4GB or more of memory.
However, if you're looking for a cost-effective solution, you can run the Actor with 2GB of memory.

#### Looking to scrape Google Search Results?
- Check out the [Google Search Results Scraper](https://apify.com/apify/google-search-scraper) actor.

#### Need to extract content from a list of URLs?
- Explore the the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor.

Browsing Tool
- https://community.openai.com/t/new-assistants-browse-with-bing-ability/479383/27


### Development
### 👷🏼 Development

#### Run STANDBY mode using apify-cli for development
```bash
Expand Down
103 changes: 88 additions & 15 deletions data/performance_measures.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ playwright-wait-dynamic-content: 7029
playwright-remove-cookie: 1073
playwright-parse-with-cheerio: 5564
playwright-process-html: 3829
playwright-before-response-send: 236
playwright-before-response-send: 236
Time taken for each request: [ 49762, 16004, 42676 ]
Time taken on average 36147.333333333336
Expand Down Expand Up @@ -132,15 +132,88 @@ before-cheerio-queue-add: 123
cheerio-request-handler-start: 2637
before-playwright-queue-add: 12
playwright-request-start: 8517
playwright-wait-dynamic-content: 6013
playwright-remove-cookie: 497
playwright-parse-with-cheerio: 2296
playwright-process-html: 1664
playwright-before-response-send: 110
playwright-wait-dynamic-content: 6013
playwright-remove-cookie: 497
playwright-parse-with-cheerio: 2296
playwright-process-html: 1664
playwright-before-response-send: 110
Time taken for each request: [ 25433, 14899, 25276 ]
Time taken on average 21869.333333333332
```

# Memory 4GB, Max Results 3, Proxy: auto

```text
Average time for each time measure event: Map(10) {
'request-received' => [
0, 0, 0, 0, 0,
0, 0, 0, 0
],
'before-cheerio-queue-add' => [
157, 157, 157,
107, 107, 107,
122, 122, 122
],
'cheerio-request-handler-start' => [
1699, 1699, 1699,
4312, 4312, 4312,
2506, 2506, 2506
],
'before-playwright-queue-add' => [
10, 10, 10, 13, 13,
13, 5, 5, 5
],
'playwright-request-start' => [
16249, 17254, 26159,
6726, 9821, 11124,
7349, 8212, 29345
],
'playwright-wait-dynamic-content' => [
1110, 10080, 10076,
6132, 1524, 18367,
3077, 2508, 10001
],
'playwright-remove-cookie' => [
1883, 914, 133,
1176, 5072, 241,
793, 4234, 120
],
'playwright-parse-with-cheerio' => [
1203, 1490, 801,
698, 2919, 507,
798, 1378, 2756
],
'playwright-process-html' => [
2597, 1304, 1398,
1099, 6756, 1031,
2110, 5416, 2028
],
'playwright-before-response-send' => [
105, 112, 74,
501, 3381, 26,
101, 1570, 69
]
}
request-received: 0 s
before-cheerio-queue-add: 129 s
cheerio-request-handler-start: 2839 s
before-playwright-queue-add: 9 s
playwright-request-start: 14693 s
playwright-wait-dynamic-content: 6986 s
playwright-remove-cookie: 1618 s
playwright-parse-with-cheerio: 1394 s
playwright-process-html: 2638 s
playwright-before-response-send: 660 s
Time taken for each request: [
25013, 33020,
40507, 20764,
33905, 35728,
16861, 25951,
46952
]
Time taken on average 30966.777777777777
```

# Memory 4GB, Max Results 5, Proxy: auto

```text
Expand Down Expand Up @@ -205,15 +278,15 @@ Time taken on average 21869.333333333332
]
}
request-received: 0 s
before-cheerio-queue-add: 145
cheerio-request-handler-start: 3117
before-playwright-queue-add: 41
playwright-request-start: 31449
playwright-wait-dynamic-content: 4987
playwright-remove-cookie: 1742
playwright-parse-with-cheerio: 2020
playwright-process-html: 2451
playwright-before-response-send: 558
before-cheerio-queue-add: 145
cheerio-request-handler-start: 3117
before-playwright-queue-add: 41
playwright-request-start: 31449
playwright-wait-dynamic-content: 4987
playwright-remove-cookie: 1742
playwright-parse-with-cheerio: 2020
playwright-process-html: 2451
playwright-before-response-send: 558
Time taken for each request: [
26517, 33101, 58388,
71906, 81101, 30794,
Expand Down
1 change: 1 addition & 0 deletions src/performance-measures.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import { Actor } from 'apify';
// const datasetId = 'aDnsnaBqGb8eTdpGv'; // 2GB, maxResults=1
// const datasetId = 'giAPLL8dhd2PDqPlf'; // 2GB, maxResults=5
// const datasetId = 'VKzel6raVqisgIYfe'; // 4GB, maxResults=1
// const datasetId = 'KkTaLd70HbFgAO35y'; // 4GB, maxResults=3
const datasetId = 'fm9tO0GDBUagMT0df'; // 4GB, maxResults=5

// set environment variables APIFY_TOKEN
Expand Down

0 comments on commit 90c7bd0

Please sign in to comment.