Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAG Web Browser implementation #1

Merged
merged 27 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
183a544
Initial working version
jirispilka Aug 16, 2024
df0db55
Organize imports
jirispilka Aug 16, 2024
c248d3c
Fixed issue with HTML parsing
jirispilka Aug 16, 2024
dba1d34
Fixed imports
jirispilka Aug 16, 2024
ca67d19
Fix playwright config
jirispilka Aug 16, 2024
aad1b4b
Add proxy and fix google search URL
jirispilka Aug 16, 2024
b706092
Add query to input schema
jirispilka Aug 16, 2024
a6ae01e
Move crawler options to processInput function
jirispilka Aug 19, 2024
2d698be
Feat: Add standby mode (query parameters are the same as input in the…
jirispilka Aug 22, 2024
d5a66d4
Feat: handle PR review comments, fix actor running in normal mode (#6)
jirispilka Aug 28, 2024
01d296a
Feat: refactor response handling and add debugMode with timeMeasures …
jirispilka Aug 30, 2024
f769492
Update README.md (#8)
jirispilka Aug 30, 2024
1ed5d14
Docs: update readme (#9)
jirispilka Sep 2, 2024
d14bf41
Update performance_measures.md
jirispilka Sep 2, 2024
1cfe48c
Update performance_measures.md
jirispilka Sep 2, 2024
90c7bd0
docs: update readme latency (#10)
jirispilka Sep 4, 2024
8a787e6
Add information about limitation w.r.t. different configuration in a …
jirispilka Sep 4, 2024
11fa1b0
Feat: add google title, description, improve Readme (#11)
jirispilka Sep 11, 2024
e375551
Add CHANGELOG.md
jirispilka Sep 11, 2024
232456a
feat: Disable debugMode (#12)
jirispilka Sep 11, 2024
dbe6c63
Provide better defaults search query in input_schema.json
jirispilka Sep 11, 2024
6059d0b
Update README.md
jirispilka Sep 12, 2024
4bb8c22
Feat: create crawlers on the fly (#14)
jirispilka Sep 16, 2024
9bc2f2b
Update CHANGELOG.md
jirispilka Sep 16, 2024
7eb9bb1
Fix response format when crawler fails (#15)
jirispilka Sep 20, 2024
edc4bea
docs: update readme min concurrency (#16)
jirispilka Sep 24, 2024
7bc3ebc
Fix: update readme with information about Google country (#17)
jirispilka Oct 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .actor/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Specify the base Docker image. You can read more about
# the available images at https://crawlee.dev/docs/guides/docker-images
# You can also use any other image from Docker Hub.
FROM apify/actor-node-playwright-chrome:18-1.40.0 AS builder

# Copy just package.json and package-lock.json
# to speed up the build using Docker layer cache.
COPY --chown=myuser package*.json ./

# Install all dependencies. Don't audit to speed up the installation.
RUN npm install --include=dev --audit=false

# Next, copy the source files using the user set
# in the base image.
COPY --chown=myuser . ./

# Install all dependencies and build the project.
# Don't audit to speed up the installation.
RUN npm run build

# Create final image
FROM apify/actor-node-playwright-firefox:18-1.40.0

# Copy just package.json and package-lock.json
# to speed up the build using Docker layer cache.
COPY --chown=myuser package*.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
&& npm install --omit=dev --omit=optional \
&& echo "Installed NPM packages:" \
&& (npm list --omit=dev --all || true) \
&& echo "Node.js version:" \
&& node --version \
&& echo "NPM version:" \
&& npm --version \
&& rm -r ~/.npm

# Copy built JS files from builder image
COPY --from=builder --chown=myuser /home/myuser/dist ./dist

# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# for most source file changes.
COPY --chown=myuser . ./

RUN rm $PLAYWRIGHT_BROWSERS_PATH/firefox-*/firefox/libnssckbi.so
RUN ln -s /usr/lib/x86_64-linux-gnu/pkcs11/p11-kit-trust.so $(ls -d $PLAYWRIGHT_BROWSERS_PATH/firefox-*)/firefox/libnssckbi.so

# Run the image.
CMD npm run start:prod --silent
63 changes: 63 additions & 0 deletions .actor/actor.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
{
"actorSpecification": 1,
"name": "serp-content-crawler",
"title": "SERP Content Crawler",
"description": "Retrieve website content from the top Google Search Results Pages (SERPs)",
"version": "0.0",
"meta": {
"templateId": "ts-crawlee-cheerio"
},
"input": "./input_schema.json",
"dockerfile": "./Dockerfile",
"storage": {
"dataset": {
"actorSpecification": 1,
"title": "SERP Content Crawler Dataset",
"description": "",
"views": {
"default": {
"title": "Text",
"description": "View of URLs of web pages and their content as simple plain text.",
"transformation": {
"fields": [
"url",
"text"
]
},
"display": {
"component": "table",
"properties": {
"url": {
"label": "Webpage URL"
},
"text": {
"label": "Extracted text"
}
}
}
},
"markdown": {
"title": "Markdown",
"description": "View of URLs of web pages and their content as Markdown with formatting.",
"transformation": {
"fields": [
"url",
"markdown"
]
},
"display": {
"component": "table",
"properties": {
"url": {
"label": "Webpage URL"
},
"markdown": {
"label": "Extracted Markdown"
}
}
}
}
}
}
}
}
35 changes: 35 additions & 0 deletions .actor/input_schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"title": "SERP Content Crawler",
"description": "Retrieve website content from the top Google Search Results Pages (SERPs)",
"type": "object",
"schemaVersion": 1,
"properties": {
"queries": {
"title": "Search term(s)",
"type": "string",
"description": "Use regular search words or enter Google Search URLs. You can also apply [advanced Google search techniques](https://blog.apify.com/how-to-scrape-google-like-a-pro/), such as <code>AI site:twitter.com</code> or <code>javascript OR python</code>.",
"prefill": "apify\nllm",
"editor": "textarea",
"pattern": "[^\\s]+"
},
"maxResults": {
"title": "Max results to return",
"type": "integer",
"description": "",
"prefill": 3.0,
"minimum": 1.0
},
"proxyConfiguration": {
"title": "Crawler: Proxy configuration",
"type": "object",
"description": "Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.",
"default": {
"useApifyProxy": true
},
"prefill": {
"useApifyProxy": true
},
"editor": "proxy"
}
}
}
13 changes: 13 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# configurations
.idea

# crawlee and apify storage folders
apify_storage
crawlee_storage
storage

# installed files
node_modules

# git folder
.git
9 changes: 9 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
root = true

[*]
indent_style = space
indent_size = 4
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
end_of_line = lf
38 changes: 38 additions & 0 deletions .eslintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"root": true,
"env": {
"browser": true,
"es2020": true,
"node": true
},
"extends": [
"@apify/eslint-config-ts"
],
"parserOptions": {
"project": "./tsconfig.json",
"ecmaVersion": 2020
},
"ignorePatterns": [
"node_modules",
"dist",
"**/*.d.ts"
],
"plugins": ["import"],
"rules": {
"import/order": [
"error",
{
"groups": [
["builtin", "external"],
"internal",
["parent", "sibling", "index"]
],
"newlines-between": "always",
"alphabetize": {
"order": "asc",
"caseInsensitive": true
}
}
]
}
}
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# This file tells Git which files shouldn't be added to source control

.DS_Store
.idea
dist
node_modules
apify_storage
storage

# Added by Apify CLI
.venv
38 changes: 15 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,20 @@
# Solution template
This repository serves as a template for creating repositories for new solutions. Using a template makes it easier to create new repos with pre-defined contents and ensures consistency between the repositories.
## Fast Google Search Result Content Crawler

## How to use this template
This Actor retrieves website content from the top Google Search Results Pages (SERPs).
Given a search query, it fetches the first page of Google search results, then crawls the top sites to extract text content.
It is capable of extracting content from JavaScript-enabled websites and can bypass anti-scraping protections.
The extracted web content is saved as plain text or markdown.
This Actor is ideal for adding up-to-date Google search knowledge to your LLM applications.

1. Click the Use this template button in the top right corner.
2. Choose a name for the repository. It should be in the format `vendor`-`customer`-`solution`-`...`.
- **Examples:**
- apify-thorn-facebook-scraper
- topmonks-microsoft-google-scraper
- topmonks-microsoft-google-data-processor
- devbros-apple-some-codename-scraper
3. Make sure the repo is **private**.
4. Create the repo.
This Actor is a combination of a two more powerful Apify actors:
- [Google Search Results Scraper](https://apify.com/apify/google-search-scraper)
- [Website Content Crawler](https://apify.com/apify/website-content-crawler)

**Once you have the repo created:**
#### Looking to scrape Google Search Results?
- Check out the [Google Search Results Scraper](https://apify.com/apify/google-search-scraper) actor.

1. Go to Settings -> Manage Access -> Invite teams or people.
2. Add the **Apify Team** as **admin**. If the solution will be delivered by a partner, add their team as **admin** too.
4. Edit this README and fill in the details in the template below. If a field cannot be filled, write **N/A**.
5. Finally, delete this guide from the Readme, so that only the newly added details will remain.
6. You're done! Thanks for using the template!
#### Need to extract content from a list of URLs?
- Explore the the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor.

# vendor-customer-solution

**Kanban link:** Add link to the Apify Kanban card.

**Issue link:** Add link to the issue created in Delivery Issue Tracker or some other tracking issue.
Browsing Tool
- https://community.openai.com/t/new-assistants-browse-with-bing-ability/479383/27
Loading
Loading