-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(actor): Docling Actor on Apify infrastructure #875
Open
vancura
wants to merge
28
commits into
DS4SD:main
Choose a base branch
from
vancura:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Set proper ownership and permissions for runtime directory. - Switch to non-root user for enhanced security. - Use `--chown` flag in COPY commands to maintain correct file ownership. - Ensure all files and directories are owned by `appuser`.
- Combine RUN commands to reduce image layers and overall size. - Add non-root user `appuser` for improved security. - Use `--no-install-recommends` flag to minimize installed packages. - Install only necessary dependencies in a single RUN command. - Maintain proper cleanup of package lists and caches.
Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.
Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.
- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning. - Improve readability with consistent formatting and spacing in RUN commands. - Enhance security by properly setting up appuser home directory and permissions. - Streamline directory structure and ownership for runtime operations. - Remove redundant `.apify` directory creation as it's handled by the CLI.
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include: - Added proper quoting around variables to prevent word splitting. - Improved error messages and logging functionality. - Implemented a cleanup trap to ensure temporary files are removed. - Enhanced validation of input parameters and output formats. - Added better handling of the log file and its storage. - Improved command execution with proper evaluation. - Added comments for better code readability and maintenance. - Fixed potential security issues with proper variable expansion.
- Initialize log file at `/tmp/docling.log` and redirect all output to it - Remove exit on error trap, now only logs error line numbers - Use temporary directory for timestamp file - Capture Docling exit code and handle errors more gracefully - Update log file references to use `LOG_FILE` variable - Remove local log file during cleanup
- Add installation of `time` and `procps` packages for better resource monitoring. - Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance. - Create a cache directory for EasyOCR to optimize storage usage.
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
vancura
changed the title
Docling Actor on Apify infrastructure
feat(actor): Docling Actor on Apify infrastructure
Feb 3, 2025
Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dear Docling maintainers,
I have wrapped Docling as an Apify Actor by adding the Actor definition in the
.actor
directory and published the Docling Actor on Apify Store. I've also added the Actor status badge and a brief usage description to the README, including the “Run on Apify” button.For the full description of the Actor, please see the README file in the
.actor
directory.Docling can now be used in the cloud without installation, free of charge. Users can avoid managing Python, OCR libraries, and ML model dependencies locally. The Actor can be used either from Apify Console, API, or CLI locally:
The Actor processes documents and stores the results in Apify's key-value store under the
OUTPUT_RESULT
key. It supports multiple output formats:Technical implementation
The Actor provides:
I've packaged Docling's environment (~6GB Docker image) with all necessary dependencies:
Apify will sponsor your project
All the links to Apify in this PR are affiliate links under the Apify open source fair share program with id
docling
in the passive tier of the program. In the passive tier, Apify commits to sending a monthly commission via the GitHub Sponsor button from all new sign-ups that come through your link. The only action required on your part is to accept the pull request and ensure your GitHub Sponsor button is set up.You can earn a larger commission and gain insights into traffic by registering directly with Apify, claiming ownership of the Actor on the Apify Store, and maintaining the Actor yourself. Simply contact support after signing up and pass the ownership challenge. The Actor will then be transferred, e.g., to
ds4sd/docling
, and you’ll see it under your Apify account.To further increase your income from Apify, you can convert your Actor on Apify Store to the pay-per-event pricing model and join the active developer tier. We offer an individual competitive advantage for the active developer tier in the form of either a significantly reduced Apify margin or discounted compute unit pricing. Feel free to ask for it!
Benefits of the Actor Programming Model
The Web Actor Programming Model is a new concept for building serverless microapps, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud. Actors are web automation scripts that are easy to integrate and scale up. The main benefit is that even a small piece of software can be turned into a public cloud service in a heartbeat.
Apify is the largest ecosystem where developers build, deploy, and publish data extraction, web automation tools, and AI agents. With over 3,000 Actors on Apify Store and 10 years of experience in the market, Apify makes Docling accessible to over 250,000 developers using the platform monthly. This also enables integration with other Actors on Store, custom Actors, and platform integrations that can create much more powerful workflows than just individual parts.
Full disclosure
I work at Apify. Apify doesn’t sell your software, but we sell the computing resources needed to run your software in the cloud to the end users. Your project is one of the first we selected to pilot Apify's open source fair share program. Please let me know if there’s anything I can do to help you accept this PR! If you do, we’d be pleased to feature your project in our marketing communication.
If you have any questions or need assistance, don’t hesitate to reach out to me (@vancura) or @netmilk, the Apify VP of DX, or just write us to [email protected].