Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(actor): Docling Actor on Apify infrastructure #875

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

vancura
Copy link

@vancura vancura commented Feb 3, 2025

Dear Docling maintainers,

I have wrapped Docling as an Apify Actor by adding the Actor definition in the .actor directory and published the Docling Actor on Apify Store. I've also added the Actor status badge and a brief usage description to the README, including the “Run on Apify” button.

For the full description of the Actor, please see the README file in the .actor directory.

Docling can now be used in the cloud without installation, free of charge. Users can avoid managing Python, OCR libraries, and ML model dependencies locally. The Actor can be used either from Apify Console, API, or CLI locally:

apify call vancura/docling -i '{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "json",
    "ocr": true
}'

The Actor processes documents and stores the results in Apify's key-value store under the OUTPUT_RESULT key. It supports multiple output formats:

  • Markdown
  • JSON
  • HTML
  • Plain text
  • Doctags (structured format)

Technical implementation

The Actor provides:

  • Cloud-based document processing through Apify's infrastructure
  • API access for easy integration
  • Support for multiple output formats
  • OCR capabilities for scanned documents
  • Integration potential with other Apify Actors
  • Clean error handling and input validation

I've packaged Docling's environment (~6GB Docker image) with all necessary dependencies:

  • Python 3.11
  • OCR libraries
  • ML models
  • Node.js 20.x
  • All required system binaries

Apify will sponsor your project

All the links to Apify in this PR are affiliate links under the Apify open source fair share program with id docling in the passive tier of the program. In the passive tier, Apify commits to sending a monthly commission via the GitHub Sponsor button from all new sign-ups that come through your link. The only action required on your part is to accept the pull request and ensure your GitHub Sponsor button is set up.

You can earn a larger commission and gain insights into traffic by registering directly with Apify, claiming ownership of the Actor on the Apify Store, and maintaining the Actor yourself. Simply contact support after signing up and pass the ownership challenge. The Actor will then be transferred, e.g., to ds4sd/docling, and you’ll see it under your Apify account.

To further increase your income from Apify, you can convert your Actor on Apify Store to the pay-per-event pricing model and join the active developer tier. We offer an individual competitive advantage for the active developer tier in the form of either a significantly reduced Apify margin or discounted compute unit pricing. Feel free to ask for it!

Benefits of the Actor Programming Model

The Web Actor Programming Model is a new concept for building serverless microapps, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud. Actors are web automation scripts that are easy to integrate and scale up. The main benefit is that even a small piece of software can be turned into a public cloud service in a heartbeat.

Apify is the largest ecosystem where developers build, deploy, and publish data extraction, web automation tools, and AI agents. With over 3,000 Actors on Apify Store and 10 years of experience in the market, Apify makes Docling accessible to over 250,000 developers using the platform monthly. This also enables integration with other Actors on Store, custom Actors, and platform integrations that can create much more powerful workflows than just individual parts.

Full disclosure

I work at Apify. Apify doesn’t sell your software, but we sell the computing resources needed to run your software in the cloud to the end users. Your project is one of the first we selected to pilot Apify's open source fair share program. Please let me know if there’s anything I can do to help you accept this PR! If you do, we’d be pleased to feature your project in our marketing communication.

If you have any questions or need assistance, don’t hesitate to reach out to me (@vancura) or @netmilk, the Apify VP of DX, or just write us to [email protected].

vancura and others added 21 commits January 13, 2025 12:30
- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.
- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.
Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.
Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.
- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:

- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.
- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup
- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.
Copy link

mergify bot commented Feb 3, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@vancura vancura changed the title Docling Actor on Apify infrastructure feat(actor): Docling Actor on Apify infrastructure Feb 3, 2025
@vancura vancura marked this pull request as ready for review February 5, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant