Skip to content

Commit

Permalink
initial implementation of the simple python APIs for invoking transfo…
Browse files Browse the repository at this point in the history
…rms (#413)

* initial implementation of the simple python APIs for invoking transforms

* addressed some comments

* add json configuration, transforms list, additional libraries removal and test

* add ray support and test

* add additional import

* fixed testing issues

* fixed testing issues

* added notebook example

* add s3 support

* cleaned configuration

* simplified code

* removed duplicate logs

* addressed review commends

* updated notebook

* updated notebook

* addressed review comments

* small enhancements, added ededup Python, run pre_commit

* adding documentation
  • Loading branch information
blublinsky authored Aug 13, 2024
1 parent c21ef4e commit 93308cd
Show file tree
Hide file tree
Showing 102 changed files with 4,454 additions and 1,297 deletions.
133 changes: 68 additions & 65 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,33 +7,33 @@ on:
- "dev"
- "releases/**"
tags:
- '*'
- "*"
pull_request:
branches:
- "dev"
- "releases/**"
jobs:
check_if_push_images:
# check whether the Docker images should be pushed to the remote repository
# The images are pushed if it is a merge to dev branch or a new tag is created.
# The latter being part of the release process.
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file.
# check whether the Docker images should be pushed to the remote repository
# The images are pushed if it is a merge to dev branch or a new tag is created.
# The latter being part of the release process.
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file.
runs-on: ubuntu-22.04
outputs:
publish_images: ${{ steps.version.outputs.publish_images }}
steps:
- id: version
run: |
publish_images='false'
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT"
publish_images='false'
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT"
test-make:
runs-on: ubuntu-22.04
steps:
Expand Down Expand Up @@ -216,8 +216,8 @@ jobs:
if: needs.check_if_push_images.outputs.publish_images == 'true'
runs-on: ubuntu-22.04
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
timeout-minutes: 30
steps:
- name: Checkout
Expand All @@ -233,23 +233,24 @@ jobs:
df -h
- name: Test Code Transform Images
run: |
make -C data-processing-lib/spark DOCKER=docker image
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
make -C data-processing-lib/spark DOCKER=docker image
- name:
Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
docker images
df -h
docker images
- name: Publish images
if: needs.check_if_push_images.outputs.publish_images == 'true'
run: |
make -C data-processing-lib/spark publish-image
make -C data-processing-lib/spark publish-image
test-code-images:
needs: [check_if_push_images]
runs-on: ubuntu-22.04
timeout-minutes: 30
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
Expand All @@ -266,22 +267,23 @@ jobs:
run: |
make -C data-processing-lib DOCKER=docker image
make -C transforms/code DOCKER=docker test-image
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
- name:
Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
docker images
df -h
docker images
- name: Publish images
if: needs.check_if_push_images.outputs.publish_images == 'true'
run: |
make -C transforms/code publish
make -C transforms/code publish
test-language-images:
needs: [check_if_push_images]
runs-on: ubuntu-22.04
timeout-minutes: 120
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
Expand All @@ -304,14 +306,14 @@ jobs:
- name: Publish images
if: needs.check_if_push_images.outputs.publish_images == 'true'
run: make -C transforms/language publish

test-universal-images:
needs: [check_if_push_images]
runs-on: ubuntu-22.04
timeout-minutes: 120
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
Expand All @@ -329,40 +331,41 @@ jobs:
run: |
make -C data-processing-lib/spark DOCKER=docker image
make -C transforms/universal DOCKER=docker test-image
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
- name:
Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
docker images
df -h
docker images
- name: Publish images
if: needs.check_if_push_images.outputs.publish_images == 'true'
run: make -C transforms/universal publish
build-kfp-components:
needs: [check_if_push_images]
runs-on: ubuntu-22.04
timeout-minutes: 30
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Build
run: |
make -C kfp/kfp_ray_components DOCKER=docker image
make KFPv2=1 -C kfp/kfp_ray_components DOCKER=docker image
- name: Publish images
if: needs.check_if_push_images.outputs.publish_images == 'true'
run: make -C kfp/kfp_ray_components publish
needs: [check_if_push_images]
runs-on: ubuntu-22.04
timeout-minutes: 30
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Build
run: |
make -C kfp/kfp_ray_components DOCKER=docker image
make KFPv2=1 -C kfp/kfp_ray_components DOCKER=docker image
- name: Publish images
if: needs.check_if_push_images.outputs.publish_images == 'true'
run: make -C kfp/kfp_ray_components publish
test-tool-images:
runs-on: ubuntu-22.04
timeout-minutes: 30
Expand All @@ -371,4 +374,4 @@ jobs:
uses: actions/checkout@v4
- name: Build and Test Tool images
run: |
make -C tools/ingest2parquet DOCKER=docker test-image
make -C tools/ingest2parquet DOCKER=docker test-image
1 change: 1 addition & 0 deletions data-processing-lib/doc/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ To learn more consider the following:
* [Transform Exceptions](transform-exceptions.md)
* [Transform Runtimes](transform-runtimes.md)
* [Transform Examples](transform-tutorial-examples.md)
* [Simplified transform APIs](simplified_transform_apis.md)
* [Data Access Factory](data-access-factory.md)
* [Testing Transforms](transform-testing.md)
* [Utilities](transformer-utilities.md)
Expand Down
113 changes: 113 additions & 0 deletions data-processing-lib/doc/simplified_transform_apis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Simplified APIs for invoking transforms

Current transform invocation, requires defining transform parameters, passing them to `sys.argv` and
then invoking launcher (runtime specific) with transform/runtime specific configuration (see for example
[NOOP Python invocation](../../transforms/universal/noop/python/src/noop_local_python.py),
[NOOP Ray invocation](../../transforms/universal/noop/ray/src/noop_local_ray.py) and
[NOOP Spark invocation](../../transforms/universal/noop/spark/src/noop_local_spark.py)).

Simplified APIs, described here make invocations a little simpler by eliminating the need for the boilerplate code.
Currently we provide 2 APIs for simplified transform invocation:
* [execute_python_transform](../python/src/data_processing/runtime/pure_python/transform_invoker.py)
* [execute_ray_transform](../ray/src/data_processing_ray/runtime/ray/transform_invoker.py)

Both APIs look the same, defining runtime in the API name and accept the following parameters:
* transform name
* transforms configuration object (see below)
* input_folder containing data to be processed, currently can be local file system or S3 compatible
* output_folder defining where execution results will be placed, currently can be local file system or S3 compatible
* S3 configuration, required only for input/output folders in S3
* transform params - a dictionary of transform specific parameters

APIs returns `True` if transform execution succeeds or `False` otherwise

APIs implementation is leveraging [TransformsConfiguration class](../python/src/data_processing/utils/transform_configurator.py)
which manages configurations of all existing transforms. By default transforms information is loaded
from this [json file](../python/src/data_processing/utils/transform_configuration.json), but can be overwritten
by the user.

Additionally configuration provides the method for listing existing (known) transforms.

Finally, as configurator knows about all existing transforms (and their dependencies) it checks
whether transform is code is installed locally and install it if it is not (using
[PipInstaller](../python/src/data_processing/utils/pipinstaller.py)). If transforms are installed, they
are removed after transform execution is complete.

An example of the APIs usage can be found in this [notebook](../../examples/notebooks/code/demo_with_apis.ipynb).

Creation and usage of configuration:

```python
from data_processing.utils import TransformsConfiguration

t_configuration = TransformsConfiguration()
transforms = t_configuration.get_available_transforms()
```

Invoking Python transform:

```python
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}

runtime_python_params = {
"runtime_pipeline_id": "pipeline_id",
"runtime_job_id": "job_id",
"runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}
input_folder = os.path.abspath(zip_input_folder)
output_folder = os.path.abspath(parquet_data_output)
supported_languages_file = os.path.abspath("../../../transforms/code/code2parquet/python/test-data/languages/lang_extensions.json")

ingest_config = {
"data_files_to_use": ast.literal_eval("['.zip']"),
"code2parquet_supported_langs_file": supported_languages_file,
"code2parquet_detect_programming_lang": True,
}

execute_python_transform(
configuration = t_configuration,
name="code2parquet",
input_folder=input_folder,
output_folder=output_folder,
params=runtime_python_params | ingest_config
)
```

In the fragment above, we first define Python runtime parameters. Most of them are defaulted, so their use is optional
(only if we need to overwrite them). Then we define input/output folder and location of the support file.
We also define code to parquet specific parameters and finally invoke the transform itself.
Note here, that `runtime_python_params` can be defined once and then reused across several Python transform
invocation.

Invoking Python transform:

```python
runtime_ray_params = {
"runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
"runtime_num_workers": 3,
"runtime_pipeline_id": "pipeline_id",
"runtime_job_id": "job_id",
"runtime_creation_delay": 0,
"runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}

ededup_config = {
"ededup_hash_cpu": 0.5,
"ededup_num_hashes": 2,
"ededup_doc_column": "contents",
}

execute_ray_transform(
configuration = t_configuration,
name="ededup",
input_folder=input_folder,
output_folder=output_folder,
params=runtime_ray_params | ededup_config
)
```

In the fragment above, we first define Ray runtime parameters. Most of them are defaulted, so their use is optional
(only if we need to overwrite them). Then we define input/output folder and ededup specific parameters and finally
invoke the transform itself.
Note here, that `runtime_ray_params` can be defined once and then reused across several Python transform
invocation.
1 change: 1 addition & 0 deletions data-processing-lib/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ authors = [
{ name = "Boris Lublinsky", email = "[email protected]" },
]
dependencies = [
"numpy < 1.29.0",
"pyarrow==16.1.0",
"boto3==1.34.69",
"argparse",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
from data_processing.runtime.pure_python.transform_file_processor import PythonTransformFileProcessor
from data_processing.runtime.pure_python.transform_orchestrator import orchestrate
from data_processing.runtime.pure_python.transform_launcher import PythonTransformLauncher
from data_processing.runtime.pure_python.transform_invoker import invoke_transform, execute_python_transform
Loading

0 comments on commit 93308cd

Please sign in to comment.