awesome-etl

a curated list of notable etl (extract, transform, load) frameworks, libraries and software.

awesome etl

related lists

awesome-pipeline

workflow management/engines

airflow - "use airflow to author workflows as directed acyclic graphs (dags) of tasks. the airflow scheduler executes your tasks on an array of workers while following the specified dependencies. rich command line utilities make performing complex surgeries on dags a snap. the rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed."
azkaban - "a batch workflow job scheduler created at linkedin to run hadoop jobs. azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows."
dray.it - "docker workflow engine. allows users to separate a workflow into discrete steps each to be handled by a single container."
luigi - "a python module that helps you build complex pipelines of batch jobs. it handles dependency resolution, workflow management, visualization etc. it also comes with hadoop support built in."
mara pipelines - "a lightweight opinionated etl framework, halfway between plain scripts and apache airflow"
pinball - "a scalable workflow management platform developed at pinterest. it is built based on layered approach."
prefect - "a new workflow management system, designed for modern infrastructure and powered by the open-source prefect core workflow engine. users organize tasks into flows, and prefect takes care of the rest."
taskflow - "allows the creation of lightweight task objects and/or functions that are combined together into flows (aka: workflows) in a declarative manner. it includes engines for running these flows in a manner that can be stopped, resumed, and safely reverted."
toil - similar to luigi, jobs are classes with a run method. supports executing jobs on other machines (workers) which can include aws spot instances.
argo - container based workflow management system for kubernetes. workflows are specified as a directed acyclic graph (dag), and each step is executed on a container, and the latter is run on a kubernetes pod. there is also support for airflow dags.
dagster - "dagster is a data orchestrator for machine learning, analytics, and etl. it lets you define pipelines in terms of the data flow between reusable, logical components, then test locally and run anywhere. with a unified view of pipelines and the assets they produce, dagster can schedule and orchestrate pandas, spark, sql, or anything else that python can invoke."

job scheduling

chronos - "a distributed and fault-tolerant scheduler that runs on top of apache mesos that can be used for job orchestration."
dagobah - "a simple dependency-based job scheduler written in python. dagobah allows you to schedule periodic jobs using cron syntax. each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface."
jenkins - "the leading open-source automation server. built with java, it provides over 1000 plugins to support automating virtually anything, so that humans can actually spend their time doing things machines cannot."

java

getl - groovy toolbox for etl tasks from practicing architectures
jsr 352 - java native api for batch processing
scriptella - java-xml etl toolbox for every day use.
spring batch - etl on spring ecosystem

python

libraries

beautifulsoup - popular library used to extract data from web pages.
blaze - "translates a subset of modified numpy and pandas-like syntax to databases and other computing systems."
bonobo - simple, modern and atomic data transformation graphs for python 3.5+.
celery - "an asynchronous task queue/job queue based on distributed message passing. it is focused on real-time operation, but supports scheduling as well."
dask - ever tried using pandas to process data that won't fit into memory? dask makes it easy. dask also has functionality to make it easy to processing continuous streams of data.
dataset - a wrapper around sqlalchemy that simplifies database operations (including upserting).
ijson - allows processing json iteratively (as a stream) without loading the whole file into memory at once.
joblib - "a set of tools to provide lightweight pipelining in python."
lxml - parses xml using c libraries libxml2 and libxslt, so it's very fast. also supports a "recover" mode that will try its best to use invalid xml or discard it. great for large xml files and advanced functionality (like using xpaths). ibm also has a great article on high-performance parsing with lxml here: http://www.ibm.com/developerworks/library/x-hiperfparse/
mrjob - "lets you write mapreduce jobs in python 2.6+ and run them on several platforms. the easiest route to writing python programs that run on hadoop."
odo - moves

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-etl

related lists

workflow management/engines

job scheduling

java

python

libraries

About

Releases

Packages

sujaanr/extract-transform-load

Folders and files

Latest commit

History

Repository files navigation

awesome-etl

related lists

workflow management/engines

job scheduling

java

python

libraries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages