Stars
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
TL;DR FOSS: what sustainability means for open source maintainers, and resources for funding.
Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.
Open, Multi-modal Catalog for Data & AI
A guide for technical professionals looking to start consulting
Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
The Startup CTO's Handbook, a book covering leadership, management and technical topics for leaders of software engineering teams
A curated list of awesome projects and resources related to Argo (a CNCF graduated project)
📚 Community guides for open source creators
A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.
A curated and opinionated list of resources for Chief Technology Officers, with the emphasis on startups
Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team co…
The Metadata Platform for your Data and AI Stack
SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
Apache Superset is a Data Visualization and Data Exploration Platform
Uniffle is a high performance, general purpose Remote Shuffle Service.
MinIO is a high-performance, S3 compatible object store, open sourced under GNU AGPLv3 license.
Compare tables within or across databases
The official home of the Presto distributed SQL query engine for big data
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.