Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: support dataset layers #5

Open
Sitin opened this issue Dec 18, 2020 · 0 comments
Open

Feature Request: support dataset layers #5

Sitin opened this issue Dec 18, 2020 · 0 comments

Comments

@Sitin
Copy link

Sitin commented Dec 18, 2020

There is a handy feature in Kedro: special tags for datasets according to the data engineering convention. It is quite useful in combination with kedro viz or (as in my case) for creating a UI on top of Kedro and sorting/filtering datasets according to their position in a pipeline.

I think we need to support this feature and provide some way to specify the layer for pipeline checkpoints.

Currently I am subclassing KedroWings and use simple rules based on the first two letters of dataset locations:

class ExplicitLark(KedroWings):
    LAYERS: Dict[str, str] = {
        '01': 'raw',
        '02': 'intermediate',
        '03': 'primary',
        '04': 'feature',
        '05': 'model_input',
        '06': 'model',
        '07': 'model_output',
        '08': 'reporting',
    }

   @hook_impl    
   def before_pipeline_run(
            self, run_params: Dict, pipeline: Pipeline, catalog: DataCatalog, name: str = None,
    ):
        super(EarlyBird, self).before_pipeline_run(run_params, pipeline, catalog)
        self._update_layers(catalog)

    def _update_layers(self, catalog: DataCatalog):
        for dataset_name in catalog.list(regex_search=r'^\d+_.*'):
            layer_code = dataset_name[:2]
            if layer_code in self.LAYERS:
                layer_name = self.LAYERS[layer_code]
                catalog.layers[layer_name] = catalog.layers.get(layer_name, set())
                catalog.layers[layer_name].add(dataset_name)

I think we can provide additional parameter to the KedroWings which accepts dictionary with regular expression -> layer name and use the default convention for XX_* datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant