Refactor pipeline to use grain crop dictionaries #1022

SylviaWhittle · 2024-11-22T17:19:21Z

This is a draft PR and documentation / tests will be added before full PR is made

This PR is huge (sorry)

Main things

This PR is designed to improve how we handle grains in the processing stages of TopoStats, starting at the grain finding stage, up to the disordered tracing stage. In future, this might be extended through the disordered tracing stage and beyond, however I've restricted the scope of this PR for the sake of everyone's sanity. The reason for stopping at disordered tracing is that once disordered tracing returns, all the data is wrapped up in neatly structured dictionaries, by grain and molecule, similar to what I've implemented, so I deemed this similar enough to not bother changing it yet.

The way this PR tries to standardize how we handle grains, is using DataClasses:

ImageGrainCrops
- has two attributes, above and below, each holding a DirectionGrainCrops object for that direction's grain crops
GrainCropsDirection
- two attributes: crops and full_mask_tensor
- crops stores dictionaries of GrainCrop objects ([int, GrainCrop])
- full_mask_tensor stores a full sized mask for the image, size is NxNxC where C is the number of classes. This is NOT automatically updated when the crops property is edited, this is because we don't want to update things during a loop. This can be discussed if this is an incorrect decision!
GrainCrop
- Stores various properties about the grain, such as mask, image, bbox padding etc.

This has the benefit of standardizing how we handle grains going forward, as we had previously been rather discordant in the types of data structures that we use in various parts of the codebase.

It also adds a helpful (I hope!!) layer of abstraction to processing functions, for example the run_grainstats function in processing no longer needs to take image, grain_masks, pixel_to_nm_scaling, it now takes just image_grain_crops which contains all the data for each crop.

This of course does come at the cost of increased memory usage as there are duplication of parts of images in the data structures as well as repeatedly listing the pixel_to_nm_scaling factor etc, however I personally find that the benefits here far outweigh the negatives. When working on the harbo-rings project, I found myself naturally extracting all the grains and storing them in a dictionary rather than keeping track of full image masks, I know Max also does this based on how he's handled the tracing code.

`disordered_tracing.py`

Removed prep_arrays. Prep arrays no longer needed, since it made a dictionary of grain crops, but we already have these now with the refactor.

TopoStats Pull Requests

Please provide a descriptive summary of the changes your Pull Request introduces.

The Software Development section of
the Contributing Guidelines may be useful if you are unfamiliar with linting, pre-commit, docstrings and testing.

NB - This header should be replaced with the description but please complete the below checklist or a short
description of why a particular item is not relevant.

Before submitting a Pull Request please check the following.

Existing tests pass.
Documentation has been updated and builds. Remember to update as required...
- docs/configuration.md
- docs/usage.md
- docs/data_dictionary.md
- docs/advanced.md and new pages it should link to.
Pre-commit checks pass.
New functions/methods have typehints and docstrings.
New functions/methods have tests which check the intended behaviour is correct.

Optional

`topostats/default_config.yaml`

If adding options to topostats/default_config.yaml please ensure.

There is a comment adjacent to the option explaining what it is and the valid values.
A check is made in topostats/validation.py to ensure entries are valid.
Add the option to the relevant sub-parser in topostats/entry_point.py.

…ures.

…ti class & subgrains

…mask required bool. Tested in debugger.

…f GrainCrops

… each row

…2] >= 2, shape[1]==shape[2]

…rainCrops. Locally debugged working

…e image plotting

SylviaWhittle · 2024-12-03T17:19:26Z

Proposed solution to the data frame issue

|----------------------------------------------------------------------------------------------------------------------|
|   image   |    direction   |     class         | grain | molecule | ... <grainstats> ... | ... <dnatracingstats> ... |
| mini.spm  |    above       |   dna_only        | 0     | 0        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   dna_only        | 0     | 1        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   dna_only        | 1     | 0        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   dna_only        | 1     | 1        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   protein_only    | 0     | 0        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   protein_only    | 0     | 1        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   protein_only    | 1     | 0        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   protein_only    | 1     | 1        | ... <stats> ...      |          NONE             |
| mini.spm  |    above       |   combined_mask   | 0     | 0        | ... <stats> ...      |     ... <stats> ...       |
|----------------------------------------------------------------------------------------------------------------------|

ns-rse · 2024-12-03T17:26:42Z

Its that or split into separate files.

I'm ambivalent as to the preferred solution as I don't use the output but consideration for end users should be given. Whilst data management, manipulation, summarisation and plotting are, in my view, core skills for researchers these days experience levels vary widely and I don't know what would be easiest.

ns-rse · 2024-12-10T12:02:21Z

Are we aiming to include this refactoring in v2.3.0 release?

topostats/processing.py

topostats/default_config.yaml

topostats/tracing/disordered_tracing.py

Co-authored-by: Neil Shephard <[email protected]>

… linting

SylviaWhittle · 2025-02-10T22:00:03Z

"it works on my machine!" (tests fail on GH-A, but not on my machine, seems to be due to this issue again:

ns-rse · 2025-02-11T10:06:52Z

"it works on my machine!" (tests fail on GH-A, but not on my machine, seems to be due to this issue again:

I think this was addressed in #995.

Its down to topoly versions, the GitHub Action uses topoly-1.0.4 and we can see in the changeset of this pull request that these are reverted in the target CSV tests/resources/tracing/ordered_tracing/catenanes_ordered_tracing_molstats.csv which revert to the older version.

Do you by chance have topoly-1.0.2 on your system/virtualenv? (pip show topoly) If so can you upgrade to topoly-1.0.4 and revert the above mentioned file as I think tests would then pass locally for you and on GitHub Actions.

… number

…s and tensors

SylviaWhittle · 2025-02-11T14:44:55Z

Okay I have ticked all the boxes in the PR guide (that is such a good feature) I believe. I'm absolutely sure I'll have missed big things, but I'll submit for review now.

Reviewers, please do just flag stuff as it comes, don't assume I've done something right and waste time trying to comprehend my silly ways of doing things, I don't want to take up more of your time than is absolutely necessary 👍

Also @ns-rse, IIRC you like to go through a PR commit-by-commit. If so, I would strongly advise against this because I developed this PR in a very broken state to begin with, the early commits won't make sense / reflect the end PR. I am very sorry for how awful this PR is 😅

SylviaWhittle and others added 21 commits November 20, 2024 14:35

WIP: Scope out refactor for grains.py

58757ad

WIP: Begin grains > grainstats pipeline overhaul. Outline data struct…

f49b1e6

…ures.

WIP: Scope out changes to GrainStats.calculate_stats to allow for mul…

4334027

…ti class & subgrains

WIP: Initial proposal for grainstats using grain dictionary refactor

d106b0f

Add function: graincrops_merge_classes

1acee67

Add function: graincrops_update_background_class

a7b9075

Update: extract_grains_from_full_image now works in theory, untested

913e3b5

Fix: extract_grains_from_full_image_mask: allocating region to empty …

843a606

…mask required bool. Tested in debugger.

WIP: Switch vetting, merging and update background to work on dicts o…

535bea8

…f GrainCrops

WIP: Update vet_grains to take / return dicts of GrainCrops

36c07e2

WIP: Update grainstats handling of dataframe to use list of dicts for…

d1e4c1d

… each row

Fix: validate_full_mask_tensor_shape: Require len(shape) == 3, shape[…

ace46ea

…2] >= 2, shape[1]==shape[2]

Edit: find_grains now stores grains in self.image_grain_crops: ImageG…

e2824f9

…rainCrops. Locally debugged working

WIP: Handle ImageGrainCrops between run_grains and run_grainstats

f21eccd

WIP: Graintstats handles ImageGrainCrops

63d0003

Fix: grainstats: process scan no longer needing grain plots returned

02ecf43

WIP: Begin grains > disordered_tracing pipeline overhaul

651e89c

Merge branch 'main' into SylviaWhittle/grain_restructure

a0370cc

[pre-commit.ci] Fixing issues with pre-commit

a03524e

WIP: grains > disorderd_tracing pipeline | fix typing and remove whol…

8d215fd

…e image plotting

[pre-commit.ci] Fixing issues with pre-commit

3a18880

Add: class index to disordered tracing config

7781c86

[WIP] Fix: Attempt to fix grain_number double index issue

80b711c

MaxGamill-Sheffield reviewed Dec 10, 2024

View reviewed changes

topostats/processing.py Outdated Show resolved Hide resolved

MaxGamill-Sheffield reviewed Dec 10, 2024

View reviewed changes

topostats/default_config.yaml Outdated Show resolved Hide resolved

MaxGamill-Sheffield reviewed Dec 10, 2024

View reviewed changes

topostats/tracing/disordered_tracing.py Outdated Show resolved Hide resolved

remove raising error on empty direction

4c6f0f3

Co-authored-by: Neil Shephard <[email protected]>

SylviaWhittle and others added 12 commits February 3, 2025 15:07

Add test: test_validate_full_mask_tensor

857e920

Add test: test_graincropsdirection_update_full_mask_tensor

236e695

Linting: processing.py

fb8aaaa

Linting: grainstats.py

f0bf8a7

Tidy up disorderd_tracing.py: remove prep_arrays, pad_width argument,…

1c5ddd8

… linting

Tidy up nodestats, remove pad_width argument

23562f7

Tidy up ordered_tracing, remove pad_width argument

7b8c764

Tidy up unet_masking.py

e4659e1

Merge main into grains refactor

be2e8ad

[pre-commit.ci] Fixing issues with pre-commit

41a682d

Actually finally delete testmondata (sorry)

0761c9c

Chore: Ignore testmondata

cf6b706

SylviaWhittle force-pushed the SylviaWhittle/grain_restructure branch from f517856 to cf6b706 Compare February 5, 2025 10:28

SylviaWhittle added 4 commits February 10, 2025 15:02

Edit: .topostats file contains full mask tensors now

b046f74

Tidy: Remove print statements

8c26042

Tidy: conftest.py

0061b7c

Tidy: test_processing.py

eded504

SylviaWhittle added 5 commits February 11, 2025 11:01

Fix test: test_ordered_tracing_image: topoly version issue

59d7398

Add documentation for added config options

59bab1b

Add parameters to entry_point.py

6f4ba99

Add documentation to data_dictionary.md for subgrain number and class…

b67926d

… number

Add brief explanations in advanced/grain_finding.md for the GrainCrop…

3beda32

…s and tensors

SylviaWhittle marked this pull request as ready for review February 11, 2025 14:45

SylviaWhittle requested review from ns-rse, llwiggins and MaxGamill-Sheffield February 11, 2025 14:46

Fix test: entry points parameter tests

c3e930e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pipeline to use grain crop dictionaries #1022

Refactor pipeline to use grain crop dictionaries #1022

SylviaWhittle commented Nov 22, 2024 •

edited

Loading

SylviaWhittle commented Dec 3, 2024

ns-rse commented Dec 3, 2024

ns-rse commented Dec 10, 2024

SylviaWhittle commented Feb 10, 2025

ns-rse commented Feb 11, 2025 •

edited

Loading

SylviaWhittle commented Feb 11, 2025

Refactor pipeline to use grain crop dictionaries #1022

Are you sure you want to change the base?

Refactor pipeline to use grain crop dictionaries #1022

Conversation

SylviaWhittle commented Nov 22, 2024 • edited Loading

This is a draft PR and documentation / tests will be added before full PR is made

This PR is huge (sorry)

Main things

disordered_tracing.py

TopoStats Pull Requests

Optional

topostats/default_config.yaml

SylviaWhittle commented Dec 3, 2024

ns-rse commented Dec 3, 2024

ns-rse commented Dec 10, 2024

SylviaWhittle commented Feb 10, 2025

ns-rse commented Feb 11, 2025 • edited Loading

SylviaWhittle commented Feb 11, 2025

SylviaWhittle commented Nov 22, 2024 •

edited

Loading

`disordered_tracing.py`

`topostats/default_config.yaml`

ns-rse commented Feb 11, 2025 •

edited

Loading