-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing the default dimension order #8
Comments
@cscherrer you've thought about this for SampleChains. Correct me if I'm wrong, but Soss/Tilde indeed represent draws as parameter vectors for sampling with HMC, but after applying transformations to constrained samples, draws are represented as NamedTuples of arrays. Since SampleChains is geared towards sampling using iterators, it chooses a memory layout that for a given parameter places all draws of that parameter from a single chain within a single array. For univariate parameters, this would motivate the memory layout |
No dimensions should be "special", that is the main reason with named dimensions. ArviZ Python just uses this order rule for As in python xarray, slicing/reordering dimensions (in a view) should be doable if needed. Memory layouts are good reasons to use specific order. |
The underlying data structure for a TupleVector (which SampleChains uses under the hood) is a (possibly nested) NamedTuple, where each leaf is an ArrayOfArrays (from the package of similar name). This is then given an interface as a Vector of NamedTuples, similar to the approach used by StructArrays. As for why I took this approach... Say you have a model that draws from an julia> t = as(LKJCholesky(4))
CorrCholesky(4)
julia> TV.dimension(t)
6
julia> x = randn(6)
6-element Vector{Float64}:
-0.634943
1.18258
-0.742168
0.13168
0.0724018
-0.162739
julia> y = TV.transform(t, x)
Cholesky{Float64, UpperTriangular{Float64, Matrix{Float64}}}
U factor:
4×4 UpperTriangular{Float64, Matrix{Float64}}:
1.0 -0.307219 0.530824 0.0657448
⋅ 0.951639 -0.300805 0.0361068
⋅ ⋅ 0.792302 -0.0809617
⋅ ⋅ ⋅ 0.993891 For evaluating the log-density, I'd much rather have I think our next iteration this will be something like (for a fixed number of samples)
Oh, and about iterators... We don't only use ArraysOfArrays.jl, but also ElasticArrays.jl. This makes it so each leaf allows a |
@oschulz has also thought a lot about this, so he might have more insights |
@cscherrer that makes sense, yes. So IIUC, in the This certainly could fit into |
Currently using ArraysOfArrays, a vector of Can you say more about your |
No I don't think so in terms of storage. To make the storage ragged, you would need to pad the shorter arrays with |
Not necessarily: ArrayOfArrays.jl provides |
Here I'm talking about our |
That is essentially what I built ValueShapes for (switching between structured and flat-array representation). I'm not sure if it can do everything you need already, but it could be modified/extended. ValueShapes can handle vectors of samples (which get flattened into vectors of vectors, stored as a single flat matrix via ArraysOfArrays). |
I think it sounds like a good idea to store results as a |
I'll need to look closely at
If we're gearing towards Since this discussion is orthogonal to this issue, I'll open a new issue to discuss this. |
Yes, this is exactly my concern -- "regular array of floating point values of the same type" is a very strong constraint, and it's very easy to come up with examples where it doesn't make sense.
Thanks @sethaxen . I hope my earlier comment has been useful and not too much of a distraction. Looking at it again, the scope is a lot narrower than I had in mind as I was writing my response. As for compatibility with downstream tools, it seems similar to the situation with models and samplers. We can write lots of different models, but a given sampler might constrain that space. HMC wants fixed-dimension continuous parameter spaces, etc. It's a reasonable constraint, but it's also reasonable to allow more general models than HMC can handle -- in those cases you just need a different inference method. EDIT: We can move this comment to the new issue, once that exists. |
Closed by #12 |
Currently by default we interpret the dimensions of arrays as
(chain, draw, param_dims...)
. I propose a breaking change that uses the new default ordering(param_dims..., draw, chain)
for several reasons.TL/DR: The main arguments for this are motivated by efficient memory usage in Julia. The main arguments against this are consistency with Python and current Julia code, but when using deserialized data, this choice actually makes us more consistent with how Python ArviZ does things.
To demonstrate the reasons for doing this, let's generate "data" in a sensible default structure we might expect from a PPL. When sampling with MCMC, each chain is usually sampled separately (whether parallel or in serial), so the draws are separate in memory. For a given chain, the draws are taken in a sequence, so by default for a single draw, the values of the different parameters are stored adjacent to each other in memory. So suppose we have 2 parameters
a
andb
with sizes(3,)
and(2, 2)
. Our samples might then take the form:Now we just simply concatenate all of these arrays in the sensible default order. i.e. we concatenate all parameter arrays for a given draw, then all draws from the same chain, then all chains.
Reason 1: the proposed ordering makes the most sense for memory layout
We've constructed a vector whose ordering of parameters keeps parameters that were close in memory (i.e. in the same array) during sampling to be at least as close in memory. But if we just reshape this array, we get the ordering proposed in this issue:
Reason 2: If all sampled parameters are returned in this format, we can use just reshapes and views to represent all parameter arrays with the proposed dimension ordering
The above array is constructed to have all flattened in a single array, but with no extra allocations, we can construct reshaped views of the individual parameter arrays in the proposed shape. Note that
a
stays flattened whileb
gets an extra dimension.Reason 3: We can construct common reshapes with no allocations
The 2 most common reshaping operations we need in ArviZ are 1) flatten array-valued parameters to a vector and 2) combine
chain
anddraw
to a singlesample
dimension. The proposed ordering allows us to do this withreshape
with no extra allocations.Because DimensionalData works hard to avoid copying data unless necessary, this ordering should let these common operations be allocation-free, which is especially useful for large datasets.
Reason 4: This ordering is more consistent with how NCDatasets.jl and Zarr.jl both read serialized arrays
Julia uses column-major (i.e. Fortran) ordering, while Numpy uses row-major (i.e. C) ordering. These are opposite, so when an array is serialized into a NetCDF or Zarr format with C ordering, NCDatasets.jl and Zarr.jl deserialize it with the order of all dimensions reversed. Here's an example using InferenceObjectsNetCDF.jl:
Here we can see that
theta
is deserialized in Julia with the dimensions reversed compared to the order Python serialized it in (i.e. Python wrote it in(chain, draw, school)
, and it is read out as(school, draw, chain)
. The reverse is true; if in Julia we assume arrays have dimensions in the proposed order, then when we serialize them, if they are deserialized by a Python user, they will come out in the default order used by Python ArviZ. To deliver these deserialized arrays in the same array dimensions as Python would require us to usePermutedDimsArray
for all arrays, which we certainly could do.There's a catch here. This means that if Python wrote a parameter
x
as(chain, draw, xdim0, xdim1)
, Julia would read it back as(xdim1, xdim0, draw, chain)
. This is a little nonintuitive, but I think it's fine.Reason 5: This is the dimension ordering planned for MCMCDiagnosticTools
We plan to port our diagnostic functions to MCMCDiagnosticTools, and in that package we seem to be converging to the ordering proposed here as the ordering to require (see TuringLang/MCMCDiagnosticTools.jl#5). When the proposed default ordering is used, then these functions could be called directly on our
DimensionalArray
s with reshaping only, so there would be no copying of data. (cc @ParadaCarleton and @devmotion in case their viewpoints have changed)Drawbacks
There are a few drawbacks to this change, but in my opinion they are outweighed by the advantages:
end
, but this cannot be set to an index variable, and one can't usesize(arr, end)
. I don't think this is a major issue because users should probably always be interacting with the dimensions through the DimensionalData interface and not assuming a specific ordering of dimensions.from_namedtuple
would need to update code. This is mitigated by marking the change as breaking, but further, in most cases an error will be raised if the user assumes the old ordering, since the sample dimensions will be misaligned.(reverse(param_dims)..., draw, chain)
, but this would just make it so Julia always uses nonintuitive parameter dimensions, and this lacks the nice memory layout properties noted above, so I don't think this makes sense. This point is an argument for usingPermutedDimsArray
every time we (de)serialize data.Most importantly, as with Python ArviZ, all code that consumes
InferenceData
orDataset
should work the same regardless of wherechain
anddraw
appear in the dimension ordering, so this is more about which default ordering makes the most sense in the end.The text was updated successfully, but these errors were encountered: