Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble implementing models in Axon #522

Closed
krstopro opened this issue Aug 19, 2023 · 30 comments · Fixed by #524
Closed

Trouble implementing models in Axon #522

krstopro opened this issue Aug 19, 2023 · 30 comments · Fixed by #524

Comments

@krstopro
Copy link
Member

Recently I tried coding a somewhat popular model in Axon, only to give up after few hours. The reason is that I found implementing custom models very hard, if not impossible. It could be that I am missing something, that I need more practice, or that I am biased towards PyTorch that I was using for years. Still, I would like to ask some questions.
Is there any particular reason why Axon.layer_name was chosen to return a function and not a struct with parameters? I know the latter is more OOP than functional (as stated here), but implementing custom models to me seems simpler this way and it allows for parameters to be reused easily. Also, I think this is exactly the approach that was taken in Nx.Scholar.
Many popular algorithms involve applying the same set of weights over and over again to the input. For example, recurrent neural networks (potentially with custom cells), Deep Sets, meta-learning (MAML and related algorithms), Neural Cellular Automata, etc. Currently, I don't think it is even possible to implement these in Axon (not sure what is going on with get_parameters and set_parameters functions).
Am I missing something? Or are there any plans to change the approach? I am aware of Add weight sharing issue still being open.

@krstopro krstopro changed the title Trouble with implementing models in Axon Trouble implementing models in Axon Aug 19, 2023
@seanmor5
Copy link
Contributor

Can you provide which model you tried implementing?

The reason layer names are functions are to make inference deterministic. Previously we relied on unique_integer to ensure layers/parameters had unique names and would not accidentally be shared across layers. An issue with this is that if you have a function which returns an Axon model, you would always get a unique model if you didn't explicitly specify every layer name. That means every time you tried to use one of these models it would need to be fully recompiled by Nx/EXLA.

Note you can still pass a binary as a layer name. The function is really only used internally. The only other way to enforce this is to just force name as a required parameter of each layer

Looking at some of the model implementations in Bumblebee might help a bit as well. It's difficult to know though without understanding what you're trying to do

@krstopro
Copy link
Member Author

krstopro commented Aug 19, 2023

Can you provide which model you tried implementing?

Sure! I was trying to implement Neural Cellular Automata (see https://distill.pub/2020/growing-ca/) for fun and practice.
The main issue was applying the same module (sequence of layers) to the input for a fixed number of steps (could be passed as an option to the model). I think the figure on the link somewhat explains what is going on. There are existing PyTorch and Tensorflow implementations, so I don't need to reinvent the wheel.

Looking at some of the model implementations in Bumblebee might help a bit as well.

This might be a good next step for me. Thanks!

@seanmor5
Copy link
Contributor

Is there a particular reason why, for example, we have Axon.dense as a function, instead of Axon.Dense module with Axon.Dense.new (which returns the parameters) and Axon.Dense.apply or Axon.Dense.forward which takes those parameters together with the input?

I think the approach we take at the moment is much more versatile and functional than the module-based approach. I'm not quite sure how having separate modules per layer would work. In scholar those algorithms are standalone, but in Axon layers are composable and the module-based approach is not very composable. Note that you can get "apply" but just going to the low level Axon.Layers implementations. You will need to manage parameter initialization yourself, but it is possible.

@krstopro
Copy link
Member Author

In scholar those algorithms are standalone, but in Axon layers are composable and the module-based approach is not very composable.

In module based approach compositionality should be easily achievable with Sequential API (e.g. as done in PyTorch). Parameter sharing in this approach is very simple, since all you need to do is pass parameters as arguments. In current approach, I don't think parameter sharing is even possible.

Note that you can get "apply" but just going to the low level Axon.Layers implementations. You will need to manage parameter initialization yourself, but it is possible.

Indeed, I am aware of transforms implemented in Axon.Layers, but I don't think this solves the parameter sharing.

@josevalim
Copy link
Contributor

I think framing it as module vs function is barking at the wrong tree. :) A module is nothing more than a collection of functions and it won't add any inherent capabilities. Structs, as mentioned earlier, could introduce new capabilities, and they would tie data to a module, but due to Elixir's functional nature and given that structs in Elixir is nothing more than maps with a special key, it could be replicated easily in other ways.

I think it is worth taking a step back (well, at least for me which is not well versed in ML). We are talking a lot about the solution, but I am trying to grasp the problem. If the issue is parameter sharing, wouldn't it be a matter of replacing this line:

https://github.com/elixir-nx/axon/blob/main/lib/axon.ex#L722

By something like:

kernel = opts[:kernel_param] || param("kernel", kernel_shape, initializer: opts[:kernel_initializer])

?

@josevalim
Copy link
Contributor

Ah, I guess the issue with the above is that the parameters are not considered shareable anyway. But we could likely introduce axon = Axon.add_shared_param(axon, ....) with are all stored in a global configuration in Axon.

@krstopro
Copy link
Member Author

Ah, I guess the issue with the above is that the parameters are not considered shareable anyway. But we could likely introduce axon = Axon.add_shared_param(axon, ....) with are all stored in a global configuration in Axon.

But how do you pass parameter to the layer? I don't think this is possible at the moment.

@josevalim
Copy link
Contributor

@krstopro change the layer to either:

  1. allow the parameter to be given as an option
  2. allow a layer param to be converted to shared after the fact

Anyway, the point I am getting to is that I don't think this is a design issue, and I think treating it as a design issue is going to lead to the wrong direction. Remember that objects (or rather mutability) allows you to give everything an identity based on the position they are in memory (and the complexity that comes from it because now you need to track how things change over time!). So you share by making things point to the same place. There is nothing in modules or structs in Elixir that will give you that. If you want to share something, you need to give it an explicit name, and put it somewhere shared.

@krstopro
Copy link
Member Author

krstopro commented Aug 20, 2023

@josevalim

allow the parameter to be given as an option

This might solve the problem.

allow a layer param to be converted to shared after the fact

Hmmm this might require solution number 1.

What I was saying is the following approach (taken exactly in Nx.Scholar; something similar is done in PyTorch).

defmodule Dense do
  defstruct [:weights, :bias]
  
  def new(num_units) do
    # returns struct with weights and bias parameters
  end
  
  def apply(x, %__MODULE__{weights: weights, bias: bias}) do # or def forward
    # applies weights and bias to x
  end
end

Then, suppose we have two inputs x1 and x2. It is very easy to apply the same layer to these inputs by doing

layer = Axon.Dense.new(num_units)
y1 = Axon.Dense.apply(x1, layer)
y2 = Axon.Dense.apply(x2, layer)

I don't think we can do something like this at the moment.

@josevalim
Copy link
Contributor

josevalim commented Aug 20, 2023

That would not solve it necessarily. Look at this Python code:

>>> class User:
...   name = None
...
>>> user = User()
>>> user.name = "josé"
>>> other = User()
>>> other.name = "josé"
>>> user == other
False

and then:

>>> from dataclasses import dataclass
>>> @dataclass
... class User:
...   name = None
...
>>> user = User()
>>> user.name = "josé"
>>> other = User()
>>> other.name = "josé"
>>> user == other
True

In Elixir, everything is data (the second). So if you have two parameters with the same name and shape, does it mean it is shared? No, it doesn't, it could be a coincidence that they are represented the same. They may still get different values on execution. That's why having a struct would not help, because we still could not simply assume that because something looks the same, it is the same. With objects, they would point to different memory addresses, and that's how you would know they are different.

One way to do this in Elixir is by adding a unique value, such as make_ref(), to each parameter, this way we would know if they are the same by looking at the make_ref(). So a parameter in different layers with the same ref would be the same: great!

The problem is that it breaks equality:

iex> Axon.dense(32) == Axon.dense(32)
false

This would be false because now it points to something unique (like objects/memory address would), but this feels very counter intuitive when everything we have is data.

@josevalim
Copy link
Contributor

Ok, here is a more concrete API:

  • Add Axon.shared_param(...), the same as Axon.param but returns Axon.SharedParam

  • Add :weight and :bias option to dense (you can deprecate/replace :use_bias by setting :bias to false)

  • Once a Axon.shared_param is given to a layer, Axon will also store it in a field called shared_params = %{shared_param_key => shared_param_struct}. All shared_params with the same name must be the exact same

  • Shared params must be given under a different namespace when executing (perhaps "_shared" - this may even mean we could implement all of this by having a fake layer where all shared params are stored - but that's an implementation detail)

@krstopro
Copy link
Member Author

@josevalim I see now, thanks! I think this is what @seanmor5 already wrote here.

@josevalim
Copy link
Contributor

Yeah, exactly. We made these mistakes in the past, which is one way of learning. :D

@krstopro
Copy link
Member Author

krstopro commented Aug 20, 2023

Ok, here is a more concrete API:

  • Add Axon.shared_param(...), the same as Axon.param but returns Axon.SharedParam
  • Add :weight and :bias option to dense (you can deprecate/replace :use_bias by setting :bias to false)
  • Once a Axon.shared_param is given to a layer, Axon will also store it in a field called shared_params = %{shared_param_key => shared_param_struct}. All shared_params with the same name must be the exact same
  • Shared params must be given under a different namespace when executing (perhaps "_shared" - this may even mean we could implement all of this by having a fake layer where all shared params are stored - but that's an implementation detail)

@josevalim Correct me if I'm wrong, but implementing these would require changing every existing layer in Axon, right?

@josevalim
Copy link
Contributor

josevalim commented Aug 20, 2023

Good call. Layers would need to declare which parameters they allow to share, correct. Making a param shared, after it is defined, would not. Something like this:

axon
|> Axon.dense("dense1", 32)
|> Axon.share_param("dense1.bias", as: "shared_dense_bias")
|> Axon.dense("dense2", 32)
|> Axon.share_param("dense2.bias", as: "shared_dense_bias")

We still need to store somewhere shared to make sure the shapes match, but the API above would be less bureaucratic, yeah. Or even:

axon
|> Axon.dense("dense1", 32)
|> Axon.dense("dense2", 32)
|> Axon.share_param(["dense1.bias", "dense2.bias"], as: "shared_dense_bias")

@krstopro
Copy link
Member Author

krstopro commented Aug 20, 2023

One of the problems might be the following (again, I might be wrong).
Suppose we have a complex model, e.g. something like (taken directly from https://hexdocs.pm/axon/Axon.html)

model =
  input
  |> Axon.dense(128, activation: :relu)
  |> Axon.batch_norm()
  |> Axon.dropout(rate: 0.8)
  |> Axon.dense(64)
  |> Axon.tanh()
  |> Axon.dense(10)
  |> Axon.activation(:softmax)

How do I share its parameters?

@josevalim
Copy link
Contributor

IIRC, each layer has a name, even if you don't give one explicitly (e.g. dense1, dense2, dense3, etc). So you can either rely on those names, but I would instead explicit name the layer so we can share the params. So something like:

model =
  input
  |> Axon.dense(128, activation: :relu, name: "dense1")
  |> Axon.batch_norm()
  |> Axon.dropout(rate: 0.8)
  |> Axon.dense(64, name: "dense2")
  |> Axon.tanh()
  |> Axon.dense(10, name: "dense3")
  |> Axon.activation(:softmax)
  |> Axon.share_param(["dense1.bias", "dense2.bias", "dense3.bias"], as: "shared_bias")

@krstopro
Copy link
Member Author

I've been thinking about this, but I am not sure if it solves the problem. If we have two inputs, say input1 = Axon.input("input1", shape: {nil, dim}) and input2 = Axon.input("input2", shape: {nil, dim}) how do we apply the same sequence of layers (with the same weights!) to both input1 and input2?

Another issue with accessing layers by name might be if we wanna do the same with a complex model with a lot of layers (e.g. ResNet101 that is 101 layers deep). Would we need to name every layer in the model and then iterate over them?

@polvalente
Copy link
Contributor

polvalente commented Aug 20, 2023

I believe it would be something like this:

input1 = Axon.input("x", shape: {nil, 10})
input2 = Axon.input("y", shape: {nil, 10})

model_fn = fn input, i ->
  input
  |> Axon.dense(16, name: "dense0_#{i}", activation: :relu)
  |> Axon.dense(32, name: "dense1_#{i}", activation: :relu)
end

model = 
  Axon.concatenate([model_fn.(input1, "x"), model_fn.(input2, "y")])
  |> Axon.dense(20, activation: :relu)
  |> Axon.dense(1, activation: :softmax)
  |> Axon.share_param(["dense0_x.bias", "dense0_y.bias"], as: "shared_bias0")
  |> Axon.share_param(["dense1_x.bias", "dense1_y.bias"], as: "shared_bias1") 

Although I believe we could get away with marking a given subgraph as shared, as if it was possible to do:

model_fn = Axon.shared_params("shared_params0", fn input -> 
  input
  |> Axon.dense(16, activation: :relu)
  |> Axon.dense(32, activation: :relu)
end

And then this function would be usable in the same fashion I used above, but instead of each call creating fully separate nodes, the second call would know to use the same parameters as the first one, close to an Axon.namespace

@josevalim
Copy link
Contributor

@polvalente your version would conflict on the name, no?

@polvalente
Copy link
Contributor

@josevalim As it currently stands, the name is ignored and the graph generated contains 2 separate instances, at least as shown via Axon.Display:

Screenshot 2023-08-20 at 5 35 30 PM

@polvalente
Copy link
Contributor

Although re-reading my example is kind of nonsense :)
I'll edit with the correct one.

@seanmor5
Copy link
Contributor

@polvalente Your solution is kind of what I was thinking with Axon.block which would represent a re-usable block where the parameters are always the same. Though maybe it makes sense to have it something that is more explicit like Axon.clone e.g. use this function to create a clone of the subgraph contained everytime it is used. It would return an anonymous function and then anytime it is present it is the same parameters used in different places

@krstopro
Copy link
Member Author

krstopro commented Aug 20, 2023

@polvalente Your solution is kind of what I was thinking with Axon.block which would represent a re-usable block where the parameters are always the same. Though maybe it makes sense to have it something that is more explicit like Axon.clone e.g. use this function to create a clone of the subgraph contained everytime it is used. It would return an anonymous function and then anytime it is present it is the same parameters used in different places

Would it make sense to have higher-order functions, such as Axon.map that applies the layer (with same parameters) to an Enum of inputs?

@polvalente
Copy link
Contributor

polvalente commented Aug 20, 2023

If you have that Axon.block/clone returning an anonymous arity-1 function, you can just Enum.map that over your Enum of inputs

@krstopro
Copy link
Member Author

If you have that Axon.block returning an anonymous arity-1 function, you can just Enum.map that over your Enum of inputs

I guess I have to check how Axon.block/clone works. :)

@polvalente
Copy link
Contributor

If you have that Axon.block returning an anonymous arity-1 function, you can just Enum.map that over your Enum of inputs

I guess I have to check how Axon.block/clone works. :)

That's the suggestion Sean made right above, in the comment you replied to.

@seanmor5
Copy link
Contributor

seanmor5 commented Aug 20, 2023

I'll take a crack at Axon.block this week and post the branch for some feedback

@krstopro
Copy link
Member Author

I'll take a crack at Axon.block this week and post the branch for some feedback

That's awesome, thanks for the quick action! Will be happy to contribute, assuming I can.

@seanmor5
Copy link
Contributor

@krstopro There is a draft of blocks here: #524

Please give it a try and let me know if it helps a bit :)

The idea is that you can use blocks almost like you would a PyTorch module. It's an incomplete draft, so expect bugs and limitations. I will continue working on it this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants