You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I've been trying to understand get_residual_entropy method, and can't quite figure it out :) Awesome paper by the way :)
Really, my underlying need is to be able to use it on input where the language L has a vocab size larger than 2. I'm unclear if that's possible in the existing implementation? I'm sort of guessing that it's using a greedy approach to minimizing over all partitions? And that the greedy approach only works for a vocab size of 2 in L?
One question I'm wondering is: is the current implementation actually normalizing, as per the formula in section 5.1 of the paper? My entropy skills are not that great, so I may be misunderstanding. Do you mind if I walk through a simplish case study, and you can tell me where I'm going wrong?
let's suppose that we have num categories = 2, and each category C[i] can take 3 possible values. The utterances will have a vocab size of 2 (as per the paper), and total 6 bits.
So, the partition comprises one sub-sequence containing entire z, and one empty sub-sequence.
Let's suppose that somehow indx is the partition which results from minimizing over all partitions. I don't quite understand how this happens, but let's suppose that somehow it does happen, e.g. maybe by using some kind of greedy approach?
Looking at the formula in 5.1, we need to calculate H(C[i], z), twice, once for the empty sub-sequence, and once for the sub-sequence containing all positions
the conditional entropy of C[i] given the entire utterances will be simply 0, since all L* are unique, and all L utterances are unique
so we can ignore this one
the conditional entropy of C[i] given no utterances at all will I guess simply be the entropy of C[i]? Which we calculated above, to be 1.58496
then, if we ignore the denominator of H(C[i]), we have:
1 / 2 (1.58496 + 0)
= 0.7925
... which is what get_residual_entropy method returns.
However, this is without the denominiator of H(C[i]). If we add in the denominator of H(C[i])), then we have, I think, but, to be honest, I'm not sure the difference between H(C[i]) with respect to L (in the numerator), and H(C[i]) with respect to L* (in the denominator). I'm assuming they are the same, since L and L* are both the same length, and each utterance from L and L* are unique. so, this gives:
1 / 2 (1.58496/ 1.58496 + 0)
= 1/2
... which is differnet from what get_residual_entropy method returns?
Dont suppose... do you have a moment to go over with me what is the difference between H(C[i]) with respect to L and H(C[i]) with respect to L*? It looks like somehow H(C[i]) with respect to L* must evaluate to 1 somehow?
Also, if you have a moment, I'm really interested in knowing how does the code manage to minimize over all possible partitions?
And lastly, what do I need to modify in the code in order to measure residual entropy in languages L having a vocab size greater than 2? (this is my actual goal in fact :) so if there is a way of short-circuiting my having to understand the code ,and just run this, this could be great too :) )
The text was updated successfully, but these errors were encountered:
Hi, I've been trying to understand get_residual_entropy method, and can't quite figure it out :) Awesome paper by the way :)
Really, my underlying need is to be able to use it on input where the language L has a vocab size larger than 2. I'm unclear if that's possible in the existing implementation? I'm sort of guessing that it's using a greedy approach to minimizing over all partitions? And that the greedy approach only works for a vocab size of 2 in L?
One question I'm wondering is: is the current implementation actually normalizing, as per the formula in section 5.1 of the paper? My entropy skills are not that great, so I may be misunderstanding. Do you mind if I walk through a simplish case study, and you can tell me where I'm going wrong?
let's suppose that we have num categories = 2, and each category C[i] can take 3 possible values. The utterances will have a vocab size of 2 (as per the paper), and total 6 bits.
Then, L* will comprise the following:
Actually, you are offsetting each category, so per your convention L* will look like:
The entropy of each C[i] will be (and I'm using base 2 entropy here, since I like thinking of it as the number of bits required):
which gives
We generate a random holistic language, where every utterance is different, eg:
Now we pass L* and L through the
get_residual_entropy
method from this repo. I've modified it in two ways:indx
, so we can see the partition10
withnum_att_vals
, which is passed in as a parameterThe result is:
So, the partition comprises one sub-sequence containing entire z, and one empty sub-sequence.
Let's suppose that somehow
indx
is the partition which results from minimizing over all partitions. I don't quite understand how this happens, but let's suppose that somehow it does happen, e.g. maybe by using some kind of greedy approach?Looking at the formula in 5.1, we need to calculate H(C[i], z), twice, once for the empty sub-sequence, and once for the sub-sequence containing all positions
... which is what
get_residual_entropy
method returns.However, this is without the denominiator of H(C[i]). If we add in the denominator of H(C[i])), then we have, I think, but, to be honest, I'm not sure the difference between H(C[i]) with respect to L (in the numerator), and H(C[i]) with respect to L* (in the denominator). I'm assuming they are the same, since L and L* are both the same length, and each utterance from L and L* are unique. so, this gives:
... which is differnet from what
get_residual_entropy
method returns?Dont suppose... do you have a moment to go over with me what is the difference between H(C[i]) with respect to L and H(C[i]) with respect to L*? It looks like somehow H(C[i]) with respect to L* must evaluate to 1 somehow?
Also, if you have a moment, I'm really interested in knowing how does the code manage to minimize over all possible partitions?
And lastly, what do I need to modify in the code in order to measure residual entropy in languages L having a vocab size greater than 2? (this is my actual goal in fact :) so if there is a way of short-circuiting my having to understand the code ,and just run this, this could be great too :) )
The text was updated successfully, but these errors were encountered: