You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First time ever posting an issue, apologies if I've written something incorrectly or missing obvious things.
On lines 82-88 of the file transformer-xl/pytorch/eval.py, the perplexity is being computed by computing the total loss and the total segment size.
mems = tuple()
for idx, (data, target, seq_len) in enumerate(eval_iter):
ret = model(data, target, *mems)
loss, mems = ret[0], ret[1:]
loss = loss.mean()
total_loss += seq_len * loss.item()
total_len += seq_len
Rather than adding to the total loss the term loss.sum(), the implementation instead multiplies the mean by seq_len. However when computing loss, there should only seq_len - 1 losses in the output of the model (in language modeling you predict the next token based on the previous tokens, so this excludes computing a loss value for the very first token).
(Compare this against the TF implementation in file transformer-xl/tf/train_gpu.py
if len(tower_losses) > 1:
loss = tf.add_n(tower_losses) / len(tower_losses)
else:
loss = tower_losses[0]
Here this issue is avoided because all losses are appended into a list tower_losses and then summed over and divided by the length of that list.)
This is subtle because it will make your perplexity value seem correct, but in actuality your perplexity computation is pretending to include one extra term. I think this is the correct implementation:
mems = tuple()
for idx, (data, target, seq_len) in enumerate(eval_iter):
ret = model(data, target, *mems)
loss, mems = ret[0], ret[1:]
loss = loss.mean()
total_loss += (seq_len - 1) * loss.item()
total_len += seq_len - 1
Or
mems = tuple()
for idx, (data, target, seq_len) in enumerate(eval_iter):
ret = model(data, target, *mems)
loss, mems = ret[0], ret[1:]
total_loss += loss.sum().item()
total_len += seq_len - 1
Is this a bug? Am I missing something? Thanks!
The text was updated successfully, but these errors were encountered:
First time ever posting an issue, apologies if I've written something incorrectly or missing obvious things.
On lines 82-88 of the file
transformer-xl/pytorch/eval.py
, the perplexity is being computed by computing the total loss and the total segment size.Rather than adding to the total loss the term
loss.sum()
, the implementation instead multiplies the mean byseq_len
. However when computing loss, there should onlyseq_len - 1
losses in the output of the model (in language modeling you predict the next token based on the previous tokens, so this excludes computing a loss value for the very first token).(Compare this against the TF implementation in file
transformer-xl/tf/train_gpu.py
Here this issue is avoided because all losses are appended into a list
tower_losses
and then summed over and divided by the length of that list.)This is subtle because it will make your perplexity value seem correct, but in actuality your perplexity computation is pretending to include one extra term. I think this is the correct implementation:
Or
Is this a bug? Am I missing something? Thanks!
The text was updated successfully, but these errors were encountered: