You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Doge is an architecture that combines the advantages of state-space and self-attention. It solves the problem of self-attention getting lost in long sequences by computing dynamic mask from cached value states using zeroth-order holding. It can also use wsd_scheduler on top of dense weight checkpoints to additionally train a sparsely activated feedforward network expansion layer.
Model description
Doge is an architecture that combines the advantages of state-space and self-attention. It solves the problem of self-attention getting lost in long sequences by computing dynamic mask from cached value states using zeroth-order holding. It can also use
wsd_scheduler
on top of dense weight checkpoints to additionally train a sparsely activated feedforward network expansion layer.paper: https://arxiv.org/abs/2412.11834
Open source status
Provide useful links for the implementation
Repository: https://github.com/LoserCheems/WonderfulMatrices
Weights: https://huggingface.co/collections/JingzeShi/doge-slm-677fd879f8c4fd0f43e05458
The text was updated successfully, but these errors were encountered: