Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add THL-150 model architecture implementation #36407

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ErebusTN
Copy link

@ErebusTN ErebusTN commented Feb 25, 2025

Description

  • Implement core THL-150 architecture with sliding window attention
  • Add configuration with RoPE embedding support
  • Include fast/slow tokenizers using BPE
  • Implement all model heads (CausalLM, Classification, QA, TokenClass)
  • Add dynamic RoPE scaling and GQA support
  • Validate attention mask generation for long sequences

Motivation

This implementation enables:

  • Efficient 32k context window processing
  • Flexible attention mechanisms (sliding window + full attention hybrid)
  • Compatibility with HF Transformers pipelines
  • Modern architecture features like Grouped Query Attention

Context

  • Built for long-context NLP tasks
  • Implements architecture similar to LLaMA with sliding window extensions
  • Designed for easy integration with existing HF ecosystems

Dependencies

  • Requires PyTorch >= 2.0
  • Recommends flash-attn >= 2.3 for optimal performance

Checklist

  • Updated documentation in docstrings
  • Verified config validation
  • Tests will be added in subsequent PR (current focus on core implementation)

Notes

  • Implements EGen License v0.1
  • Initial release focused on base architecture
  • Special thanks to the reviewer for consultation

Authored-by: @ErebusTN [[email protected]]

ErebusTN and others added 5 commits February 25, 2025 23:46
Core Components:
- Added THL150Config with sliding window attention parameters
- Implemented THL150Model with RoPE embeddings and GQA support
- Created slow/fast tokenizers with BPE preprocessing
- Included all model heads:
  * THL150ForCausalLM
  * THL150ForSequenceClassification
  * THL150ForTokenClassification
  * THL150ForQuestionAnswering
- Added configuration validation and attention mask handling

Key Features:
- 32k context window support
- Sliding window attention implementation
- Dynamic RoPE scaling
- Multi-query attention compatibility
- HF Transformers integration ready

Code Structure:
src/transformers/models/thl_150/
├── __init__.py
├── configuration_thl_150.py
├── modeling_thl_150.py
├── tokenization_thl_150.py
└── tokenization_thl_150_fast.py

Fixes Included:
- Implemented missing loss functions in model heads
- Fixed RoPE initialization fallback
- Added proper BOS token handling
- Resolved attention mask generation issues
- Validated configuration imports

License: EGen License v0.1
@Rocketknight1
Copy link
Member

hi @ErebusTN, we usually don't add architectures to the library until there's an existing pretrained model. Is there a relevant model repo or paper anywhere?

@ErebusTN
Copy link
Author

ErebusTN commented Feb 26, 2025

Hi @Rocketknight1 i working on a model EGen V1 and it's my phD project and i going to update it frequently it's just in the first steps as v0.1 I still have a lot to work on to improve it so I haven't uploaded it yet but stay tuned the next update will be the first and I will upload the model to be open source

@Rocketknight1
Copy link
Member

Cool! One thing we'd advise is that models can be uploaded as custom code using the steps here. This will let you share the model immediately, and it'll work exactly the same as a library model (except that users will need to set trust_remote_code=True)

This can be a lot faster than actually getting a PR into transformers, and it's a good way to validate the model and get users, which will help speed up the PR later!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants