-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model: Support Janus-pro #3203
base: main
Are you sure you want to change the base?
model: Support Janus-pro #3203
Conversation
Coooooool! |
b6542eb
to
a29f545
Compare
Forgive my noobishness, but how to test the PR?
And it aborts with the error below:
|
Could you make sure the local repo has been updated, and python has chosen the correct |
The issue was caused by a missing dependency. Managed to get it working after running "pip install addict". |
Any plans to add text to image? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late review, I take a basic review through it and leave some comments here, I'll take a closer look these days.
8446258
to
a8b0764
Compare
0ee809f
to
85d8d81
Compare
janus-pro支持多模态理解和多模态生成,大佬,当前这套代码是仅支持多模态理解吧? |
9e8aecd
to
a1226ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave some questions and comments, review is continued...
@@ -134,7 +152,7 @@ def calculate_max_num_frames() -> int: | |||
ret = (max_req_input_len - len(input_ids)) // self.NUM_TOKEN_PER_FRAME | |||
return min(ret, 100) | |||
|
|||
MAX_NUM_FRAMES = calculate_max_num_frames() | |||
MAX_NUM_FRAMES = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we change max_num_frames
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because I find it hard to estimate the final number of tokens expanded from an image, it's implementation-specific. As a result, a hard limit is set here. Maybe I should slightly loose it a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible way to address it, is to introduce something like max_sampled_frames
into SamplingParams
. But I haven't seen anything like this in existing popular frameworks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is hard to estimate, I think we can remove calculate_max_num_frames()
, and we need tune MAX_NUM_FRAMES
to a value which don't cause OOM in most case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but how much memory should we assume we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can set the number to server_args or env variable (default 30 is ok), once oom is raised, we can leave an entrance for user to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me, will do it in a separate pr
Could we paste mmmu benchmark result here? |
hi @BaiStone2017, I think this commit only support multimodal understanding |
mmmu result:
|
It seems to change too many files. There are somes refractor code and some of them will change the structure under the folder of
|
Motivation
Support deepseek-ai/Janus-Pro models, #3195
Modifications
Checklist