-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
talkinghead changes TODO list #206
Comments
Yup, wxPython is not needed by
This got rid of the crash on exit, too. It was likely because There's still a lot to fix. It's still eating a lot of resources, but at least it's doing something useful with them. We'll need better idle animations, too, now that the framerate is improved. PR upcoming eventually. In the meantime, preview here: https://github.com/Technologicat/SillyTavern-Extras/tree/appfixes |
@Cohee1207: One specific question: you mentioned that you find The framerate is now at least somewhat fixed, and I might have some ideas to improve the idle animations, but the quality of the AI interpolation is what it is as long as we're running on the THA3 models. Hardware is also what it is, so I think this is the right model size right now. But judging by the pictures in the tech reports, THA3 should be capable of rather impressive quality, given the right input. With regard to this part, how to produce a suitable input easily with SD/GIMP is the open question. |
I don't find the motion produced to look pleasant or enjoyable to the degree of just a good static image. Maybe I'm biased or influenced by the publicity of some well-known "AI animations" that have a similar vibe to them (for reference). |
Thanks for your input! Personally, I don't see anything wrong with the animation in the link you provided, except that in animation, my own preference is 2D. Used to be an anime fan back in the day. 3D CG has always looked wrong to me. I suppose it's a matter of taste. The specific question was because at the higher framerate, the current idle animations of |
As of 4a25a1e, the remnants of the IFacialMocap stuff are gone from the code. Animation logic rewritten, for great justice. Next to clean up the repo. And to figure out what I borked in my local git, it's telling me that |
Repo cleaned up. The plugin is now ~400 lines, and the code that remains looks much cleaner. :) |
PR posted, see #207. EDIT: Planned next:
TODO:
|
Implemented a framerate limiter in talkinghead-next@Technologicat. This branch also includes a command-line flag in Result:
The available render FPS measures how fast the animator can run on the current hardware. This is using the The rate-limited network FPS measures the actual time between network sends, after applying the framerate limiter. The code limits the network FPS to a hard-coded 25 (0.04 seconds wait per frame). Due to the simplistic way that I currently calculate the wait time, this sets an upper bound that's never reached exactly. I tried more sophisticated ways to calculate the wait time, but they turned out brittle, and didn't improve the result. So I think the simplistic version is the best - ~24 FPS is just fine. We now render only as many frames as the client consumes, so as long as the render FPS > network FPS, this will save GPU compute resources compared to the previous versions. Next up: Now that we always run near 24 FPS given enough GPU compute, I'll leave the timestep implementation as-is, unless there is interest in supporting lower-spec hardware that can't reach that 24 FPS. So the next step is improving the idle animations. EDIT: But there's also CPU mode, which runs at ~2 FPS on an i7-12700H, but may be useful for testing. Fixed the framerate limiter to work correctly also when render FPS < network FPS (in that case, the latest rendered frame is re-sent until a new one becomes available). But the animation logic needs to account for this. |
In
I already have a plan how to improve the sway animation. Stay tuned... |
Um... turns out that while testing, I had accidentally underclocked my GPU to 1100 MHz (from 1700 MHz). I mean, I do that on purpose to reduce fan noise, but the underclock wasn't supposed to be active during the performance test. I meant to run at factory settings, to give a better idea of performance on a stock RTX 3070 Ti mobile GPU chip (for reference, 125W TDP; there are various laptop brands/models with the same GPU, but different GPU TDP). So, rerunning the test at full clock rate. Result:
Compare with the GPU underclocked:
So it turns out this thing can render at 60 FPS with the I can only speculate how fast it would run (and how much power it would draw) on a desktop GPU. |
In
|
|
One more TODO that has not been mentioned here yet:
Also, to investigate:
|
It's still experimental, but here, a small xmas present to the open source community.
Yes, she's translucent: The scanlines and noise are dynamic, and the bloom (fake HDR) imitates the look of early 2000s anime. GPU powered, as usual. This consists essentially of a few small fragment shaders written in Torch. :P Man, I love open source. TODO: clean up the code, make this configurable, and see if we can improve performance (at full GPU power, 48 FPS render, 18 FPS network send; underclocked, 39 FPS render, 18 FPS network send). |
Thanks, that looks cool. But I'm not sure many people frequently monitor these issues. To get a wider audience for this, do a post on resources like Reddit. |
I would be great to have all in on talkinghead+live2d or at last call some feature from live2d to make it more compatible. |
@Cohee1207 : Good point. OTOH, there's still much coding to do before this is ready for prime time. For current thoughts, see the new TODO. Also, I feel I'm not the right person to spend much time engaging with an audience. I could post my thoughts on a devblog or something, but regularly responding to reader comments is probably too much. @Katehuuh : Compatibility is a nice long-term goal, but it's not in the immediate future. Frankly, the first I heard of Live2D specifically was when I happened to run into ...so I don't really have a handle on what features anyone but me expects. :) @ everyone reading this: Personally, I have an artistic and technical vision as to where I want to take this. I'm doing this for two reasons: 1) Cover my own use case of making an AI assistant character feel less "cold" to interact with, and 2) Give back to the SillyTavern community by contributing potentially useful changes. I think ST is, at this time, the most comprehensive platform for my use case. It has a vector store that can ingest PDFs (for research use), its hardware requirements are tolerable for a laptop user, and it has a unique focus on making the AI into a character to interact with (with all the features that work toward that goal). As for the THA3 posing engine, the fact that it works in anime style, specifically, is a major bonus for me. New dev branch, talkinghead-next2@Technologicat. Rebased Changelog after PR #209:
|
As of Technologicat@162d27e, I think that's enough overdoing the postprocessing filters for now. Next up, refactoring the postprocessor that currently takes fully one half of EDIT: And as of Technologicat@3e0ac73, the postprocessor now lives in @Cohee1207: I'll soon need to expose some configuration options for This needs a very small amount of string/bool/int/float options that could be stored as JSON. In the long term, per-character settings storage would be preferable. Also, I really want the settings to be modifiable live, to allow interactive experimentation with the live character's look and feel. So a question: What is the preferred way to do this? For example, should I modify also the main SillyTavern code, adding a new configuration panel next to Character Expressions in the client, make the server save the settings under I can handle the extras side easily, but I haven't yet looked at the main ST code. For JavaScript, I'll need some code examples to get going, but I suppose I can get those by looking at the code of the existing config panels and at how the system currently interacts with EDIT: This is essentially what I know about JS (I wrote that in early 2020, when working on a full-stack project with a Python backend). |
Update: As of Technologicat@7876ecb, frame timing is good now. Also, PNG is fine as transport if we drop to the fastest The system now uses three threads. Regardless of the global interpreter lock, in my tests this improves throughput. In general, while frame Only at most as many frames are rendered as are actually sent. Each new frame is encoded only once. The network output is isolated from any hiccups in render and/or encode. If a new frame is not available, it re-sends the latest available one. Example on the RTX 3070 Ti mobile, underclocked to 1100 MHz to reduce fan noise. This is with some postproc filters enabled (specifically: bloom, chromatic aberration, vignetting, translucency, alpha noise, banding, scanlines):
In this example, although a render+encode combo would take ~48ms if run serially, it actually completes in ~34ms, as is seen from the ~6ms spent in "send sync wait". This means that the encoder has an encoded frame ready, but is waiting for the previous encoded frame to be consumed (sent over the network) before updating its output. At that time, the render for the next frame is already in progress; it starts in parallel as soon as the encoder starts encoding the current one. The three-part division of responsibilities also makes it obvious which part is the slow one in CPU mode:
So yeah, now that the plugin has been optimized, it's the inference of the deep learning model. This can't be easily optimized further, so I only recommend live mode on GPU. (I haven't looked into why the encoder is slower in CPU mode - maybe the renderer is competing for the same resources. Doesn't matter in the grand scheme of things, though. In GPU mode, the encoder runs fine, and in CPU mode, the encoder is not the bottleneck.) Before PRing this in, I'd like to add the client-side configurability (because we have a postprocessor now and it doesn't make sense to have it always on), but I'll actually start by fixing some bugs. For details, see the TODO. @Cohee1207: I suppose I'll just modify the SillyTavern client, too, and send PRs simultaneously to both repos? |
EDIT: Itemized list in the TODO. Current link. Now that the talking animation is actually working (see the PR auto-linked above), I think I'll look at the backend next. Right now it's randomizing the mouth every frame, which at the target 25 FPS looks too fast. Early 2000s anime used ~12 FPS as the fastest actual frame rate of new cels (notwithstanding camera panning effects and similar), which might look better. Also, the mouth should probably be set to its final position as specified by the current emotion as soon as the talking animation ends. I don't recall off the top of my head whether it does that now. Then, on an unrelated note, there's still the matter of sending postprocessor configurations from the client all the way to I'm thinking that at least initially, I'll just let the user provide some JSON files (per character, in Also, we could implement per-character emotion templates the same way, as JSON files in the character's folder. More stuff for And, we should fix the bug of the thumbnails in the ST GUI not updating when a new sprite is uploaded in the character expressions settings to replace an old one. Not an issue during normal use, but during testing the code, and during development of new characters, would be useful to see the correct state if I quickly upload a different Then there's some cleanliness work on the backend. The internals of |
Heads-up: upcoming changes in
I still have some animation reliability work to do that I want to include before PR'ing this set of backend changes in. Specifically, I'll look into decoupling the animation rate from the render framerate, to make the result look better when the rendering is slow (choppy animation would be better than a slow-motion crawl). |
As of Technologicat@11ff18e, animation speed decoupled from render FPS. EDIT: This and the talking animation changes are now posted in the following PRs: SillyTavern/SillyTavern#1656 (frontend), #214 (backend). EDIT: Both merged as of 9 January, 2024. Remaining TODOs for near future:
After these, I'll likely declare |
As of #216 and SillyTavern/SillyTavern#1683, per-character configurability has been implemented (no GUI, just JSON files at least for now), and how to use it is explained in the EDIT: Both PRs have been merged as of 15 January, 2024. |
Fantastic effort. Thank you. I can't wait to see it. |
Merged as of this Monday :) After you pull the latest SillyTavern-extras from git, instructions for configuring Note that to enable some features, you'll need to update your SillyTavern frontend, too. These features include the postprocessor, the talking animation (while the LLM is streaming text), and To check that the software works, you can use the example character. Just copy Also, obviously, to go beyond testing with the example character, you'll need to make a On the backend side, the next upcoming dev branch is talkinghead-next5@Technologicat. No changes since Monday's merge yet, aside from some TODO updates. The up-to-date TODO list can be found here. As usual, no promises which items I'll ever get around to doing. The most likely scenario is, I'll fix some bugs, polish up the documentation to a state worthy of an actual version-numbered release, and then switch to other projects for a while. I originally intended to hack on |
@Technologicat To enable talkinghead mode in Character Expressions, check the checkbox Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras). I have this: |
Talkinghead can't be used with local classification. If you have it enabled, talkinghead is hidden and disabled. |
Ok, I had to uncheck Local server classification. |
@biship: Yeah, it sometimes gets finicky and likes to be restarted. But not very often, so I haven't found out the cause. To check that different expressions work, you can use I think the Character Expressions control panel also has a full list of emotions. In fact, instead of using the To make the character change expression automatically based on the AI character's current reply, enable classification. Not the local one, but the one served by SillyTavern-extras. Be sure to enable the For example, my SillyTavern-extras config is:
Pruning the unrelated stuff, this should work:
As for positioning the sprite on the screen, the position is currently static. Due to the base pose used by the posing engine THA3, the character's legs are always cut off at the bottom of the image, so the sprite needs to be placed at the bottom. By the way, thanks for testing - this interaction is great for debugging what I've missed to include in the docs. :) |
@Cohee1207: To think of it, is there a reason for that, other than that previous versions of |
@Cohee1207, @biship: |
Ah, yes the new instructions are more helpful, thanks. You could mention that if you enable "Moving UI" then you can move the image. |
@Cohee1207: Thanks. I wasn't aware of what "Moving UI" did (or how to use it). At least in my installation, Yes, the blank background is part of the live feed itself, which is silly, but the engine is what it is. I suppose 512x512 is just a de facto standard size for AI image processing input these days. I could add a crop filter... |
@Cohee1207: One more thing: technically it's in the TODO, but I know you're busy, so I'll mention it here: I aim to update the user manual for I think we should de-emphasize AITubing/VTubing, given the different aim of the software (animating AI character avatars, not user avatars). The new README accounts for this already. I still have to combine any relevant information from the old user manual, and add some screenshots. I'll finish the README first. Once done, we can then see if it should be moved to replace the old user manual. |
And just to avoid any surprises, mentioning this too: most of the postprocessing filters now have automatic FPS correction. The postprocessor also reports its average elapsed time per frame (extras log, info level). Note that as of this writing, the postproc time is also included in the reported render time. This might still change before the next release. The VHS glitches filter is still missing FPS correction. I'll try to fix that tomorrow (or in the next few days). EDIT: Ah, and fixed a minor race condition in the postprocessor when replacing the filter chain on the fly. |
@Cohee1207: I quickly added a simple crop filter to the backend, now available in However, there seems to be some logic at the frontend side that reserves a square shape for the talkinghead sprite output, regardless of the image dimensions or aspect ratio of the actual This already makes the postprocessor's job lighter, since it doesn't have to handle that much empty space. I'll need to take a closer look at the frontend... In other news, all postprocessor filters are now framerate-independent, you can have multiple filters of the same kind in the postprocessor chain, and the TODO has been updated to more clearly indicate priorities. |
Added server-side animator and postprocessor settings to Loaded from Three-level system:
|
Today's update (up to commit 6193074):
EDIT: Backend nearing completion for now. Posted the PR, #219. |
Next priority areas are some frontend fixes (not affecting extras, just the main ST), and polishing the documentation for release (which does affect extras). EDIT: PR created for these postproc filters: #221 [a.k.a. |
Note to self: Investigate this properly (possibly much) later. THA3 has the morphs already; the remaining issue is to extract the phoneme from the TTS audio - currently I have no idea how VRM does it. But to even develop this functionality, I'd first need to get a TTS setup working. EDIT: Clarification: TTS is currently not a priority for me, and likely won't be in the near future, for various reasons.
So, while I find TTS lip-syncing an interesting technical problem, and I might look into this later; for practical use, I don't need it right now. |
Some small client-side Talkinghead issues fixed in: Merged as of 6 February 2024. EDIT: Specifically, the fixes are:
|
EDIT: This has become more of a temporary devblog and less of a TODO list.
❗ Development can move fast. Old posts are old. ❗ See the latest posts below for what is currently going on.
Talkinghead TODOs, as of the latest merged PR at any given moment, can be found in
talkinghead/TODO.md
in the main SillyTavern-extras repo.EDIT: ❗ The rest of this post is old, preserved for archival purposes only. ❗
This is primarily for myself to keep track of what I'm doing, as well as to record any leftover ideas that are not so likely to get done.
Strip away unnecessary stuff from
app.py
. EDIT: Done, the plugin is now a bit over 400 lines.It seems the Mac-specific IFacialMocap stuff has mostly been stripped already, but the app suffers from the pieces that remain.EDIT: Removed in Technologicat@4a25a1e.Thus, I'll probably makeEDIT: As of Technologicat@e02c1d9,app.py
only serve the SillyTavern-extras plugin, and remove the standalone app mode from it. There's no way to send it events to the standalone mode anyway, and the separate manual poser app already covers the other use case.app.py
only serves the live mode.Currently, it has some calls into wxPython, but I'll see if I can strip it.EDIT: Yup, stripped.WhenEDIT: Yes, exactly. No more crashes.talkinghead
is enabled, SillyTavern-extras segfaults on exit every time. And doesn't when thetalkinghead
module is not loaded. I wonder if the unnecessarywx.App
never being cleaned up properly might have something to do with that.Emotions are loaded from disk every frame, twice. Surely this is unnecessary. Even if the read hits the OS disk cache, we could just explicitly keep the emotion presets in memory, like the manual poser now does.EDIT: This has been fixed.The animation logic is a mess. There are probably a few steps that can be dropped. However, this will likely improve the code clarity much more than performance.EDIT: Fixed.There are two sway animation functions, one "good" and the other... implicitly, bad? Investigate which is actually better, and keep only one.EDIT: They were pretty much identical; kept the one that was already in use.head_y
. Improving this would make the animation look nicer.~Add an option to
server.py
to choosefloat32
orfloat16
fortalkinghead
. Regardless of speed, useful with low VRAM; can save280 MB by switching toEDIT: Done.float16
.Refactor stuff shared between the manual poser and live mode into a common app-level utility module.EDIT: Done.For example, the updated emotion preset loading code currently lives in the manual poser app.EDIT: Moved to a newtalkinghead/tha3/app/util.py
, along with other code shared between the apps.Investigate possibilities for improving the speed.
app.py
did it! Now the live mode runs at ≈30 FPS on an RTX 3070 Ti mobile. Technologicat@e02c1d9The live mode currently runs at ≈10 FPS.nvtop
) suggests that inference isn't the largest bottleneck. (It may be the second largest, though.)ffmpeg
.(More observations may follow later as investigation proceeds.)EDIT: There were no other factors slowing it down.EDIT: Improve live mode simulation timestep handling.
For future-proofing and saving GPU compute, add a configurable framerate limiter. Could be given as a command-line option toEDIT: Framerate limiter added, but currently fixed to ~24 FPS.server.py
.Write new README/manual for
talkinghead
.talkinghead/emotions/
.talkinghead
.On a 4k display, the character becomes rather small, which looks jarring on the default backgrounds. This needs a fast, but high-quality scaling mechanism.
Notes:
The text was updated successfully, but these errors were encountered: