Why Character Consistency Isn't Maintained - It's Never Possible That Way


Inevitable Collapse: Why Character Consistency Isn't Maintained - It's Never Possible That Way

Video editing screen view

Character consistency. It has been about a year since this channel was renamed, but now I will finally put out something I have never written about in this much detail.

First, let me start from here.

I have never really spoken about this before, because I kept losing both the timing and the place to talk about it.

For that reason, I will probably never write about it in this much detail anywhere else.

I have touched on it lightly on YouTube, though.

About the structure

Without exception, AI video generation is a frame-by-frame shooting method.

It creates motion by shifting images little by little. Well, it is the same principle as film.

And why is it that characters are maintained to some extent? That is the clue.

Even when motion is added, the character remembers at least something like the immediately preceding image. In other words, it draws a character with the same Seed value.

At that point, what it is referring to is that memory and the prompt you wrote.

As for the environment, everything is cloud-based, and depending on the tool, some of them can handle millions of simultaneous connections.

In reality, even for a video of only 5 seconds, depending on the motion, thousands of images may be drawn to create that movement.

So I think you can imagine that it is not as simple as assuming that all of those images for all users are fully saved.

About characters

Even so, some tools allow you to specify the Seed value.

This is already quite generous, but at best it only applies to a single image, or a few images. For video generation, that is still just a drop in the bucket.

And if a tool does not even have that function, then functionally speaking, it means it cannot preserve the Seed value at all.

That said, in order to create continuity, some of it is still being preserved behind the scenes in ways the user does not notice.

What character consistency is

The answer to character consistency is to prepare a large number of character images with the same Seed value, and then use those for video generation.

And by skillfully inserting those images into places such as the starting and ending images, you continue fixing the Seed value.

Character consistency and its method

First, let us understand the structure by which a Seed is chosen.

AI looks at the prompt and then searches for a Seed value.

And the timing at which it refers back to the prompt is fixed.

There are three points:

  • the very beginning
  • when the session is cut
  • and when the character is lost

Yes, the answer is simple.

If you insert images with the same Seed value at those three timings, such as in the initial image, then character consistency will be maintained.

If you do not insert images with the same Seed value, then the character will be drawn only from your prompt.

In other words, things like “he” or “Japanese man.” lol

That is why the result looks vaguely similar, yet becomes a definitely different character.

In theory, if you keep supplying images with the same Seed value, then naturally the character with that same Seed value will continue to be selected forever.

And although I said “in theory,” our channel has been proving this through hundreds of videos.

So then, why does the character still sometimes collapse?

That happens overwhelmingly when the number of image samples is too small.

Originally, to represent a character and give that character motion, a large number of images is necessary.

Many people have probably had the experience of suddenly discovering how beautiful the side profile of someone they like in the same class at school is. It is like that.

In order to represent that person, you need to recognize them from many different angles.

The key point is whether, in every image you prepare, the character being drawn is one you yourself can accept.

Of course, this varies from person to person, and some people may not need that much.

So I cannot say exactly how many images are required, but I think you understand that many are needed.

For reference, the videos on our channel vary because there is a lot of experimentation, but when I later think, “yes, this one turned out really well,” then for a 3-minute video I have usually prepared 20 to 30 images, all of the same Seed value.

How to prepare the images

Now, here is today’s most important topic.

I believe this is the part many people are getting wrong.

A lot of people naturally prepare images for video generation AI by using image generation AI, but this is a bad move.

The Seed values produced by image generation AI are always adjusted so that the exact same thing is not output again.

You may be able to get closer with LoRA, but even then it will never be perfect.

That is right. The answer is not there.

I am sure many people are stuck at that point.

In my case, I do not use image generation AI for video generation in the first place.

It has already been about a year since I stopped doing that.

I first shoot a rough video using the same video generation AI that will be used for the final output, extract screenshots from that, and then do the real production.

At most, I may use only the very first image, depending on my mood.

Why?

Because that guarantees that images with the same Seed value will continue to be selected in the actual production.

And I also care about the number of patterns.

For songs, I used to take dozens of mouth-opening variations alone.

Even so, it is possible to get around 10 images from just 5 seconds or so, so once you get used to it, there is nothing troublesome about it at all.

Final Thought

Character consistency can be achieved through structural understanding and a bit of technique. There is no difficult theory or hidden mystery behind it.

Even so, there are two reasons why people still cannot do it.

One is that many people assume that companies will eventually solve it by adding new features.

That can be clearly denied.

For a service used by millions of people, even temporarily storing dozens of images per user would likely become a service that ignores profit entirely from the company’s point of view.

For that reason, this kind world is not coming any time soon.

Let us just give up on that quickly.

The other reason, I think, is that someone declared LoRA to be the correct answer, and that idea spread.

We used to get frequent probing emails ourselves, asking things like, “What do you think about using LoRA?”

Of course, I assume they had already made up their minds and were trying to push it toward illegality, or toward some kind of scam narrative. But who knows.

Also, I think some people are simply reluctant to use a large amount of video generation casually.

As for me, after using them sufficiently in advance, I now only maintain annual Ultimate contracts for Ray and Dream Machine.

There are major reasons for that, and I understand their strengths very clearly.

As for Ray, I would like to write about just how impressive it is in an article at some point.


Discussion will be added here later.

Hide

I was born and raised in Japan. After working for 30 years in the IT industry as an engineer and manager, I became fascinated by the true potential of technology and founded "havefunwithAIch." Current.