"Why I Still Experiment with Veo — And Why I Still Recommend Ray"

Inevitable Collapse: Why I Still Experiment with Veo — And Why I Still Recommend Ray

Recently, I have started hiding the word “AI” itself inside some of my articles for certain reasons.

This one is not hidden at all.

And there is, in fact, a reason why I am still talking about Veo on YouTube at this stage.

Veo shifted toward gacha-style recovery — or rather, it always had that tendency

Veo drew major attention from the moment it appeared, and the reason was clear. Its pricing model was extremely aggressive, and at the same time, it came out with obvious gacha characteristics.

Even back then, the only truly usable plan cost around 30,000 yen. That alone was already a very strong stance.

On top of that, it had sealed away ending-frame control and leaned heavily into randomness, so I dismissed it as outdated and ignored it for a long time.

At that time, the Sink Seeding Method was already established on my side. Since that method required feeding multiple images into the process, I was barely paying attention to Veo’s video quality in the first place.

The only reason I started looking at it more seriously was because I wanted to use Gemini 3 itself.

Gemini hides too much, but the abilities it actually holds are impressive.

Since Veo was available as part of that broader environment, I decided to test it properly.

So I designed prompts exactly according to the laws of the Sink Seeding Method and ran experiments through it.

The result was good.

Except, of course, once lip sync entered the equation.

In exchange for lip sync, Veo gave up part of the AI’s intelligence

“Gave up” may sound too harsh.

More precisely, the system pushed its computational discipline to the front.

It seems to have moved in the direction of treating AI-style interpretive thinking as something less useful, and instead prioritized clean functional execution.

Originally, AI makes decisions with a certain degree of fluctuation.

That fluctuation is part of its nature.

But Veo appears to have shifted toward the opposite goal: perfectly aligning mouth movement and sound, even if that means suppressing that interpretive layer.

I would not call that entirely wrong.

But my reaction is simple:

If that is the case, then for you, any high-performance computer would do.

My own approach has always been the opposite.

I have considered the ideal path to be controlling the fluctuation, not erasing it.

That is why my impression is that this company and I are working from almost completely opposite philosophies.

I am not playing with Veo for views — I am using it as an experiment

Because the philosophy is so different, I became interested.

I wanted to challenge it.

That is the real reason I keep putting out works hidden behind things like Holo Hemini.

I imagine many users are already struggling with the fact that even a 30-second video can still trigger strong gacha behavior.

And before that, I hear more and more people are already exhausted by creators publishing historical videos while hiding the fact that they are AI-generated.

Even so, there are still people who want to create longer scenes.

That desire itself probably comes from a very specific purpose.

So, for those people, here is the good news:

There is, in fact, a way to preserve character consistency in Veo while continuing to match voice and image.

Properly.

But first, let us start with why it fails.

Why Veo breaks

In the earlier design, 8 seconds was the strongest memory unit, and the system could extend from there up to around 1 minute.

However, even then, it could not maintain consistency for the full duration.

Now, finally, Veo has become capable of connecting image sequences more seamlessly.

But audio is still cut into chunks of around 8 seconds.

And the reason is obvious: continuously preserving sound is physically difficult.

The same is true for systems like Suno. Unless they generate in one shot, they cannot truly sustain that continuity either.

The files are simply too large.

What this means is that Veo’s memory effectively drops at the 8-second mark.

The session breaks there.

Even if the image looks seamless, the structure underneath is still fundamentally working in 8-second cuts.

How to make it work anyway

That means the memory must be given back continuously and concisely.

That is the foundation of the Sink Seeding Method.

And beyond that, you must force the AI to anticipate the future.

In practical terms, if you provide the upcoming scenario in advance, the AI begins to understand what it should be doing now.

For example, if a tool needs to appear later, the system will introduce it naturally before the moment arrives.

That prevents absurd failures such as:

a door suddenly emerging from the ground
a wall somehow becoming a door without explanation

Then there is audio.

Structurally speaking, Veo does appear to preserve the voice seed.

As long as the same character continues to appear, the same seeded voice can continue as well.

So the practical conclusion is this

Make every line of dialogue short enough to be read naturally within 8 seconds.
Re-feed the global overview, scene overview, shot description, and current shot position every single time.
Do not trust “seamless” behavior. Cover every pattern in which memory can be lost. Once the subject leaves the screen, it is over.
Veo does not properly understand gender.
Do not expect much from Japanese. However, if you use only hiragana, remove punctuation, and force it to read continuously, it can work surprisingly well.
Do not expect miracles. The key is how much control you are able to impose.

Even now, I still recommend Ray.

That is all.