Technical challenges, current status, and solutions regarding the integration of audio and video.

Inevitable Collapse: Technical challenges, current status, and solutions regarding the integration of audio and video.

Introduction

My secretary always scolds me for not writing this first.

Let me say this first: my articles are almost entirely AI-generated. Even in English. I only have them checked to see how they’re understood by everyone.

Everything is truly my own words.

And this time, again, I’ll be talking about tools, but this isn’t a criticism of tools. It’s in line with my basic principles.

Technically, this method works at this point. That’s the kind of article it is.

Occasionally, I do watch other people’s YouTube videos.

Honestly, I’m usually too busy to watch them.

I’m not a YouTuber, but from the outside, I probably look like one.

From that perspective, I think we’re all pretty similar.

Now, about that video… it was a guy with quite a bit of charisma just talking.

Well, basically, it was another one of those “we’re deceiving people” moments.

He said, “Be more yourself” — so far, I agree — and if you don’t like that, just create a character and have AI talk.

Yes, there it is.

I hate these irresponsible consulting shows where the person hasn’t actually done anything themselves, and the ideas are completely empty — the “I just thought of this” type.

He said he used ElevenLabs to record the voice, based on information he’d gotten from somewhere, and if that wasn’t possible, just use his own voice.

Anyone who sees this will face reality.

The video seems to have come out quite a while ago, but even now, no one has solved the following problems. He doesn’t seem to know that.

Placing a character within the video and achieving 100% perfect lip-syncing still requires considerable technical skill.

For example, even with the current cutting edge, Veo, he can’t do it.

It’s a huge mistake to think everyone can do it.

Incidentally, he probably doesn’t even know this, but with Veo, it’s done like this:

First, prepare the voice data. Either Eleven or Fish will do.
Then transcribe it into text and have Veo read it aloud.
Have Veo read a lot of the video. However, a certain technique is required here.
Finally, in editing, discard the Veo audio and overlay the voice from step 1.

Sounds easy, right? That’s where the trap lies.

Techniques 3 and 4 are important, but in step 3, you need to make sure the reading is completed at roughly the same pace every 8 seconds.

Veo was originally designed to create a video that is essentially extended by one minute every 8 seconds.

Recently, it has finally broken out of its shell in some respects.

However, the audio is still within the shell of its design philosophy.

You probably already understand step 4, but you stretch or shorten the video to match the audio, which is read at an inconsistent pace every 8 seconds.

The important point is that I can only recommend stretching it by about 5% at most.

The video will become strangely slow and jarring, and gaps in the video that were previously hidden will become apparent.

It’s absolutely not something that can be released publicly. Well, that kind of thing is rampant, too.

In conclusion, with the capabilities of current AI tools, there’s virtually no way to perform long-form lip-syncing.

We haven’t reached that point yet.

You said “virtually,” didn’t you? You say?

Yes, there is. That’s the answer you’re looking for.

I mentioned the design philosophy earlier, didn’t I?

For example, Veo’s current design philosophy is for a time frame of 1 minute, but Ray is shorter, under 30 seconds.

However, Ray doesn’t incorporate audio output into its structure.

It doesn’t currently have a method for producing audio either.

But that doesn’t mean it doesn’t understand audio.

To put it in a roundabout way, it’s this:

The core of any tool has a strong understanding of both audio and music.

This means it’s thoroughly familiar with them through learning.

And,

Unfortunately, there isn’t even a tool in the world that integrates video and audio, builds logic around it, and links it to output. Simply put, Suno is music only, Ray is video only.

Veo, which is trying to do both, is desperately trying to combine voice and video afterwards.

However, that doesn’t utilize the core.

At least, that’s how it looks to me from the results.

So how can users do it?

Currently, the only way is to directly access the core and output it without being pulled in by unnecessary logic.

Is that even possible? Yes, it is.

I call it the SinkSeeding Method.

A little something to remember

Don’t be fooled by consulting shows.

They have a lot of good points, so focus on those.

For example, they speak confidently in front of the camera. Of course, the camera angle and everything else is calculated.

They have techniques to make themselves look good.

There’s a lot to learn. But there are also mistakes.

See things correctly.

Nevertheless, I see him as one of the people who incited the current atmosphere. If you’re going to use the media and show your words to a large audience, you should take responsibility for what your words bring about. That’s what I believe.