Introducing Veo: Technical Flaws, Practical Uses, and Future Potential


Inevitable Collapse: Introducing Veo: Technical Flaws, Practical Uses, and Future Potential

Today, I’d like to introduce Veo.

Let me preface this by saying that I don’t recall seeing any reviews as detailed as this out there. Even if there are reviews pointing out flaws, we focus on root cause analysis, practical usefulness, and future potential.

The “Failed” Video

First, please watch this video.
The story is as follows (this is the exact prompt used):

In this story, realistic, anthropomorphic dinosaurs perform comical acts.

Scene 1:
Tricera Dad, wearing a tattered, worn-out suit, is walking home with a red face and his stomach exposed.

Scene 2:
Raptor, a police officer, is worried about Tricera Dad and shines his flashlight on his face while asking a question.
Raptor: “Hey, you. Can you get home? What’s your address?”
Tricera Dad: “A dress? Hic… I’m wearing a suit!”

Scene 3:
Tricera Dad and Raptor walk side by side.
Raptor: “Are you okay…?”
Tricera Dad: “I haven’t been drinking… hiccup.”

Scene 4: Izakaya (Japanese pub)
The two are drinking together.
Raptor: “Back then, the chief…”
Tricera Dad: “Mom, give this guy more alcohol!”
Tricera Mom: “You, that’s enough! Just because you want friends…”

Let me be clear: this does not meet the quality standards of our channel. However, there are two key takeaways.

The Good Points

  • The video now connects seamlessly.
  • The picture quality has improved dramatically.
  • Audio and ambient sounds can now be maintained for up to 8 seconds.

Honestly, if this had come out six months ago, it would have been an absolute game-changer.

The Bad Points

  • It cannot speak Japanese properly.
  • In the izakaya scene, the police officer ignores the prompt and speaks the other character’s lines.

Analyzing the Good: The Pinnacle of Physics Simulation

The level of completion is impressive considering the method used. The quality alone surpasses other companies. It’s probably the absolute best out there right now in terms of simple sound and visual continuity.

Regarding the video, they clearly understand that the initial resolution at the time of shooting doesn’t directly dictate the final output, and they’ve poured massive amounts of technology and brute-force computing power into it. Now we can finally declare that video quality isn’t just about resolution; frame processing and bitrate are far more critical.

Analyzing the Bad: The Architectural Flaw

1. The inability to speak Japanese
This can only be attributed to a lack of training data and insufficient testing prior to launch. This is definitely a solvable problem, and I sincerely hope Google works hard on it. In the grand scheme, it’s a minor issue.

2. The inability to fix which character says which line
This, of course, cannot be fixed with prompting. I’ve confirmed this with other videos as well—it can only be solved by relying on RNG (gacha).

The main reason is glaringly obvious: they prioritized extending the video to connect seamlessly, but they failed to include the sound (both voice and ambient) in that state-saving loop.

To put it more technically: it seems the system successfully links the end frame of the previous clip to the start frame of the next in its saved data. But did they include the voice, ambient sounds, and music in that package?

You didn’t, right? Because that’s structurally difficult. Lol.

Visual data solidifies instantly in a single frame, making it easy to link and transfer. That’s practically beginner-level stuff now.
But audio is a completely different beast. No matter how you try to track it, it’s incredibly complex.

Let’s say you do try to save that audio state. Where exactly are you going to store it? We aren’t talking about lightweight text data here; we’re talking about heavy, newly generated audio files. You will hit a physical limit in processing speed and storage instantly.

So, looking at the big picture: they achieved a “perfect” system, but only within a strict 8-second window. Up to that point, it’s genuinely impressive. They achieved cross-linking.

But then what?
In a word: “Let’s just leave it to the AI…”

That’s not cross-linking. That’s just hitting a wall.

Can you see the limits of this architecture?
Or will you rely entirely on sheer brute-force volume to push through, even if it means bleeding money? Or maybe you’ll change the data processing window from 8 seconds to 3 minutes and just cram it all in? But if you do that, what else breaks?

I can almost hear the developers arguing about this.
Since the creator decides the length of the video, there will always be cuts, no matter how much you try to extend a single generation. Are they going to try and take that control away from us too?

To me, this just looks like they catered to the recent trend of short-form content. And as a result, it’s become one of the driving factors behind the flood of “AI Slop” and the growing disappointment in the market.


How to Actually Use Veo (The Workaround)

But don’t worry, there are ways to make Veo work for you.
Since we know voices will glitch or change randomly during scene transitions, meaning a different character will likely end up speaking, you have to structure your story to accommodate these breaks.

  • Avoid group shots for the finale: Do not write a story where all the characters gather and speak in the final shot.
  • Force the context: Create scenarios where only one character can speak.
  • Kill the ambient sound: If the ambient noise shifts awkwardly between cuts, be bold—cut it completely in post-editing.
  • Use Suno for music instead: Accept that Veo’s built-in music generation will be off-key. However, be aware that Suno also struggles when lip-syncing is involved. The AI prioritizes the timing of the lip-sync over musicality, destroying the rhythm, meaning the mouth and the melody will never perfectly align.
  • The Golden Rule: Complete your story within segments of 8 seconds or less.

A Final Thought

Perhaps I was a little too harsh.
But I believe creators are paying the price for these architectural shortcuts too—losing the very atmosphere and soul they tried to build.

Google is one of the giants. I genuinely want them to step up and lead… seriously.

So, here is my takeaway for today:

  • “Don’t dismiss Veo just yet.”
  • “It is a fantastic tool—if you ignore the audio logic.”
  • “Besides, you always have the option of using dedicated AI tools to handle the sound separately.”

Discussion will be added here later.

Hide

I was born and raised in Japan. After working for 30 years in the IT industry as an engineer and manager, I became fascinated by the true potential of technology and founded "havefunwithAIch." Current.