I came back from a Europe trip with 58 files from a DJI Osmo Pocket 3 — 173 minutes of unstructured 4K. No shot list, no plan, just walking and filming. Instead of losing a weekend to a timeline editor, I built a pipeline and let AI cut the film.
How the pipeline works
Frame analysis. The system extracted ~10,000 frames and scored each for sharpness, exposure, and camera-motion type. All of it ran locally on a Mac Mini M4 — no cloud compute bill.
Speech as raw material. Whisper transcribed the audio and found Russian speech in 20 of the first 23 processed clips. The first pass missed all of it — the field audio was too quiet — so the pipeline gained a preprocessing step: 2x gain plus noise filtering, then re-transcription.
Scene understanding. Claude looked at key frames and tagged what it saw: which city (Paris / Amsterdam / Barcelona), landmarks, people, mood, and a visual-interest score from 1 to 10.
Creative direction. Claude Opus took everything — frame scores, transcripts, scene tags — and produced an edit plan. Its key structural decision: use the speech moments as narrative anchors. "We're going to Paris!", jokes under the Eiffel Tower, a heated FEBO debate in Amsterdam — those became the spine of the film, with music auto-ducking whenever someone talks.
Five versions to "I like it"
- V1 — fast cuts, no sound design. Verdict: a slideshow.
- V2 — music and a hook. Music crushed everything; slow-mo stuttered.
- V3 — pauses in the music. Silence with no motivation.
- V4 — even music bed. Technically fine, emotionally flat.
- V5 — speech-anchored narrative. That one I actually liked.
The finished cut: 5:30, 33 of 56 clips, 17 segments flowing Paris → Amsterdam → Barcelona, three royalty-free tracks, ten speech moments carrying the story.
What I learned
Speech is what makes a travel film alive. Every version without it felt like a screensaver. The moment real voices anchor the cut, it becomes a story about people.
AI doesn't nail it on the first pass — it iterates fast. Five full versions took less effort than one manual rough cut. The loop "watch → say what's wrong in one sentence → get a new cut" is the actual product.
Whisper needs help in the field. Camera-mic audio from a windy street defeats it silently — you get an empty transcript, not an error. Preprocessing is mandatory, and "no speech found" should always be treated as a bug until proven otherwise.
A pocket camera plus a Mac Mini is a full video studio now. No videographer, no editor, no render farm. The marginal cost of the film was zero on top of subscriptions I already pay.
The code is open source: github.com/vboldyrev16/ai-video-editor.
Originally discussed (in Russian) on my Telegram channel.