M
MercyNews
Home
Back
Two Brothers Build Text-to-Video Model from Scratch
Technology

Two Brothers Build Text-to-Video Model from Scratch

Hacker News6h ago
3 min read
📋

Key Facts

  • ✓ Sahil and Manu, two brothers, spent two years training a text-to-video model entirely from scratch, releasing it under the Apache 2.0 license.
  • ✓ The 2B parameter model generates 2-5 seconds of footage at either 360p or 720p resolution, with capabilities comparable to Alibaba's Wan 2.1 1.3B model.
  • ✓ Development focused heavily on building effective curation pipelines, including hand-labeling aesthetic properties and fine-tuning VLMs for large-scale filtering.
  • ✓ The model uses T5 for text encoding, Wan 2.1 VAE for compression, and a DiT-variant backbone trained with flow matching.
  • ✓ Current strengths include cartoon/animated styles, food and nature scenes, and simple character motion, while complex physics and fast motion remain challenging.
  • ✓ The brothers view this as a stepping stone toward state-of-the-art capabilities, with future plans for post-training, distillation, and audio integration.

In This Article

  1. Quick Summary
  2. The Two-Year Journey
  3. Technical Architecture
  4. Capabilities & Limitations
  5. Why Build Another Model?
  6. Future Roadmap
  7. Looking Ahead

Quick Summary#

Two brothers have completed a two-year journey to build a text-to-video model entirely from scratch, releasing it as open-source software. The project, led by Sahil and Manu, demonstrates that independent developers can compete in the advanced AI space without massive corporate resources.

The resulting model contains 2 billion parameters and can generate short video clips from text descriptions. While not claiming to match the performance of commercial systems like Sora or Veo, the brothers view their work as a crucial stepping stone toward state-of-the-art capabilities.

The Two-Year Journey#

The brothers began their work in early 2024, shipping their first model in January of that year—before OpenAI's Sora made headlines. Their initial release was a 180p, 1-second GIF bot that was bootstrapped off Stable Diffusion XL. However, they quickly discovered fundamental limitations with using image-based models for video generation.

Image VAEs don't understand temporal coherence, and without the original training data, it's impossible to smoothly transition between image and video distributions. At some point, the brothers determined they were better off starting over rather than trying to patch existing solutions.

Their second version represents a complete rebuild from the ground up. The model uses:

  • T5 for text encoding
  • Wan 2.1 VAE for compression
  • A DiT-variant backbone trained with flow matching

Interestingly, while they built their own temporal VAE, they ultimately used Wan's smaller version because it offered equivalent performance while saving on embedding costs. The brothers have committed to open-sourcing their VAE shortly.

"We're not claiming to have reached the frontier. For us, this is a stepping stone towards SOTA - proof we can train these models end-to-end ourselves."

— Sahil and Manu, Model Developers

Technical Architecture#

The model generates 2-5 seconds of footage at either 360p or 720p resolution. In terms of model size, the closest comparison is Alibaba's Wan 2.1 1.3B model, though the brothers report that their model achieves significantly better motion capture and aesthetics in their testing.

The bulk of their development time wasn't spent on the model architecture itself, but on building curation pipelines that actually work. This involved hand-labeling aesthetic properties and fine-tuning Vision-Language Models (VLMs) to filter training data at scale.

When asked about their approach, the brothers explained their philosophy:

Products are extensions of the underlying model's capabilities. If users want a feature the model doesn't support—character consistency, camera controls, editing, style mapping, etc.—you're stuck. To build the product we want, we need to update the model itself.

This perspective drives their decision to own the entire development process, despite the significant computational costs involved.

Capabilities & Limitations#

The model demonstrates particular strengths in specific domains. Through extensive testing, the brothers identified what works best:

  • Cartoon and animated styles
  • Food and nature scenes
  • Simple character motion

However, the model still faces challenges with more complex scenarios. Areas that don't work well include:

  • Complex physics simulations
  • Fast motion sequences (gymnastics, dancing)
  • Consistent text rendering

The brothers are transparent about their model's position in the current landscape. They explicitly state: "We're not claiming to have reached the frontier." Instead, they view this release as proof of concept—demonstrating they can train these models end-to-end themselves.

Why Build Another Model?#

With commercial offerings like Google's Veo and OpenAI's Sora already available, the brothers' decision to build from scratch might seem counterintuitive. Their reasoning centers on product control and flexibility.

When commercial models don't support specific features, developers are limited by what those models can do. The brothers believe that to build the product they envision, they need to update the model itself. This requires owning the development process rather than relying on external APIs.

It's a significant bet that requires substantial GPU compute resources and time to pay off, but they believe it's the right long-term strategy. Their approach allows them to:

  • Customize capabilities for specific use cases
  • Iterate quickly on model improvements
  • Control the entire technology stack
  • Build features that commercial models don't support

Future Roadmap#

The brothers have outlined a clear roadmap for future development. Their immediate priorities include:

  • Post-training for physics and deformations
  • Distillation for speed optimization
  • Audio capabilities integration
  • Model scaling for improved performance

They've also maintained a detailed "lab notebook" of all their experiments in Notion, which they're willing to share with others interested in the technical details of building models from zero to one.

The model is released under the Apache 2.0 license, making it freely available for commercial and non-commercial use. This open-source approach aligns with their goal of democratizing access to advanced AI capabilities.

Looking Ahead#

The release of this 2B parameter model represents more than just a technical achievement—it demonstrates that independent developers can compete in the advanced AI space with sufficient dedication and resources. The brothers' two-year journey from a 180p GIF bot to a sophisticated text-to-video model shows what's possible with focused effort.

While the model may not yet match the performance of commercial giants, it serves as a stepping stone toward state-of-the-art capabilities. The brothers' commitment to open-source development and transparent documentation could inspire other independent researchers to pursue similar projects.

As the AI landscape continues to evolve, projects like this highlight the importance of diversity in development approaches. Rather than relying solely on large corporate research labs, the field benefits from contributions from independent developers who bring different perspectives and priorities to the table.

"Products are extensions of the underlying model's capabilities. If users want a feature the model doesn't support—character consistency, camera controls, editing, style mapping, etc.—you're stuck."

— Sahil and Manu, Model Developers

"To build the product we want, we need to update the model itself. That means owning the development process."

— Sahil and Manu, Model Developers

Continue scrolling for more

AI Transforms Mathematical Research and Proofs
Technology

AI Transforms Mathematical Research and Proofs

Artificial intelligence is shifting from a promise to a reality in mathematics. Machine learning models are now generating original theorems, forcing a reevaluation of research and teaching methods.

Just now
4 min
351
Read Article
AI Agents Flunk Real-World Workplace Tests
Technology

AI Agents Flunk Real-World Workplace Tests

A comprehensive new benchmark testing AI models on real-world professional tasks shows most leading systems are not yet ready for white-collar work. The study reveals critical failures across consulting, investment banking, and legal scenarios.

54m
5 min
6
Read Article
Final Fantasy VII Remake on Switch 2 Review
Entertainment

Final Fantasy VII Remake on Switch 2 Review

The Nintendo Switch 2 offers a compelling, portable way to experience the acclaimed Final Fantasy VII Remake, balancing performance with convenience.

55m
5 min
6
Read Article
Microsoft 365 Outage Disrupts Global Enterprise Services
Technology

Microsoft 365 Outage Disrupts Global Enterprise Services

An hours-long outage is preventing Microsoft's enterprise customers from accessing their inboxes, files, and video meetings.

1h
3 min
6
Read Article
Xbox Showcase Reveals Rare Glimpse Into Pokémon Maker
Technology

Xbox Showcase Reveals Rare Glimpse Into Pokémon Maker

In a surprising crossover, Game Freak, the renowned developer behind Pokémon, made a notable appearance on an Xbox showcase to discuss their upcoming title, Beast of Reincarnation.

1h
5 min
6
Read Article
Senator Markey Questions OpenAI on ChatGPT Ads
Politics

Senator Markey Questions OpenAI on ChatGPT Ads

Senator Ed Markey (D-MA) has formally questioned OpenAI and other major tech companies about their plans to embed advertisements into AI chatbots, raising alarms about consumer safety and privacy.

1h
5 min
6
Read Article
Arab Knesset Parties Sign Agreement to Revive Joint List
Politics

Arab Knesset Parties Sign Agreement to Revive Joint List

Arab Knesset parties have signed an agreement to work toward reviving the Joint List, with a spokesperson citing mass public pressure over violent crime in Arab communities as the driving force behind the reunification.

1h
5 min
6
Read Article
Jack Smith Defends Political Independence in Capitol Testimony
Politics

Jack Smith Defends Political Independence in Capitol Testimony

Special Counsel Jack Smith testified before a congressional committee, defending his investigations into former President Donald Trump as politically independent despite facing intense partisan questioning.

1h
5 min
6
Read Article
Hyundai IONIQ 6 N Launches in UK with 641hp Power
Automotive

Hyundai IONIQ 6 N Launches in UK with 641hp Power

The Hyundai IONIQ 6 N, the brand's second performance EV, is now available to order in the UK. It delivers up to 641 horsepower and a 0-62 mph time in just 3.2 seconds, but offers much more than just raw power.

1h
5 min
6
Read Article
Game Freak Addresses Performance Concerns for Non-Pokémon RPG
Technology

Game Freak Addresses Performance Concerns for Non-Pokémon RPG

Game Freak has directly addressed performance concerns surrounding its upcoming non-Pokémon RPG, Beast in Reincarnation. The developer acknowledges the game's ambitious scope and outlines its approach to optimization.

1h
5 min
7
Read Article
🎉

You're all caught up!

Check back later for more stories

Back to Home