Two Brothers Build Text-to-Video Model from Scratch

📋

Key Facts

✓ Sahil and Manu, two brothers, spent two years training a text-to-video model entirely from scratch, releasing it under the Apache 2.0 license.
✓ The 2B parameter model generates 2-5 seconds of footage at either 360p or 720p resolution, with capabilities comparable to Alibaba's Wan 2.1 1.3B model.
✓ Development focused heavily on building effective curation pipelines, including hand-labeling aesthetic properties and fine-tuning VLMs for large-scale filtering.
✓ The model uses T5 for text encoding, Wan 2.1 VAE for compression, and a DiT-variant backbone trained with flow matching.
✓ Current strengths include cartoon/animated styles, food and nature scenes, and simple character motion, while complex physics and fast motion remain challenging.
✓ The brothers view this as a stepping stone toward state-of-the-art capabilities, with future plans for post-training, distillation, and audio integration.

Quick Summary

Two brothers have completed a two-year journey to build a text-to-video model entirely from scratch, releasing it as open-source software. The project, led by Sahil and Manu, demonstrates that independent developers can compete in the advanced AI space without massive corporate resources.

The resulting model contains 2 billion parameters and can generate short video clips from text descriptions. While not claiming to match the performance of commercial systems like Sora or Veo, the brothers view their work as a crucial stepping stone toward state-of-the-art capabilities.

The Two-Year Journey

The brothers began their work in early 2024, shipping their first model in January of that year—before OpenAI's Sora made headlines. Their initial release was a 180p, 1-second GIF bot that was bootstrapped off Stable Diffusion XL. However, they quickly discovered fundamental limitations with using image-based models for video generation.

Image VAEs don't understand temporal coherence, and without the original training data, it's impossible to smoothly transition between image and video distributions. At some point, the brothers determined they were better off starting over rather than trying to patch existing solutions.

Their second version represents a complete rebuild from the ground up. The model uses:

T5 for text encoding
Wan 2.1 VAE for compression
A DiT-variant backbone trained with flow matching

Interestingly, while they built their own temporal VAE, they ultimately used Wan's smaller version because it offered equivalent performance while saving on embedding costs. The brothers have committed to open-sourcing their VAE shortly.

"We're not claiming to have reached the frontier. For us, this is a stepping stone towards SOTA - proof we can train these models end-to-end ourselves."
— Sahil and Manu, Model Developers

Technical Architecture

The model generates 2-5 seconds of footage at either 360p or 720p resolution. In terms of model size, the closest comparison is Alibaba's Wan 2.1 1.3B model, though the brothers report that their model achieves significantly better motion capture and aesthetics in their testing.

The bulk of their development time wasn't spent on the model architecture itself, but on building curation pipelines that actually work. This involved hand-labeling aesthetic properties and fine-tuning Vision-Language Models (VLMs) to filter training data at scale.

When asked about their approach, the brothers explained their philosophy:

Products are extensions of the underlying model's capabilities. If users want a feature the model doesn't support—character consistency, camera controls, editing, style mapping, etc.—you're stuck. To build the product we want, we need to update the model itself.

This perspective drives their decision to own the entire development process, despite the significant computational costs involved.

Capabilities & Limitations

The model demonstrates particular strengths in specific domains. Through extensive testing, the brothers identified what works best:

Cartoon and animated styles
Food and nature scenes
Simple character motion

However, the model still faces challenges with more complex scenarios. Areas that don't work well include:

Complex physics simulations
Fast motion sequences (gymnastics, dancing)
Consistent text rendering

The brothers are transparent about their model's position in the current landscape. They explicitly state: "We're not claiming to have reached the frontier." Instead, they view this release as proof of concept—demonstrating they can train these models end-to-end themselves.

Why Build Another Model?

With commercial offerings like Google's Veo and OpenAI's Sora already available, the brothers' decision to build from scratch might seem counterintuitive. Their reasoning centers on product control and flexibility.

When commercial models don't support specific features, developers are limited by what those models can do. The brothers believe that to build the product they envision, they need to update the model itself. This requires owning the development process rather than relying on external APIs.

It's a significant bet that requires substantial GPU compute resources and time to pay off, but they believe it's the right long-term strategy. Their approach allows them to:

Customize capabilities for specific use cases
Iterate quickly on model improvements
Control the entire technology stack
Build features that commercial models don't support

Future Roadmap

The brothers have outlined a clear roadmap for future development. Their immediate priorities include:

Post-training for physics and deformations
Distillation for speed optimization
Audio capabilities integration
Model scaling for improved performance

They've also maintained a detailed "lab notebook" of all their experiments in Notion, which they're willing to share with others interested in the technical details of building models from zero to one.

The model is released under the Apache 2.0 license, making it freely available for commercial and non-commercial use. This open-source approach aligns with their goal of democratizing access to advanced AI capabilities.

Looking Ahead

The release of this 2B parameter model represents more than just a technical achievement—it demonstrates that independent developers can compete in the advanced AI space with sufficient dedication and resources. The brothers' two-year journey from a 180p GIF bot to a sophisticated text-to-video model shows what's possible with focused effort.

While the model may not yet match the performance of commercial giants, it serves as a stepping stone toward state-of-the-art capabilities. The brothers' commitment to open-source development and transparent documentation could inspire other independent researchers to pursue similar projects.

As the AI landscape continues to evolve, projects like this highlight the importance of diversity in development approaches. Rather than relying solely on large corporate research labs, the field benefits from contributions from independent developers who bring different perspectives and priorities to the table.

"Products are extensions of the underlying model's capabilities. If users want a feature the model doesn't support—character consistency, camera controls, editing, style mapping, etc.—you're stuck."
— Sahil and Manu, Model Developers