Phenaki is an innovative model developed for generating realistic videos from textual descriptions. This model addresses several challenges in the field of text-to-video generation, such as high computational costs, variable video lengths, and the limited availability of high-quality text-video data.
Phenaki`s approach involves two main components. First, it uses an encoder-decoder model that compresses videos into discrete embeddings, or tokens. This tokenizer can handle variable-length videos thanks to its use of causal attention in time. Second, a transformer model translates text embeddings into video tokens. This is achieved using a bi-directional masked transformer conditioned on pre-computed text tokens, which are then de-tokenized to create the actual video.
One of the key innovations of Phenaki is its ability to generate arbitrarily long videos based on a sequence of prompts. This is particularly useful for creating videos that tell a story or follow a time-variable text sequence. The model has been trained on a large corpus of image-text pairs and a smaller number of video-text examples, allowing it to generalize beyond the available video datasets.
Phenaki`s performance has been impressive, outperforming previous methods in both spatio-temporal quality and the number of tokens per video. This makes it a significant advancement in the field of video synthesis, opening up new possibilities for creating realistic and engaging video content from textual descriptions.