

Discover more from tech for humans
A couple of weeks ago, I was listening to music on YouTube when I ran into a newish song by El Kuelge: Diganselon.
If you don’t know them, El Kuelge is an Argentinian indie band I only learned about in the last couple of years but has grown on me quite a bit. Check them out!
Anyway, I was watching their video clip when I noticed what it was, or more accurately, how it was made. The video clip follows someone riding a bike in Buenos Aires. The video is shot from behind, and for most of it, they use the biker’s t-shirt as a screen to show short snippets related to the song's content.
Immediately, it dawned on me that not only do I like the idea, but I must be able to reproduce the same effect with my Machine Learning skills. I went to work.
The steps to achieve a similar video can be described as follows.
We want to know where the shirt is at any point in time. More precisely, we want a mask of where the t-shirt is. A mask is a matrix of 1s and 0s that indicate whether something is there. Since images and videos are pixel matrices, we need another matrix, the mask, to know where the t-shirts are. If we know that, and let’s call the mask m, then we can do something like v * m + (1 - m) * w, where v is a frame in the main video where the main character is biking, and w is a frame in the secondary video that we are playing in the t-shirt.
To find m, we can use a segmentation model on each frame v of the first video V. Segmentation models are trained on some classes of objects and can only recognize those classes. That means that if I wanted to re-use a model from the internet, I had two options: search for a long time until I found one in which t-shirts were part of the original dataset, or fine-tune a pre-trained model, basically creating a new dataset of t-shirts, and teaching the model how to recognize those. Both stoke me as too time-consuming for this project. I wanted something simpler.
Luckily, with the advent of multi-modal LLMs, we now have access to zero-shot segmentation models. They work similarly to chatGPT; they accept a prompt and then can return something (text, but because they are multi-modal, also images). In this case, I went for a particular multi-modal LLM designed to map text + images into masks.
Unlike the other models that would have been trained specifically on t-shirts, I knew that the quality of what I was getting would likely be low. However, I was happy with this experiment as long as it was usable.
I downloaded two stock videos from pixels (one with someone walking on a t-shirt and another with a soothing background) and the model SOMETINH-CLIP and quickly tested it in a frame.
The results were good but not perfect; in particular, I was getting a probability of having a t-shirt, not the boolean 1 or 0 I needed.
I played with several thresholds: mask >- 0.something, and quickly settled in 0.3 as a decent value for what I wanted. After that, it was time to run the models.
We split the videos into frames and truncated the longest to the shorter one. Then, we needed to change the aspect ratio of the videos; the CLIP model will only take square images, so we went ahead and shrunk all frames in both videos to 352x325, the input resolution of the model.
After that, it was as simple as running the models, generating the new frame with our magic formula, and putting the video back together.
You can find the results below.
And the code on GitHub.
Overall, I’m pretty happy with the results. The mask is not perfect, and there are several ways to improve it, but I’ll leave that as an exercise for the reader ;)
Until the next time!