Generator Visualizations | Bachelor thesis

1. Performance visualizations

Uncertainty video

Thanks to the uncertainty video, I was able to find my model's weak spots and target them with special loss functions such as boundary loss. It displays also the original mask for refference, segnet and posenet losses and added speedometer for fun :). The uncertainty map clearly shows that biggest improvement in segenet distortion can be made in distant objects.

Image transformations

This video shows what happens in each step in the Generator with the input data. The colors are PCA approximations into rgb space (similar colors means the convolution layers interpret the area as separate object). You can see, that most details are lost in the U-net down layer. If I had more time, this would be the place to focus on and try to conserve more details.

2. Generated Videos

Decompressed video from best version of Generator model

Interesting intermediate products of Generator training

First video is generated by approx. 4× larger network that preserved better details.

Second video is generated by model with higher road lane luminance.

3. Embedding space

The embedding space aptly shows relationships between classes. For example, the model can well separate undrivable from rest of objects, and interestingly, ego-vehicle and road lanes are closely related in the embedding space. You can also see that ego-vehicle and vehicle being on the other sides of embedding space, maybe this means that other cars are seen as obstacles from the model :).

4. Convolutional Kernel Analysis

Learned 3x3 Filters (Stem Layer)

In this image, you can see the first two convolutional layers in my Generator network. Note that many of the kernels learned to look for similar features, therefore future work could replace classic convolutions with Ghost Convolutions which approximate feature maps with cheap linear transformations and achieve similar results while saving some memory.

5. Generator Architecture

My approach uses a Generator network that extracts masks, inter-pair poses, and intra-pair poses, and generates pairs of yuv6 frames. The generator consists of a MaskEncoder that extracts features from mask, and a FrameHead that applies shifts to the features. They both use deptwise separated convolutions with residual connections, and SqueezeExcitation and PixelAttention blocks.

6. How it works (Technical Summary)

Here is a quick breakdown of what is happening "under the hood" for those interested in the technical side of my solution.

Why not just use standard video compression?

My first instinct was to use FFmpeg. I managed to shrink the 37 MB video down to just a few hundred KB, but the quality was terrible. The high compression created blocky artifacts that lacked the texture PoseNet needs to work correctly and also the block edges were hard to remove by learned Enhancer. That is why I pivoted to building a custom neural reconstruction system.

The Generator Architecture

I designed a Generator network with roughly 110,000 parameters. It is quite small because the model size itself counts towards the final score. It consists of three main parts:

MaskEncoder: This part takes the segmentation mask and turns it into abstract feature maps. I used embedding layers so the model can learn that things like "road" and "lanes" are semantically related.
MLP: A tiny network that processes the movement vectors (poses) and tells the generator how to "shift" the visuals based on the car's motion.
FrameHead: The final stage that puts everything together and outputs the video in 6-channel YUV format.

Optimization Tricks

To keep the model size tiny while maintaining quality, I used several "cheap" neural tricks:

Separable Convolutions: Instead of standard convolutions, I used depth-wise separable ones. They do almost the same job but with about 10x fewer parameters.
Attention Mechanisms: I added Squeeze-Excitation and Pixel Attention blocks. These help the model focus on what is important (like road edges) and ignore the noise.
Quantization: I saved the model weights as 8-bit integers instead of 32-bit floats. This significantly cut down the file size without hurting the performance too much. In future work, I would use Quantization Aware Training to make the model more robust to the precision reduction in int8 quantizatin.

In the end, while the video looks a bit like a neon game from the 80s, it was highly effective for the challenge. I managed to get score of 0.72 and secure 9th place out of 30 participants at the time of my submission, which I'm pretty happy with!