Generator Visualizations

This page includes visualizations of my solution to comma.ai compression challenge.

I hope you will like my visualizations as much as I do. I really liked the whole process of working on this challenge because the decompressed videos are visually very appealing and they often seemed like 80's video games with lots of neons.

For more details and description of the Generator network, see Sections 5 and 6 of this page.

1. Performance visualizations

Uncertainty video

Thanks to the uncertainty video, I was able to find my model's weak spots and target them with special loss functions such as boundary loss. It displays also the original mask for refference, segnet and posenet losses and added speedometer for fun :). The uncertainty map clearly shows that biggest improvement in segenet distortion can be made in distant objects.

Image transformations

This video shows what happens in each step in the Generator with the input data. The colors are PCA approximations into rgb space (similar colors means the convolution layers interpret the area as separate object). You can see, that most details are lost in the U-net down layer. If I had more time, this would be the place to focus on and try to conserve more details.

2. Generated Videos

Decompressed video from best version of Generator model

Interesting intermediate products of Generator training

First video is generated by approx. 4× larger network that preserved better details.

Second video is generated by model with higher road lane luminance.

3. Embedding space

The embedding space aptly shows relationships between classes. For example, the model can well separate undrivable from rest of objects, and interestingly, ego-vehicle and road lanes are closely related in the embedding space. You can also see that ego-vehicle and vehicle being on the other sides of embedding space, maybe this means that other cars are seen as obstacles from the model :).

Embedding space Visualization

4. Convolutional Kernel Analysis

Learned 3x3 Filters (Stem Layer)

In this image, you can see the first two convolutional layers in my Generator network. Note that many of the kernels learned to look for similar features, therefore future work could replace classic convolutions with Ghost Convolutions which approximate feature maps with cheap linear transformations and achieve similar results while saving some memory.

Convolution Kernels Visualization

5. Generator Architecture

My approach uses a Generator network that extracts masks, inter-pair poses, and intra-pair poses, and generates pairs of yuv6 frames. The generator consists of a MaskEncoder that extracts features from mask, and a FrameHead that applies shifts to the features. They both use deptwise separated convolutions with residual connections, and SqueezeExcitation and PixelAttention blocks.

Generator Architecture

6. How it works (Technical Summary)

Here is a quick breakdown of what is happening "under the hood" for those interested in the technical side of my solution.

Why not just use standard video compression?

My first instinct was to use FFmpeg. I managed to shrink the 37 MB video down to just a few hundred KB, but the quality was terrible. The high compression created blocky artifacts that lacked the texture PoseNet needs to work correctly and also the block edges were hard to remove by learned Enhancer. That is why I pivoted to building a custom neural reconstruction system.

The Generator Architecture

I designed a Generator network with roughly 110,000 parameters. It is quite small because the model size itself counts towards the final score. It consists of three main parts:

Optimization Tricks

To keep the model size tiny while maintaining quality, I used several "cheap" neural tricks:

In the end, while the video looks a bit like a neon game from the 80s, it was highly effective for the challenge. I managed to get score of 0.72 and secure 9th place out of 30 participants at the time of my submission, which I'm pretty happy with!