Image Segmentation is a topic of machine learning where one needs to not only categorize what’s seen in an image, but to also do it on a per-pixel level. Here, we want to go from a satellite image to a map representation where things in the image are automatically categorized – just like generating a map view automatically from satellite images.

Automatic Satellite Image to Map View


In the past few months, I have worked on such an image classifier which goal is to precisely identify objects in satellite images. This was done by training a few U-Net Convolutional Neural Networks (one per category of object – class – to predict) with Keras and TensorFlow, using GPU servers in the cloud.

clouds made of linux servers joke

To sum up, I have built a pipeline for training and evaluating neural networks so as to automatically refine the neural architecture and configuration to best solve for a given problem. This was done using Hyperopt.

More specifically, I worked on solving the DSTL’s Kaggle competition for satellite imagery feature detection. I would have ended up in approximately the top 3.5% of the participants (10th position on the private leaderboard) if the competition was still open, but it is hard to compare since the competition is already finished, and that a few solutions were shared.

In consequence, I managed to use as little computational resources as possible (approximately 4 times less than what the competition winners suggest to use) to build, train and evaluate my pipeline. The thing here is that despite competition winners shared some code to reproduce exactly their winning submission (which was released after I started working on my pipeline), this does not include a lot of the required things to be able to come up with that code in the first place if one want to apply the pipeline to another problem or dataset.

Having in mind to use the neural network architecture on another dataset later on, the difficult thing is that despite some winners shared their code, what they released is explicitly stated to be unsuited for a production environment. Competition-winning code is rarely clean. Moreover, building neural networks is an iterative process where one needs to start somewhere and slowly increase a metric / evaluation score by modifying the neural network architecture and hyperparameters (configuration). This is mostly done in function of the data at hand and the problem at hand. Transferring a neural network solution to another dataset or problem always require tweaking its configuration in order to have a good classifier. At least, when trying the official code of the 3rd place winners, it was possible to achieve results worth of the 2nd place.

Defining the data & the problem to solve

First off, here is a glance at the training data of the competition:

Automatic Satellite Image to Map View


The DSTL’s Satellite Imagery Feature Detection Challenge is a challenge where participants need to code a model capable of doing those predictions – the images just above are taken from the dataset, it represents an (X, Y) pair example from the training data.

First off, the satellite images are provided with 20 useable bands rather than 3 bands. If red, green and blue represents 3 bands (RGB), the 20 bands contain a lot more information for the neural network to ponder upon, and hence ease learning and the quality of the predictions: it is aware of visual features that humans do not see naturally. The extra bands of available light are called the P, M and A bands. Just below is a nice visualization of the visible RGB bands, as well as the other M and A bands. This image taken from the 3rd place winners’ interview post on Kaggle’s blog:


Note that the last P bands that is not illustrated. It stands for “Panchromatic”, and capture information wide across the spectrum and into 1 large channel rather than many small and specific channels.

Basically, only 25 satellite images are labelled and given for training data. Aggregated as one big 5×5 patch, the whole thing training set looks like that:

Kaggle Dstl Satellite Imagery Feature Detection full training data


It is also possible to see that some images are repeated, hence there is even less than 25 images for training in terms of different terrain being seen. It goes without saying that the amount of training data is quite scarce, whereas normally, for having a good accuracy on 10 classes, one would need 50k images for training as of now: such as experimented by researchers with the CIFAR-10 and CIFAR-100 dataset. Moreover, the DSTL’s data is highly imbalanced: crops covers nearly half of the surface while most of the classes to predict covers less than 5% of the surface, such as buildings, structures, roads, tracks, and a few other classes.

At least, the resolution of each image is quite large and this accounts as if there was a more data available. Images from the M band have varying resolutions close to 850×840 px (and sometimes as low as 835×835 px for obtaining square images). RGB images had a considerably higher resolution close to 3560×3560 px. Stretching and aligning the images is a part of the competition. Moreover, some bands are misaligned from the other ones and shot at different times, therefore simple scaling does not result in a proper alignment: an offset needs to be corrected with an image registration. Cars in the pictures may still be offset depending from one band to another because of the time difference. Predictions on the car categories can’t be accurate, and after inspection, the labeled data for cars is of bad quality, such as pictured in a Kaggle kernel.

The score evaluated is the average jaccard index. As a simple explanation: it is the intersection of the surface of the predictions with the true labels that should be predicted, divided by their union. This “intersection over union” score is computed for each class and then averaged over every classes. First, the jaccard index is a good metric in situations of training data availability imbalance (e.g.: not enough cars in training data) to avoid excess of false positives and false negatives. Second, it is a good metric in cases of class imbalance (e.g.: there are more tree examples than car examples) since the error is averaged over classes with a per-class weighting.

By the way, we have also used Git LFS for managing some huge files within a Git workflow.

A Convolutional Neural Network (CNN) for image segmentation

Many techniques over the years enable image segmentation using Convolutional Neural Networks (CNNs). A few recent techniques of up to 2017 are discussed in Meet Pragnesh Shah’s post and also in’s post.

Why U-Net?

In our case, using a U-Net is a good choice because of the lack of training data, and it also seems to be the choice of most Kagglers that participated to this satellite imagery competition. This neural network architecture has revealed to be very good in this situation. U-Nets have an ability to learn in environments of low to medium quantities of training data, and the amount of training data available in this competition is considered low. Also, a U-Net can be adapted to recent research, since its architecture is quite similar to the PSPNet or the One Hundred Layers Tiramisu, which are recent improvements such as for when dealing with more data.

How does a U-Net works exactly?

A U-Net is like a convolutional autoencoder, But it also has skip-like connections with the feature maps located before the bottleneck (compressed embedding) layer, in such a way that in the decoder part some information comes from previous layers, bypassing the compressive bottleneck.

See the figure below, taken from the official paper of the U-Net:


Thus, in the decoder, data is not only recovered from a compression, but is also concatenated with the information’s state before it was passed into the compression bottleneck so as to augment context for the next decoding layers to come. That way, the neural networks still learns to generalize in the compressed latent representation (located at the bottom of the “U” shape in the figure), but also recovers its latent generalizations to a spatial representation with the proper per-pixel semantic alignment in the right part of the U of the U-Net.

You might have guessed that it is called a U-Net because it makes the shape of a U. To optimally train a U-Net, one needs to considerably augment the dataset. The 3rd place winners used the 4 possible 90 degrees rotations, as well as using a mirrored version of those rotation, which can increase the training data 8-fold: this data transformation belongs to the D4 Dihedral group. We proceeded with an extra random 45 degrees rotation, augmenting the data 16-fold, which represents the D8 Dihedral group. Thus every time an image is fed to the network, it is randomly rotated and mirrored before proceeding to train on that image. In the official U-Net paper, some elastic transforms are also used in the preprocessing. I think it would be interesting to use such data transformations, as depicted on a Kaggle kernel from another similar competition. In our pipeline, we also normalize every channel of the inputs and then use per-channel randomized linear transformations, for every patches.

Especially in the case of satellite imaging, it is possible to use the CCCI, NWDI, SAVI and EVI indexes. Those are extra channels (bands) computed from the naturally available bands. Such channels are used to extract more information in the images, especially for isolating reflectant objects, vegetation, buildings, roads, or water. For example, the 3rd place winners obtained good results for predicting the occurrence of water by directly setting a threshold on those indexes from a quick threshold search. The four extra channels (from indexes) are shown below, along with the original image’s RGB human-visible channels for comparison:

RGB image for computing indexes
CCCI NWDI SAVI EVI indexes from satellite view

Here is the U-Net architecture from the 3rd place winners:


That’s good to reuse winners’ code… but how to arrive to such code at first?

The only open-source code we found online from winners was from the 3rd place winners. Other winners seems not to have shared their code. The 3rd place winners says that their code is not ready for production. It is provided in a minimal way to reproduce the solution, but arriving to such code in the first place is a long process of refinements and iteration.

According to Andrew Ng, VP & Chief Scientist of Baidu, Co-Founder of Coursera and Adjunct Professor at Stanford University: a neural network architecture, despite it might be good and well-fitted at one task, needs to be tuned again for fitting another task. Moreover, according to him, one must not be hesitant on changing the architecture of a neural network: trying new things along the process is good.

It has become very clear that recently, to have a good neural network, one may not only have to train it. One may not only have to adjust the learning rate and hyperparameters. One may want to also adjust the whole architecture of the neural network, automatically. Here is how Andrej Karpathy, Director of AI at Tesla, previously a Research Scientist at OpenAI, states that recent progress as:

Arriving to such code is challenging. I started working on the project before the winners’ official code was available publicly, so I made my own development and production pipeline going in that direction, which reveals useful not only for solving the problem, but to code a neural network that can eventually be transferred on other datasets. For example, the 1st place winner used a very complicated set of custom neural networks for winning the competition by a great margin (1st on private leaderboard AND on public leaderboard). I guess that he used an automatic system to come up with those neural networks in the first place, though I don’t know what he did exactly.

Taking experts out of the equation

The architecture I managed to develop was first derived from public open-source code pre-release before the end of the competition:

It seems that a lot of participants developed their architectures on top of precisely this shared code, which has both been very helpful and acted as a bar raiser for participants of the competition to keep up in the leaderboard.

Adding improvements taken from that code, and merging it with another project of mine where I used Hyperopt and Keras, I managed to create the following humongous monster of 71 convolutional layers. In the following image, our “U”-Net is flipped like a “”-Net, it is automatically made from Keras’ visualization tool, hence why it seems skewed:


Hyperopt is used here to automatically figure out the best neural network architecture to best fit the data of the given problem, so this humongous neural network has been grown automatically. The high-level architecture is still like a U-Net – a mutant one. So, using hyperopt is comparable to taking the expert out of the equation in case the code ever needs to be transferred to another task: it is automatically and lazily maintainable.

In other words, to use Hyperopt, I first needed to define an hyperparameter space, such as the range for which the learning rate can vary, the number of layers, the number of neurons in height, width and depth in the layers, how are the layers stacked, and on. Then running hyperopt takes time, but it proceeds to do what could be compared to using genetic algorithms to perform breeding and natural selection, except that there is no breeding here: just a readjustment from the past trials to try new trials in a way that balances exploration of new architectures versus optimization of the architecture near local maximas of performance.

If you would like to learn how Hyperopt works, you may want to read another blog post of mine which covers the basics of using it and defining the hyperparameter space. A deep learning engineer or developer can define the hyperparameter space. From that point on, it should be easy to transfer the project on another task and to give the code to a team working on the problem, it is also very formative to thereafter visualize the results of a training and how does the hyperparameters can interact.

Once searched, an hyperspace can be refined to narrow down the range in which hyperparameters are tested in the event of having to restart the meta-optimization – and in case some hyperparameters were already too narrow (e.g.: best points are near the limit of the parameter), it is still possible to widen the range.

Having set up a meta-optimization environment such as Hyperopt on an algorithm, the hardest thing in applying the algorithm on another problem should be to format the new data properly to fit the pipeline. A team workflow can be interesting where beginners learn to manipulate the data and to launch the optimisation to slowly start modifying the hyperparameter space and eventually add new parameters based on new research. One could also try to merge new research breakthroughs and new neural network layers into that architecture and its hyperspace.


On blending prediction patches: a bit of code for tech lovers

Note that I also have managed to create a perfect module for smoothing prediction patches. Fun fact, it requires the usage of a 5-dimensional (5D) array. Prediction patches would look jagged once put together if not using any rotation and merging techniques to counter this. See here the difference between using my module compared to using nothing at all to make the prediction patches smooth:

The code for doing this is available on the company’s GitHub under the MIT License. More details here:

Star  Fork


Running the improved 3rd place winners’ code, it is possible to get a score worth of the 2nd position because they coded their neural networks from scratch after the competition to make the code public and more useable. By merging my predictions for the car classes (that the 3rd place winners dropped) with their prediction, I can get a private leaderboard score of 0.48161: 2nd position.

The results that my customly grown neural network has yield would have made me in the top 3.5% of the participants (10th position) if the competition wasn’t finished. I used the 3rd place winners’ post processing code, which rounds the prediction masks in a smoother way and which corrects a few bugs such as removing building predictions while water is also predicted for a given pixel, and on, but the neural network behind that is still quite custom.

Possible improvements

It is possible to improve my pipeline’s effectiveness. The 1st place winner appear to have used a different neural network architecture for every classes. On my side, I have used one neural architecture optimized with hyperopt on all classes at once, to then take this already-evolved architecture to train it on single classes at a time, so there is an imbalance in the bias/variance of the neural networks, each used on a single task rather than on all tasks at once as when meta-optimized. An improvement was seen by moving to this “one-vs-all” classifiers strategy, however it would have been bpetter to directly use hyperopt on individual classes from the start.

Also, I realized afterwards that I forgot a crucial thing in my hyperoptimization sweeps: I forgot to add an hyperparameter for the initialization of convolutional weights: such as enabling the use of He’s technique (named “he_uniform” in Keras) which seems to perform best on a variety of U-Net CNNs. My default was always set to “glorot_uniform”, which might have been harmful in this special case of a U-net architecure.

Moreover, I have omitted the use of mini-networks (except in the inception-pooling layers which I used) such as pictured in the figure 1 of a paper from Google on Inception. In contrast, the 3rd place winners and most of the recent papers on image segmentation uses that or something similar:


Implementing mini-networks and combinations of them in the hyperparameter space may help.

Another thing that may increase performances would be to try to use densely connected convolutional blocks, such as depicted in the research paper that obtained the Best Paper Award at CVPR 2017. This new convolutional block can be applied to image segmentation, such as in the One Hundred Layers Tiramisu. The One Hundred Layers Tiramisu might not help for fitting on the DSTL’s dataset according to the 3rd place winner, but at least it has a lot of capacity potential in being applied to larger and richer datasets due to the use of the recently discovered densely connected convolutional blocks. Another recent 2017 breakthrough is the PSPNet, in which pyramidal pooling is used so as to do convolutions at different scales.

Also, it might be interesting to have a better data preprocessing pipeline, such as using elastic transforms and slight variations in zoom level so as to further augment the data.

Finally, my architecture would perform better if the prediction output were padded out so as to be smaller and thus profit from a larger receptive field input and larger context near the edges.

We are currently implementing a few of those ideas in our neural network architecture to improve it.


To sum up, I have managed to reuse the competition winners’ code to obtain good results, but also to create a custom pipeline which can be further used and improved on other datasets and problems.

Share on

Where to
find us

510-1015 av. Wilfrid-Pelletier
Quebec, QC, Canada
G1W 0C4

418 800.0027
TOLL FREE: 1 844 800.0027


Our newsletter