Semantic segmentation of urban areas is crucial for many applications. Self-supervised networks require few or no labels for training, making them highly appealing approaches. One such network is STEGO1, which builds upon DINO2 and operates without any labeled data, yet effectively segments buildings, vegetation, and roads in the ISPRS Potsdam dataset3. The resulting segmentations are refined using Conditional Random Fields (CRF). In remote sensing, additional channels like the Normalized Digital Surface Model (NDSM) enhance the segmentation task, as pixels of the same class often exhibit similar elevation characteristics, especially when adjacent. Since the transformer-based DINO network is built for RGB data, we extend the CRF with NDSM information to overcome this limitation, introducing a second pairwise potential that encourages neighboring pixels with similar elevation to have the same label. For evaluation in both the linear and cluster probe, we employ Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) to assess the segmentation against the six classes of the Potsdam dataset, besides standard IoU and accuracy metrics. Enhancing the CRF with elevation information improves the mIoU by 0.83% over the RGB only baseline in the cluster probe, which constitutes a considerable improvement.
KEYWORDS: RGB color model, Convolution, Buildings, Image segmentation, Computer programming, Image fusion, Data modeling, Near infrared, Vegetation, Data fusion
Many deep learning architectures exist for semantic segmentation. In this paper, their application to multi-modal remote sensing data is examined. Two well-known network architectures, U-Net and DeepLab V3+, developed originally for RGB image data, are modified to accept additional input channels, such as near infrared or depth information. In both networks, ResNet101 is used as the backbone, while data-preprocessing steps, including data augmentation, are identical. We compare both networks and experiment with different fusion techniques in U-Net and with hyper-parameters for weighting the input channels for fusion in DeepLab V3+. We also evaluate the effect of pre-training on RGB and non-RGB data. The results show a minimally better performance of the DeepLab V3+ model compared to U-Net, while for the certain classes, such as vehicles, U-Net yields a slightly superior accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.