We developed several methods in framework of remote sensing (RS) image understanding, search and retrieval for fast and accurate information discovery from massive data archives. To achieve accurate remote sensing image representations, we introduced: i) a multi-attention driven approach; ii) a graph-theoretic deep representation learning method; iii) a plasticity-stability preserving multi-task learning method to jointly learn different learning tasks; and iv) several label-noise robust deep learning (DL) models to reduce the negative impact of noisy land-use and land-cover annotations. Due to the dramatically increased volume of RS image archives, images are usually stored in compressed format to reduce the storage size. Existing content based RS image retrieval and classification systems require as input fully decoded images, thus resulting in a computationally demanding task in the case of large-scale image retrieval problems. To overcome this limitation in retrieval problems, we developed novel systems, such as: 1) a system that achieves a coarse to fine progressive RS image description and retrieval in the partially decoded JPEG 2000 compressed domain; 2) a system that applies scene classification with deep neural networks in JPEG 2000 compressed domain; and 3) a system that achieves simultaneous deep learning-based image compression and hashing-based image indexing. The developed systems significantly reduce the computational time with similar retrieval and classification accuracies when compared to traditional approaches. To achieve high time-efficient search capability within huge data archives, we also researched on deep hashing methods that encode high-dimensional image descriptors into a low-dimensional Hamming space where the image descriptors are represented by binary hash codes. By this way, the (approximate) nearest neighbors among the images can be efficiently identified based on the the Hamming distance with simple bit-wise operations. One of the methods that we developed is the metric-learning based hashing network, which learns: 1) a semantic-based metric space for effective feature representation; and 2) compact binary hash codes for fast archive search. To integrate feature representations of different RS image modalities into a unified form of feature representation, we developed several multi-modal learning methods and tools. As an example, we introduced a self-supervised cross-modal RS image retrieval method that: i) models mutual-information between different modalities in a self-supervised manner; ii) retains the distributions of modal-specific feature spaces similar to each other; and iii) defines the most similar images within each modality without requiring any annotated training image. Moreover, we explored the effectiveness of masked autoencoders for sensor-agnostic (modality-agnostic) image search and retrieval in RS. We derived a guideline to exploit masked image modeling for uni-modal and cross-modal search and retrieval problems in RS. Most DL models require a huge amount of annotated images during training to optimize model parameters and reach a high performance during evaluation. The availability and quality of such data determine the feasibility of many DL models. To address this issue, we introduced benchmark datasets (e.g. BigEarthNet, HySpecNet-11k). BigEarthNet is a large-scale benchmark archive for RS image understanding (it is available at
http://bigearth.net) and is the most impactful dataset that we developed. It is made up of 590,326 pairs of Sentinel-1 and Sentinel-2 image patches, enabling data-hungry DL algorithms in the context of multi-label RS image retrieval and classification tasks. Thus, it makes a significant advancement for the use of DL in RS, opening up promising directions to advance DL-based research in the framework of RS image scene classification and retrieval. All the data and the DL models are made publicly available, offering an important resource to guide future progress on image scene classification and retrieval problems in RS.