Change Detection in Satellite Imagery With Region Proposal Networks


This article presents a deep learning approach called Proposal-based Efficient Adaptive Region Learning (PEARL) to change detection (CD) in satellite imagery based on region proposal networks. Region proposals are useful for selecting regions of interest in convolutional networks used for object detection and localization. The generated region proposals compare object-like characteristics in target and reference images taken at different times. The effectiveness of the PEARL approach is demonstrated with an airplane dataset generated by physics-based synthetic data from Digital Imaging and Remote Sensing Image Generation (DIRSIG).


Change Detection

The dramatic proliferation of satellite images that require inspection has increased the demand for automating challenging tasks, including CD, to assist analysts. Detecting changes in satellite images of the same location, but observed at different times, is important in surveillance and security, disaster management, demographic estimation, crop monitoring, and other important applications.

However, CD remains a challenging problem because of the wide variation of potential changes of interest, including environment illumination, atmospheric changes, weather conditions, seasonal variations, and other factors such as sensor resolution and noise.  Hence, there is a need for algorithm robustness against changing operating conditions.  To deal with this complexity, data-driven approaches collect as many representative examples as possible, while model-driven approaches generate synthetic data based on the fundamental physics of the sensor, environment, and target (SET) operating conditions.

The task of CD can be broadly categorized into three types—structured, unstructured, and semistructured.  Structured CD works by identifying changes in known object classes (e.g., the movement of vehicles or aircraft on the ground).  These known classes are used to label data and are explicitly included in the training dataset seeding the algorithms.  The Siamese network shown in Figure 1 is an example of a model [1] trained to detect vehicles, helicopters, or other objects of interest.  Unstructured CD is more general and difficult, as the desired objects are unlabeled and include wider variations, such as piles of materials (dirt or coal), construction sites, or containers at a port.  The difficulty with unstructured CD is due to changes of interest not anticipated in advance or incorporated in the training data. Semistructured CD approaches have a small set of labeled data from a subject matter expert that seed the algorithm to learn subtle variations without requiring a large amount of labeled data for training.

DIRSIG Data Generation

Another challenge with CD is that datasets for algorithm training and testing are not readily available, and collecting and annotating them is a laborious task.  We address this need by utilizing DIRSIG [2] to generate datasets for training or testing our models.  DIRSIG is a physics-driven synthetic image generation model developed by the Digital Imaging and Remote Sensing Laboratory at RIT (see  The model can produce passive single-band, multispectral, or hyperspectral imagery from the visible through the electromagnetic spectrum’s thermal infrared region.

DIRSIG is a first-principle radiative transfer image and data simulation tool.  It generates airborne and satellite imagery through unified path-tracing approaches for light transport.  This process generates sensor-reaching radiance from geometric objects and the Earth’s surface, attributed with reflectivity and emissivity models.  The model can simulate many modalities, including panchromatic; red, green, blue (RGB); multi- and hyperspectral; thermal; polarization; low-light level; light detection and ranging (LIDAR); and synthetic aperture radar.  Example imagery of DIRSIG outputs can be seen in Figure 2.

Sensor response functions and noise models enhance the data to produce realistic imagery.  Training chips are easily created using the DIRSIG Chip Maker plugin.  This plugin simplifies and automates the generation of training images (i.e., various view angles, illumination angles, ground sample distance, etc.).  Unlike game engines, DIRSIG does not drop detail in order to maintain frame rate. Additionally, fine-grain sensor control provides detail for the synthetic sensor. Over the years, DIRSIG has been evaluated in many verification and validation studies [3].

High-Performance Computing (HPC)

DIRSIG requires large computational resources from which the U.S. Air Force ported DIRSIG to an HPC platform [4]. The High Performance Computing Modernization Office (HPCMO) provided HPC systems and expertise to install DIRSIG across U.S. Department of Defense (DoD) systems. The HPCMO offers free HPC access to all DoD civilians, military, and contractors across all the Services. To request access, a DoD Supercomputing Resource Center (DSRC) account needs to be established (by following the directions at After a DSRC account is established, a user would request to be added to the DIRSIG group by submitting a help desk ticket at

The DIRSIG-HPC has three advantages—processing speed, testing agility, and scene variety. DIRSIG scales across HPC systems easily because DIRSIG 5 now supports multiple cores. A user with DSRC access can render several hundred frames in parallel with DIRSIG 5 using PBS Job Arrays. DIRSIG 5 also includes a Chip Maker module built specifically for training artificial intelligence (AI)/machine-learning (ML) systems. The data generated by DIRSIG can be tailored to match imagery obtained from a particular sensor or satellite platform.

A big advantage of using DIRSIG for AI/ML research results from having perfect truth, as the simulations precisely record what has changed due to the control the geometry of the three-dimensional (3-D) model used to generate the synthetic data. Another reason for using DIRSIG is that we can generate a limitless amount of training data. In our previous work [1, 5], we used DIRSIG-generated pairs of helicopters and backgrounds to train and test our models.

Change Detection With Region Proposal Networks

Our PEARL approach to CD uses deep learning methodologies that leverage the strengths of popular network architectures. Deep convolutional neural networks (CNNs) demonstrating breakthrough advances in image classification [6–8] have been used in various applications, including face recognition [9], semantic segmentation [10], object tracking [11], and CD [1, 12]. In our work, we build a simple and scalable model for CD that uses region proposal networks to generate a change map between image pairs. We demonstrate the performance of PEARL on a DIRSIG-generated CD dataset of airplanes.

The PEARL insight utilizes the idea that an image pair of the same location with no major structural changes will have similar object proposals generated in both images. On the other hand, a pair of images with major structural changes (e.g., presence of a plane) will yield different object proposals. After the region proposals provide results for two images, a change map is constructed and analyzed to make a change or no-change decision.

The change map highlights arrivals, departures, and deviations.  PEARL takes advantage of the proposal information before the final classification stage so that the detected changes are less specific to particular object classes.  The initial stages of our investigation show promise of the PEARL method’s potential to generalize toward unstructured CD, semistructured activity monitoring, and multimodal transfer learning analysis.


Siamese networks are promising for CD [1], object tracking [13], and person reidentification [14]. The Siamese models use two identical network channels (with shared weights) to generate feature maps that are further processed for the application of interest. The network channels used for CD in Rahman et al. [1] consist of the convolutional part of pretrained Visual Geometry Group (VGG) object detectors [6]. The VGG features feed a decision network trained to identify changes, as illustrated in Figure 1.

Region Proposal Generation

PEARL utilizes region proposals, which are rectangles identifying the regions of the image most likely to contain objects [15].  Region proposals were incorporated in the Faster R-CNN architecture [8] shown in Figure 3, which is a popular architecture for object detection. Faster R-CNN is a region-based convolutional neural network composed of a feature extractor network, the Region Proposal Network (RPN), and a fully connected network used for detection. The typical feature extractor is a CNN (e.g., ResNet [7]) pretrained on the ImageNet dataset [16].

After receiving the feature maps, the RPN generates a number of bounding boxes, called regions of interest (ROIs), which have a high probability of containing an object. The RPN generates region proposals based on anchors, which are boxes of different sizes and aspect ratios. For each window on the feature map, nine anchors based on three different scales and three aspect ratios are generated.

The last stage of the network has a classifier and a regressor.  The classifier takes each of the nine anchor boxes for each location and determines if the object belongs to the background or foreground.  The regressor predicts four coordinates of a bounding box relative to each anchor.  The process generates many proposals for a single image.  The RPN then sorts the proposals to find the ones with the highest probability of containing an object.  Sorting includes applying nonmax suppression to keep the proposals with the highest confidence.

Nonmax suppression is a method that reduces multiple detections of the same object to just one.  It starts with the most confident detection and then looks at all the remaining detection rectangles to determine which ones have significant overlap.  The detections with large overlap are suppressed with the assumption that they are generated from the same object.  This procedure yields a small number of detections per object.  The ROIs produced from the RPN are rectangles of different sizes.  However, if their intended use is for further CNN processing, they have to be a predetermined, fixed size.  The ROI pooling layer splits the input feature map to a fixed number of equal regions, and max pooling applies to every region.

These fixed feature regions are input into a classifier and regressor.  The classifier infers the class and probability of the object belonging to that class.  The regressor provides the bounding box coordinates after regression.  The objects detected are marked with a bounding box, referred to as region proposals.

CD With Region Proposals

Figure 4 presents the CD process.  A Faster R-CNN network examines each image pair, consisting of the target and reference images, and generates the bounding box coordinates of the proposals. Proposal overlap maps are then created by accumulating the contributions of each region proposal box. The proposal overlap maps provide information of where in the image objects are most likely to exist. These object maps are generated for both the target and reference images in the image pair, which are compared using mean square error (MSE). No-change image pairs have a low MSE, as the same objects are present at the same location in the image pair. Change image pairs have a higher MSE, which indicates a significant structural change between the two images.


Training and Testing Datasets

The Faster R-CNN network was originally trained on ImageNet [16], which contains ground-level imagery.  However, these types of images are not well suited for remote sensing applications. Since we had to generate a CD map for satellite imagery, we utilized a version of Faster R-CNN [17] trained on the Dataset for Object Detection in Aerial Images (DOTA) [18]. (DOTA is a large-scale dataset for object detection that uses Google Earth images of various locations around the globe.) The DOTA image sizes ranged from 800 x 800 to 4000 x 4000 pixels, with objects in various scales, orientations, and shapes. Annotating the images resulted in 15 object categories, where the most important classes of interest for CD were helicopters, planes, and ships.

The DIRSIG planes dataset was the CD dataset used for testing because DOTA is not a CD dataset and could not be used for testing our method. The planes dataset generated using DIRSIG consisted of scenes with target and reference images measuring 512 x 512 pixels. Different backgrounds and illuminations, with spectral and structural properties similar to commercial satellite imagery, enhanced the dataset.  The objects used were four types of planes on different backgrounds and illuminations.  The dataset consisted of 1,358 images.  Each image had a change and no-change counterpart, making the total number of image pairs 2,716.  Representative DIRSIG planes images are shown in Figure 5.

Change Detection Results

We applied our CD methodology on the DIRSIG plane’s dataset. Examples of the region proposal bounding boxes for change and no-change pairs are shown in Figure 6, where the left, center, and right images represent the original airplane rendering and the background and plane at different illuminations. After contrast stretching the images and applying nonmax suppression, the accuracy of change/no-CD based on the MSE of the overlap maps was 97%. (Note that the changes in illumination did not have a strong effect on generating region proposals.) Additionally, no proposals were generated for the images without planes. These results highlight the advantages of PEARL, including that it is independent of illumination and small changes.


We presented a new approach for CD in satellite imagery motivated by deep learning architectures for region proposal networks. Our PEARL method had a high success rate in detecting structural changes in the DIRSIG-generated planes dataset. The performance of PEARL was robust to change in illumination and small variations.  However, the proposals generated by current networks were influenced by the object types contained in the dataset (e.g., ground-level images in ImageNet vs. satellite images in DOTA).  Thus, suitable datasets are still needed for training and testing CD methods.  As a result, synthetic dataset generation by DIRSIG was a great value.


This research was funded in part by Kitware Inc., the U.S. Air Force Research Laboratory (AFRL), Air Force Office of Scientific Research (AFOSR), and the Center for Emerging and Innovative Sciences, an Empire State Development-designated Center for Advanced Technology.

  1. Rahman, F., B. Vasu, J. Van Cor, J. Kerekes, and A. Savakis. “Siamese Network With Multi-Level Features for Patch-Based Change Detection in Satellite Imagery.” Proceedings of the 6th IEEE Global Conference on Signal and Information Processing (GlobalSIP), Anaheim, CA, November 2018.
  2. Goodenough, A., and S. D. Brown. “DIRSIG5: Next-Generation Remote Sensing Data and Image Simulation Framework.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, November 2017.
  3. DIRSIG Verification and Validation Studies.”, accessed 4 August 2019.
  4. Hardin, D. “Air Force’s Newest Supercomputer Supports Game-Changing Research.”, accessed 4 August 2019.
  5. Han, S., A. Fafard, J. Kerekes, M. Gartley, E. Ientilucci A. Savakis, C. Law, J. Parhan, M. Turek, K. Fieldhouse, and T. Rovito.  “Efficient Generation of Image Chips for Training Deep Learning Algorithms.”  SPIE Defense and Commercial Sensing, Automatic Target Recognition XXVII, April 2017.
  6. Simonyan, K., and A. Zisserman. “Very Deep Convo-lutional Networks for Large-Scale Image Recognition.” International Conference on Learning Representations (ICLR), 2015.
  7. He, K., X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recognition.” Computer Vision and Pattern Recognition (CVPR), 2016.
  8. Ren, S., K. He, R. Girshick, and J. Sun. “Faster R-CNN: Towards Real-Time Object Detection With Region Proposal Networks.” Advances in Neural Information Processing Systems, 2015.
  9. Schroff, F., D. Kalenichenko, and J. Philbin. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” Computer Vision and Pattern Recognition (CVPR), 2015.
  10. Long, J., E. Shelhamer, and T. Darrell. “Fully Convolutional Networks for Semantic Segmentation.” Computer Vision and Pattern Recognition (CVPR), 2015.
  11. Minnehan, B., A. Salmin, K. Salva, and A. Savakis. “Benchmarking Deep Learning Trackers on Aerial Videos.” SPIE Defense and Security Symposium, Pattern Recognition and Tracking XXIX, April 2018.
  12. Daudt, R. C., B. Le Saux, and A. Boulch. “Fully Convolutional Siamese Networks for Change Detection.” IEEE International Conference on Image Processing (ICIP), 2018.
  13. Bertinetto, L., J. Valmadre, J. F. Henriques, et al. “Fully-Convolutional Siamese Networks for Object Tracking.” Proceedings of the European Conference on Computer Vision (ECCV), pp. 850–865, Springer, 2016.
  14. Chung, D., K. Tahboub, and E. J. Delp. “A Two Stream Siamese Convolutional Neural Network for Person Re-identification.” International Conference on Computer Vision (ICCV), 2017.
  15. Yang, F., H. Fan, P. Chu, E. Blasch, and H. Ling. “Clustered Object Detection in Aerial Images.” International Conference on Computer Vision (ICCV), 2019.
  16. Deng, J., W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. “ImageNet:  A Large-Scale Hierarchical Image Database.” Computer Vision and Pattern Recognition (CVPR), 2009.
  17. Zhu, Z., and J. Ding. “DOTA Faster R-CNN.”, accessed 4 August 2019.
  18. Xia, G. S., et al. “DOTA:  A Large-Scale Dataset for Object Detection in Aerial Images.” Computer Vision and Pattern Recognition (CVPR), 2018.

ANDREAS SAVAKIS is a professor of computer engineering and Director of the Center for Human-aware Artificial Intelligence at RIT. His research interests include computer vision, deep learning, machine learning, and applications. Prior to joining RIT, he was with Kodak Research Labs. He has coauthored over 120 publications and holds 12 U.S. patents. Dr. Savakis received his Ph.D. in electrical and computer engineering from North Carolina State University.

NAVYA NAGANANDA is a research assistant at RIT’s Vision and Image Processing Lab under the supervision of Prof. Savakis. Her research interests are in deep learning, image analysis, and machine learning. Ms. Nagananda received her B.S. and M.S. degrees in physics from Indian Institute of Science Education and Research, Thiruvananthapuram, India, and is pursuing her Ph.D. in imaging science at RIT.

JOHN P. KEREKES is a professor in the Chester F. Carlson Center for Imaging Science at RIT. His research interests include remote sensing system analyses and performance sensitivity studies using simulation and modeling techniques. Prior to joining RIT, he was a staff member at the MIT Lincoln Laboratory. Dr. Kerekes received his B.S., M.S., and Ph.D. degrees in electrical engineering from Purdue University.

EMMETT J. IENTILUCCI is an assistant professor at RIT. He has been a program reviewer for the National Aeronautics and Space Administration and the DoD and is Chair for both the Society of Photographic Instrumentation Engineers Imaging Spectrometry Conference and the Western New York IEEE Geoscience and Remote Sensing Society. He has over 65 publications and is currently working on a textbook titled “Radiometry and Radiation Propagation.” Dr. Ientilucci received his Ph.D. in imaging science from RIT.

RUSSELL (RUSTY) BLUE is a technical lead in the Computer Vision program at Kitware Inc. His work focuses on leading a team of developers on Kitware’s efforts across several government programs in visual content analysis, including the U.S. Air Force Research Laboratory (AFRL) Phase III VIGILANT Small Business Innovation Research (SBIR) and AFRL Do-It-Yourself AI efforts, which both have significant deep learning components. Previously, he developed advanced visualizations in scientific and medical computing based on Kitware’s open-source Visualization Toolkit.

WILLIAM HICKS is a senior research and development engineer in the Computer Vision department at Kitware Inc. His work focuses on automated analysis of overhead imagery through the AFRL Phase III VIGILANT SBIR, as well as content-based retrieval across a range of government-sponsored programs.

TODD V. ROVITO is a senior research computer scientist, Decision Science Branch, Multi-Domain Sensing Autonomy Division at AFRL, Wright-Patterson Air Force Base, OH. He works on remote sensing exploitation research and is currently focusing on passive 3-D reconstruction and reasoning and deep learning object detection from commercial space satellite systems.

ERIK BLASCH is a program officer at AFOSR supporting research in dynamic data-driven applications systems. He has been with AFRL as a civilian and reservist since 1996, compiling 700+ scientific papers and 19 patents. He is also the author of multiple books and a SPIE and IEEE Fellow, as well as an AIAA Associate Fellow. Dr. Blasch earned a bachelor’s degree in mechanical engineering from MIT; master’s degrees in mechanical, health science, and industrial engineering (human factors) from Georgia Tech; and an MBA, master’s degrees in electronics and economics, and Ph.D. in electrical engineering from Wright State University. He is also a graduate of the Air War College.