Found In The Desert
A humanitarian use-case for computer vision: image geolocation as an aid in desert wilderness search and rescue.
Every year, hundreds of people die trying to cross a particularly dangerous section of the Sonoran Desert in Southern Arizona. Injury, heat stroke, dehydration or disorientation quickly become life-threatening in this environment, with few options for communication or rescue. In some cases, people in distress find one of the scattered patches of cell phone coverage and text a photo of their surroundings to friends who then pass it on to search and rescue (SAR) groups.
The people crossing are mostly undocumented migrants who, unlike most people who require search and rescue help, are generally trying to hide from authorities. To that end, they use an end to end encrypted messaging service — typically WhatsApp — to communicate with contacts in the the U.S. Unfortunately, end to end encryption strips the location metadata from the images and texts. Without GPS data, there is no quick way to determine the subject’s location from the photograph. Painstaking investigation of satellite images can sometimes provide broad location clues, as can the personal experience of humanitarian aid volunteers who might recognize a particular landmark in a photo. But in most cases, the guessed locations provide only rough guidance to searchers. This means a different way to rapidly geolocate these texted images for SAR purposes is necessary.
The geographical area in question, from the U.S. – Mexico border north to I-8 and from Growler Valley east to the Sauceda Mountains, encompasses about 3,000 square miles. In some cases, where there is a prominent landmark in a photo, a location can be narrowed down to 20-50 square miles. This is still an impossible scale for any search team, especially when a person without water may die within a day or two during the summer and the rescuers themselves face the same hazards working in this extreme environment.
In a different scenario, a person will emerge after days in the desert with photographs of the location where someone was left behind who may still be alive. Even if not found alive, a timely remains recovery can provide closure to the family.
In either case — a person texting for help or requesting help for someone else later — the goal of this project is to apply existing computer vision tools to quickly narrow the search area with maximum possible accuracy, increasing the odds of a person being found alive while also reducing risks to rescuers.
The problem — resolving ground view vs. satellite view — is specifically called “cross view matching” in the field of image geo-localization. In practice, it would look something like this: A lost person texts a photo of their location to a SAR group. The photo is passed on to a computer scientist who then enters it into a computer-vision model.* The model will have been trained on a database of actual photographs and satellite images of the region, allowing it to infer the GPS coordinates of the person who took the photo.
Below are three photos of the same place: the Google Earth satellite photo with my location pinned, the Google Earth virtual “street view” from that point facing toward the dark ridge to the north, and the actual photographic image I shot with my phone camera from the pinned spot, facing the same direction as the street view computational image.
A properly trained computer model, if given Figure 3 (except without GPS or other metadata) could identify the location (the “camera pose”) with reasonable accuracy and guide rescuers to the coordinates in Figure 1. After inputing the satellite imagery for the search area, the computer will have a picture of the desert with fine grained detail. Each pixel has elevation data, so it can create a virtual mountain range with very accurate topography, shading (based on season and time of day) and surface texture (rocky, sandy, scattered large cactus, etc.).
A human searcher can also scour satellite images click by click, but this takes a massive amount of time, familiarity with the landscape and lucky guessing as to where to start. Furthermore, the eye-level appearance of the ground surface when viewing satellite images in street-view is massively distorted by pixel stretching** (Figure 2), which makes it extremely difficult to pinpoint the camera person’s exact coordinates. The computer model on the other hand, would “see” ground surface details from above (called “high oblique” or “nadir” view) in the same way a human can , but would also be trained to see what any given slopes looked like from ground level. This is part of the “training” of the model (see below).
The first challenge will be the lack of a large, thorough image set for the region in question. Neural networks used for cross view matching are typically trained on large, publicly available image sets such as Flickr or Google Street View. The area in question is remote desert wilderness, so no such handy image set exists. It would have to be created, but this is not as difficult as it might seem. Humanitarian aid volunteers (doing water drops, SAR, reconnaissance and recreation) are hiking through this area regularly. Each trip could have one or more people designated to snap photos every few hundred yards or so, facing in all directions. Since the texted images in question are simple phone snaps, no special camera equipment is required to generate a database of similar images. Within a few months, thousands of images, covering diverse areas of the landscape at different times of day could easily be amassed.
It is not necessary to photographically capture every single possible view of this landscape for a computer model to be able to identify an exact location. The process is not simple image matching or content identification as in the familiar spam-blocker captchas where we are asked to “click all squares with a bicycle”. Rather, the model would work by inference, “learning”*** what the eye-level view of a satellite image should look like after being trained on a relatively small number of actual eye-level photographs (e.g. Figure 3). (I’ve seen studies that demonstrated this, so I know it’s possible.) This is important because covering every possible view of every canyon, ridge, flat, outcrop and wash over 3,000 square miles is impractical if not impossible. Building a sparse but representative training database of diverse views of these geological features and surface vegetation patterns is easy.
It works something like this: The model is given the complete satellite imagery as its reference. From that dataset, it generates a complete “street view” landscape — i.e. virtual, distorted eye-level imagery (like in Figure 2). In other words, it’s creating the Google Earth street view (Figure 2) because it already has enough information in each pixel to do so. Then a series of photographic training images (the phone snaps volunteers collect; e.g. Figure 3) with complete metadata (GPS, direction, tilt, time, date, lens data, etc.) are input to the model. Using the example of Figure 3, the model would generate the computational/virtual eye-level view (Figure 2) based on the GPS and other metadata and then compare the computed image with the photographic image and store that in it’s memory.
After doing this for enough images, it could be given an image without the metadata (e.g. Figures A, B; called a query image), generate a computational image, compare it to its reference database of the entire landscape of such images already generated and infer the location from which the texted image was taken (inferring the location like this is called “regressing to pose”). From what I’ve read, I’m convinced that in many cases this could be accurate to an area the size of a football field or less.
(I’m guessing here that what I’ve described above is the most likely process. But I see how it could go the other direction. Instead of turning photographic images into computational ones and searching a computational landscape, the program could wallpaper its computational landscape with photorealistic surfaces and then compare the real photo to the photorealistic, computed landscape in its little brain. That is what I’d assumed would be the case, but the papers I read seem to go the other direction.)
The desert landscape has the disadvantage of not having obviously identifiable elements of a built environment to match with such as buildings or highways. But I think that the desert environment has significant other advantages, like being able to see and clearly identify individual, unique-looking cactus and rocks in satellite imagery and being able to see sharp ridge lines and other topographic features from distances up to 20 miles.
I’m hoping that someone in the computer vision world takes an interest in this. From reading enough papers, it’s obvious to me that most of this work has already been done and only needs to be applied in this new context. Please contact me [email protected] or via text/Signal +1 510-334-8194.
A couple of notes:
“Why not just put the photos online and use an OSINT model?”
Context is everything. If we were looking for normal lost hikers, this would probably work very well. The problem is that, while these people for the most part want to be found and rescued, they are also politically vulnerable since most are crossing without documentation. Right-wing militias and other vigilantes operate in the borderlands area and could use the photos to locate and assault these already vulnerable individuals.
While it is unlikely that vigilantes would be organized enough identify a lost person’s location from a photo, grab their guns and hike far into the desert, the mere perception in the migrant and activist community that this is possible would be enough to deter people from ever sharing the photos. What would be more likely is that vigilantes would spam the channel with hate, DDOS attempts and false “leads” about the likely location of a photo.
“How about a private, vetted OSINT group?”
This is the backup plan, but it takes significant organizational resources that are likely beyond my lone ability at the moment. It would look like this: a webmap would be created with the landscape divided up by zones. Each basin and range would either have its own zone or, for the larger physiographic features, multiple zones. Once volunteers were vetted, they’d have secure login access to the map. When photo geolocalization was required, an email/text would be sent out. People would volunteer to take a zone and begin searching through satellite images and mark their best guesses on the map. A secure chat function would allow discussion. People with local experience would winnow the location guesses down to the top few, then identify the most likely match and begin a search.
“How would SAR volunteers access the model? How would a person with a photo from a lost friend contact the group?”
I had originally conceived of the model having a user-friendly web interface. I no longer think this is practical or necessary, though maybe something that could be done once the process has proven itself. All that is needed is a secure channel for photos to be sent to whomever understands how to program the model. I’m guessing that this process is computationally intensive and unless someone wants to develop a GUI for it, is best left to the computer scientists who developed it.
Currently, there are multiple humanitarian aid and SAR groups who have online presences and this is how people pass on photos and info about lost people. Having a simple one-page site devoted to explaining the project, with contact info for humanitarian aid groups would be useful. This could also be used as a portal for random volunteers to upload their own phone snaps to be included in the training database. I don’t think it would be a good idea to have people upload photos sent by lost people, since this could easily be abused by vigilantes to send searchers on wild goose chases. When the photos are sent directly to the humanitarian aid groups embedded in the migrant communities, it’s easier for them to filter out this kind of sabotage.
A few more examples of photographic and computational images
The following image pairs are from my own random desert hikes. The first is the iPhone photo and the second is the Google Earth “street view” based on the GPS and other metadata recorded with the image. After a little experimentation, it seems that Google Earth renders the best images if the topographical feature is nearby. Distant mountains, even when very distinct, tend to lose detail when viewed in “street view”. That said, a computer vision model would be able to match images despite this limitation for humans.
*Note that I’m writing this both for colleagues who can use this tool and for the people who are in the computer vision field whose interest I’m hoping to inspire, so I’m trying to not get technical beyond my understanding, while also explaining terms to those who’ve never heard them. For nontechnical colleagues, when I say “model” or “neural network”, think of that as simply “computer program” though, to be specific, they are called convolutional neural networks, sometimes supplemented by generative adversarial networks.
**I believe the process of generating a ground level view from a satellite/nadir view is called a polar transformation. If anyone can confirm that or tell me what it is properly called, please do.
***I have a minor philosophical quibble with attributing human characteristics to machines, hence the scare quotes around these terms. I’m also writing this for my activist/humanitarian aid colleagues who probably aren’t familiar with these terms, so the quotes are meant to indicate that the word is being used in a nonstandard sense specific to the field.