The Bear Necessity of AI in Conservation
The AI for Bears Challenge results which aims to improve the monitoring and identification of bears using advanced computer vision techniques.
The "AI for Bears" Challenge aims to improve the monitoring and identification of bears using advanced computer vision techniques. Brown bears, critical to ecosystem balance, are notoriously difficult to track due to their vast ranges and lack of unique markings. Traditional monitoring involves invasive methods like physical tagging, which poses significant challenges. Therefore, we focus on developing non-invasive AI technologies to enhance bear conservation efforts.
With our main goal of identifying bears in mind, we formed 4 teams that each tackled 1 part of the Challenge:
The main objective of Team 1 was to create efficient AI models that can accurately classify images based on the presence of bears. The goal was to filter and process a large amount of camera trap data so that subsequent AI processes could focus only on relevant bear images. Once accurate classification was established, Team 2 concentrated on precisely detecting and segmenting the bear faces within the images. This was needed to identify individuals, as it isolates the necessary bear features for the subsequent identification process. Team 3 was at the core of the challenge and was tasked with developing AI models that could identify individual bears from the segmented face images. The final team worked on the practical deployment of the AI models developed by Teams 1 to 3. Team 4 focused on optimizing these models to run efficiently on edge devices.
In 10 weeks we build a machine learning pipeline that takes photographs and camera trap imagery, finds a bear face, segments it, and compares it to a database of known bears. Using metric learning, we can find the correct match in more than 90% of the cases!
We leveraged multiple datasets from the BearID Project. The first dataset contained 3300 photographs of bears which we used for training our models.
_Figure 1: Example image from the dataset _
The other data existed of “chips", the cutout versions of the bear images, and unique bear IDs.
Figure 2: An example of a bear chip.
We would like to express our gratitude for the technical and financial support provided by ARM and NXP which played a crucial role in the success of this Challenge.
ARM contributed to the Challenge by providing their specialized virtual hardware platform and tools tailored for AI development on edge devices. This enabled the teams to optimize their machine learning models for ARM processors. ARM also offered technical expertise, which helped enhance the development and optimization process. Their sponsorship provided essential financial support, which was instrumental in ensuring the Challenge's success.
NXP, on the other hand, contributed to the Challenge by providing advanced microcontrollers (i.MX 93 chip) and processors designed for efficient AI processing in remote settings. They also provided integration support to ensure that this hardware worked seamlessly with the necessary sensors and camera systems used in wildlife monitoring. Alongside ARM, NXP co-sponsored the overall costs of the Challenge, covering technical and operational expenses, which were crucial for the execution of this initiative.
The objective of this team was to develop AI models for the detection of bears in video frames captured from camera traps placed in the wild. Given the edge deployment of these models, size and efficiency were among the main constraints.
Requirements for the solution:
To solve this problem, we chose to train a classification model instead of a detection model because classification models are generally smaller and require less computational resources compared to detection models. However, one drawback of classification models is that they are not designed to learn to localize the subject of an image, meaning they may pick up on features that are useful for minimizing loss but need to be more robust for generalization. Nevertheless, as the final use case doesn't require localization information, we decided to proceed with classification models, prioritizing efficiency.
We chose the MobileNetV3-Small architecture for its efficiency, which is a modern classification CNN specifically designed for this purpose. Moreover, this model is already available in TensorFlow version 2.15, making it easier to use. All the models were initialized using the “ImageNet” weights provided.
This approach (Figure 3) consists of a single MobileNetV3-Small trained to classify the images into three classes:
Figure 3. Schematic representation of the Single MobileNet-V3 approach.
This approach (Figure 4) consists of two MobileNetV3-Small models acting in a sequence. The first model is trained to identify whether the frame is empty or it contains something:
The second model is executed only if the first model detects a non-empty frame (🌳❌) to identify the presence of either:
This pipeline was designed to address the problems caused by the recurring backgrounds in the dataset and by the inability of classification models to localize the main subjects of the image. By splitting the pipeline in two, each model can focus on a simpler task at the cost of some computational overhead.
Figure 4. Schematic for Double MobileNetV3-Small Pipeline approach.
The following metrics were calculated on the same test set mentioned in the data split discussed earlier. These metrics are commonly used in machine learning classification problems: Accuracy, Recall, Precision, and F1 Score. All metrics, except Accuracy, are reported for each individual class in a one-vs-all manner.
Model | ||||
Metric (%) | Approach A
MobileNetV3-Small |
Approach B
Two MobileNetV3-Small Sequential Pipeline |
Approach A Quantized
MobileNetV3-Small |
|
Accuracy | 91.70 | 93.10 | 82.84 | |
Recall | 🌳 | 80.16 | 87.30 | 94.15 |
🐺 | 96.95 | 97.10 | 82.91 | |
🐻 | 93.95 | 93.23 | 77.09 | |
Precision | 🌳 | 93.81 | 91.18 | 67.96 |
🐺 | 95.08 | 95.77 | 97.84 | |
🐻 | 88.60 | 92.14 | 84.57 | |
F1 Score | 🌳 | 86.44 | 89.20 | 78.94 |
🐺 | 95.96 | 96.43 | 89.76 | |
🐻 | 91.12 | 92.68 | 80.65 |
Tabel 1. Evaluation metrics for the bear classification models
Overall, it can be said that both approaches perform well and to a similar degree. Replicating the empty frames that the models see in each epoch was a fundamental step to increase the performance of the models.
Both approaches have an average accuracy of over 90%, with a drop of ~10% in the quantized version.
The goal of the Bear’s Face Detection and Segmentation team is to provide the image of a bear’s head with no background given the image of a bear from a camera trap. The image of a bear’s head with no background is called a “chip” and it’s used as input for the Bear’s identification team to re-identify bears.
We combine an object detector and a SAM segmentation model to create a dataset of segmented bears. This dataset is used to fine-tune a YOLO segmentation model. By using this bear’s segmentation model, we can obtain a bear’s mask. The combination of a trained bear’s head detection and a bear segmentation model leads to obtaining a bear’s head with no background.
Figure 5: Training process and inference pipeline of the bear face detection
For the bear’s face detection, we used a YOLOv8 model. After 20 epochs, this model gives a mAP50 of 0.9, which is almost perfect. This model was trained on the BearID dataset with the corrected bounding boxes from Roboflow. Hyperparameter tuning was not needed. The results of this model are shown in Figure 6.
Figure 6: Results of a YOLOv8 bear face detector model.
We use a YOLOv8n-seg to segment the bear faces. Like the previous model, we obtained a high mAP50 of almost 0.99, after a couple of epochs. Once more, no hyperparameter tuning was needed. The results of the bear face segmentation model are shown in Figure 7.
Figure 7: Results of a YOLOv8n-seg bear face segmentation model.
Here we will tackle the main goal of the Challenge: Identifying individual bears. We do this using a technique called Metric learning.
Figure 8: Complete pipeline to identify bears.
Metric Learning is a process of creating models that produce embedding vectors. These vectors aim to minimize the differences within the embeddings of the same individual and maximize the differences between the embeddings of different individuals. To achieve this, we used various loss functions for metric learning, each with its unique properties and advantages.
Due to findings in previous Fruitpunch AI challenges, we only want to use the faces of the bear. Backgrounds can cause the model to use them for identification, i.e. bear X is always photographed in the water, then it will use that feature in the output embeddings. Also, the bodies of bears change over time, which can make it harder for the model to learn useful features. Therefore, the other teams worked on a model that extracts the face and removes the background from images of bears. The output of their model is our input data, and their labels are the bear IDs provided by the bearID experts. During the first part of the challenge, we used a total of 4662 images of 132 bears. Near the end of the challenge, we got access to additional data, increasing the total to 51042 images of 144 bears.
Given the constraint of deploying our solution on edge hardware, we prioritized efficiency and performance in selecting our feature extractors. Our exploration included:
For each of these architectures, we utilized ImageNet pre-trained models as our starting point. Recognizing the unique requirements of our task, we replaced the models' final classification layers with fully-connected layers of varying output dimensionalities. This modification aims to tailor the networks for our specific objective of bear re-identification.
Our experimentation with loss functions included:
These loss functions were chosen for their ability to effectively train models for tasks requiring fine-grained distinction between classes, such as individual animal re-identification.
In the realm of bear re-identification, accurately evaluating the performance of our models is paramount. The BearID researchers informed us that they would use our model to generate a list of top candidates for each bear. To match this setting, we use the Hit Rate @ K metric to evaluate performance. This metric is defined as the ratio of predicted bear IDs that contain the correct bear ID. So if we take k = 3, we return a list of three bear ids (with possible duplicates). If it contains the correct one, we have a hit, otherwise we don’t. The average of this is the hit rate. Specifically, we used 1, 3 and 5 as values for k.
The results of the experiments are in the table below. It turns out that the ConvNext_tiny was the best-performing backbone. Also, the larger dataset increased the performance significantly.
Backbone | Loss function | Hit Rate @ 1 | Hit Rate @ 3 | Hit Rate @ 5 |
ConvNext_tiny | circle loss | 0,938 |
0,963 |
0,972 |
Efficientnet_v2_s | circle loss | 0,7867 |
0,889 |
0,893 |
ResNet-50 | circle loss | 0,678 |
0,819 |
0,862 |
Table 2: Hit rate @ k scores for different backbones |
Next to training and fine-tuning our network, we also fine-tuned the open-source MegaDescriptor model. This is a transformer-based architecture specifically trained on animal re-identification tasks. Different sizes of the model exist. We used the smallest model, ie. tiny, to align with our objective to run the model on edge hardware.
As shown in Table 3, this approach already achieves remarkable performance.
Hit Rate @ 1 | Hit Rate @ 3 | Hit Rate @ 5 |
0,905 | 0,943 | 0,956 |
Table 3: hit rate @ k scores for fine-tuned MegaDescriptor tiny |
The Edge Pipeline team was tasked with optimizing the performance of the Pytorch models on the NXP i.MX 93 chip. We selected the i.MX 93 chip because it has an NPU that accelerates machine-learning model operations. The team worked on converting the models to the appropriate frameworks that enable acceleration on the chip. Additionally, they quantized the models to ensure efficient inference performance.
The i.MX 93 chip is placed on an evaluation board that has 2 GB of working memory and 16 GB of storage. The evaluation kit comes with a camera interface that allows for connection to bear trap cameras. What's more, the chip is equipped with an Arm Ethos U-65 MicroNPU. This integration enables it to hasten the different models' operations. The NPU speeds up critical machine learning operations such as matrix multiplications and convolutions, which are necessary for quick processing of such tasks.
The bear identification pipeline comprises multiple components and uses different models, so several models needed to be converted. Figure 9 illustrates the pipeline and its various components. The numbers in red indicate the different components of the pipeline where a machine learning model was utilized. These components were crucial for our team, as they can be accelerated.
Pipeline part | Objective | Model |
1 | detecting bears in the frame | MobileNetV3 |
2 | detecting and segmenting the bear faces | Yolov8 |
3 | matching bear faces | ConvNextTiny |
The most crucial part of the pipeline is the model for component 1, as it is the part that runs the most frequently. The remaining components of the pipeline will only run when a bear is actually detected in a frame. Given that component 1 runs the most, it is important to make this component as efficient as possible to reduce energy consumption. Therefore, having a quantized model that can be accelerated is of great importance.
Figure 9: Components of the Bear identification pipeline
After converting the classification, detection, and segmentation models, we compared the inference speed and model performance of the original and converted models. The results are summarized in Table 4. For the classification model, we observed a significant increase in inference speed, with the time decreasing from 38 ms to 2 ms. However, there was a notable decrease in performance in terms of recall, dropping from 93.4 to 77.1. As for the detection and segmentation model, we also noticed a substantial boost in inference speed, with the converted model showing a 5-fold increase, decreasing from 1500 ms to 300 ms. The gains in inference speed also resulted in a significant reduction in power consumption.
Pipeline Component | Model | iMX 93 | Inference Time | Recall
before conversion |
Recall
after conversion |
1. Classification | MobilenetV3 | CPU | ~38 ms | ~ 94 | 93.4 (unquantized) |
1. Classification | MobilenetV3 | NPU | ~2 ms | ~ 94 | 77.1 (quantized) |
2. Detection
and Segmentation |
Yolov8 | CPU | ~1500 ms | ||
2. Detection
and Segmentation |
Yolov8 | NPU | ~300 ms |
_Table 4: Results when performing inference on the converted models _
As we wrap up the "AI for Bears" Challenge, we celebrate significant advancements in using AI to identify bears non-invasively, setting a new standard in wildlife conservation. Over ten weeks, our dedicated teams developed a sophisticated machine-learning pipeline that efficiently identifies individual bears using non-invasive AI technologies!
A heartfelt thanks to our partners, ARM and NXP, for their crucial support, and to all participants whose expertise and dedication have driven this project's success!
Looking ahead, we're excited to see the implementation of these models in the wild.
Big thank you to everyone involved for making this challenge a success:
Ed Miller, Melanie Clapham, Mary Bennion, Brian de Bart, Martijn van der Linden, Laurens Polgar, Hiram Rayo Torres Rodriquez, Matthias Wilkens, David Tischler, Adam, Bart Emons, Carmen Martinez Barbosa, Arthur Caillua, Matthieu Fraissinet-Tachet, Tristan Koskamp, Meredith Palmer, Anton Alvarez, Thor Veen, Christos Panagiotopoulos, Simon Dahrs, Yuri Shlyakhter, Thea Shin, Davide Coppola, John Cooper, Gaspard Bos, Jesse Wiers, Paul L, Jaka Cikač, Prishani Sokay, Sako Arts, Dorian Groen