May 14, 2024

The Bear Necessity of AI in Conservation

The AI for Bears Challenge results which aims to improve the monitoring and identification of bears using advanced computer vision techniques.

The "AI for Bears" Challenge aims to improve the monitoring and identification of bears using advanced computer vision techniques. Brown bears, critical to ecosystem balance, are notoriously difficult to track due to their vast ranges and lack of unique markings. Traditional monitoring involves invasive methods like physical tagging, which poses significant challenges. Therefore, we focus on developing non-invasive AI technologies to enhance bear conservation efforts.

With our main goal of identifying bears in mind, we formed 4 teams that each tackled 1 part of the Challenge:

  • Team 1: Bear classification
  • Team 2: Bear face detection and Segmentation
  • Team 3: Bear Identification
  • Team 4: ML pipeline on edge

The challenge in short:

The main objective of Team 1 was to create efficient AI models that can accurately classify images based on the presence of bears. The goal was to filter and process a large amount of camera trap data so that subsequent AI processes could focus only on relevant bear images. Once accurate classification was established, Team 2 concentrated on precisely detecting and segmenting the bear faces within the images. This was needed to identify individuals, as it isolates the necessary bear features for the subsequent identification process. Team 3 was at the core of the challenge and was tasked with developing AI models that could identify individual bears from the segmented face images. The final team worked on the practical deployment of the AI models developed by Teams 1 to 3. Team 4 focused on optimizing these models to run efficiently on edge devices.

In 10 weeks we build a machine learning pipeline that takes photographs and camera trap imagery, finds a bear face, segments it, and compares it to a database of known bears. Using metric learning, we can find the correct match in more than 90% of the cases!

We leveraged multiple datasets from the BearID Project. The first dataset contained 3300 photographs of bears which we used for training our models.

_Figure 1: Example image from the dataset _

The other data existed of “chips", the cutout versions of the bear images, and unique bear IDs.

An example of a bear chip

Figure 2: An example of a bear chip.

Before we dive into the details…

We would like to express our gratitude for the technical and financial support provided by ARM and NXP which played a crucial role in the success of this Challenge.

ARM contributed to the Challenge by providing their specialized virtual hardware platform and tools tailored for AI development on edge devices. This enabled the teams to optimize their machine learning models for ARM processors. ARM also offered technical expertise, which helped enhance the development and optimization process. Their sponsorship provided essential financial support, which was instrumental in ensuring the Challenge's success.

NXP, on the other hand, contributed to the Challenge by providing advanced microcontrollers (i.MX 93 chip) and processors designed for efficient AI processing in remote settings. They also provided integration support to ensure that this hardware worked seamlessly with the necessary sensors and camera systems used in wildlife monitoring. Alongside ARM, NXP co-sponsored the overall costs of the Challenge, covering technical and operational expenses, which were crucial for the execution of this initiative.

Classification

The objective of this team was to develop AI models for the detection of bears in video frames captured from camera traps placed in the wild. Given the edge deployment of these models, size and efficiency were among the main constraints.

Requirements for the solution:

  • Work on camera trap frames, both daytime and nighttime vision
  • An AI model suitable for edge deployment on low-powered device
    • Small model size
    • Fast computation
  • High recall on identifying bears
    • Missing bears has a high cost

To solve this problem, we chose to train a classification model instead of a detection model because classification models are generally smaller and require less computational resources compared to detection models. However, one drawback of classification models is that they are not designed to learn to localize the subject of an image, meaning they may pick up on features that are useful for minimizing loss but need to be more robust for generalization. Nevertheless, as the final use case doesn't require localization information, we decided to proceed with classification models, prioritizing efficiency.

We chose the MobileNetV3-Small architecture for its efficiency, which is a modern classification CNN specifically designed for this purpose. Moreover, this model is already available in TensorFlow version 2.15, making it easier to use. All the models were initialized using the “ImageNet” weights provided.

Approach A | Single MobileNetV3-Small

This approach (Figure 3) consists of a single MobileNetV3-Small trained to classify the images into three classes:

  • empty 🌳
  • any other animal 🐺
  • bear 🐻

Schematic representation of the Single MobileNet-V3 approach

Figure 3. Schematic representation of the Single MobileNet-V3 approach.

Approach B | Double MobileNetV3-Small Pipeline

This approach (Figure 4) consists of two MobileNetV3-Small models acting in a sequence. The first model is trained to identify whether the frame is empty or it contains something:

  • empty 🌳
  • not empty 🌳❌

The second model is executed only if the first model detects a non-empty frame (🌳❌) to identify the presence of either:

  • any other animal 🐺
  • bear 🐻

This pipeline was designed to address the problems caused by the recurring backgrounds in the dataset and by the inability of classification models to localize the main subjects of the image. By splitting the pipeline in two, each model can focus on a simpler task at the cost of some computational overhead.

Schematic representation of the Single MobileNet-V3 approach

Figure 4. Schematic for Double MobileNetV3-Small Pipeline approach.

The following metrics were calculated on the same test set mentioned in the data split discussed earlier. These metrics are commonly used in machine learning classification problems: Accuracy, Recall, Precision, and F1 Score. All metrics, except Accuracy, are reported for each individual class in a one-vs-all manner.

Model
Metric (%) Approach A

MobileNetV3-Small

Approach B

Two MobileNetV3-Small Sequential Pipeline

Approach A Quantized

MobileNetV3-Small

Accuracy 91.70 93.10 82.84
Recall 🌳 80.16 87.30 94.15
🐺 96.95 97.10 82.91
🐻 93.95 93.23 77.09
Precision 🌳 93.81 91.18 67.96
🐺 95.08 95.77 97.84
🐻 88.60 92.14 84.57
F1 Score 🌳 86.44 89.20 78.94
🐺 95.96 96.43 89.76
🐻 91.12 92.68 80.65

Tabel 1. Evaluation metrics for the bear classification models

Overall, it can be said that both approaches perform well and to a similar degree. Replicating the empty frames that the models see in each epoch was a fundamental step to increase the performance of the models.

Both approaches have an average accuracy of over 90%, with a drop of ~10% in the quantized version.

Face Detection & Segmentation

The goal of the Bear’s Face Detection and Segmentation team is to provide the image of a bear’s head with no background given the image of a bear from a camera trap. The image of a bear’s head with no background is called a “chip” and it’s used as input for the Bear’s identification team to re-identify bears.

We combine an object detector and a SAM segmentation model to create a dataset of segmented bears. This dataset is used to fine-tune a YOLO segmentation model. By using this bear’s segmentation model, we can obtain a bear’s mask. The combination of a trained bear’s head detection and a bear segmentation model leads to obtaining a bear’s head with no background.

Training process and inference pipeline of the bear face detection Figure 5: Training process and inference pipeline of the bear face detection

Training process:

For the bear’s face detection, we used a YOLOv8 model. After 20 epochs, this model gives a mAP50 of 0.9, which is almost perfect. This model was trained on the BearID dataset with the corrected bounding boxes from Roboflow. Hyperparameter tuning was not needed. The results of this model are shown in Figure 6.

Results of a YOLOv8 bear face detector model Figure 6: Results of a YOLOv8 bear face detector model.

Bear face segmentation

We use a YOLOv8n-seg to segment the bear faces. Like the previous model, we obtained a high mAP50 of almost 0.99, after a couple of epochs. Once more, no hyperparameter tuning was needed. The results of the bear face segmentation model are shown in Figure 7.

Results of a YOLOv8n-seg bear face segmentation model

Figure 7: Results of a YOLOv8n-seg bear face segmentation model.

Bear Identification

Here we will tackle the main goal of the Challenge: Identifying individual bears. We do this using a technique called Metric learning.

Complete pipeline to identify bears

Figure 8: Complete pipeline to identify bears.

Metric Learning is a process of creating models that produce embedding vectors. These vectors aim to minimize the differences within the embeddings of the same individual and maximize the differences between the embeddings of different individuals. To achieve this, we used various loss functions for metric learning, each with its unique properties and advantages.

Dataset used

Due to findings in previous Fruitpunch AI challenges, we only want to use the faces of the bear. Backgrounds can cause the model to use them for identification, i.e. bear X is always photographed in the water, then it will use that feature in the output embeddings. Also, the bodies of bears change over time, which can make it harder for the model to learn useful features. Therefore, the other teams worked on a model that extracts the face and removes the background from images of bears. The output of their model is our input data, and their labels are the bear IDs provided by the bearID experts. During the first part of the challenge, we used a total of 4662 images of 132 bears. Near the end of the challenge, we got access to additional data, increasing the total to 51042 images of 144 bears.

Feature Extractors and Model Architectures

Given the constraint of deploying our solution on edge hardware, we prioritized efficiency and performance in selecting our feature extractors. Our exploration included:

  1. ResNet-50: Known for its depth and efficiency, making it a robust choice for feature extraction.
  2. EfficientNet Small: Optimized for speed and accuracy, particularly well-suited for edge devices.
  3. ConvNext Tiny: A newer architecture that promises improvements over traditional ConvNets.

For each of these architectures, we utilized ImageNet pre-trained models as our starting point. Recognizing the unique requirements of our task, we replaced the models' final classification layers with fully-connected layers of varying output dimensionalities. This modification aims to tailor the networks for our specific objective of bear re-identification.

Loss Functions

Our experimentation with loss functions included:

  1. Triplet Loss: Encourages the model to distinguish between anchor, positive, and negative samples.
  2. Circle Loss: Aims to enhance the discriminative power of the embedding space.
  3. ArcFace Loss: Focuses on increasing the angular margin between classes to improve separability.

These loss functions were chosen for their ability to effectively train models for tasks requiring fine-grained distinction between classes, such as individual animal re-identification.

Evaluation metrics

In the realm of bear re-identification, accurately evaluating the performance of our models is paramount. The BearID researchers informed us that they would use our model to generate a list of top candidates for each bear. To match this setting, we use the Hit Rate @ K metric to evaluate performance. This metric is defined as the ratio of predicted bear IDs that contain the correct bear ID. So if we take k = 3, we return a list of three bear ids (with possible duplicates). If it contains the correct one, we have a hit, otherwise we don’t. The average of this is the hit rate. Specifically, we used 1, 3 and 5 as values for k.

Results

The results of the experiments are in the table below. It turns out that the ConvNext_tiny was the best-performing backbone. Also, the larger dataset increased the performance significantly.

Backbone Loss function Hit Rate @ 1 Hit Rate @ 3 Hit Rate @ 5
ConvNext_tiny circle loss

0,938

0,963

0,972

Efficientnet_v2_s circle loss

0,7867

0,889

0,893

ResNet-50 circle loss

0,678

0,819

0,862

Table 2: Hit rate @ k scores for different backbones


Next to training and fine-tuning our network, we also fine-tuned the open-source MegaDescriptor model. This is a transformer-based architecture specifically trained on animal re-identification tasks. Different sizes of the model exist. We used the smallest model, ie. tiny, to align with our objective to run the model on edge hardware.

As shown in Table 3, this approach already achieves remarkable performance.

Hit Rate @ 1 Hit Rate @ 3 Hit Rate @ 5
0,905 0,943 0,956
Table 3: hit rate @ k scores for fine-tuned MegaDescriptor tiny


ML Pipeline on Edge

The Edge Pipeline team was tasked with optimizing the performance of the Pytorch models on the NXP i.MX 93 chip. We selected the i.MX 93 chip because it has an NPU that accelerates machine-learning model operations. The team worked on converting the models to the appropriate frameworks that enable acceleration on the chip. Additionally, they quantized the models to ensure efficient inference performance.

The i.MX 93 chip

The i.MX 93 chip is placed on an evaluation board that has 2 GB of working memory and 16 GB of storage. The evaluation kit comes with a camera interface that allows for connection to bear trap cameras. What's more, the chip is equipped with an Arm Ethos U-65 MicroNPU. This integration enables it to hasten the different models' operations. The NPU speeds up critical machine learning operations such as matrix multiplications and convolutions, which are necessary for quick processing of such tasks.

The bear identification pipeline comprises multiple components and uses different models, so several models needed to be converted. Figure 9 illustrates the pipeline and its various components. The numbers in red indicate the different components of the pipeline where a machine learning model was utilized. These components were crucial for our team, as they can be accelerated.

Pipeline part Objective Model
1 detecting bears in the frame MobileNetV3
2 detecting and segmenting the bear faces Yolov8
3 matching bear faces ConvNextTiny

The most crucial part of the pipeline is the model for component 1, as it is the part that runs the most frequently. The remaining components of the pipeline will only run when a bear is actually detected in a frame. Given that component 1 runs the most, it is important to make this component as efficient as possible to reduce energy consumption. Therefore, having a quantized model that can be accelerated is of great importance.

Components of the Bear identification pipeline

Figure 9: Components of the Bear identification pipeline

Results

After converting the classification, detection, and segmentation models, we compared the inference speed and model performance of the original and converted models. The results are summarized in Table 4. For the classification model, we observed a significant increase in inference speed, with the time decreasing from 38 ms to 2 ms. However, there was a notable decrease in performance in terms of recall, dropping from 93.4 to 77.1. As for the detection and segmentation model, we also noticed a substantial boost in inference speed, with the converted model showing a 5-fold increase, decreasing from 1500 ms to 300 ms. The gains in inference speed also resulted in a significant reduction in power consumption.

Pipeline Component Model iMX 93 Inference Time Recall

before

conversion

Recall

after

conversion

1. Classification MobilenetV3 CPU ~38 ms ~ 94 93.4 (unquantized)
1. Classification MobilenetV3 NPU ~2 ms ~ 94 77.1 (quantized)
2. Detection

and Segmentation

Yolov8 CPU ~1500 ms
2. Detection

and Segmentation

Yolov8 NPU ~300 ms

_Table 4: Results when performing inference on the converted models _

Concluding

As we wrap up the "AI for Bears" Challenge, we celebrate significant advancements in using AI to identify bears non-invasively, setting a new standard in wildlife conservation. Over ten weeks, our dedicated teams developed a sophisticated machine-learning pipeline that efficiently identifies individual bears using non-invasive AI technologies!

A heartfelt thanks to our partners, ARM and NXP, for their crucial support, and to all participants whose expertise and dedication have driven this project's success!

Looking ahead, we're excited to see the implementation of these models in the wild.

Big thank you to everyone involved for making this challenge a success:

Ed Miller, Melanie Clapham, Mary Bennion, Brian de Bart, Martijn van der Linden, Laurens Polgar, Hiram Rayo Torres Rodriquez, Matthias Wilkens, David Tischler, Adam, Bart Emons, Carmen Martinez Barbosa, Arthur Caillua, Matthieu Fraissinet-Tachet, Tristan Koskamp, Meredith Palmer, Anton Alvarez, Thor Veen, Christos Panagiotopoulos, Simon Dahrs, Yuri Shlyakhter, Thea Shin, Davide Coppola, John Cooper, Gaspard Bos, Jesse Wiers, Paul L, Jaka Cikač, Prishani Sokay, Sako Arts, Dorian Groen

AI for Good
AI for Wildlife
Artificial Intelligence
Computer vision
Deep Learning
Edge Computing
Challenge results
Subscribe to our newsletter

Be the first to know when a new AI for Good challenge is launched. Keep up do date with the latest AI for Good news.

* indicates required
Thank you!

We’ve just sent you a confirmation email.

We know, this can be annoying, but we want to make sure we don’t spam anyone. Please, check out your inbox and confirm the link in the email.

Once confirmed, you’ll be ready to go!

Oops! Something went wrong while submitting the form.