The Bear Necessity of AI in Conservation
The AI for Bears Challenge results which aims to improve the monitoring and identification of bears using advanced computer vision techniques.
In the dense rainforests of Central Africa, a captivating endeavor is underway, driven by researchers from Cornell University. Their mission? To track and protect elephants through the power of audio monitoring. These majestic creatures emit resounding rumbles at an incredibly low frequency, nearly imperceptible to the human ear. These deep calls, traversing vast distances, serve as a concealed communication network for elephants, veiled from our understanding for centuries.
Fortunately, technological advancements have granted us the ability to record and analyze these mysterious elephant rumbles. Over the past few years, a network of more than 50 recorders has been carefully planted across the heart of Central Africa's tropical rainforest, tirelessly capturing the continuous symphony of sounds these gentle giants produce. Yet, faced with such an immense volume of data, human efforts alone struggle to efficiently unearth valuable insights from this wealth of information. So that’s where our AI Challenge comes in!
After experimenting with various model architectures the two approaches that were the most promising were using an Audio-Spectrogram Transformer (AST) and Hierarchical Token-Semantic Audio Transformer (HTS-AT). Their performance is shown in the table below. Our team also had a look at the data processing pipeline that Cornell already had in place to see if we could speed up their process. Overall we have been able to decrease the total inference time on a 24h audio file from about 75 mins to 16 mins in the worst case and 4 mins in the best case, on a 6-core i-7 CPU.
Last but not least we explored new ways for real-time rumble monitoring using a NXP i.MX8M plus device. Using Edge Impulse we were able to easily deploy a YAMNet and reduce its size from 11 MB to 4 MB. Comparing CPU vs NPU inference on the NXP devise we detected a 30% boost to inference speed using the NPU.
African forest elephants are key ecosystem engineers in Central Africa’s tropical rainforests, but they’re critically endangered—even more threatened than the larger bush elephants and their cousins in Asia. They help keep their home lush and healthy, but poaching is driving them to extinction.
To make matters worse, these elephants are losing their habitat due to deforestation and their food supply is decreasing due to climate change. The resourceful giants are therefore increasingly raiding farmers’ crops. Of course, this doesn’t make for very empathic farmers, further jeopardizing peaceful co-existence between people and elephants.
But what do you do when you can’t see the animals you want to protect? You listen! 50 engineers from all over the globe worked for 10 weeks to recognize elephant voices and gunshots on sound monitoring devices, using AI.
The AI enthusiasts teamed up with the world’s authority on conservation bioacoustics at Cornell University, who were introduced to FruitPunch by Zambezi Partners, to bring an experimental machine learning model from Stanford University into the jungle. Rainforest Connection (or RFCx), a frontrunner in acoustic wildlife monitoring, gladly shared its knowledge of sound-based wildlife observation with the Challenge participants.
Where elephants live, there is no power grid or network connection. The detection model needed to run on a small chip, powered by only a battery or maybe a solar panel. This pushed the FruitPunch community into the domain of edge computing. Luckily, they had some backup: Global semiconductor powerhouse NXP provided them with several SOC (system on a chip) boards for developing a prototype acoustic detector.
To make a small-scale model that runs on edge hardware, it’s easiest to first build a regular model and downsize it afterward. AWS provided the teams with access to Sagemaker Studio Lab and a whole bunch of compute credits, to train the best possible model. To then shrink the result down to run on the NXP boards, Edge Impulse lent a hand as well, opening up their convenient edge computing platform to downscale the detector and deploy it to the device.
Dr. Andreas Paepcke’s students had developed a deep learning model that was able to detect elephants’ deep rumbles, as well as poachers’ gunshots. Researchers at the K. Lisa Yang Center for Conservation Bioacoustics at Cornell then continued the work and generated a labeled dataset of sound recordings from the Central African rainforests.
The resulting model could process raw audio and detect gunshots and rumbles with high accuracy. It was, however, too slow. The detector needed a lot of work before it could be effective for further research and even more work to be implemented in the rainforests to run inference in real-time. In this Challenge, we worked on 4 aspects to improve upon the RumbleDetector.
You can read all about our work in the Challenge reports which you will find at the bottom of this page. But to give you a summary, here are some of our interesting findings.
For modeling, we used two models, both AST and HTS-AT are transformer based, which has as an added benefit that they tend to be faster than RNN-based models, making them ideal for the fast inference asked for.
In the first table, you can see the metrics for both the gunshot and the rumbles. AST outperforms HTS on the development set. In the second table, you can see the scores when you limit the events to the rumbles. In this case, the performance drops a bit and does not seem to outperform the previous model used. Based on the benchmarks, the HTS-AT is designed to outperform the AST. Its architecture is a clear upgrade. However, we had to consider the size of our dataset and the model. The HTS-AT is specifically built to perform well on the AudioSet dataset, which contains over 2 million labeled 10-second clips, equating to more than 5,500 hours of audio. Although we do have 1,560 hours of recordings, it only includes 32,089 clips of gunshots and rumbles. This small dataset is insufficient for training the HTS-AT, leading to painful overfitting. Finally, we would like to mention that the new models are incredibly speedy, capable of processing 24 hours' worth of audio in just 20 seconds. In contrast, the older models require upwards of 10 minutes to complete the same task. This disparity is primarily due to the selected architecture, as transformer-based models can process all timesteps in parallel.
After examining the Stanford code with the Scalene profiler, we discovered that the slowest part of the code involved fetching the 24-hour spectrogram, dividing it into slices, and feeding them to the model for generating predictions. We identified this as the primary area for optimization and aimed to develop a pipeline that would enable easy replacement of the model with a more efficient version.
Although spectrogram conversion was also a processing bottleneck, it was less time-consuming than the aforementioned issues and therefore not given as much priority. The model's prediction generation was the next slowest aspect, but since the Model Development Team was already tasked with creating a more efficient model, we did not include it as part of our objectives.
Overall, we managed to reduce the total inference time on a 24-hour audio file from approximately 75 minutes to 16 minutes in the worst-case scenario and 4 minutes in the best case, on a 6-core i-7 CPU. With the integration of the new model and improvements in spectrogram generation, we anticipate a best-case time of about 1 minute.
Our team aimed to utilize the capabilities of the NXP i.MX8M Plus device. We successfully implemented a proof-of-concept deployment by using the YAMNet deep learning model with the help of Edge Impulse and TFLite.
After conducting extensive experiments with Edge Impulse, we were able to compress the original 11 MB YAMNet model to a much smaller 4 MB size. Moreover, we conducted performance tests to compare the CPU with the NPU (Neural Processing Unit) for YAMNet. The results were interesting as the CPU had an average inference time of 28 ms for a 0.975-second clip while the NPU, once warmed up, achieved an impressive 19 ms per inference, providing a significant 30% increase in inference speed. Although the outcomes are predominantly positive, there are still obstacles that need to be overcome to establish a complete and operational system. Networking, power supply, and durability in challenging environments are crucial challenges that need to be addressed.
The primary objective of the project was centered around deploying the model, with a particular emphasis on the preprocessing and model inference stages. The team utilized the YAMNet project as a starting point and adapted the mel spectrogram generation, successfully integrating it into the ML model, which reduced the need for extensive preprocessing. However, further research is necessary to determine which model, whether through transfer learning or training from scratch, would be most effective in detecting elephant rumbles and gunshots. Improving the dataset quality and refining the input-prediction relationship are both essential for enhancing the model's accuracy and effectiveness.
The NXP i.MX8M Plus device is promising for real-time audio data analysis, which can help track forest elephants and prevent poaching. As progress is made in this area, the success of real-time monitoring will depend on the quality of the model inference. Therefore, additional efforts should be focused on improving the model, selecting the most suitable architecture, and refining the dataset to achieve superior accuracy and meet specific use case requirements. With continued dedication, AI-powered audio monitoring has the potential to revolutionize elephant conservation and combat poaching activities in the African rainforest.
Special thanks to all the collaborators: Alexandra, Estine, Masum, Simon, Feline, Jackline, Jaka, Juan, Kumaran, Muhammad, Ronan, Sebas, Timothy, Gerson, Karl, Patryk, Ryan, Traun & Emile