There is growing interest in using deep learning models to automate wildlife detection in aerial imaging surveys to increase efficiency, but human-generated annotations remain necessary for model training. However, even skilled observers may diverge in interpreting aerial imagery of complex environments, which may result in downstream instability of models. In this study, we present a framework for assessing annotation reliability by calculating agreement metrics for individual observers against an aggregated set of annotations generated by clustering multiple observers' observations and selecting the mode classification.
View Article and Find Full Text PDF