Welcome to visit Scientia Silvae Sinicae,Today is

Scientia Silvae Sinicae ›› 2025, Vol. 61 ›› Issue (2): 12-20.doi: 10.11707/j.1001-7488.LYKX20240436

• Special subject: Smart forestry • Previous Articles     Next Articles

Zero-Shot Classification of Bird Audio Based on Taxonomy

Shanshan Xie(),Junguo Zhang,Jiangjian Xie*(),Changchun Zhang   

  1. School of Technology,Beijing Forestry University Key Laboratory of National Forestry and Grassland Administration on Forestry Equipment andAutomation State Key Laboratory of Efficient Production of Forest Resources Beijing 100083
  • Received:2024-07-13 Online:2025-02-25 Published:2025-03-03
  • Contact: Jiangjian Xie E-mail:xieshanshan@bjfu.edu.cn;shyneforce@bjfu.edu.cn

Abstract:

Objective: The bird audio pretraining model, constructed through a large number of audio-text pairs, can be used for zero-shot classification of audio with insufficient training samples by utilizing side information for species classification. This approach can reduce the burden of data collection and provide an effective theoretical basis for zero-shot classification of bird audio, aiding ecological monitoring and analysis of species distribution changes in open environments. Method: The taxonomic information reflecting the phylogenetic relationships of birds was used as side information for species class. The pretrained RoBERTa text encoder and acoustic embeddings of audio using the pretrained HTSAT audio encoder were used to extract semantic embeddings of the taxonomy, respectively. The contrastive learning methods were used to calculate the similarity between semantic and acoustic embeddings, and construct a contrastive language-audio pretraining model for birds (CLAP-Bird). Subsequently, zero-shot classification for bird audio was realized based on the side information for zero-shot classes and CLAP-Bird model. Result: The proposed method was trained and evaluated on a large imbalanced bird audio dataset containing 725 hours of recordings. The average F1_score obtained across five different test sets, each with 8 to 10 classes, was 0.289. Compared to baseline models that were used for bird scientific name, life history, and basic characteristics as side information for species class, the proposed model significantly improved the zero-shot classification performance for bird audio. Conclusion: The taxonomy of birds is served as side information for species class, which provides insights into the biological and genetic relationships about bird species, helps the model better understand the connections between bird sounds and improves the performance of zero-shot learning for bird audio classification. Moreover, the closer the taxonomic relationship between the training set and the test set, the better the zero-shot classification performance on the test set.

Key words: bird audio classification, zero-shot learning, taxonomy, side information for species class, contrastive learning

CLC Number: