欢迎访问林业科学,今天是

林业科学 ›› 2025, Vol. 61 ›› Issue (2): 12-20.doi: 10.11707/j.1001-7488.LYKX20240436

• 专题:智慧林业 • 上一篇    下一篇

基于系统分类学信息的鸟类音频零样本分类

谢珊珊(),张军国,谢将剑*(),张长春   

  1. 北京林业大学工学院 林业装备与自动化国家林业和草原局重点实验室 林木资源高效生产全国重点实验室 北京 100083
  • 收稿日期:2024-07-13 出版日期:2025-02-25 发布日期:2025-03-03
  • 通讯作者: 谢将剑 E-mail:xieshanshan@bjfu.edu.cn;shyneforce@bjfu.edu.cn
  • 基金资助:
    国家自然科学基金项目(62303063,32371874)。

Zero-Shot Classification of Bird Audio Based on Taxonomy

Shanshan Xie(),Junguo Zhang,Jiangjian Xie*(),Changchun Zhang   

  1. School of Technology,Beijing Forestry University Key Laboratory of National Forestry and Grassland Administration on Forestry Equipment andAutomation State Key Laboratory of Efficient Production of Forest Resources Beijing 100083
  • Received:2024-07-13 Online:2025-02-25 Published:2025-03-03
  • Contact: Jiangjian Xie E-mail:xieshanshan@bjfu.edu.cn;shyneforce@bjfu.edu.cn

摘要:

目的: 通过大量音频-文本对构建的鸟类音频预训练模型能基于物种类别辅助信息对缺乏训练样本的音频进行零样本分类,以减轻数据采集的负担,为鸟类音频零样本分类研究提供有效的理论依据,也为开放环境中的生态监测和物种分布变化分析提供参考。方法: 利用反映鸟类系统发育关系的系统分类学信息作为声音类的物种类别辅助信息,以预训练的RoBERTa文本编码器和HTSAT音频编码器分别提取系统分类学信息的语义嵌入和鸟类音频的声学嵌入,通过对比学习方法计算语义嵌入和声学嵌入的相似度,构建鸟类对比语言-音频预训练模型(CLAP-Bird),然后基于零样本类的物种类别辅助信息和CLAP-Bird模型实现零样本分类。结果: 在一个包含725 h的大型不平衡鸟类音频数据集上训练和评估了所提出的方法,在5个不同的8~10个类别的测试集上获得的平均F1_score为0.289,与以鸟类学名、鸟类生活史和基础特性信息作为物种类别辅助信息的基线模型相比,本文提出的模型对鸟类音频零样本分类性能明显提升。结论: 鸟类的系统分类学信息作为物种类别辅助信息,提供了关于鸟类的生物学遗传信息,有助于模型更好地理解鸟类鸣声之间的关系,提升了鸟类音频零样本学习的性能。且训练集与测试集的系统分类学关系越接近,则对测试集的零样本分类性能越好。

关键词: 鸟类音频分类, 零样本学习, 系统分类学信息, 物种类别辅助信息, 对比学习

Abstract:

Objective: The bird audio pretraining model, constructed through a large number of audio-text pairs, can be used for zero-shot classification of audio with insufficient training samples by utilizing side information for species classification. This approach can reduce the burden of data collection and provide an effective theoretical basis for zero-shot classification of bird audio, aiding ecological monitoring and analysis of species distribution changes in open environments. Method: The taxonomic information reflecting the phylogenetic relationships of birds was used as side information for species class. The pretrained RoBERTa text encoder and acoustic embeddings of audio using the pretrained HTSAT audio encoder were used to extract semantic embeddings of the taxonomy, respectively. The contrastive learning methods were used to calculate the similarity between semantic and acoustic embeddings, and construct a contrastive language-audio pretraining model for birds (CLAP-Bird). Subsequently, zero-shot classification for bird audio was realized based on the side information for zero-shot classes and CLAP-Bird model. Result: The proposed method was trained and evaluated on a large imbalanced bird audio dataset containing 725 hours of recordings. The average F1_score obtained across five different test sets, each with 8 to 10 classes, was 0.289. Compared to baseline models that were used for bird scientific name, life history, and basic characteristics as side information for species class, the proposed model significantly improved the zero-shot classification performance for bird audio. Conclusion: The taxonomy of birds is served as side information for species class, which provides insights into the biological and genetic relationships about bird species, helps the model better understand the connections between bird sounds and improves the performance of zero-shot learning for bird audio classification. Moreover, the closer the taxonomic relationship between the training set and the test set, the better the zero-shot classification performance on the test set.

Key words: bird audio classification, zero-shot learning, taxonomy, side information for species class, contrastive learning

中图分类号: