基于ConvNeXt的北京地区红外相机野生动物图像识别改进模型构建

doi:10.11707/j.1001-7488.LYKX20230276

摘要/Abstract

摘要：

目的: 针对红外相机拍摄的野生动物图像数据量大、无效图像占比多、图像背景复杂等问题，提出一种可对图像进行自动、高准确率识别的模型，为生物多样性研究和野生动物保护工作提供更高效的支持。方法: 收集整理近4年来北京园林绿化生态系统监测网络各站点红外相机拍摄的约5 TB图像数据，对其手工标注并进行数据增强后自建10类共4 234张图像数据集。基于ConvNeXt卷积神经网络，结合北京地区野生动物图像数据集特点，设计BSGG-ConvNeXt模型，使用BlurPool、SENet、全局响应归一化层（GRN）、GCNet提升模型识别能力，并在自建数据集上探究训练策略对ConvNeXt网络识别准确率的影响，通过与其他经典模型比较，明确BSGG-ConvNeXt模型的优势。利用公开的红外野生动物Snapshot Serengeti（SS）数据集和Caltech Camera Traps（CCT）数据集，验证模型的泛化能力。结果: 以ConvNeXt的ConvNeXt-T网络尺寸模型为例，其在自建数据集中的准确率为74.13%，乘加累积操作数（MACs）为4.47×10⁹。应用不同改进方案发现，使用 BlurPool后准确率提升2.2%，MACs降至1.07×10⁹；使用SENet后准确率提升3.2%；使用GRN并删掉缩放层后准确率升至87.18%，参数数量增至27.88×10⁶；使用GCNet后在不增大计算量的情况下准确率升至75.44%，但参数数量增至28.25×10⁶。将上述改进方案结合得到的BSGG-ConvNeXt应用于ConvNeXt-T模型获得BSGG-ConvNeXt-T模型，参数数量虽有少量增多，但MACs降为1.07×10⁹，模型准确率升至83.63%，高于原模型。使用预训练权重后的BSGG-ConvNeXt-T模型准确率可达94.07%，高于ResNet-50（76.39%）、ResNeXt-50（87.60%）、MobileViT（90.00%）、DenseNet（87.66%）、RegNet（69.90%）、ConvNeXtv2（91.93%）、SwinTransformer的（86.23%）和MobileOne（71.53%），将BSGG-ConvNeXt模型应用于4种不同网络尺寸的ConvNeXt模型后，在自建数据集中的表现均优于未改进模型。BSGG-ConvNeXt模型在SS数据集中的识别准确率达50.28%，在CCT数据集中的识别准确率达56.15%，均高于原模型的准确率。结论: BSGG-ConvNeXt模型识别红外相机拍摄的野生动物图像准确率更高，在自建、公开的野生动物红外图像数据集上均有较好表现，且具有一定泛化能力。

关键词: 野生动物, 图像识别, 深度学习, 卷积神经网络, ConvNeXt

Abstract:

Objective: Aiming at the problems of large amount of data, high proportion of invalid images, and complex image backgrounds in wild animal images captured by infrared cameras, a model that can automatically and accurately recognize images is proposed, providing more efficient support for biodiversity research and wildlife conservation work. Method: Collect and organize approximately 5 TB of image data captured by infrared cameras at various stations of the Beijing Ecological Observatory Network over the past 4 years. After manual annotation and data augmentation, create a total of 4234 image datasets in 10 categories. Based on ConvNeXt convolutional neural network and combined with the characteristics of wild animal image datasets in Beijing, a BSGG-ConvNeXt model was designed. BlurPool, SENet, global response normalization layer (GRN), and GCNet were used to improve the recognition ability of the model. The impact of training strategies on the recognition accuracy of ConvNeXt network was explored on a self-built dataset. By comparing with other classic models, the advantages of the BSGG-ConvNeXt model are clarified. Verify the generalization ability of the model using publicly available infrared wildlife snapshot serengeti (SS) dataset andcaltech camera traps (CCT) dataset. Result: Taking the ConvNeXt size model of the ConvNeXt model as an example, the accuracy in the self-built dataset is 74.13%, and the multiply add cumulative operands (MACs) are 4.47×10⁹. By applying different improvement schemes, it was found that the accuracy increased by 2.2% and MACs decreased to 1.07×10⁹ after using BlurPool. After using SENet, the accuracy improved by 3.2%. After using GRN and removing the scaling layer, the accuracy improved to 87.18% and the number of parameters increased to 27.88×10⁶. After using GCNet, the accuracy was improved to 75.44% without increasing the computational load, but the number of parameters increased to 28.25×10⁶. The BSGG-ConvNeXt obtained by combining the above improvement schemes is applied to the ConvNeXt-T model to obtain the BSGG-ConvNeXt-T model. Although there is a slight increase in the number of parameters, the MACs are reduced to 1.07×10⁹, and the accuracy of the model is improved to 83.63%, which is higher than the original model. After using pre-trained weights, the accuracy of the BSGG-ConvNeXt-T model can reach 94.07%, which is higher than the accuracy of ResNet-50 (76.39%), ResNeXt-50 (87.60%), MobileViT (90.00%), DenseNet (87.66%), RegNet (69.90%), ConvNeXtv2 (91.93%), SwinTransformer (86.23%), and MobileOne (71.53%) models. After applying the BSGG-ConvNeXt model to four different network sizes of ConvNeXt models, its performance in the self-built dataset is better than that of the unimproved model. The recognition accuracy of the BSGG-ConvNeXt model in the SS dataset can reach 50.28%, and the recognition accuracy in the CCT dataset can reach 56.15%, both of which are higher than the accuracy of the original model. Conclusion: The BSGG-ConvNeXt model has a higher accuracy in recognizing wild animal images captured by infrared cameras, and performs well on both self built and publicly available wild animal infrared image datasets, with a certain degree of generalization ability.

Key words: wildlife, image recognition, deep learning, convolutional neural network, ConvNeXt

中图分类号:

TP391.4

齐建东,郑尚姿,陈子仪,马鐘添. 基于ConvNeXt的北京地区红外相机野生动物图像识别改进模型构建[J]. 林业科学, 2024, 60(8): 33-45.

Jiandong Qi,Shangzi Zheng,Ziyi Chen,Zhongtian Ma. Wildlife Image Recognition of Infrared Cameras in Beijing Area Based on an Improvement ConvNeXt Model[J]. Scientia Silvae Sinicae, 2024, 60(8): 33-45.

图/表 17

图1

图2

表1

表2

表3

图3

图4

图5

图6

图7

图8

表4

表5

图9

表6

表7

图10

参考文献 0

	何　嘉. 2019. 基于深度学习的野生动物智能检测与识别. 深圳: 深圳大学.
	He J. 2019. Wildlife smart detection and recognition based on deep learning. Shenzhen : Shenzhen University. ［in Chinese］
	邱志斌, 石大寨, 况燕军, 等. 基于深度迁移学习的输电线路涉鸟故障危害鸟种图像识别. 高电压技术, 2021, 47 (11): 3785- 3794.
	Qiu Z B, Shi D Z, Kuang Y J, et al. Image recognition of harmful bird species related to transmission line outages based on deep transfer learning. High Voltage Engineering, 2021, 47 (11): 3785- 3794.
	汪国海, 李生强, 施泽攀, 等. 广西猫儿山自然保护区的兽类和鸟类多样性初步调查——基于红外相机监测数据. 兽类学报, 2016, 36 (3): 338- 347.
	Wang G H, Li S Q, Shi Z P, et al. Preliminary survey of mammal and bird diversity of Guangxi Mao’ershan National Nature Reserve: based on infrared camera monitoring. Acta Theriologica Sinica, 2016, 36 (3): 338- 347.
	杨铭伦, 张　旭, 郭　颖, 等. 基于YOLOv5的红外相机野生动物图像识别. 激光与光电子学进展, 2022, 59 (12): 382- 390.
	Yang M L, Zhang X, Guo Y, et al. Recognition of wild animals using infrared camera images based on YOLOv5. Laser & Optoelectronics Progress, 2022, 59 (12): 382- 390.
	袁东芝. 2018. 基于卷积神经网络的动物识别算法研究. 广州: 华南理工大学.
	Yuan D Z. 2018. Research on animal recognition algorithm based on convolutional neural network. Guangzhou: South China University of Technology. ［in Chinese］
	于莉莉. 2017. 陆生野生动物保护对生物多样性的影响机理及对策. 南京: 南京林业大学.
	Yu L L. 2017. Effects of terrestrial wildlife conservation on biodiversity and countermeasures. Nanjing: Nanjing Forestry University. ［in Chinese］
	张　毓, 高雅月, 常峰源, 等. 小样本条件下基于数据扩充和ResNeSt的雪豹识别. 北京林业大学学报, 2021, 43 (10): 89- 99. doi: 10.12171/j.1000-1522.20210185
	Zhang Y, Gao Y Y, Chang F Y, et al. Panthera unica recognition based on data expansion and ResNeSt with few samples. Journal of Beijing Forestry University, 2021, 43 (10): 89- 99. doi: 10.12171/j.1000-1522.20210185
	Beery S, Van Horn G, Perona P. 2018. Recognition in terra incognita. Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer, 472−489.
	Brock A, De S, Smith S L, et al. 2021. High-performance large-scale image recognition without normalization. Proceedings of the 38th International Conference on Machine Learning Research (PMLR), 1059−1071.
	Chen G B, Han T X, He Z H, et al. Deep convolutional neural network based species recognition for wild animal monitoring. 2014 IEEE International Conference on Image Processing (ICIP). Paris, 2014, France, 858- 862.
	Ding X H, Zhang X Y, Han J G, et al. Scaling up your kernels to 31 × 31: revisiting large kernel design in CNNs. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, 2022, LA, USA,11953- 11965.
	Dosovitskiy A, Beyer L, Kolesnikov A, et al. 2020. An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv: 2010.11929.
	Girshick R, Donahue J, Darrell T, et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 580-587.
	Gomez Villa A, Salazar A, Vargas F. Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. Ecological Informatics, 2017, 41, 24- 32. doi: 10.1016/j.ecoinf.2017.07.004
	He K M, Zhang X Y, Ren S Q, et al. 2016. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV: USA,770−778.
	Howard A G, Zhu M L, Chen B, et al. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv: 1704.04861.
	Hu J, Shen L, Sun G. Squeeze-and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018, UT, USA,7132- 7141.
	Karanth K U. Estimating tiger Panthera tigris populations from camera-trap data using capture-recapture models. Biological Conservation, 1995, 71, 333- 338. doi: 10.1016/0006-3207(94)00057-W
	Kays R, McShea W J, Wikelski M. Born-digital biodiversity data: Millions and billions. Diversity and Distributions, 2020, 26 (5): 644- 648. doi: 10.1111/ddi.12993
	Krizhevsky A, Sutskever I, Hinton G E. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25.
	Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV). Venice, 2017, Italy, 2999- 3007.
	Liu W, Anguelov D, Erhan D, et al. 2016. SSD: single shot MultiBox detector. European Conference on Computer Vision. Cham: Springer, 21−37.
	Liu Z, Lin Y T, Cao Y, et al. 2021. Swin Transformer: hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC: Canada,9992−10002.
	Liu Z, Mao H Z, Wu C Y, et al. 2022. A ConvNet for the 2020s. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA: USA,11966−11976.
	Niedballa J, Sollmann R, Mohamed A B, et al. Defining habitat covariates in camera-trap based occupancy studies. Scientific Reports, 2015, 5, 17041. doi: 10.1038/srep17041
	Norouzzadeh M S, Nguyen A, Kosmala M, et al. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceedings of the National Academy of Sciences of the United States of America, 2017, 115 (25): E5716- E5725.
	O’Connell A F, Nichols J D, Karanth K U. 2011. Camera traps in animal ecology: methods and analyses. Springer, New York.
	Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
	Schneider S, Greenberg S, Taylor G W, et al. Three critical factors affecting automated image species recognition performance for camera traps. Ecology and Evolution, 2020, 10 (7): 3503- 3517. doi: 10.1002/ece3.6147
	Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556.
	Swanson A, Kosmala M, Lintott C, et al. Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Scientific Data, 2015, 2, 150026. doi: 10.1038/sdata.2015.26
	Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, 2015, MA, USA,1- 9.
	Tan M X, Le Q V. 2019. EfficientNet: rethinking model scaling for convolutional neural networks. International Conference on Machine Learning, 6105-6114.
	Van Horn G, Mac Aodha O, Song Y, et al. 2018. The iNaturalist species classification and detection dataset. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT. IEEE, 132−139.
	Vecvanags A, Aktas K, Pavlovs I, et al. Ungulate detection and species classification from camera trap images using RetinaNet and faster R-CNN. Entropy, 2022, 24 (3): 353. doi: 10.3390/e24030353
	Wang M J, Li Y D, Zhou J, et al. 2023. GCNet: probing self-similarity learning for generalized counting network. arXiv: 2302.05132.
	Wang X L, Girshick R, Gupta A, et al. Non-local neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018, UT, USA,7794- 7803.
	Woo S, Debnath S, Hu R H, et al. 2023. ConvNeXt V2: co-designing and scaling ConvNets with masked autoencoders. arXiv: 2301.00808.
	Xie S N, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, 2017, HI, USA,5987- 5995.
	Zhang R. Making convolutional networks shift-invariant again. International Conference on Machine Learning, 2019, 97, 7324- 7334.

动物种类Animal species	图像数量Image number
猪獾Arctonyx collaris	252
鸟（不含鸭类）Aves (not included mallard)	1 071
野猪Sus scrofa	112
豹猫Prionailurus bengalensis	119
鹿Cervus axis	1 199
山羊Capra hircus	332
野狗Canis lupus familiaris	105
野兔Lepus sinensis	126
松Sciurus vulgaris	300
鸭类Mallard	619
总计Total	4 234

SS数据集子集SS dataset subset		CCT数据集子集CCT dataset subset
动物种类Animal species	图像数量Image number	动物种类Animal species	图像数量Image number
转角牛羚Damaliscus lunatus	571	浣熊Procyon lotor	1 101
鸟类Aves	980	鸟类Aves	982
长颈鹿 Giraffa camelopardalis	1 000	狗Canis dingo	419
斑马Equus burchellii	1 000	啮齿动物（不含松鼠）Geomys bursarius (not included squirrel)	464
大羚羊Oryx	1 000	猫Prionailurus bengalensis	543
水牛Bubalus bubalus	1 000	鹿Cervus axis	1 256
警犬Canis lupus familiaris	1 000	郊狼Canis latrans	1 720
大象Elephas maximus	1 000	牛Bos taurus	332
珠鸡Numididae	1 000	野猫Prionailurus bengalensis	789
鬣狗Hyaenidae	1 000	松鼠Sciurus carolinesis	445
非洲旋角大羚羊Addax nasomaculatus	1 000	臭鼬Mephitis mephitis	180
黑斑羚Aepyceros melampus	1 000	狐狸Vulpes vulpes	239
瞪羚Gazella	1 000	野兔Lepus sinensis	1 237
黑尾牛羚Connochaetes taurinus	1 000	负鼠Didelphis virginiana	1 622
总计Total	13 551	总计Total	11 329

模型 Model	输入的通道数量 Number of input channels	重复堆叠次数 The number of times to repeat stacking
ConvNeXt-T	(96, 192, 384, 768)	(3, 3, 9, 3)
ConvNeXt-S	(96, 192, 384, 768)	(3, 3, 27, 3)
ConvNeXt-B	(128, 256, 512, 1 024)	(3, 3, 27, 3)
ConvNeXt-L	(192, 384, 768, 1 536)	(3, 3, 27, 3)
ConvNeXt-XL	(256, 512, 1 024, 2 048)	(3, 3, 27, 3)

方案号 Scheme No.	模型 Model	乘加累积操作数 MACs	参数数量 Params	准确率 Accuracy(%)
原始Oringinal	ConvNeXt-T	4.47×10⁹	27.83×10⁶	74.13
1	ConvNeXt-T+BP	1.07×10⁹	27.83×10⁶	76.39
2	ConvNeXt-T+SENet	4.47×10⁹	28.24×10⁶	77.34
3	ConvNeXt-T+GRN-缩放层 ConvNeXt-T+GRN-scale layer	4.47×10⁹	27.88×10⁶	87.18
4	ConvNeXt-T+GCNet	4.47×10⁹	28.25×10⁶	75.44
5	ConvNeXt-T+ BSGG-ConvNeXt	1.07×10⁹	28.71×10⁶	83.63

模型 Model	乘加累积操作数 MACs	参数数量 Params	准确率 Accuracy(%)
ConvNeXt-T	4.47×10⁹	27.83×10⁶	69.40
BSGG-ConvNeXt-T	1.07×10⁹	28.71×10⁶	83.63
ConvNeXt-S	8.7×10⁹	49.46×10⁶	73.43
BSGG-ConvNeXt-S	1.26×10⁹	51.08×10⁶	83.39
ConvNeXt-B	15.38×10⁹	87.68×10⁶	74.02
BSGG-ConvNeXt-B	2.21×10⁹	90.38×10⁶	82.70
ConvNeXt-L	34.4×10⁹	196.25×10⁶	78.13
BSGG-ConvNeXt-L	4.90×10⁹	202.42×10⁶	80.31