التفاصيل البيبلوغرافية
العنوان: |
Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks |
المؤلفون: |
Guralnick, Robert, LaFrance, Raphael, Denslow, Michael, Blickhan, Samantha, Bouslog, Mark, Miller, Sean, Yost, Jenn, Best, Jason, Paul, Deborah L., Ellwood, Elizabeth, Gilbert, Edward, Allen, Julie |
المصدر: |
Applications in Plant Sciences ; volume 12, issue 1 ; ISSN 2168-0450 2168-0450 |
بيانات النشر: |
Wiley |
سنة النشر: |
2024 |
المجموعة: |
Wiley Online Library (Open Access Articles via Crossref) |
الوصف: |
Premise Among the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long‐recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches. Methods We present two new semi‐automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans‐in‐the‐loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline. Results Our results showcase a >93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre‐processing, multiple OCR engines, and post‐processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields >4‐fold reductions in errors compared to off‐the‐shelf open‐source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool. Discussion Our work showcases a usable set of tools for herbarium digitization including a custom‐built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use. |
نوع الوثيقة: |
article in journal/newspaper |
اللغة: |
English |
DOI: |
10.1002/aps3.11560 |
الاتاحة: |
http://dx.doi.org/10.1002/aps3.11560 |
Rights: |
http://creativecommons.org/licenses/by/4.0/ |
رقم الانضمام: |
edsbas.5E74B6BE |
قاعدة البيانات: |
BASE |