Herbarium Data Analyst/Coordinator

About the Opportunity
As part of the biodiversity science efforts embedded within the Thriving California Initiative, the California Herbarium Specimen Digitization Project will make hundreds of thousands of specimens from the California Academy of Sciences (CAS) herbarium collections available online. This project combines the efficiency of high throughput specimen imaging using conveyor belt technology to image California herbarium specimens housed at CAS. Specimen images and associated data records will then be uploaded to a community science platform for further transcription and georeferencing of label data. Afterwards, fully transcribed and georeferenced records will be imported into CAS collections database and linked to their corresponding images. Results from this project will mark a major step forward in democratizing CAS museum collections, providing equitable access to these important specimens for people (botanists, scientists, and the general public) all over the world.

About the Botany Team
We are a team of botanists, scientists, professionals and enthusiasts that collectively curate the Academy’s collection of over 2.3 million herbarium specimens. This position will broadly support adding collections imagery and label data to the CAS botany database. This will include working with scanning contractors, ingesting and cleaning label data, working with community science organizers to crowdsource data entry, OCR, georeferencing, and all related processes and technologies. The role will be responsible for ensuring that the data entry is as efficient and correct as possible, by creating processes, working with colleagues, and by using scripting/programming to automate said processes as needed.

Key Responsibilities

Work with the Botany Curator, Collection Manager, and Director of Scientific Computing to identify requirements for data import/export to/from contracted imaging and transcription services to the CAS internal database and computational infrastructure. This includes: Scripted export to and ingest from community science data organizers, Scripted imagery and transcription ingest
Coordinate with contractors to implement workflows and pipelines that are in line with the needs of internal CAS databases and computational infrastructure
Develop, test and modify workflows and pipelines to georeference specimens using transcribed label data
Develop, test and modify (as needed) workflows and pipelines to achieve high level quality control, modification and/or data reshaping as images and associated records move from one place to another; regularly test and modify workflows and pipelines, as needed
Coordinate QC and data modification efforts with other digitization technicians
Coordinate with contractors to alter data delivery techniques and/or formats as needed
Coordinate with collection preparators to maximize data collection efficiency
Follow all Academy safety regulations
Other duties as assigned
Qualifications
A qualified person for this position is capable of working with large datasets without seeing each piece of data individually. This person is capable of working with data in multiple formats and can modify data to suit different software and application needs. This person has either a background in the natural sciences with extensive database and programming experience, or has a background in bioinformatics and/or computer/data science with coursework and interest in the natural sciences.

Experience and/or Education:

Undergraduate degree required, Masters degree (or higher) preferred
Experience with building, managing, and/or maintaining SQL databases
Experience working with large data, including cleaning/validation/transformation, clustering, and formatting.
Working knowledge of Python and preferably at least one other high level language suitable for data analysis (e.g., R) and techniques (regular expressions, parsing, reading in formatted data, etc)
Comfortable (ideally expert) with Linux command line and bash scripting (bash, ssh, scp, rsync, awk, etc).
Comfortable with task automation using scripting and programming tools
Knowledge of data cleaning tools (OpenRefine, Trifacta, etc) and techniques
Working knowledge of common data formats (JSON, yml, csv, tsv) and issues therein (unicode, whitespace, etc)
Knowledge of biological data systems (GBIF, Encyclopedia of life, NCBI, iNaturalist, etc) and familiarity with geospatial data.
Knowledge of taxonomy and classification, ideally botanical.
Experience working as part of a team, with both independent and collaborative goals

Permalink: https://www.aspt.net/news-blog/2023/herbarium-data-analystcoordinator