ento-label-deberta
DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw
label string the model extracts semantic fields as verbatim character spans.
Three sizes are included in this repo: small, base, and large
(subdirectories of the same name). ONNX exports are in onnx/small,
onnx/base, and onnx/large.
Entity types
| Label |
Description |
country |
Country name |
state |
State, province, or region |
verbatim_locality |
Locality description |
verbatim_date |
Collection date as written |
verbatim_elevation |
Elevation as written |
verbatim_collectors |
Collector name(s) |
verbatim_habitat |
Habitat description |
verbatim_method |
Collection method |
verbatim_latitude |
Latitude as written |
verbatim_longitude |
Longitude as written |
Evaluation results (macro F1 per entity)
| Entity |
small |
base |
large |
| country |
0.9695 |
0.9749 |
0.9751 |
| state |
0.9046 |
0.9220 |
0.9212 |
| verbatim_locality |
0.8282 |
0.8499 |
0.8573 |
| verbatim_date |
0.9673 |
0.9700 |
0.9693 |
| verbatim_elevation |
0.9722 |
0.9742 |
0.9739 |
| verbatim_collectors |
0.4867 |
0.5393 |
0.5311 |
| verbatim_habitat |
0.7485 |
0.7751 |
0.7930 |
| verbatim_method |
0.9123 |
0.9205 |
0.9080 |
| verbatim_latitude |
0.7154 |
0.7145 |
0.6512 |
| verbatim_longitude |
0.8552 |
0.8528 |
0.7969 |
| macro avg |
0.8360 |
0.8493 |
0.8377 |
Usage (PyTorch)
from transformers import pipeline
ner = pipeline(
"token-classification",
model="SpeciesFileGroup/ento-label-deberta/base",
aggregation_strategy="simple",
)
results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori")
for r in results:
print(r["entity_group"], repr(r["word"]))
Usage (ONNX / hugot)
ONNX models are compatible with
hugot and ONNX Runtime. Load
from onnx/small, onnx/base, or onnx/large.
Training
Fine-tuned for 5 epochs with the HuggingFace Trainer. Hyperparameters:
| Parameter |
small / base |
large |
| Learning rate |
5e-6 |
2e-6 |
| Batch size |
16 |
16 |
| LR scheduler |
linear |
linear |
| Warmup ratio |
0.06 |
0.06 |
| Weight decay |
0.01 |
0.01 |
| Max seq length |
128 |
128 |
Training data: ~22 000 insect collection label strings with character-span
annotations for the 10 entity types above.