Skip to main content
European Commission logo
ESARDA
Scientific paper

NukeLM: Pre-Trained and Fine-Tuned Language Models for the Nuclear and Energy

ESARDA Bulletin - The International Journal of Nuclear Safeguards and Non-Proliferation

Details

Identification
ISSN: 1977-5296, DOI: 10.3011/ESARDA.IJNSNP.2021.9
Publication date
1 December 2021
Author
Joint Research Centre

Description

Volume: 63, December 2021, pages 30-40,
Special Issue on Data Analytics for Safeguards and Non-Proliferation

Authors: Lee Burke1, Karl Pazdernik1, Daniel Fortin1, Benjamin Wilson1, Rustam Goychayev1 and John Mattingly2

1Pacific Northwest National Laboratory, 2North Carolina State University

Abstract:

Natural language processing (NLP) tasks (text classification, named entity recognition, etc.) have seen revolutionary improvements over the last few years. This is due to language models such as BERT that achieve deep knowledge transfer by using a large pre-trained model, then fine-tuning the model on specific tasks. The BERT architecture has shown even better performance on domain-specific tasks when the model is pre-trained using domain-relevant tex ts. Inspired by these recent advancements, we have developed NukeLM, a nuclear- domain language model pre-trained on 1.5 million abstracts from the U.S. Department of Energy Office of Scientific and Technical Information (OSTI) database. This NukeLM model is then fine-tuned for the classification of research articles into either binary classes (related to the nuclear fuel cycle [NFC] or not) or multiple categories related to the subject of the article. We show that continued pre-training of a BERT-style architecture prior to fine-tuning yields greater performance on both article classification tasks. This information is critical for properly triaging manuscripts, a necessar y task for better understanding citation networks that publish in the nuclear space, and for uncovering new areas of research in the nuclear (or nuclear-relevant) domains.

Keywords: nuclear, energy, language, classification

Reference guideline:

Burke, L., Pazdernik, K., Fortin, D., Wilson, B., M., Goychayev, R., & Mattingly, J. (2021). NukeLM: Pre-Trained and Fine-Tuned Language Models for the Nuclear and Energy Domains. ESARDA Bulletin - The International Journal of Nuclear Safeguards and Non-proliferation, 63, 30-40. https://doi.org/10.3011/ESARDA.IJNSNP.2021.9

Thumb article Bulletin 63, p.30-40

Files

NukeLM: Pre-Trained and Fine-Tuned Language Models for the Nuclear and Energy
English
(494.33 KB - PDF)
Download