JOSA Term-Browser Project

17 May 2016

The JOSA Term Browser project was inspired by the Google Books Ngram Viewer and championed by OSA past president and Optics Express founding editor, Joseph Eberly. Our aim is to provide interested researchers with a simple tool to view trends for meaningful terms that appeared in the journals JOSA, JOSA A, and JOSA B over the past 100 years.

Methodology

The JOSA Term Browser allows users to enter freeform terms and immediately see a plot showing the relative frequencies of those terms across the years. To prepare the dataset, OSA staff primarily used the Python Natural Language Toolkit (NLTK, http://www.nltk.org/) to preprocess text and to isolate one-, two-, and three-word phrases (ngrams) along with their relative frequencies by year.

term term year

word count

by year

term count

by year

term frequency jnl
bessel 1982 397417 94 0.000236527 JOSA
bessel 1983 425003 54 0.000127058 JOSA

Tableau software was used to build the data visualizations that appear on the JOSA Centennial microsite.

Source text

The titles, abstracts, and body paragraphs for each article published in JOSA, JOSA A, and JOSA B were extracted for this project. Because OSA has converted its journal back file to an XML format, we were able to exclude unwanted noisy components such as reference lists and acknowledgments. We also excluded certain unwanted (also noisy) article types, including book reviews, calls for papers, and retractions.

Preprocessing the source text

OSA staff used XSLT and Python NLTK scripts to normalize text extracted from journal articles. Normalization involved removing all punctuation, setting all text in lowercase, eliminating any single-character words, converting all text to US ASCII equivalents, and eliminating stopwords—both common stopwords such as we, you, they and custom stopwords such as fig, introduction, and acknowledgments.

Generating ngrams

Python NLTK has functions for outputting ngrams of any length. We chose to output terms with one, two, and three words, and the ngrams were output only if they appeared n times in a given year: 8 times for one-grams, 6 times for two-grams, and 5 times for three-grams. The thresholds were set following trial and error by OSA staff.

Limitations and future steps

The Optical Society is exploring the use of ngram analysis along with other methods for identifying the topical trends in both legacy and current content. The JOSA Term Browser offered through the JOSA Centennial microsite has several limitations. Ngrams are limited to just one, two, or three words. Text used to create ngrams was not tagged with parts of speech, common entities (such as people and places), or terms from OSA's new Optics and Photonics Thesaurus. Such tagging could improve our ability to understand terms in context and allow us to create more accurate and better organized datasets for trend analysis. Future term browsers might include such enhancements and also serve as a discovery portal into the OSA Publishing database.