JOSA Term-Browser Project
17 May 2016
The JOSA Term Browser project was inspired by the Google Books Ngram Viewer and championed by OSA past president and Optics Express founding editor, Joseph Eberly. Our aim is to provide interested researchers with a simple tool to view trends for meaningful terms that appeared in the journals JOSA, JOSA A, and JOSA B over the past 100 years.
Methodology
The JOSA Term Browser allows users to enter freeform terms and immediately see a plot showing the relative frequencies of those terms across the years. To prepare the dataset, OSA staff primarily used the Python Natural Language Toolkit (NLTK, http://www.nltk.org/) to preprocess text and to isolate one-, two-, and three-word phrases (ngrams) along with their relative frequencies by year.
term | term year | word count by year |
term count by year |
term frequency | jnl |
---|---|---|---|---|---|
bessel | 1982 | 397417 | 94 | 0.000236527 | JOSA |
bessel | 1983 | 425003 | 54 | 0.000127058 | JOSA |
Tableau software was used to build the data visualizations that appear on the JOSA Centennial microsite.
Source text
The titles, abstracts, and body paragraphs for each article published in JOSA, JOSA A, and JOSA B were extracted for this project. Because OSA has converted its journal back file to an XML format, we were able to exclude unwanted noisy components such as reference lists and acknowledgments. We also excluded certain unwanted (also noisy) article types, including book reviews, calls for papers, and retractions.
Preprocessing the source text
OSA staff used XSLT and Python NLTK scripts to normalize text extracted from journal articles. Normalization involved removing all punctuation, setting all text in lowercase, eliminating any single-character words, converting all text to US ASCII equivalents, and eliminating stopwords—both common stopwords such as we, you, they and custom stopwords such as fig, introduction, and acknowledgments.
Generating ngrams
Python NLTK has functions for outputting ngrams of any length. We chose to output terms with one, two, and three words, and the ngrams were output only if they appeared n times in a given year: 8 times for one-grams, 6 times for two-grams, and 5 times for three-grams. The thresholds were set following trial and error by OSA staff.
Limitations and future steps
The Optical Society is exploring the use of ngram analysis along with other methods for identifying the topical trends in both legacy and current content. The JOSA Term Browser offered through the JOSA Centennial microsite has several limitations. Ngrams are limited to just one, two, or three words. Text used to create ngrams was not tagged with parts of speech, common entities (such as people and places), or terms from OSA's new Optics and Photonics Thesaurus. Such tagging could improve our ability to understand terms in context and allow us to create more accurate and better organized datasets for trend analysis. Future term browsers might include such enhancements and also serve as a discovery portal into the OSA Publishing database.