Linguistic wonders

Posted on April 1, 2019 by lembrechtsjonas

So, I had a little question: did my English vocabulary improve after 5 years of paper writing? Good question, I thought, and nothing our good friend R could not help me answer! So I dove into the data, read in all 10 of my first-author publications (8 published, 2 in the review stage) into R (after cleaning away references and figures), and asked R boldly: is there more unique words in my more recent work? I went for the easy answer at first: I took a sample of 2000 words (to unify the length of each paper) from every paper and calculated how many unique words could be found in each one of them. The results were… disappointing (Fig. 1):

Figure 1: The number of unique words in a standardized sample of 2000 words (app. 3/4 of most papers) for each of my first-author publications, chronologically ordered (paper 1 published in 2014, paper 9 and 10 currently in the review stage).

A linear model on that data was far from significant: no trend at all could observed here. Perhaps I was not getting more eloquent after all?

But then my wife pointed out that, yes, sampling 2000 words is an attempt to standardize between papers, but the length of each paper will still strongly affect the amount of words used, with longer papers allowing for a more variable word use.

Could that be true?

Figure 2: The number of unique words in the same standardized sample of 2000 words, yet now as a function of the total length of the paper

Oh, yes, it was true! Longer papers indeed – perhaps obviously so – allowed for a higher diversity in words, even within that sample of 2000 words!

So what to do next? I still wanted to know if, if corrected for that bias, my vocabulary was increasing. After a brief over-dinner consult with my linguistically trained sister-in-law, I came up with the following:

Figure 3: cumulative unique word count throughout each paper (gray/reddish is old, greener is new, black is the most recent paper)

This graph wonderfully solves the issue, in my opinion. By plotting the cumulative unique word count in the order of the paper, I neatly take into account changes in structure in the writing, as per linguistic advice, while correcting for the length of the paper. Steeper curves would suggest a more elaborate use of the English language, even when they stopped earlier due to shorter paper length.

And indeed: my more recent papers (in green in Fig. 3) all but one (paper 6) show a steeper curve compared to the older papers. Especially paper 10 (the black line) is an interesting case, as it was a clear low outlier in Fig. 1. This time, it revealed a steep curve, together with the other recent papers, showing a great variety in word use despite its shorter length.

The trend is perhaps not too shocking, but of course none of these papers are ever written by me alone. There is a whole team of professionals behind each of them, giving me advice along the way, and likely suggesting new vocabulary to use, especially early on in my career.

So my vocabulary improved (a bit) over time. But how did the use of specific words change? Can we visualise changes in my topics of interests from the early stages of my PhD to my time as a postdoc now? As I now I had all this papers elegantly read in into R, this could easily be done. Check out the following:

Figure 4: Frequency of words in my 3 most recent papers (from my postdoc, so to speak), compared to word frequency in my 3 first papers (2014-2016). Colors indicate overall frequency of the word in question, the dashed line indicates a constant use in both datasets. Not all words are visualised.

And oh, is that interesting! There seems to be a constant interest in ‘change’, ‘anthropogenic’ and ‘conditions’ (as these are close to the dashed line). But my overal interest clearly shifted from a focus on ‘survival’ (of plants), ‘gaps’ (caused by disturbance), ‘native’ (and ‘non-native’ species) and ‘alpine’ and ‘elevational’-related questions to ‘soil’, ‘air’ and ‘microclimate’, and ‘temporal’ and ‘spatial’ patterns in ‘distributions’ at the ‘local’ ‘scale’.

Take this last one as a ‘spoiler’: from this recent papers, only one is currently published, so you can expect some more cool things about microclimate in the near future, as we are finally opening the black box that is the soil. Please stay tuned if you like those words above the dashed lines, cause you will see a lot more of them! If you are more of a fan of what happens below the dashed lines: don’t worry, there will be more of those as well, yet perhaps less often with me as a first author. That’s why we have students on board now!

Want to answer similar questions? The ‘Text mining in R‘-book from Julia Silge and David Robinson is a great source of code!