Lexical homogenization is not the same as idea homogenization, and it does not surprise me that as time goes on the word choice in a given category of writing (esp an insular one like grant applications) would constrict.
It also wouldn't surprise me a huge amount if the idea space was constricting, but just looking at the cosine distance is not enough to establish that. There probably are meaningful ways to exploit and analyze the representations inside language model transformers to better capture the idea-space geometry, but that's a big research project.
The internet has made it very easy for people to homogenize both words and ideas very rapidly. This is the cost of making knowledge universal I guess. Maybe it’s a good thing if it allows us to progress all on the same page rather than taking the time to understand differences? If progress is good. Whatever progress is. Who knows. Although the ideas aren’t really homogenizing in a lot of cases, it is often causing dichotomies. But really it’s no more than a few factions that reinforce their homogenizations. Ok I guess it’s bad… I am rambling as much as that essay. You have increased your follow count to 6 now, good show.
Just as an observation about this…I’ve increasingly begun to see the term knowledge expand to encompass information as if they are equivalent. That tracks, in my field, with changes in US K-12 education and the problems we now see in college learning behaviors.
I agree with your general ramble, but found this interesting in context.
Yeah, good point, should have said information. Knowledge cannot be transferred directly to people, only information. The person needs to understand the information to convert it into knowledge. An analogy I have heard used before is that data is the primitive, the integral of data is information, and the integral of information is knowledge.
The point of the thing for me was that the Crimson analysis showed no change in rate of decline since 1900. So there is a need for more examples over larger timelines before we start modelling, perhaps.
The analysis also seems sensitive to the mapping of words to categories. Some kind of robustness analysis to this sensitivity would also be interesting to see.
Yea exactly. This is survivorship bias - the prospects of today look at the winners of yesterday to guide their entry on how to communicate, they’re slightly more likely to get funded, etc
Survivorship bias is when excessively high-risk behaviors are made to look good by only showing the winners. I think what you're referring to is evolution.
> Lexical homogenization is not the same as idea homogenization
Actually it IS evidence for idea homogenization. Words represent ideas. Unless you are claiming that the same word is used to represent different ideas, which would be even more confusing than having different words representing the same idea. Therefore fewer unique words => fewer unique ideas.
I would not call it a consequence of idea homogenisation but it might certainly lead to this. The basic scientific idea IMHO is compatibility of research. Which is important at least in a competitive and comparitive setting. If people call a measure like 'sensitivity' different in every adjunct field it does not help reviewers. If structural elements of a proposal are similar, it helps understanding quickly the key difference that remain. And this difference should be actually the actual idea, which is often the smallest part. We actually teach students to emulate style and do the same to write successful grant applications.This sure creates a bubble and in effect hinders outsiders to enter. That we see a gradual assimilation is IMHO rather an effect of available 'training material' and interdisciplinarity. Sure there are dangers that this might be an indicator for. However, do not misinterpret it in a way that it make a grant proposal more novel just because it uses totally different language...
When, for instance, a science is young, the same concept is often explained using many different terms. Over time canonical naming conventions emerge which standardize the settled parts of the field, leaving less settled future frontiers to be discussed productively with the aid of shorthand.
All named theorems do this in math, continually compressing the lexical space as a way of enabling further out ideas to be even expressed. Granted, math may be the except that proves the rule, but it is an important one.
For instance, the addition of boilerplate to every document would increase their similarity metric. This has certainly happened over that time period with required compliance statements.
I would guess that the increased rate of information transmission across society has contributed to this trend. Copying is a key mechanism in how culture develops and propagates, and information moving more easily makes it easier for different groups to copy the dominant examples for any particular activity or discipline. A lot of information systems and cultural phenomenons have winner-take-all / power-law dynamics. Combined with increased information transmission, this would cause the leading styles and practices to become even more dominant.
Tangentially related, but I was thinking about this while watching the Olympics men's figure skating competition recently (given the ongoing drama on the women's side, I'll leave that aside for now).
The quality of skating has drastically improved in just a decade. In 2010 the Olympic champion had no quad jumps. In 2022 the easiest jump from the top 3 competitors in their short program was a triple axel - the rest were all quads.
My theory about this is that the advent of smartphones and ubiquitous video has made it much easier to constantly see what your competitors are doing. This not only pushes you to work harder and try new jumps, but also lets you see what new techniques work for other athletes. A couple decades ago people wondered if a quad jump was even possible, now people train going into it that a triple is no longer the expected limit.
Kotler's The Rise of Superman goes into how some these meta-performance skills are a big part of what's being transferred more and more effectively. There's young kids doing mind blowing things in extreme sports like skateboarding and BMX that were unimaginable to even the top echelon of the sports a few decades ago.
It's exciting to see transfer learning's effects in near real time, as a lot of that has occurred over my lifetime. I just wish meta learning was more accessible and common knowledge. As it stands, I find myself continually having to do my own research to uncover ways to improve my own improvement.
Skateboarding is fairly new sport. A few decades ago, there was nothing. Same goes for BMX. Both at current level require technology unavailable few decades ago.
I wonder if, setting quality aside, the kinds of tricks (is "tricks" right?) that skaters do becomes more similar with technology. That is, skater X was going to do some random trick, until X learned that skaters Y and Z were doing quad jumps, so X figures he must do quad jumps now.
In other words, if we could represent skating routines as vectors, would the average cosine distance between all those vectors be increasing or decreasing?
As others have said, the usage of similar words is no convincing evidence for homogenized ideas. I would like to add that the publication referred to in the article notes that there are many more proposals now than there were in the past.
This can naturally lead to lower average "distances" between proposals. In a simplistic example, lets assume 100 proposals existed in the "good old times" and they were different at random. Let's further assume in the "bad new days" people use those old ideas, change/improve upon them just slightly ("add noise"), but some old ideas are less often picked up than others. Say for example the least attractive old proposal is picked up twice, whereas the most attractive old idea is picked up and changed in 100 new proposals. Then, there is a lower average distance between proposals, all the while the total range of ideas has increased.
Because it's Friday and I am waiting for my oven to finish cooking my food, I wrote a small simulation. It's probably full of mistakes and I may have made terrible mistakes in my assumptions, but I thought it's fun:
It's interesting how your histogram and candle chart shows that, while the mean has shifted, there's a sliver of samples with greater cosine distance than anything previously recorded in the dataset. So I guess while the system has become more inclusive of boilerplate language, it's also become more inclusive of far-out novelties. I'd be interested in reading those abstracts.
>As others have said, the usage of similar words is no convincing evidence for homogenized ideas
I disagree. Words are used to convey ideas, so if the space of words is shrinking, one should assume that the space of ideas is shrinking. It's possible for this not to be the case, but if word-space is shrinking then the burden of proof should be on those who claim that idea-space is not shrinking.
Maybe we could use embeddings / nlp analysis to determine whether idea-space is shrinking. Or just get a bunch of people to read abstracts from different time-periods and rate how similar they are to one another in their semantic content.
Words are like letters making up ideas, not ideas themselves. Having more than 26 letters wouldn't make us more expressive, and having fewer (like many extant languages)... wouldn't make us any less.
I think that's possible, but in your model, the lower average distance has indeed decreased. Yeah, the total range (or maybe the convex hull of the idea space) is bigger, but it's not obvious that the range is what we should think about. If the top 100 proposals are small variations on the "most attractive" old idea, then a lot hangs on whether that idea is really good or not - which in turn suggests that the proposals are probably not providing enough diversity.
Yes, that was indeed the point I was trying to convey! In an even sillier example, assume word vectors X, then calculate "proposal by proposal" similarities (i.e. inverse distances). Then duplicated X and concatenate [X,X], recalculate "proposal by proposal" distances (now for twice as many proposals)---those distances must now be less on average because each proposal has at least one "zero distance" neighbor. HOWEVER, why would you assert that the overall "idea space" has been reduced?
Here's one metric by which, in your first model, the overall idea space has been reduced: the distribution of models has become more concentrated. That's because 100 proposals are tiny variations around 1 basic one. The same holds in your second model: with word vectors X, if I pick (say) two ideas to fund at random, they will never be the same idea, while with (X, X), that will sometimes happen.
1. Language needs to be somewhat homogenous. At an extreme, if the languages spoken by two academics are entirely heterogenous, they (literally) won't understand a word the other is saying!
2. I recently investigated why there were 300 (!!) words I did not know the meaning of in a single George Orwell novel. Google N-Gram Viewer shows around 60-80% of these words to have been in common usage in 1934 when the novel was written, but not in common usage for some decades now [1]. Using these words today would increase lexical diversity, but at the expense of communicative effectiveness!
I'm surprised you were unfamiliar with words like "mauve", "wanton" or "sallow". These are fairly common in today's UK, but maybe less so in the US vernacular?
Jocosely is the only one I had a question over, until I realised it had the root "jocose".
It's because British people have such bad taste they still have mauve stuff. In French, the language it comes from, this color is synonymous with ugliness and the 60s hehe
There's no evidence from this article that the idea space is decreasing. Vocabulary is becoming somewhat more similar: the measurement is on individual words. Suppose an institution decides that the faculty is using overly obscure language (or a grant-making agency decides this) and asks that grant submitters reduce the amount of jargon, include more background, etc. Wouldn't this measurement then show a significant drop in the distance measure? Note that I am not saying that this is what is happening (I seriously doubt it, in fact), but I think too much is being concluded based on this measurement.
Words represent ideas. Fewer words is good evidence that there are fewer ideas.
Sometimes multiple words represent the same idea, so in that case homogenization is "good". It is also possible that multiple ideas are represented by the same word, but that is "bad" in an academic context because it leads to ambiguity. Therefore fewer unique words IS actually good evidence that the idea space is decreasing, unless you are saying academic literature is becoming more and more ambiguous.
Words transmit ideas. The difference is notable I’m this data set in particular. I posted a longer version elsewhere in the thread but to summarize.
During this time period the nsf became more strict about explicitly addressing their review criteria. Faculty became trained to use the languag explicitly and connect their ideas to the language of the review criteria. I’m effect, you started to get an ‘api’ where, irregardless of what the idea in the grant is, it’s is clearly and explicitly connected to key nsf terminology. That is a de novo narrowing of the language space no matter the underlying ideas. It’s purpose is definitively about improving communication, not narrowing ideas.
From what I can tell the paper didn’t filter anything in the data. Had they filtered “broader impacts” and “intellectual merit” I would be hard money they would get different results.
I'd be curious to see what the plot of average annual cosine distance would look like when using different sets of pre-trained embeddings. I suspect the corpus used is biased toward more recent documents. It wouldn't surprise me if there's more variance in the embeddings of documents that look less like those in the training set, e.g. if you were to embed documents written in German you may get some extreme outliers.
The telling part for me is the scale of the graphs from the original paper and the blog post
The paper graph shows a decrease from like .1 to .075 over 30 years
The blog post shows a decrease from .35 to .1 over 100 years
However, the crimson had a notable drop around 2000 - on the order of the entire decrease in the research study - and then looks fairly stable.
It’s almost like there are policy and effective communication reasons for narrowing your vocabulary to that appropriate for a target audience.
If you look at the graph from the underlying study, there’s a bit of a “shoulder” around 1997. That is telling. 1997 was when the NSF introduced a new, clear and explicit, set of review criteria - broader impacts and intellectual merit [0]. That change alone is likely a cause of significant (Meaningful?) linguistic narrowing. To get NSF funding, researchers now had to explain why their research matters using explicit language that aligned with specific strategic objectives of the funding agency.
Then you add a layer of Goodharts law. During this time period, two other things were also happening. First, University’s increasingly began to rely on external funding - especially public university’s. Second, the field of “faculty development” was increasingly formalizing and offering training and support on things like grant writing. Those trainings include a lot of focus on using normalized, almost shibboleth like, language in grant applications. Ensuring that there is religion of key words so that it is easy for the reviewers to establish that a particular grant application addresses the required review criteria.
So the data set used here is part of the problem - if they had used submitted rather than funded they would likely see different results. Not because ideas are narrowing but because grant applications are basically a human manifested api, and someone tried to actually standardize it.
This. Diversity words are almost always describing the Broader Impact. Meanwhile, the original study is pretending they are for the research ideas of the grant.
This point should get more visibility — embeddings are not made in the abstract; they reflect the lexicon of their training sets, and I strongly suspect word embeddings used here reflect the lexicon and word frequencies (and their meaning/usage context) in modern literature. Using them for text in the past (nearly a century back!) warrants some skepticism.
I'd like to see a study of vocabulary with the number of unique words we have used over time. My sense is that older prose has a much richer and more varied vocabulary.
The studies could be showing that we use a smaller dictionary today.
I had a similar intuition and graphed unique words per year while writing this. I found, surprisingly, that actually the reverse was true. Unique words per year go up, even as diversity goes down. Another finding that may explain this is that the articles get longer as time goes on - so a simple unique word count may just increase as a function of the authors using more words. There is a period in the late 80's to early 90's where average word counts per article nearly double. I'd speculate that this is about the time The Crimson switched to using computers or good word processing or something that made writing articles easier.
A graph that may get to the heart of your question is something like "Unique word percentage over time" or maybe "What percentage of articles use unique words".
> Unique words per year go up, even as diversity goes down. Another finding that may explain this is that the articles get longer as time goes on - so a simple unique word count may just increase as a function of the authors using more words.
You'd probably see the same effect if the articles don't get any longer, but more of them get written every year. Unique word count will always go up with words produced, even if most words produced are formulaic boilerplate.
One nice thing about getting feedback is learning all of the additional stuff I should have included in the blog post. I did look at number of articles per year, and it fluctuates, but there isn't a huge change across the century, and the change goes up and down. Total words, on the other hand, does trend up and goes up faster more recently.
Maybe you should just crop the same number of words from the start of each article, and take the same number of articles from each year by random sampling. That would make things easier to compare.
The second and third graphs aren't zero-based, which exaggerates the effect.
Also, the author may be over-interpreting the result: it's odd that very different fields are showing roughly the same change in cosine difference, and it is a measurement over individual words. I think a deeper analysis is needed to figure out in more detail what is going on.
In my point of view this is actually good news; Since I'm not a native english speaker and since this language has become the world's lingua franca it's very nice being able to understand an article just because the author's didn't waste time looking for bombastic synonyms just to sound smarter. I applaud this and projects like Simple English Wikipedia which allows us, non-native english speakers, keep learning.-
The rising proportion of non-native speakers in US academia is what I thought of when I saw the article's claim that academic vocabulary is shrinking over time. Papers written by adult language learners tend to cluster around domain terminology + the absolute most common English words + words that happen to be similar to ones from their native language, and there's not a lot of incentive for them to go read Charles Dickens until they're indistinguishable from native speakers because, really, it doesn't matter and nobody cares.
Add in “grant writing specialists” with no domain expertise but lots of “test the nsf as an api” expertise and you have even bigger sources of error in these claims.
I would interested if grant proposals which do _not_ get funded, show less of language homogenization than ones that do. Although, it still wouldn't tell you if the idea space is being constricted; that seems like a much more difficult thing to measure. It could just be the equivalent of everyone at court speaking the way the king does, because he's the king. Winning grants probably get read, and imitated, more than other grant proposals that don't get funded. Which might also happen with ideas, but I'm not convinced that is what is measured here.
Reminds me of grad school where there was literally only one correct way to write a particular sentence or paragraph in a manuscript according to multiple advisors over the years. Frequently that way is synonymous with what was originally written, but apparently the original doesn't have the correct "feel," or whatever. For a physical sciences paper, where no one is a particularly good writer.
Basically, communities seem to get into vocabulary and sentence structure ruts. Deviation means your reviewers are going to give you more shit, at least indirectly.
This article is almost criminally flawed. The author makes some horrible assumptions and presents shoddy data. Even if we assume that the latent space model is "correct" (we shouldn't), they don't present anything like variance or number of samples in a year. Then they sort of arbitrarily fit a line to data which pretty clearly looks non-linear.
Suppose, for example, that The Crimson (not Harvard btw, it's a student newspaper) runs 10x as many articles this year as last. It's possible you're going to get a huge reduction in cosine distance just by virtue of a few authors producing a lot more content.
At a minimum, we need mean, variance, and number of samples. This doesn't tell you anything about "Harvard", it just tells you about the students who the Crimson choose to publish. There are lots of structural reasons within Harvard that the Crimson has probably stopped being a unified voice of the student body -- but again, that's not what the article purports to show.
For the title, I did say "At Harvard" and not "In Harvard" or "By Harvard". I think the student newspaper is indeed at Harvard. I'm also pretty clear about what text I'm looking at in the article. In the title I used "Harvard" over "The Crimson" because I figured fewer people would know "The Crimson" compared to "Harvard".
Regarding the flaws in the article (phew, I'm glad I didn't quite reach criminal level) - I'm curious which assumptions you think are horrible or what data shoddy. I don't think, for example, that I'm assuming the latent space model is "correct" as you say. I don't think I really have any significant assumptions about the technique or the meaning behind it. I read about the technique in the linked paper and reproduced it in my blog with a different dataset and found a similar result. It's strange to me that the signal produced by this technique is as consistent as it is across the 120 years of data. Beyond that, I'm pretty explicit that I don't know what it means or why it happens.
Regarding the "arbitrarily fit" line - as I say explicitly in the post, that's a regression plot to illustrate the trend.
Regarding the possibilities that The Crimson has more articles per year - it's true that's possible. It's not reality, they run about (for a generous definition of "about") the same number of articles every year. The articles do get longer over time. Either way, it's not clear to me what impact this should have on average cosine distance.
There are a lot of things that I looked at that didn't make it into the blog post. Without including them, then perhaps it looks like I'm cutting corners. If I did include them then I think the blog post would be shooting off in many directions. For example, I considered that political violence might be related - like maybe, in times where there's lots of political violence elite institutions come together and their language becomes more similar. That didn't really pan out though. I graphed a bunch of things that ultimately I decided didn't contribute very much and did not include.
Another way of thinking about it is in the original article Rasmussen (the original author) says "Look at this elite writing in NSF grants. The cosine distance is decreasing over time." I then say "Here is some elite writing - student newspaper at an elite school. Is the cosine distance decreasing there over time too?" And, it is. That's what the blog post is trying to say.
Now, maybe the latent space is "incorrect" - although Rasmussen and I use different embeddings that find a similar trend. Maybe it's not meaningful to use cosine distance in this context. But, it does seem like something has to cause it. Whatever it is and whatever it means, it doesn't look like the kind of thing that happens entirely by chance because it is consistent in different datasets and over many years.
Thank you for your article! Anecdotally, this is a phenomenon I see in high pressure service companies, like McKinsey’s of this world: there’s a very restricted, idiosyncratic vocabulary used by people working there driven by the idea that it will promote sales.
Isn't the pre-made model you're using trained almost entirely on recent (last decade or so) text? I didn't dig too far into it but it looks like news, web crawls, twitter, wikipedia, etc.
Without commenting on the overall trend's cause, your diversity hypothesis is bunk and suggests you are looking to making things fit a diversity-related narrative:
- there's (unsurprisingly) no significant diversity-word change from 1900 to 1940 but a very significant distance drop
- there's a big diversity-word change around ~1990 with no concomitant distance change
Your comment is a bit ironic in the sense that I can tell you didn't read the article because you reproduce conclusions from the article. That's okay! Obviously you didn't need to read it to know what I would have said. :)
Let me quote from the end:
"Another argument against connecting distance and diversity is that distance is on a long running decline from 1900 even for the first four decades while diversity words were basically flat. When diversity words pop in the 90's there isn't an immediate reaction in cosine distance, it's only about a decade later, in 2000, that cosine distance takes a steep drop."
That seems awfully similar to the two points you've raised here.
What I do find a bit distasteful is that you jump in with "your diversity hypothesis is bunk" and accuse me of trying to fit a narrative - without even reading what you're commenting on.
Hey, I thought your article was nice. First it had an easy intro to word embeddings and cosine similarity. And second, you followed the investigation and even came up with the idea "against connecting distance and diversity", so it didn't seem you had the conclusion before you started the work.
If someone complains about not mentioning variance - it's still implicitly visible by the cloud of dots representing each year around the regression line.
My first thought was that this is just the output from the loss function for a genetic algorithm. Lots of people write grant applications, some of those get accepted, the next round of people writing applications then look at the success stories from the previous round and emulate them. Rinse and repeat. New members of the grant evaluation board want to conform to the standards of their predecessors, and so look at the accepted applications from yesteryear to see what they should accept, rinse and repeat. Its a pretty classic self reenforcing feedback loop, and as other commenters have pointed out constriction in lexical space doesnt mean a constriction in idea space. That said, I really enjoyed the article! Love this kind of data driven analysis.
The author should ask the question: What does my "model" actually assign here? Instead, this very core of the work is ignored and the focus rests completely on the numbers put out of a black box...
If the model simply assigns a closer similarity to more modern words (i.e., it would evaluate older words as "weird"), we would expect exactly this outcome, no?
Maybe we've discovered diversity etc is worth studying, in part because it does have such a profound effect on the world. There's a reason these words and studies are considered "politicized" and that's because understanding them threatens the status quo. Just because something is politicized doesn't mean it isn't worth studying.
> Maybe we've discovered diversity etc is worth studying, in part because it does have such a profound effect on the world
Or alternatively, NSF favors grant applications that mention it and toe the party line, so everyone begins to add more and more garbage to their application to optimize for grant acceptance.
Generally if something is so profound that it disrupts the status quo we quickly see it take over. For example the smartphone. But there still isn't a scientific consensus on diversity despite the well-known liberal lean of college campuses. And companies pay lip service and it almost seems that the most successful companies implement DEI programs _after_ they become successful, implying that it provides no competitive advantage.
It totally doesn't, which is why it's baffling that (either!) author goes into so much detail about "diversity words."
I can put a chart of lexical similarity over a chart of the number of employed airplane mechanics and probably pontificate that by golly, we need to stop fixing airplanes for the sake of intellectual diversity!
"The diversity words are new (new-ish). Anything else newish would be inversely correlated with cosine distance too. For example, I repeated the same experiment with "tech" words like Google, YouTube, iPhone, browser, and so on. Those tech words correlated at -0.6 with cosine distance"
It would be nice to see a type of text where this measure has gone up over time. I'm suspicious that the decrease is an artifact of the measure, for example it might be that the word embedding is "trained" on current texts and thus noisier on older texts.
The original paper that inspired this article is a hot mess. It’s just one huge correlation/causation confusion of a political argument posed as research. My freshman expository writing professor tore classmates to shreds over more justifiable logical constructs.
It does show lexical homogenization. It does show an increase in their shoddily assembled list so-called diversity words. The concluded implications and pretty much every other part is hand waving, assumptions and political insinuation.
Like here where the author disclaims the shoddiness of using word frequency as a measure of politicization because the nuance and subtlety of bias make quantitative measurement difficult (he should apply for a grant to research that)… but then draws the entirely unsupported conclusion that politicization is actually worse than their only data implies:
“Note that word counts is a somewhat crude way of measuring politicization. Bias, particularly in the social sciences, is often subtle, and can apply to the kinds of questions that get asked and the standard of evidence used to accept or reject a hypothesis. Thus, the fact that so many grants contain terms that are in most contexts clearly associated with left-wing political causes likely underestimates the degree of politicization in science funding.”
And the author puts that assumption to work a couple of paragraphs and charts later:
”This report presents direct evidence that scientific funding at the federal level has become more politicized and less supportive of novel ideas since 1990.”
The opening paragraphs don’t even indicate a connection beyond “Look at this trend which I assume means Y. Now look at that trend which I assume means Y. Now look at them together. Vaguely similar, eh? Awfully suspicious, eh?.”
“Taken together, the results imply that there has been a politicization of scientific funding in the US in recent years and a decrease in the diversity of ideas supported.”
How about a real “diversity word” selection criteria instead of just showing some tenuously relevant clustering? How about contextual analysis or explanation of the terms? Language and public discourse have changed a lot since the 90s. To inform and help validate their word selection analyzing ~30 years of data, they used the “DEI terms” diversity, equity and inclusion, but the DEI acronym wasn’t used until about 10 years ago— it could be a coincidence, but could indicate those terms weren’t always an accepted standard — is there evidence other terms weren’t used instead? Is there any evidence those are the best terms to use now? Did 90s political grants use euphemisms or code words for political ideas to send more neutral? Were the terms you selected used by political activists as commonly in 1993 as they are now? Would the declining prevalence of second wave feminists and original civil rights movement activists, to use two prevalent examples of very political movements ubiquitous in academia long before 1990, have been equally political but using different words? How many of those words are central to the research topic and how many are positioning unrelated research as good for society or aware of current trends or more beneficial than they are? What other words or topics showed similar/different results for comparison? So-called inclusion words removed, was the lexical similarity static? During that same period, a small number of ubiquitous spelling and grammar checkers gained sophistication and adoption; writers who disregarded or disabled them probably retired and others probably changed how they wrote; the Internet sharpened the curves of trends and memes and also probably exposed people to better writing practices; writing trends have probably come and gone; educational curricula become more standardized; the Internet facilitated access to far more grant/business writing examples; there may have been particularly influential events or pieces of writing that changed things for other reasons— how did they consider or reason about factors like that? Do Internet-focused terms which would have gained prevalence in the same time frame follow a similar trend line? Any medical terms? BPA? GMOs? Genetic testing? Any other points of comparison at all?
Anyone interested in actually advancing human understanding rather than creating political cudgels could poke holes in this garbage all day. I don’t see how anyone could look at that and think “huh — thoughtful analysis” rather than “huh — that’s some elaborate work to give a tenuous air of legitimacy to a claim they didn’t test presented in a manipulative piece which won’t get much traction beyond a politically sympathetic subreddit.”
> Rasmussen connects the decline in lexical diversity with an increase in diversity words - equity, diversity, inclusion, gender, marginalize, underrepresented, and disparity.
> Suppose we were going to rate words on a scale of one to ten in two dimensions. First, we are going to rate how old fashioned a word is. Second, we will rate how funny the word seems to us.
> For example, if we wanted to rate "poppycock" we might say that it's a 9 for "old fashioned" and an 8 for "Funny". Poppycock would be 9, 8. Another word, like snail, isn't especially old fashioned or funny. Though, the word has been around a while - we could call it a 5, 2.
This is a sense of "old-fashioned" that I've never heard of. As far as I'm aware, "old-fashioned" is entirely defined by not being in current use. The age of the word is irrelevant, but if it were relevant, and "poppycock" - a word from the 19th century - were a 9 out of 10, "snail" - a word from Old English - would be something more like a 1500 out of 10.
Well, it's pretty easy to assume that snail is very, very old. As I said above, in my opinion it doesn't sound "old-fashioned" because it's still current, and how old it might be is irrelevant.
But see also my observation about the English female names Gertrude and Etheldreda. Which one is more old-fashioned?
I suppose "old fashioned" is much more of a subjective measurement than an objective one. At first I was going to rate "snail" as not at all old fashioned, since we still use the word. But then, I thought of medieval snail drawings, and felt like the word has some element to it that does harken back to older days. "Citadel" is kind of a similar example. There is a modern hedge fund named Citadel and we still use that word, but it also has a connection to the past. I'd say Citadel is more "old timey" than "snail" even though both words are in modern parlance.
A better way to think about it is that these measurements are my subjective opinion on old timeyness and sillyness via an undefined and intuitive process for assigning values.
> like the word has some element to it that does harken back to older days
It's more interesting than you might think! Turns out snail is a diminutive form of snake[1], from a root referring to creeping over the ground.
The other aspect of judgments of old-fashionedness is that things that are really old-fashioned, like being named Etheldreda, will tend to be rated as less old-fashioned than things that are pretty recent but out of current fashion, like being named Gertrude. The old old things are too forgotten to be "old".
My other question would be "why 'cosine similarity' rather than 'correlation'?". Same thing, but people are a lot more familiar with the term 'correlation'.
[1] OE snaca preserved the /k/ sound, but OE snægl voiced it, and the /g/ then predictably turned into a Y sound (compare "yard" / "day"), giving us the I of modern snail.
I would probably model "old fashionedness" as a question of "when was this word's popularity peak?" i.e. how long ago (how "old") was the time at which this word was at its most popular (most in "fashion")?
It also wouldn't surprise me a huge amount if the idea space was constricting, but just looking at the cosine distance is not enough to establish that. There probably are meaningful ways to exploit and analyze the representations inside language model transformers to better capture the idea-space geometry, but that's a big research project.