Analysis of Nietzsche's Ecce Homo using BERTopic

Another way to explore of the question whether Nietzsche's Ecce Homo is his autobiography is to use topic modeling. There are multiple approaches to topic modeling, but the general idea is to find clusters of related words or topics. If a particular text is an autobiography, or even a biography, you would expect certain word clusters. In the case of an autobiography, you'd expect words connected to childhood, or perhaps words related to whatever the person became famous for, and so on. Of course, there will also be many common words and they may cluster in ways that could be revealing, but also may not be. For example, present tense forms of "to be" might be present in all clusters, but that's not typically informative. Common but largely uninformative words are called "stop words" and traditionally one eliminates them from the text when performing these sorts of analyses. There are off-the-shelf lists of such words readily available. I decided to roll my own because some traditional stop words like "I" and "you" are important for my analysis and I did not want them eliminated. I added words to the list only as necessary to improve clustering. I ended up with this list of stop words:

against, all, among, an, and, any, are, as, at, be, because, been, before, but, by, can, come, does, even, ever, for, from, had, has, have, he, here, him, his, how, in, into, is, it, man, men, more, most, much, must, no, of, on, one, only, saw, say, than, than, that, the, this, those, to, too, was, were, what, when, which, who, whom, with, would

 

For my initial foray, I used BERTopic and discovered that when I set ngram_range = (1,1),[code] the algorhythm would reliably return up to five topics. One of the indicators of a good spread is that the topic clusters are neither too close nor too distant from each other. Here is the intertopic distances from one particular run: