Words maybe and non
I was having fun chatting to my teenage nephew recently about the different operating systems and free software and ended up revisiting Ken Church's Unix™ for poets classic, which somehow led me to the burning question: If one were to letter sort all of /usr/share/dict/words
, which contains every Webster's Second International Dictionary (1934) term, what portion of the results are actual words defined in that same input list? I know there are bound to be a few like "now" for example, but how many really? 🤔
# Count all the words (235886 on macOS latest). wc -w /usr/share/dict/words
That is a tall stack, there are probably fancier awk(1)
or Perl one-liner solutions to avoiding the while loop, which may run faster even, but for the sake of readability:
# Sample ./filter script: while read word; do echo "${word}" | # Break up each word into line separated letters. fold -w 1 | # Ignore case when sorting. sort -f | # Pull letters back into words. tr -d '\n' | # Look for exact matches in the dictionary. Choosing fgrep(1) here, because it # is supposed to be "quicker" when not using regular expressions. fgrep -w -f - /usr/share/dict/words done
As a matter of principle, I am on a very basic machine, so let me also time(1)
the process for good measure:
time cat /usr/share/dict/words | ./filter | wc -l
Okay, well, that makes my laptop hurt and seems to be taking forever, save for later maybe. But, there are more wordlists under /usr/share/dict
, such as connectives
on the Mac at least, a far smaller set of 150
, what if I try those instead? And the answer is a mind blowing 41
unique entries plus a duplicate one, which is… 42
of course!
Here is the full list in order of appearance: a, in, is, eh, for, it, as, his, no, be, at, by, i, hist, not, aer, or, an, all, him, been, how, no, fi, os, pu, ist, amy, do, first, any, my, now, em, most, how, now (own), begin, ady, know, aery, go.