WORDLE – Some Statistics

Wordle, the new internet game craze, has indeed some interesting ideas to offer. Tactics have been analysed as well as the the game code. I’d like to add some statistical data on the set of possible solutions. If you haven’t heard of it, check out the link above, the game is pretty self-explanatory.

In short, there is a list of 2315 words from which a new puzzle will be created every day. I won’t give you the complete list, but here are some properties of the included words.

Composition of Words

First, I will answer the question, how many differerent letters may a solution have?

Distinct lettersTotal abs.Total %
210.04
3572.46
469129.85
5156667.65

Almost exactly 1/3 of the solutions indeed consist of 5 distinct letters. Roughly 30% contain a single character that appears twice (e.g. LOOSE or CLOCK). A few words, about 1 in 40, consist of only 3 different letters (e.g. SASSY). Moreover, there is one special word that needs only 2 different letters (I won’t tell you).

So, the expectation that you have to deal with a word of five different characters is basically correct. If you keep in mind that there may be a single letter that appers twice you have more than 97% covered.

Frequency of Letters

Next, I examined the frequency of the letters in the set of possible solutions. There are statistics about the frequency of letters in the English language, but maybe the are differences because of the features of the chosen set. Here are the results:

LetterLetter FrequencyWord Frequency
A979909
B281267
C477448
D393370
E12331056
F230207
G311300
H389379
I671647
J2727
K210202
L719648
M316298
N575550
O754673
P367346
Q2929
R899837
S669618
T729667
U467457
V153149
W195194
X3737
Y425417
Z4035

The letter frequency shows the occurences of the single letters, so we are looking at all the letters in all the words. The word frequency shows the number of actual words that contain at least one instance of the letter in question. There is not much difference in the order of frequencies between both methods of counting:

E, A, R, O, T, L, I make up for about 50% of all cases. Subsequently including C, N, and S, these 10 letters are much more common than all 16 others and they are included in over half the possible solutions. According to some random Google result, this is very similar to the results in the Oxford Dictionary.

On the bottom of the list, J, Q, X, Z are all very uncommon and together, they sum up to only about 1% of letters and occur in only about 1% of words. Same in the Oxford Dictionary, so nothing spectacular here.

Good Starting Words

I’ve read a discussion on Review Geek about the best guesses to start with. A programmer named Tyler Glaiel determined RAISE as the best possible first guess for his bot, but admitted there might be room for improvement.

For my analysis of this question, I looked for frequency of letters as well as position. The position is very important as the game gives multiple hints on where a letter does or does not occur in the solution:

  • Obviously, it verifies the correct position of a letter by green colour;
  • by showing the mere occurrence of a letter in a yellow colour, it can be infered that the letter does not appear in that particular position (otherwise it would be green!); therefore, it verifies a wrong position.

I went on and counted the frequencies of the letters A to Z of each position 1 to 5 where they appear in the 2315 solutions of 5-letter-words. And there are big differences in comparison to the overall distribution of letters – there are different frequencies of the letters for each position.

Here are the five most common letters ranked top down for each of the positions the letters may have in the word. The absolute numbers are added in parantheses:

1st2nd3rd4th5th
S (366)A (304)A (307)E (318)E (424)
C (198)O (279)I (266)N (182)Y (364)
B (173)R (267)O (244)S (171)T (253)
T (149)E (242)E (177)A (163)R (212)
P (142)I (202)U (165)L (162)L (156)

This allows for a much better understanding of the first guess problem:

  • It’s very likely the solution starts with one of five consonants, followed by one or two vowels;
  • the most frequent letter E dominates the 4th and 5th position;
  • clearly, you can recognize some common or distinctive pairs at the end, including -ER, -NY, -ST, -AL, -AR …
  • the last letter -Y is very prominent on position 5, probably because of adjectives included in the list.

Rating Frequency against Position

To quantify the quality of a word in the solution list as a first guess, we subsume that it is advantageous to get as many letter hits as possible. This might not be true for 2nd and 3rd guesses!

To rate every solution against the ranking of the letters according to their positions, I took every position, gave every letter a value from 26 to 1 (0 for not included at all) according to their ranking at that position. Other ways of rankings are possible, e.g. the relative distribution based on the percentages, but I have chosen the cardinal ranking for now. Then I calculated different averages (arithmetic, geometric and harmonic). There were only slight differences between them.

Here is my list of the 10 best first guesses:

  • SAINT – starts with S, tests for high-ranking letters;
  • CRANE – includes R and the common N and E;
  • COAST – tests for important S and T and most vowels;
  • CRONY – high-ranking with the very common -NY ending;
  • BOAST – tests for B in addition to S and T;
  • CAUSE – tests vowels and two most common 1st letters;
  • PAINT – interesting option to go for the slightly less frequent P;
  • SLANT – tests many of the most common consonants at once;
  • SHINE – nice variant to go for H with a good letter combo;
  • SHINY – same as above containing the important Y.

I only included words with 5 different characters. I manually edited the list a bit to reduce spoilers. I didn’t use the extended guess words list, so „solve in one strike“ is always possible.

More Options

RAISE, as suggested by Tyler, ranks at about 90 of the 1566 solutions in my statistics – not bad, but not extremely good, either. It tests for S, but R is only rank 11 as first letter and not better than rank 3 overall. The vowels are placed quite well, though. CADET, RAINY, DRONE or PRICE have about the same potential in my ranking.

EARLY could seem like a really good first guess looking at the overall frequencies of letters. However, in my ranking, EARLY only scores at rank 576, a third down the list. A good example how our choice can be affected by taking the positions of letters into account.

Oddballs

Now a look at the bottom of the list – these are 10 of the worst first guesses you can make, in alphabetical order:

AFFIX, AZURE, INBOX, JAZZY, JUMBO, NINJA, OZONE, PIQUE, USHER, ZEBRA

This list is funny considering the subject of our question, but also scary because these are actual solutions – and they are really hard to solve if you walk in the wrong direction initially. Your six guesses may be used up much too fast.

Conclusions

There are no magic words to easily solve the Wordle puzzle but there are lots of little things to improve and optimize the strategy:

  • The most important idea could be to minimze the set of candidates. Testing efficiently means to only test guesses with 5 distinct letters, not to test letters twice and review and revalue every hint after each turn.
  • To reduce uncertainty, is not always a good tactic to include the letters you already know are part of the solution – except you play in Hard Mode, then it’s a must.
  • A viable strategy might be to test for vowels early on. Personally, I find it more convincing to verify and dismiss consonants, as vowels are much easier to check (only five at most). Consonants also allow or exclude certain other consonants in close proximity whereas vowels are much more permissive in this regard.
  • Taking positions into account is really important; every yellow letter tells you where the letter does not appear. That information can help reduce possibilities immensely.

Finally, have a lot of fun and wordle on!