I had an interesting requirement for a website: Prevent a user from entering random text/garbage into a web form. They even provided a list of data that had got through (ex. "XXF", "djfhfhfhf", "asdf", "John Doe", etc).
So I thought about this problem for a while. I did a bit of research to see which direction to look in. There is a lot of material on generating randomness, but most of the focus on detecting randomness is on repeatable processes like random number generators that can be run over and over again to produce a sample set. In our case, we only get one sample, which is simply what the user enters when they submit the form.
In then end, I looked at the samples of random garbage I was given, and though about what the person probably did when they entered it into our form. I came up with the following rules, taken in combination, to flag a word as random garbage.
- Ratio of vowels-to-consonants and its inverse cannot exceed 0.16. This rule catches words that have too many vowels and consonants in them to be phonically useful (ex. "akdjfsa", "aeiouux")
Obviously, edge cases involving words which are all vowels or consonants are also caught (ex. "aaaaaa", "bbb") The idea behind this rule was that people tend to use just a few keys on the keyboard to generate a random word.
- Ratio of unique-letters-to-total-word-size cannot exceed 0.4. (ex. "dfffdf" would fail because there are 2 unique characters used, 'd' and 'f', to make a word of size 6. 2/6 < 0.4) This rule tests to see if the word was created using the same keys in combination. In some ways this is an extension of rule 1, to catch cases where an acceptable mix of vowels and consonants was used (ex. "akkkakkaa") but the word doesn't have enough "variety" of letters to make phonic sense.
- All the characters in the word are positioned beside each other on the keyboard. This was the hardest of the rules to implement, since keys on a keyboard don't line up top to bottom. The goal was to catch words generated when a person just mashes their keyboard with one hand. I think this is also the rule carrying the least weight, since lots of valid words would violate it (ex. "fred").
I chose to combine rules 2 and 3 together, for a total score out of 2, where any word scoring less than 2 was considered to be valid (not random).
Obviously these rules are limited at catching all cases of randomness, and are especially ineffective if the adversary explicitly tries to defeat it by entering an actual name (ex. "Curious George", "Darth Vader"); However, their false acceptance rate seemed like it should be fairly low (so as not to penalize people with slightly obscure names). In my tests, only about 20% of random input provided to the tests was caught. When I ran the tests against first and last names of people in our database, a very small number (< 0.001%) were caught, and upon visual inspection, all of these were random garbage, not actual people's names.
I'm interested in seeing how other people have tackled this problem.
Click here to try some input into the test.