Friday 26 February 2016

Finding anagrams with Mihalis Tsoukalos

Today we will focus on word anagrams. You probably already know that an anagram of a given word is a different word that has exactly the same letters as the original. Word anagrams have many applications, including dictionaries, word processors, computer games, scrabbles, password checking and more. Apart from the practical applications, the whole problem is a good exercise for the mind because it’s important that you see the problem from the right perspective before trying to solve it.

The process is greatly simplified by introducing the concept of word signatures. The letters of the word sorted in alphabetical order compose the signature of a word. Signatures are the building blocks of anagrams because words with the same signature are anagrams of each other! Therefore, hist, this and hits are anagrams; their common signature is “hist”.

If you don’t choose to use word signatures, finding a solution will undoubtedly be very difficult, inefficient, complex and slow. If you attempt to solve the problem by finding all of the possible permutations of the letters in a given word, you might well fail miserably or develop a solution that would appear rather pathetic. A word with ten letters has ten (factorial) possible permutations, which is an enormous number. Checking which of the permutations are valid words would require many look-ups to a dictionary, delaying the process even more. If you want to find the anagrams of multiple words, then the whole process might take hours or even days depending on the total number of words.

The presented code will be in Perl because it is smaller than C; you can try to implement a C or C++ version as an exercise. As you will see at the end of the column, you can even solve the problem using AWK!

From the definition of the signature term, you can understand that you will need to be able to do some kind of sorting in your code; the presented code uses the Perl sort function but you can use any algorithm you want.

The following Perl code does most of the job as it separates the letters of a word and puts them into a new variable named @sign:

for ($i = 0; $i < length($word); $i++)  {

    @sign = (@sign, substr($word, $i, 1)); }

Now you just have to sort the letters of the @sign array. Please note that @sign can contain any letter multiple times; the signature of “panama” should be “aaamnp”.

A simple run of the complete Perl program (wordSignature.pl) produces the following output:

$ ./wordSignature.pl

Give me a string: persistent

The signature of persistent is eeinprsstt.

As wordSignature.pl implements the signature function, you can easily use it and find out if two words have the same signature. Similarly, you can use the signature function to correct or auto correct misspelled words.

To save processing time, you can pre-compute word signatures for the dictionary file usually found inside /usr/share/dict on many Linux systems and store them as (word, anagram) pairs. Then, a grep using the signature of a given word will reveal its anagrams. You can also sort the words in respect of their signatures.

It is now time to look at the AWK implementation. The AWK script takes the same approach, so the AWK code will remind you of the Perl code. The AWK script reads its input from a text file as can be seen in the next output:

$ gawk -f signature.awk file_with_words

The signature of this is hist

The signature of that is ahtt

The script ran using GNU AWK version 4.0.1 that is an improved version of the original AWK implementation.

Not only is the AWK version smaller than Perl, it also processes every line of the input file without requiring any extra code! Nevertheless, the greatest advantage of AWK is that it makes you write clean code as you can recognise by looking at the implementation of the signature function. So, don’t underestimate the power of the AWK programming language.

Remember that sometimes you should simplify the original problem by transforming it into a simpler one. Where anagrams are concerned, the original problem is equal to two sub-problems: calculating a signature and selecting words with the same signature.

You can find the relevant source code at bit.ly/1T0mcpj and at bit.ly/1KYLr5A.



from Linux User & Developer – the Linux and FOSS mag for a GNU generation http://ift.tt/1TbwCoa
via IFTTT

No comments:

Post a Comment

Amazon

Donate

Donate Towards More Raspberry PI's for Projects