TiVo Forum Special Member
Registered: Oct 2000
Location: Reading, UK
Regular expressions are normally used for searching and will just match a text string against a regular expression and say whether it matches or not.
To compare two strings, you would somehow have to convert one string into a regular expression that could then be used to match with the second string. There would have to be patterns that include wildcards in every position that could have missing text, and large iterations of text transposition tables. Unfortunately, as the string gets longer, the number of possible expressions will go up exponentially, and you would need a far more powerful processor to finish a match.
Allowing for misspellings, missing words, and word re-ordering, you really need to build up word tables and compare those. I've used more sophisticated versions of the method in this post above in high-volume commercial address deduplication and bibliographic matching systems. This method only allows for misspelling of words where consonants are doubled-up, but more sophisticated systems would use character-analysis of unmatched words, Soundex conversion, and thesaurus's of common misspellings and abbreviations.
To parse into words, uppercase the text and find consecutive groups of characters A-Z,0-9. One exception is that you should treat two words that are separated by just a hyphen as one word with the hypen removed. Another is to treat the ampersand character as the word 'AND'. All other characters can be ignored.
POST #58 | Report this post to a moderator
| IP: Logged