When working with human names, you often want to see if two namestrings refer to the same person. This can be tough for a lot of reasons–but the first is the large variety of possible formats and abbreviations in use for the same name. That’s where HumanNameParser comes in; it takes names of various complexities and formats like:
- J. Walter Weatherman
- de la Cruz, Lupe
- George Oscar “Gob” Bluth, Jr.
and parses out the:
- leading initial (Like “J.” in “J. Walter Weatherman”)
- first name (or first initial in a name like ‘R. Crumb’)
- nicknames (like “Gob” in “George Oscar “Gob” Bluth, Jr.”)
- middle names
- last name (including compound ones like “van der Sar’ and “Ortega y Gasset”), and
- suffix (like ‘Jr.’)
Features
Like nameparse.php (by Keith Beckman), HumanNameParser handles comma-reversed names (‘Smith, John’), names with non-English symbols, names with odd capitalization or punctuation (‘e e cummings’), first names made of initials (‘J.K. Rowling’), etc. (See the testing interface for more). However, it also:
- captures leading initials and nicknames seperately, instead of calling them first or middle names.
- is easy to hack:
- object-oriented PHP
- uses simple regular expressions for matching
- includes suite of test names and a testing interface, as well as PHPUnit tests
- fully-documented for PHPdoc
Usage:
// 1. include HumanNameParser.php in your script require_once('./HumanNameParser/init.php'); // 2. instantiate the parser, passing the (utf8-encoded) name you want to parse $parser = new HumanNameParser_Parser("de la Rúa, C. John Roger, Jr."); // 3. Use the relevant 'get' method to retrieve name parts: // 'leadingInit', 'first', 'nicknames', 'middle', 'last', and 'suffix' echo $parser->getFirst() . ' ' . $parser->getLast(); // returns 'John de la Rúa' // You can also get the names as an array print_r($parser->getArr()); // array( [leadingInit] => 'C.', [first] => 'John' ... ) // 4. Use the setter method for new names $parser->setName("Angela H. Brooks");
Testing/hacking
testNames.txt contains the test names and correct parsings of each one.
The included index.php will run the parser and test against each name. This list is
a good way to see how the parser will parse a given name. Lines are formatted like this:
<nameString>|<firstInitial>|<firstName>|<nicknames>|<middleNames>|<lastNames>|<suffix>
Issues
- Requires that your PHP has been compiled with the –enable-utf8 and –enable-unicode-properties flags set if you use names that require unicode (here’s some info on how to recompile).
- Can’t recognize ‘Ben’ as a middle name; assumes it’s the first part of a last name like ‘ben Gurion’.
- Doesn’t know which name is the surname or given name, just first and last.
- Doesn’t recognize multiple-word first names like “Billy Joe.”
- Doesn’t match titles (like ‘Mr.’ or ‘Dr.’) for now; I haven’t needed them. But
they could be added easily.
Credits
Thanks to Keith Beckman for nameparse.php; I expanded a bit on his lists of suffixes and prefixes. Also thanks to Jed Hartman, who as far as I can tell wrote the first one of these at http://alphahelical.com/code/misc/nameparse/.
Download
HumanNameParser currently lives at GitHub; here’s a direct link to the zipped source code.