Better human name parsing for php

When working with human names, you often want to see if two namestrings refer to the same person. This can be tough for a lot of reasons–but the first is the large variety of possible formats and abbreviations in use for the same name. That’s where HumanNameParser comes in; it takes names of various complexities and formats like:

  • J. Walter Weatherman
  • de la Cruz, Lupe
  • George Oscar “Gob” Bluth, Jr.

and parses out the:

  • leading initial (Like “J.” in “J. Walter Weatherman”)
  • first name (or first initial in a name like ‘R. Crumb’)
  • nicknames (like “Gob” in “George Oscar “Gob” Bluth, Jr.”)
  • middle names
  • last name (including compound ones like “van der Sar’ and “Ortega y Gasset”), and
  • suffix (like ‘Jr.’)

Features

Like  nameparse.php (by Keith Beckman), HumanNameParser handles comma-reversed names (‘Smith, John’), names with non-English symbols, names with odd capitalization or punctuation (‘e e cummings’), first names made of initials (‘J.K. Rowling’), etc. (See the  testing interface for more). However, it also:

  • captures leading initials and nicknames seperately, instead of calling them first or middle names.
  • is easy to hack:

Usage:

// 1. include HumanNameParser.php in your script
require_once('./HumanNameParser/init.php');
 
// 2. instantiate the parser, passing the (utf8-encoded) name you want to parse
$parser = new HumanNameParser_Parser("de la Rúa, C. John Roger, Jr.");
 
// 3. Use the relevant 'get' method to retrieve name parts:
//   'leadingInit', 'first', 'nicknames', 'middle', 'last', and 'suffix'
echo $parser->getFirst() . '  ' . $parser->getLast(); // returns 'John de la Rúa'
 
//   You can also get the names as an array
print_r($parser->getArr()); // array( [leadingInit] => 'C.', [first] => 'John' ... )
 
// 4. Use the setter method for new names
$parser->setName("Angela H. Brooks");

Testing/hacking

testNames.txt contains the test names and correct parsings of each one.
The included index.php will run the parser and test against each name. This list is
a good way to see how the parser will parse a given name. Lines are formatted like this:

<nameString>|<firstInitial>|<firstName>|<nicknames>|<middleNames>|<lastNames>|<suffix>

Issues

  • Requires that your PHP has been compiled with the –enable-utf8 and –enable-unicode-properties flags set if you use names that require unicode (here’s some info on how to recompile).
  • Can’t recognize ‘Ben’ as a middle name; assumes it’s the first part of a last name like ‘ben Gurion’.
  • Doesn’t know which name is the surname or given name, just first and last.
  • Doesn’t recognize multiple-word first names like “Billy Joe.”
  • Doesn’t match titles (like ‘Mr.’ or ‘Dr.’) for now; I haven’t needed them. But
    they could be added easily.

Credits

Thanks to Keith Beckman for nameparse.php; I expanded a bit on his lists of suffixes and prefixes. Also thanks to Jed Hartman, who as far as I can tell wrote the first one of these at http://alphahelical.com/code/misc/nameparse/.

Download

HumanNameParser currently lives at GitHub; here’s a direct link to the zipped source code.