21
Jan
2012
name2gender in python

Given a name or an email address how can you guess the gender ? The answer is as simple as that. It’s all about data.

US Social Security Administration office http://www.ssa.gov/ created a service where you can request the number of births of a babies per name per gender since 1880. However, their pages are not easily accessible. After searching a while I found this page http://www.infochimps.com/datasets/popular-baby-names-by-year-top-1000-us-social-security-administr where is given a method of how to scrape SAS name list with a bash script.

!#/bin/sh
url=http://www.ssa.gov/cgi-bin/popularnames.cgi
mkdir -p names
for ((yr=1879 ; $yr <= 2010 ; yr++)) ; do
	echo $yr
	curl -d "year=$yr&top=1000&number=n" $url > names/$yr.html
done

Alternatively you can just download the data from infochimp’s website.

However this is raw information and thus not very helpful. What we need is to have them in a data structure which will enable us to query the gender per name (ideally a distribution).

I used BeautifulSoup to parse the page and extract records.

import glob
from BeautifulSoup import BeautifulSoup

files = glob.glob('names/*.html')

for f in files :
	html_data = open( f ,'r').read()

	soup = BeautifulSoup(html_data)
	year = soup.find(id="yob")['value']
	tables = soup.findAll('table')
	trs = tables[2].findAll('tr')
	for tr in trs[1:-1]:
		tds = tr.findAll('td')
		print "%s,%s,%s,%s" % (
                       tds[1].contents[0],
                       tds[0].contents[0].replace(',',''),
                       'male',
                       year)
		print "%s,%s,%s,%s" % (
                       tds[3].contents[0],
                       tds[2].contents[0].replace(',',''),
                       'female',
                       year)

I stored the output to names.csv .

Then, to transform this information to a probability distribution per name I used the following code.

import json 

def prob( m, f ) :
    s = m + f
    return {'male':m/(1.0*s), 'female':f/(1.0*s)}

def load_data( file ) :
    names = {}
    f = open( file, 'r' )
    for l in f :
        d = l.rstrip().split(',')

        name = d[0]
        counter = d[1]
        gender = d[2]
        year = d[3]

        if name not in names :
            names[name] = { 'male':0, 'female':0 }

        if gender == 'male' :
            names[name]['male'] += int(counter)
        else :
            names[name]['female'] += int(counter)

    return names

db = load_data('names.csv')

names = {}
for d in db:
    p = prob( db[d]['male'], db[d]['female'])
    if p['male'] > p['female'] :
        gender = 'male'
    elif p['female'] > p['male'] :
        gender = 'female'
    else:
        gender = 'both'

    names[d] = gender

print json.dumps( names )

I saved this to names.json.

Finally, to query the gender of a name you just compare the probabilities

f = open('names.json','r')
names = json.loads( f.read() );
print names
def check_name( name ):
	if name in names :
                if names[name]['male']>names[name]['female']:
                         return 'male';
                elif names[name]['male']>names[name]['female']:
                         return 'female';
                else :
                         return 'unknown';
	else :
		return 'unknown'

The probability approach gives the flexibility to compute also a confidence of the gender for a given name [TODO].

No comments yet.

By submitting a comment you grant ptigas blog a perpetual license to reproduce your words and name/web site in attribution. Inappropriate and irrelevant comments will be removed at an admin’s discretion. Your email is used for verification purposes only, it will never be shared.