21
Jan
2012
name2gender in python

The problem is the following. Given a name or an email address how can you guess
the gender ? The answer is as simple as that. It’s all about Data.

US Social Security Administration office http://www.ssa.gov/ created a service where you
can request the number of births of a babies per name per gender since 1880. However, their
pages are not easily accessible. After searching a while I found this page
http://www.infochimps.com/datasets/popular-baby-names-by-year-top-1000-us-social-security-administr
where is given a method of how to scrape SAS name list with a bash script.

!#/bin/sh
url=http://www.ssa.gov/cgi-bin/popularnames.cgi
mkdir -p names
for ((yr=1879 ; $yr <= 2010 ; yr++)) ; do
	echo $yr
	curl -d "year=$yr&top=1000&number=n" $url > names/$yr.html
done

Alternatively you can just download the data from infochimp’s website.

The problem is this is raw information and thus not very helpful. What we need is to have them in a data structure which will enable us to query the gender per name (ideally a distribution).

I used BeautifulSoup so as to parse the page and extract records.

import glob
from BeautifulSoup import BeautifulSoup

files = glob.glob('names/*.html')

for f in files :
	html_data = open( f ,'r').read()

	soup = BeautifulSoup(html_data)
	year = soup.find(id="yob")['value']
	tables = soup.findAll('table')
	trs = tables[2].findAll('tr')
	for tr in trs[1:-1]:
		tds = tr.findAll('td')
		print "%s,%s,%s,%s" % (tds[1].contents[0], tds[0].contents[0].replace(',',''), 'male', year)
		print "%s,%s,%s,%s" % (tds[3].contents[0], tds[2].contents[0].replace(',',''), 'female', year)

I stored the output to names.csv .

Then, to transform this information to a probability distribution per name I used the following code.

import json 

def prob( m, f ) :
    s = m + f
    return {'male':m/(1.0*s), 'female':f/(1.0*s)}

def load_data( file ) :
    names = {}
    f = open( file, 'r' )
    for l in f :
        d = l.rstrip().split(',')

        name = d[0]
        counter = d[1]
        gender = d[2]
        year = d[3]

        if name not in names :
            names[name] = { 'male':0, 'female':0 }

        if gender == 'male' :
            names[name]['male'] += int(counter)
        else :
            names[name]['female'] += int(counter)

    return names

db = load_data('names.csv')

names = {}
for d in db:
    p = prob( db[d]['male'], db[d]['female'])
    if p['male'] > p['female'] :
        gender = 'male'
    elif p['female'] > p['male'] :
        gender = 'female'
    else:
        gender = 'both'

    names[d] = gender

print json.dumps( names )

I saved this to names.json.

Finally, to query the gender of a name you just compare the probabilities

f = open('names.json','r')
names = json.loads( f.read() );
print names
def check_name( name ):
	if name in names :
                if names[name]['male']>names[name]['female']:
                         return 'male';
                elif names[name]['male']>names[name]['female']:
                         return 'female';
                else :
                         return 'unknown';
	else :
		return 'unknown'

The probability approach gives the flexibility to compute also a confidence of the gender for a given
name.

No comments yet.

By submitting a comment you grant ptigas blog a perpetual license to reproduce your words and name/web site in attribution. Inappropriate and irrelevant comments will be removed at an admin’s discretion. Your email is used for verification purposes only, it will never be shared.