The problem is the following. Given a name or an email address how can you guess
the gender ? The answer is as simple as that. It’s all about Data.
US Social Security Administration office http://www.ssa.gov/ created a service where you
can request the number of births of a babies per name per gender since 1880. However, their
pages are not easily accessible. After searching a while I found this page
http://www.infochimps.com/datasets/popular-baby-names-by-year-top-1000-us-social-security-administr
where is given a method of how to scrape SAS name list with a bash script.
!#/bin/sh url=http://www.ssa.gov/cgi-bin/popularnames.cgi mkdir -p names for ((yr=1879 ; $yr <= 2010 ; yr++)) ; do echo $yr curl -d "year=$yr&top=1000&number=n" $url > names/$yr.html done
Alternatively you can just download the data from infochimp’s website.
The problem is this is raw information and thus not very helpful. What we need is to have them in a data structure which will enable us to query the gender per name (ideally a distribution).
I used BeautifulSoup so as to parse the page and extract
import glob
from BeautifulSoup import BeautifulSoup
files = glob.glob('names/*.html')
for f in files :
html_data = open( f ,'r').read()
soup = BeautifulSoup(html_data)
year = soup.find(id="yob")['value']
tables = soup.findAll('table')
trs = tables[2].findAll('tr')
for tr in trs[1:-1]:
tds = tr.findAll('td')
print "%s,%s,%s,%s" % (tds[1].contents[0], tds[0].contents[0].replace(',',''), 'male', year)
print "%s,%s,%s,%s" % (tds[3].contents[0], tds[2].contents[0].replace(',',''), 'female', year)
I stored the output to names.csv .
Then, to transform this information to a probability distribution per name I used the following code.
import json
def prob( m, f ) :
s = m + f
return {'male':m/(1.0*s), 'female':f/(1.0*s)}
def load_data( file ) :
names = {}
f = open( file, 'r' )
for l in f :
d = l.rstrip().split(',')
name = d[0]
counter = d[1]
gender = d[2]
year = d[3]
if name not in names :
names[name] = { 'male':0, 'female':0 }
if gender == 'male' :
names[name]['male'] += int(counter)
else :
names[name]['female'] += int(counter)
return names
db = load_data('names.csv')
names = {}
for d in db:
p = prob( db[d]['male'], db[d]['female'])
if p['male'] > p['female'] :
gender = 'male'
elif p['female'] > p['male'] :
gender = 'female'
else:
gender = 'both'
names[d] = gender
print json.dumps( names )
I saved this to names.json.
Finally, to query the gender of a name you just compare the probabilities
f = open('names.json','r')
names = json.loads( f.read() );
print names
def check_name( name ):
if name in names :
if names[name]['male']>names[name]['female']:
return 'male';
elif names[name]['male']>names[name]['female']:
return 'female';
else :
return 'unknown';
else :
return 'unknown'
The probability approach gives the flexibility to compute also a confidence of the gender for a given
name.


No comments yet.
By submitting a comment you grant ptigas blog a perpetual license to reproduce your words and name/web site in attribution. Inappropriate and irrelevant comments will be removed at an admin’s discretion. Your email is used for verification purposes only, it will never be shared.