Given a name or an email address how can you guess the gender ? The answer is as simple as that. It’s all about data.
US Social Security Administration office http://www.ssa.gov/ created a service where you can request the number of births of a babies per name per gender since 1880. However, their pages are not easily accessible. After searching a while I found this page http://www.infochimps.com/datasets/popular-baby-names-by-year-top-1000-us-social-security-administr where is given a method of how to scrape SAS name list with a bash script.
!#/bin/sh url=http://www.ssa.gov/cgi-bin/popularnames.cgi mkdir -p names for ((yr=1879 ; $yr <= 2010 ; yr++)) ; do echo $yr curl -d "year=$yr&top=1000&number=n" $url > names/$yr.html done
Alternatively you can just download the data from infochimp’s website.
However this is raw information and thus not very helpful. What we need is to have them in a data structure which will enable us to query the gender per name (ideally a distribution).
I used BeautifulSoup to parse the page and extract
import glob
from BeautifulSoup import BeautifulSoup
files = glob.glob('names/*.html')
for f in files :
html_data = open( f ,'r').read()
soup = BeautifulSoup(html_data)
year = soup.find(id="yob")['value']
tables = soup.findAll('table')
trs = tables[2].findAll('tr')
for tr in trs[1:-1]:
tds = tr.findAll('td')
print "%s,%s,%s,%s" % (
tds[1].contents[0],
tds[0].contents[0].replace(',',''),
'male',
year)
print "%s,%s,%s,%s" % (
tds[3].contents[0],
tds[2].contents[0].replace(',',''),
'female',
year)
I stored the output to names.csv .
Then, to transform this information to a probability distribution per name I used the following code.
import json
def prob( m, f ) :
s = m + f
return {'male':m/(1.0*s), 'female':f/(1.0*s)}
def load_data( file ) :
names = {}
f = open( file, 'r' )
for l in f :
d = l.rstrip().split(',')
name = d[0]
counter = d[1]
gender = d[2]
year = d[3]
if name not in names :
names[name] = { 'male':0, 'female':0 }
if gender == 'male' :
names[name]['male'] += int(counter)
else :
names[name]['female'] += int(counter)
return names
db = load_data('names.csv')
names = {}
for d in db:
p = prob( db[d]['male'], db[d]['female'])
if p['male'] > p['female'] :
gender = 'male'
elif p['female'] > p['male'] :
gender = 'female'
else:
gender = 'both'
names[d] = gender
print json.dumps( names )
I saved this to names.json.
Finally, to query the gender of a name you just compare the probabilities
f = open('names.json','r')
names = json.loads( f.read() );
print names
def check_name( name ):
if name in names :
if names[name]['male']>names[name]['female']:
return 'male';
elif names[name]['male']>names[name]['female']:
return 'female';
else :
return 'unknown';
else :
return 'unknown'
The probability approach gives the flexibility to compute also a confidence of the gender for a given name [TODO].


No comments yet.
By submitting a comment you grant ptigas blog a perpetual license to reproduce your words and name/web site in attribution. Inappropriate and irrelevant comments will be removed at an admin’s discretion. Your email is used for verification purposes only, it will never be shared.