{ "info": { "author": "Suriyan Laohaprapanon, Gaurav Sood", "author_email": "suriyant@gmail.com, gsood07@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.5", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Utilities" ], "description": "ethnicolr: Predict Race and Ethnicity From Name\r\n----------------------------------------------------\r\n\r\n.. image:: https://travis-ci.org/appeler/ethnicolr.svg?branch=master\r\n :target: https://travis-ci.org/appeler/ethnicolr\r\n.. image:: https://ci.appveyor.com/api/projects/status/qfvbu8h99ymtw2ub?svg=true\r\n :target: https://ci.appveyor.com/project/soodoku/ethnicolr\r\n.. image:: https://img.shields.io/pypi/v/ethnicolr.svg\r\n :target: https://pypi.python.org/pypi/ethnicolr\r\n.. image:: https://anaconda.org/soodoku/ethnicolr/badges/version.svg\r\n :target: https://anaconda.org/soodoku/ethnicolr/\r\n.. image:: https://pepy.tech/badge/ethnicolr\r\n :target: https://pepy.tech/project/ethnicolr\r\n\r\nWe exploit the US census data, the Florida voting registration data, and \r\nthe Wikipedia data collected by Skiena and colleagues, to predict race\r\nand ethnicity based on first and last name or just the last name. The granularity \r\nat which we predict the race depends on the dataset. For instance, \r\nSkiena et al.' Wikipedia data is at the ethnic group level, while the \r\ncensus data we use in the model (the raw data has additional categories of \r\nNative Americans and Bi-racial) merely categorizes between Non-Hispanic Whites, \r\nNon-Hispanic Blacks, Asians, and Hispanics.\r\n\r\nDIME Race\r\n-----------\r\nData on race of all the people in the `DIME data `__ \r\nis posted `here `__ The underlying python scripts \r\nare posted `here `__ \r\n\r\nCaveats and Notes\r\n-----------------------\r\n\r\nIf you picked a random individual with last name 'Smith' from the US in 2010 \r\nand asked us to guess this person's race (measured as crudely as by the census),\r\nthe best guess would be based on what is available from the aggregated Census file. \r\nIt is the Bayes Optimal Solution. So what good are last name only predictive models\r\nfor? A few things---if you want to impute ethnicity at a more granular level,\r\nguess the race of people in different years (than when the census was conducted \r\nif some assumptions hold), guess the race of people in different countries (again if some \r\nassumptions hold), when names are slightly different (again with some assumptions), etc. \r\nThe big benefit comes from when both the first name and last name is known.\r\n\r\nInstall\r\n----------\r\n\r\n::\r\n\r\n pip install ethnicolr\r\n\r\nOr ::\r\n \r\n conda install ethnicolr\r\n\r\nNote: If you are installing on Windows, Theano installation typically needs admin. privileges on the shell.\r\n\r\nGeneral API\r\n------------------\r\n\r\nTo see the available command line options for any function, please type in \r\n`` --help``\r\n\r\n::\r\n\r\n # census_ln --help\r\n usage: census_ln [-h] [-y {2000,2010}] [-o OUTPUT] -l LAST input\r\n\r\n Appends Census columns by last name\r\n\r\n positional arguments:\r\n input Input file\r\n\r\n optional arguments:\r\n -h, --help show this help message and exit\r\n -y {2000,2010}, --year {2000,2010}\r\n Year of Census data (default=2000)\r\n -o OUTPUT, --output OUTPUT\r\n Output file with Census data columns\r\n -l LAST, --last LAST Name or index location of column contains the last\r\n name\r\n\r\n\r\nExamples\r\n----------\r\n\r\nTo append census data from 2010 to a `file without column headers `__ and the first column carries the last name, use ``-l 0``\r\n\r\n::\r\n\r\n census_ln -y 2010 -o output-census2010.csv -l 0 input-without-header.csv\r\n\r\nTo append census data from 2010 to a `file with column header in the first row `__, specify the column name carrying last names using the ``-l`` option, keeping the rest the same:\r\n\r\n::\r\n\r\n census_ln -y 2010 -o output-census2010.csv -l last_name input-with-header.csv \r\n\r\n\r\nTo predict race/ethnicity using `Wikipedia full name model `__, if the input file doesn't have any column headers, you must using ``-l`` and ``-f`` to specify the index of column carrying the last name and first name respectively (first column has index 0).\r\n\r\n::\r\n\r\n pred_wiki_name -o output-wiki-pred-race.csv -l 0 -f 1 input-without-header.csv\r\n\r\n\r\nAnd to predict race/ethnicity using `Wikipedia full name model `__ for a file with column headers, you can specify the column name of last name and first name by using ``-l`` and ``-f`` flags respectively.\r\n\r\n::\r\n\r\n pred_wiki_name -o output-wiki-pred-race.csv -l last_name -f first_name input-with-header.csv\r\n\r\n\r\nFunctions\r\n----------\r\n\r\nWe expose 6 functions, each of which either take a pandas DataFrame or a CSV. If the CSV doesn't have a header,\r\nwe make some assumptions about where the data is\r\n\r\n- **census\\_ln**\r\n\r\n - Input: pandas DataFrame or CSV and a string or list of the name or\r\n location of the column containing the last name.\r\n\r\n - What it does:\r\n\r\n - Removes extra space.\r\n - For names in the `census file `__, it appends relevant data.\r\n\r\n - Options:\r\n\r\n - year: 2000 or 2010\r\n - if no year is given, data from the 2000 census is appended\r\n\r\n - Output: Appends the following columns to the pandas DataFrame or CSV:\r\n pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic\r\n\r\n- **pred\\_census\\_ln**\r\n\r\n - Input: pandas DataFrame or CSV and string or list of the name or\r\n location of the column containing the last name.\r\n\r\n - What it does:\r\n\r\n - Removes extra space.\r\n - Uses the `last name census 2000\r\n model `__\r\n or `last name census 2010\r\n model `__\r\n to predict the race and ethnicity.\r\n\r\n - Options:\r\n\r\n - year: 2000 or 2010\r\n\r\n - Output: Appends the following columns to the pandas DataFrame or CSV:\r\n race (white, black, asian, or hispanic), api (percentage chance asian),\r\n black, hispanic, white.\r\n\r\n- **pred\\_wiki\\_ln**\r\n\r\n - Input: pandas DataFrame or CSV and string or list of the name or\r\n location of the column containing the last name.\r\n\r\n - What it does:\r\n\r\n - Removes extra space.\r\n - Uses the `last name wiki model `__\r\n to predict the race and ethnicity.\r\n\r\n - Output: Appends the following columns to the pandas DataFrame or CSV:\r\n race (categorical variable --- category with the highest probability), \r\n \"Asian,GreaterEastAsian,EastAsian\", \"Asian,GreaterEastAsian,Japanese\", \r\n \"Asian,IndianSubContinent\", \"GreaterAfrican,Africans\", \"GreaterAfrican,Muslim\",\r\n \"GreaterEuropean,British\",\"GreaterEuropean,EastEuropean\", \r\n \"GreaterEuropean,Jewish\",\"GreaterEuropean,WestEuropean,French\",\r\n \"GreaterEuropean,WestEuropean,Germanic\",\"GreaterEuropean,WestEuropean,Hispanic\",\r\n \"GreaterEuropean,WestEuropean,Italian\",\"GreaterEuropean,WestEuropean,Nordic\"\r\n\r\n- **pred\\_wiki\\_name**\r\n\r\n - Input: pandas DataFrame or CSV and string or list containing the name or\r\n location of the column containing the first name, last name, middle\r\n name, and suffix, if there. The first name and last name columns are\r\n required. If no middle name of suffix columns are there, it is\r\n assumed that there are no middle names or suffixes.\r\n\r\n - What it does:\r\n\r\n - Removes extra space.\r\n - Uses the `full name wiki\r\n model `__ to predict the\r\n race and ethnicity.\r\n\r\n - Output: Appends the following columns to the pandas DataFrame or CSV:\r\n race (categorical variable---category with the highest probability), \r\n \"Asian,GreaterEastAsian,EastAsian\", \"Asian,GreaterEastAsian,Japanese\", \r\n \"Asian,IndianSubContinent\", \"GreaterAfrican,Africans\", \"GreaterAfrican,Muslim\",\r\n \"GreaterEuropean,British\",\"GreaterEuropean,EastEuropean\", \r\n \"GreaterEuropean,Jewish\",\"GreaterEuropean,WestEuropean,French\",\r\n \"GreaterEuropean,WestEuropean,Germanic\",\"GreaterEuropean,WestEuropean,Hispanic\",\r\n \"GreaterEuropean,WestEuropean,Italian\",\"GreaterEuropean,WestEuropean,Nordic\"\r\n\r\n- **pred\\_fl\\_reg\\_ln**\r\n\r\n - Input: pandas DataFrame or CSV and string or list of the name or location\r\n of the column containing the last name.\r\n\r\n - What it does?:\r\n\r\n - Removes extra space, if there.\r\n - Uses the `last name FL registration\r\n model `__ to predict the race\r\n and ethnicity.\r\n\r\n - Output: Appends the following columns to the pandas DataFrame or CSV:\r\n race (white, black, asian, or hispanic), asian (percentage chance Asian),\r\n hispanic, nh_black, nh_white\r\n\r\n- **pred\\_fl\\_reg\\_name**\r\n\r\n - Input: pandas DataFrame or CSV and string or list containing the name or\r\n location of the column containing the first name, last name, middle\r\n name, and suffix, if there. The first name and last name columns are\r\n required. If no middle name of suffix columns are there, it is\r\n assumed that there are no middle names or suffixes.\r\n\r\n - What it does:\r\n\r\n - Removes extra space.\r\n - Uses the `full name wiki\r\n model `__ to predict the\r\n race and ethnicity.\r\n\r\n - Output: Appends the following columns to the pandas DataFrame or CSV:\r\n race (white, black, asian, or hispanic), asian (percentage chance Asian),\r\n hispanic, nh_black, nh_white\r\n\r\nUsing ethnicolr\r\n----------------\r\n\r\n::\r\n\r\n >>> import pandas as pd\r\n\r\n >>> from ethnicolr import census_ln, pred_census_ln\r\n Using TensorFlow backend.\r\n\r\n >>> names = [{'name': 'smith'},\r\n ... {'name': 'zhang'},\r\n ... {'name': 'jackson'}]\r\n\r\n >>> df = pd.DataFrame(names)\r\n\r\n >>> df\r\n name\r\n 0 smith\r\n 1 zhang\r\n 2 jackson\r\n\r\n >>> census_ln(df, 'name')\r\n name pctwhite pctblack pctapi pctaian pct2prace pcthispanic\r\n 0 smith 73.35 22.22 0.40 0.85 1.63 1.56\r\n 1 zhang 0.61 0.09 98.16 0.02 0.96 0.16\r\n 2 jackson 41.93 53.02 0.31 1.04 2.18 1.53\r\n\r\n >>> census_ln(df, 'name', 2010)\r\n name race pctwhite pctblack pctapi pctaian pct2prace pcthispanic\r\n 0 smith white 70.9 23.11 0.5 0.89 2.19 2.4\r\n 1 zhang api 0.99 0.16 98.06 0.02 0.62 0.15\r\n 2 jackson black 39.89 53.04 0.39 1.06 3.12 2.5\r\n\r\n >>> pred_census_ln(df, 'name')\r\n name race api black hispanic white\r\n 0 smith white 0.002019 0.247235 0.014485 0.736260\r\n 1 zhang api 0.997807 0.000149 0.000470 0.001574\r\n 2 jackson black 0.002797 0.528193 0.014605 0.454405\r\n\r\n >>> help(pred_census_ln)\r\n Help on function pred_census_ln in module ethnicolr.pred_census_ln:\r\n\r\n pred_census_ln(df, namecol, year=2000)\r\n Predict the race/ethnicity by the last name using Census model.\r\n\r\n Using the Census last name model to predict the race/ethnicity of the input\r\n DataFrame.\r\n\r\n Args:\r\n df (:obj:`DataFrame`): Pandas DataFrame containing the last name\r\n column.\r\n namecol (str or int): Column's name or location of the name in\r\n DataFrame.\r\n year (int): The year of Census model to be used. (2000 or 2010)\r\n (default is 2000)\r\n\r\n Returns:\r\n DataFrame: Pandas DataFrame with additional columns:\r\n - `race` the predict result\r\n - `black`, `api`, `white`, `hispanic` are the prediction\r\n probability.\r\n\r\nApplication\r\n--------------\r\n\r\nTo illustrate how the package can be used, we impute the race of the campaign contributors recorded by FEC for the years 2000 and 2010 and tally campaign contributions by race.\r\n\r\n- `Contrib 2000/2010 using census_ln `__\r\n- `Contrib 2000/2010 using pred_census_ln `__\r\n- `Contrib 2000/2010 using pred_fl_reg_name `__\r\n\r\nData on race of all the people in the `DIME data `__ is posted `here `__ The underlying python scripts are posted `here `__ \r\n\r\nData\r\n----------\r\n\r\nIn particular, we utilize the last-name--race data from the `2000\r\ncensus `__\r\nand `2010\r\ncensus `__,\r\nthe `Wikipedia data `__ collected by Skiena and colleagues,\r\nand the Florida voter registration data from early 2017.\r\n\r\n- `Census `__\r\n- `The Wikipedia dataset `__\r\n- `Florida voter registration database `__\r\n\r\nAuthors\r\n----------\r\n\r\nSuriyan Laohaprapanon and Gaurav Sood\r\n\r\nContributor Code of Conduct\r\n---------------------------------\r\n\r\nThe project welcomes contributions from everyone! In fact, it depends on\r\nit. To maintain this welcoming atmosphere, and to collaborate in a fun\r\nand productive way, we expect contributors to the project to abide by\r\nthe `Contributor Code of\r\nConduct `__.\r\n\r\nLicense\r\n----------\r\n\r\nThe package is released under the `MIT\r\nLicense `__.\r\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/appeler/ethnicolr", "keywords": "race ethnicity names", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "ethnicolr", "package_url": "https://pypi.org/project/ethnicolr/", "platform": "", "project_url": "https://pypi.org/project/ethnicolr/", "project_urls": { "Homepage": "https://github.com/appeler/ethnicolr" }, "release_url": "https://pypi.org/project/ethnicolr/0.2.4/", "requires_dist": null, "requires_python": "", "summary": "Predict Race/Ethnicity Based on Name", "version": "0.2.4" }, "last_serial": 5865051, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "b0e6704274220003d349b65efd66827c", "sha256": "15002c6fc4e8ffbebe6f1de0e654e74305f4a01f597befa5951e7bb7f32dddff" }, "downloads": -1, "filename": "ethnicolr-0.1.2.tar.gz", "has_sig": false, "md5_digest": "b0e6704274220003d349b65efd66827c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17146791, "upload_time": "2017-05-29T04:58:47", "url": "https://files.pythonhosted.org/packages/64/34/26492694a2dab13c4fab12c12daabffff14c64b2bb5a080c3c9fbd7b11fa/ethnicolr-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "0c678deb2f722eec01ce5aa8b370f73e", "sha256": "7c000de0502b162076cec8c0c29b7cccd1f2af28e0e5677c6a8a8569d8725a85" }, "downloads": -1, "filename": "ethnicolr-0.1.3.tar.gz", "has_sig": false, "md5_digest": "0c678deb2f722eec01ce5aa8b370f73e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17146838, "upload_time": "2017-09-17T11:03:22", "url": "https://files.pythonhosted.org/packages/f8/89/0896f10f77a37be9b1b5e8ee2cc111c9977aa5d6f6573f2898e07dca975b/ethnicolr-0.1.3.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "b96b89065b7301cbee9161a07d2cb626", "sha256": "35da754a3fa323322c1e27243b4e9b92a5552df21c779fd5b8b45c4ab42ecec9" }, "downloads": -1, "filename": "ethnicolr-0.1.5.tar.gz", "has_sig": false, "md5_digest": "b96b89065b7301cbee9161a07d2cb626", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17155755, "upload_time": "2018-02-07T14:57:58", "url": "https://files.pythonhosted.org/packages/3d/36/f2c938835c821d3ea7d1c0fc7e50341f5784e133eb14b8a010e825d31093/ethnicolr-0.1.5.tar.gz" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "554ec1d0b0a2fd26e54d6d69faeef85a", "sha256": "1372dfc726f389bcb2f3fa5cd00f72cd4a1d115b9e2436982e8a411cbf58b183" }, "downloads": -1, "filename": "ethnicolr-0.1.7.tar.gz", "has_sig": false, "md5_digest": "554ec1d0b0a2fd26e54d6d69faeef85a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16944791, "upload_time": "2018-05-07T14:18:50", "url": "https://files.pythonhosted.org/packages/01/21/9f975639aa45c527cdca801e14d234f715c507d0559f12f144b87813e31e/ethnicolr-0.1.7.tar.gz" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "fd2dab3b638acaba85623af6b5998088", "sha256": "3f932acf5b62f8e1684509d58c411360cb60023cd539eaf251304e0fcd38c996" }, "downloads": -1, "filename": "ethnicolr-0.1.8.tar.gz", "has_sig": false, "md5_digest": "fd2dab3b638acaba85623af6b5998088", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25296448, "upload_time": "2018-11-02T08:19:56", "url": "https://files.pythonhosted.org/packages/b4/3d/c5aa889a6d1938f09138b87297c255505226ee48d7930d3ea5cf4368ccb0/ethnicolr-0.1.8.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "a8e712579e23cc01c400b37738f7cbe5", "sha256": "c62ae6c0e168fa07da3ff959a98cf507a28cfcc3b2b03ff759350aa365030e7a" }, "downloads": -1, "filename": "ethnicolr-0.2.0.tar.gz", "has_sig": false, "md5_digest": "a8e712579e23cc01c400b37738f7cbe5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25296695, "upload_time": "2018-11-09T14:59:06", "url": "https://files.pythonhosted.org/packages/97/ec/ed4d5930059c77c0bab4d101580691d137b5b891020ae5a2931c89983a19/ethnicolr-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "0bc1f64163b5efca9232464468345bfa", "sha256": "17ab4e287476d37705c9b8b072cfb2cc40025eb340230772981fcebb042a9563" }, "downloads": -1, "filename": "ethnicolr-0.2.1.tar.gz", "has_sig": false, "md5_digest": "0bc1f64163b5efca9232464468345bfa", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25296563, "upload_time": "2019-06-26T13:17:40", "url": "https://files.pythonhosted.org/packages/0a/92/547262319b2695300275f063b1d3a189263b5f9679544b0d89f5beaa62a2/ethnicolr-0.2.1.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "fbdac820d5d2b728fd17ec5fd424c389", "sha256": "a3e0802f05d50ee69712ac6a49be5c9b5fdeeced9df4dcb753ea3b27c8c4660e" }, "downloads": -1, "filename": "ethnicolr-0.2.4.tar.gz", "has_sig": false, "md5_digest": "fbdac820d5d2b728fd17ec5fd424c389", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25296572, "upload_time": "2019-09-21T03:53:59", "url": "https://files.pythonhosted.org/packages/f4/0c/e91e33a93c97a216e0ee9c7b795d26f3718730301b40adffd018a979ae50/ethnicolr-0.2.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fbdac820d5d2b728fd17ec5fd424c389", "sha256": "a3e0802f05d50ee69712ac6a49be5c9b5fdeeced9df4dcb753ea3b27c8c4660e" }, "downloads": -1, "filename": "ethnicolr-0.2.4.tar.gz", "has_sig": false, "md5_digest": "fbdac820d5d2b728fd17ec5fd424c389", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25296572, "upload_time": "2019-09-21T03:53:59", "url": "https://files.pythonhosted.org/packages/f4/0c/e91e33a93c97a216e0ee9c7b795d26f3718730301b40adffd018a979ae50/ethnicolr-0.2.4.tar.gz" } ] }