{
"info": {
"author": "Prompsit Language Engineering",
"author_email": "info@prompsit.com",
"bugtrack_url": null,
"classifiers": [
"Environment :: Console",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: POSIX :: Linux",
"Programming Language :: Python :: 3.6",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Software Development :: Libraries :: Python Modules",
"Topic :: Text Processing :: Filters",
"Topic :: Text Processing :: Linguistic"
],
"description": "# binonymizer\n\nBinonymizer is a tool in Python that aims at tagging personal data1 in a parallel corpus.\n\nFor example, for a input like:\n\n```\nURL1 URL2 My name is Marta and my email is fake@email.com Mi nombre es Marta y mi email es fake@email.com\n```\n\nBinonymizer's output will be:\n\n```\nURL1 URL2 My name is Marta and my email is fake@email.com Mi nombre es Marta y mi email es fake@email.com\n```\n## Detectable entity tipes\n\nCurrently, the Binonymizer is able to detect and tag the following types of entities:\n\n* PER: person names\n* ORG: organism and company names\n* EMAIL: email addresses\n* PHONE: phone numbers\n* ADDRESS: addresses\n* ID: personal card IDs (such as spanish DNIs)\n* MISC: other personal data, or when the type it's uncertain \n* OTHER: other\n\n## Installation & Requirements\n\nBinonymizer works with Python 3.6, and can be installed with `pip`:\n\n```\npython3.6 -m pip install binonymizer\n```\n\nAfter installation, two binary files (`binonymizer` and `binonymizer-lite`) will be located in your `python/installation/prefix`/bin directory.\n\nLanguage-dependant packages and models are automatically downloaded and installed on runtime, if needed.\n\n### Extra instructions for basque\n\nIn case you plan to binonymize basque data, you need to download `binonymizer` from [github](http://github.com/bitextor/binonymizer), and run the following steps:\n\n```bash\ncd binonymizer\ngit submodule sync\ngit submodule update --init --recursive --remote\ncd prompsit_python_bindings\npython3.6 setup.py install\n```\nPlease note that you need to have access to Prompsit's private repository. [Contact us](mailto:help@prompsit.com) if you need further details.\n\n## Usage\n\nBinonymizer can be run with:\n\n```bash\nbinonymizer [-h] --format {tmx,cols} [--tmp_dir TMP_DIR]\n [-b BLOCK_SIZE] [-p PROCESSES] [-q] [--debug]\n [--logfile LOGFILE] [-v]\n input [output] srclang trglang\n```\n\n\n### Parameters\n* positional arguments:\n * input: File to be anonymized (See format below)\n * output: File with anonymization annotations (default: standard output)\n * srclang: Source language code of the input\n * trglang: Target language code of the input\n* optional arguments:\n * -h, --help: show this help message and exit\n* Mandatory:\n * --format {tmx,cols}: Input file format. Values: cols, tmx (\"cols\" format: URL1 URL2 SOURCE_SENTENCE TARGET_SENTENCE [extra columns] tab-separated)\n* Optional:\n * --tmp_dir TMP_DIR: Temporary directory where creating the temporary files of this program (default: default system temp dir, defined by the environment variable TMPDIR in Unix)\n * -b BLOCK_SIZE, --block_size BLOCK_SIZE: Sentence pairs per block (default: 10000)\n * -p PROCESSES, --processes PROCESSES: Number of processes to use (default: all CPUs minus one)\n* Logging:\n * -q, --quiet: Silent logging mode (default: False)\n * --debug: Debug logging mode (default: False)\n * --logfile LOGFILE: Store log to a file (default: standard error output)\n * -v, --version: show version of this script and exit\n\n### Example\n```bash\nbinonymizer corpus.en-es.raw corpus.en-es.anon en es --format cols --tmp_dir /tmpdir -b50000 -p31 \n```\nThis will read the corpus \"corpus.en-es.raw\", which is in a column-based format, extracting personal data and writing the tagged output in \"corpus.en-es.anon\". Binonymizer will run in blocks of 50000 sentences, using 31 cores, and writing temporary files in /tmpdir\n\n\n## Lite version\n\nAlthough `binonymizer` makes use of parallelization by distributing workload to the available cores, some users might prefer to implement their own parallelization strategies. For that reason, a single-thread version of the script is provided: `binonymizer_lite`. The usage is exactly the same as for the full version, but omitting the blocksize (-b) and processes (-p) parameter.\n\n\n## TO DO\n* Fully support TMX input/output\n* Address recognition\n* GPU support\n* Automate Prompsit-python-bindings submodule ( git submodule update --remote , python3.6 setup.py install)\n\n\n\n1: See EC definition of \"personal information\": https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en\n\n\n",
"description_content_type": "text/markdown",
"docs_url": null,
"download_url": "",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "https://github.com/bitextor/binonymizer",
"keywords": "",
"license": "GNU General Public License v3.0",
"maintainer": "Marta Ba\u00f1\u00f3n",
"maintainer_email": "mbanon@prompsit.com",
"name": "binonymizer",
"package_url": "https://pypi.org/project/binonymizer/",
"platform": "",
"project_url": "https://pypi.org/project/binonymizer/",
"project_urls": {
"Binonymizer on GitHub": "https://github.com/bitextor/binonymizer",
"Homepage": "https://github.com/bitextor/binonymizer",
"Paracrawl": "https://paracrawl.eu/",
"Prompsit Language Engineering": "http://www.prompsit.com"
},
"release_url": "https://pypi.org/project/binonymizer/0.1.1/",
"requires_dist": [
"certifi (==2018.11.29)",
"chardet (==3.0.4)",
"cymem (==2.0.2)",
"cytoolz (==0.9.0.1)",
"dill (==0.2.9)",
"idna (==2.8)",
"jpype1",
"msgpack (==0.5.6)",
"msgpack-numpy (==0.4.3.2)",
"murmurhash (==1.0.1)",
"numpy (==1.16.1)",
"plac (==0.9.6)",
"preshed (==2.0.1)",
"regex (==2018.1.10)",
"requests (==2.21.0)",
"semver (==2.8.1)",
"six (==1.12.0)",
"spacy (==2.0.18)",
"thinc (==6.12.1)",
"toolz (==0.9.0)",
"tqdm (==4.31.1)",
"ujson (==1.35)",
"urllib3 (==1.24.1)",
"wrapt (==1.10.11)"
],
"requires_python": "",
"summary": "Binonymizer is a tool in Python that aims at tagging personal data in a parallel corpus.",
"version": "0.1.1"
},
"last_serial": 4874177,
"releases": {
"0.1": [
{
"comment_text": "",
"digests": {
"md5": "0813dad3dca4a89417db8e88b8e16e5d",
"sha256": "64245fdb79c3a64e8441a683f200a2678606b1b319c39719363c43cdf65e33f0"
},
"downloads": -1,
"filename": "binonymizer-0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0813dad3dca4a89417db8e88b8e16e5d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 33495,
"upload_time": "2019-02-27T11:53:31",
"url": "https://files.pythonhosted.org/packages/6c/c7/5ce95de6af4e872c1eb44fb7882661d76d31c02b22d1de921fce587c2e9c/binonymizer-0.1-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "df20034ded2df41f0ab21e937a43bd85",
"sha256": "9598dba282d18cd7d6bd5fdf34e687cca3e28c6493c23b25bc81287f03dfd40a"
},
"downloads": -1,
"filename": "binonymizer-0.1.tar.gz",
"has_sig": false,
"md5_digest": "df20034ded2df41f0ab21e937a43bd85",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 14746,
"upload_time": "2019-02-27T11:53:34",
"url": "https://files.pythonhosted.org/packages/de/c5/27af046286fbc286675bc4a421ee60e9a715b9ce0fbffc39afed2a059b25/binonymizer-0.1.tar.gz"
}
],
"0.1.1": [
{
"comment_text": "",
"digests": {
"md5": "6d9411a6444e307518ace451a41c3d65",
"sha256": "5f6e3cb226daae912843ca82a19860b24b633dc0dbebb67ae4af4717816e4658"
},
"downloads": -1,
"filename": "binonymizer-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6d9411a6444e307518ace451a41c3d65",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 33745,
"upload_time": "2019-02-27T13:37:52",
"url": "https://files.pythonhosted.org/packages/43/c2/22b06d02e1187ce12716b5b339ad23606abf15e85fde9c5a706ba96381d2/binonymizer-0.1.1-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "9a9f1ec55072fbaf5dabc84dd0017fe8",
"sha256": "8c72e8c1191564ea98bd65e41f86896d278929a3a024b6a84f21dcde12b2a76f"
},
"downloads": -1,
"filename": "binonymizer-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "9a9f1ec55072fbaf5dabc84dd0017fe8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15008,
"upload_time": "2019-02-27T13:37:53",
"url": "https://files.pythonhosted.org/packages/50/65/bf7e08f216262b6ea4fcbdcf3b873c38e35f53c35c140f0331b880291cd2/binonymizer-0.1.1.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "6d9411a6444e307518ace451a41c3d65",
"sha256": "5f6e3cb226daae912843ca82a19860b24b633dc0dbebb67ae4af4717816e4658"
},
"downloads": -1,
"filename": "binonymizer-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6d9411a6444e307518ace451a41c3d65",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 33745,
"upload_time": "2019-02-27T13:37:52",
"url": "https://files.pythonhosted.org/packages/43/c2/22b06d02e1187ce12716b5b339ad23606abf15e85fde9c5a706ba96381d2/binonymizer-0.1.1-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "9a9f1ec55072fbaf5dabc84dd0017fe8",
"sha256": "8c72e8c1191564ea98bd65e41f86896d278929a3a024b6a84f21dcde12b2a76f"
},
"downloads": -1,
"filename": "binonymizer-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "9a9f1ec55072fbaf5dabc84dd0017fe8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15008,
"upload_time": "2019-02-27T13:37:53",
"url": "https://files.pythonhosted.org/packages/50/65/bf7e08f216262b6ea4fcbdcf3b873c38e35f53c35c140f0331b880291cd2/binonymizer-0.1.1.tar.gz"
}
]
}