{ "info": { "author": "Prompsit Language Engineering", "author_email": "info@prompsit.com", "bugtrack_url": null, "classifiers": [ "Environment :: Console", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 3.6", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Filters", "Topic :: Text Processing :: Linguistic" ], "description": "# binonymizer\n\nBinonymizer is a tool in Python that aims at tagging personal data1 in a parallel corpus.\n\nFor example, for a input like:\n\n```\nURL1 URL2 My name is Marta and my email is fake@email.com Mi nombre es Marta y mi email es fake@email.com\n```\n\nBinonymizer's output will be:\n\n```\nURL1 URL2 My name is Marta and my email is fake@email.com Mi nombre es Marta y mi email es fake@email.com\n```\n## Detectable entity tipes\n\nCurrently, the Binonymizer is able to detect and tag the following types of entities:\n\n* PER: person names\n* ORG: organism and company names\n* EMAIL: email addresses\n* PHONE: phone numbers\n* ADDRESS: addresses\n* ID: personal card IDs (such as spanish DNIs)\n* MISC: other personal data, or when the type it's uncertain \n* OTHER: other\n\n## Installation & Requirements\n\nBinonymizer works with Python 3.6, and can be installed with `pip`:\n\n```\npython3.6 -m pip install binonymizer\n```\n\nAfter installation, two binary files (`binonymizer` and `binonymizer-lite`) will be located in your `python/installation/prefix`/bin directory.\n\nLanguage-dependant packages and models are automatically downloaded and installed on runtime, if needed.\n\n### Extra instructions for basque\n\nIn case you plan to binonymize basque data, you need to download `binonymizer` from [github](http://github.com/bitextor/binonymizer), and run the following steps:\n\n```bash\ncd binonymizer\ngit submodule sync\ngit submodule update --init --recursive --remote\ncd prompsit_python_bindings\npython3.6 setup.py install\n```\nPlease note that you need to have access to Prompsit's private repository. [Contact us](mailto:help@prompsit.com) if you need further details.\n\n## Usage\n\nBinonymizer can be run with:\n\n```bash\nbinonymizer [-h] --format {tmx,cols} [--tmp_dir TMP_DIR]\n [-b BLOCK_SIZE] [-p PROCESSES] [-q] [--debug]\n [--logfile LOGFILE] [-v]\n input [output] srclang trglang\n```\n\n\n### Parameters\n* positional arguments:\n * input: File to be anonymized (See format below)\n * output: File with anonymization annotations (default: standard output)\n * srclang: Source language code of the input\n * trglang: Target language code of the input\n* optional arguments:\n * -h, --help: show this help message and exit\n* Mandatory:\n * --format {tmx,cols}: Input file format. Values: cols, tmx (\"cols\" format: URL1 URL2 SOURCE_SENTENCE TARGET_SENTENCE [extra columns] tab-separated)\n* Optional:\n * --tmp_dir TMP_DIR: Temporary directory where creating the temporary files of this program (default: default system temp dir, defined by the environment variable TMPDIR in Unix)\n * -b BLOCK_SIZE, --block_size BLOCK_SIZE: Sentence pairs per block (default: 10000)\n * -p PROCESSES, --processes PROCESSES: Number of processes to use (default: all CPUs minus one)\n* Logging:\n * -q, --quiet: Silent logging mode (default: False)\n * --debug: Debug logging mode (default: False)\n * --logfile LOGFILE: Store log to a file (default: standard error output)\n * -v, --version: show version of this script and exit\n\n### Example\n```bash\nbinonymizer corpus.en-es.raw corpus.en-es.anon en es --format cols --tmp_dir /tmpdir -b50000 -p31 \n```\nThis will read the corpus \"corpus.en-es.raw\", which is in a column-based format, extracting personal data and writing the tagged output in \"corpus.en-es.anon\". Binonymizer will run in blocks of 50000 sentences, using 31 cores, and writing temporary files in /tmpdir\n\n\n## Lite version\n\nAlthough `binonymizer` makes use of parallelization by distributing workload to the available cores, some users might prefer to implement their own parallelization strategies. For that reason, a single-thread version of the script is provided: `binonymizer_lite`. The usage is exactly the same as for the full version, but omitting the blocksize (-b) and processes (-p) parameter.\n\n\n## TO DO\n* Fully support TMX input/output\n* Address recognition\n* GPU support\n* Automate Prompsit-python-bindings submodule ( git submodule update --remote , python3.6 setup.py install)\n\n\n\n1: See EC definition of \"personal information\": https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/bitextor/binonymizer", "keywords": "", "license": "GNU General Public License v3.0", "maintainer": "Marta Ba\u00f1\u00f3n", "maintainer_email": "mbanon@prompsit.com", "name": "binonymizer", "package_url": "https://pypi.org/project/binonymizer/", "platform": "", "project_url": "https://pypi.org/project/binonymizer/", "project_urls": { "Binonymizer on GitHub": "https://github.com/bitextor/binonymizer", "Homepage": "https://github.com/bitextor/binonymizer", "Paracrawl": "https://paracrawl.eu/", "Prompsit Language Engineering": "http://www.prompsit.com" }, "release_url": "https://pypi.org/project/binonymizer/0.1.1/", "requires_dist": [ "certifi (==2018.11.29)", "chardet (==3.0.4)", "cymem (==2.0.2)", "cytoolz (==0.9.0.1)", "dill (==0.2.9)", "idna (==2.8)", "jpype1", "msgpack (==0.5.6)", "msgpack-numpy (==0.4.3.2)", "murmurhash (==1.0.1)", "numpy (==1.16.1)", "plac (==0.9.6)", "preshed (==2.0.1)", "regex (==2018.1.10)", "requests (==2.21.0)", "semver (==2.8.1)", "six (==1.12.0)", "spacy (==2.0.18)", "thinc (==6.12.1)", "toolz (==0.9.0)", "tqdm (==4.31.1)", "ujson (==1.35)", "urllib3 (==1.24.1)", "wrapt (==1.10.11)" ], "requires_python": "", "summary": "Binonymizer is a tool in Python that aims at tagging personal data in a parallel corpus.", "version": "0.1.1" }, "last_serial": 4874177, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "0813dad3dca4a89417db8e88b8e16e5d", "sha256": "64245fdb79c3a64e8441a683f200a2678606b1b319c39719363c43cdf65e33f0" }, "downloads": -1, "filename": "binonymizer-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "0813dad3dca4a89417db8e88b8e16e5d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 33495, "upload_time": "2019-02-27T11:53:31", "url": "https://files.pythonhosted.org/packages/6c/c7/5ce95de6af4e872c1eb44fb7882661d76d31c02b22d1de921fce587c2e9c/binonymizer-0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "df20034ded2df41f0ab21e937a43bd85", "sha256": "9598dba282d18cd7d6bd5fdf34e687cca3e28c6493c23b25bc81287f03dfd40a" }, "downloads": -1, "filename": "binonymizer-0.1.tar.gz", "has_sig": false, "md5_digest": "df20034ded2df41f0ab21e937a43bd85", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14746, "upload_time": "2019-02-27T11:53:34", "url": "https://files.pythonhosted.org/packages/de/c5/27af046286fbc286675bc4a421ee60e9a715b9ce0fbffc39afed2a059b25/binonymizer-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "6d9411a6444e307518ace451a41c3d65", "sha256": "5f6e3cb226daae912843ca82a19860b24b633dc0dbebb67ae4af4717816e4658" }, "downloads": -1, "filename": "binonymizer-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "6d9411a6444e307518ace451a41c3d65", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 33745, "upload_time": "2019-02-27T13:37:52", "url": "https://files.pythonhosted.org/packages/43/c2/22b06d02e1187ce12716b5b339ad23606abf15e85fde9c5a706ba96381d2/binonymizer-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9a9f1ec55072fbaf5dabc84dd0017fe8", "sha256": "8c72e8c1191564ea98bd65e41f86896d278929a3a024b6a84f21dcde12b2a76f" }, "downloads": -1, "filename": "binonymizer-0.1.1.tar.gz", "has_sig": false, "md5_digest": "9a9f1ec55072fbaf5dabc84dd0017fe8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15008, "upload_time": "2019-02-27T13:37:53", "url": "https://files.pythonhosted.org/packages/50/65/bf7e08f216262b6ea4fcbdcf3b873c38e35f53c35c140f0331b880291cd2/binonymizer-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "6d9411a6444e307518ace451a41c3d65", "sha256": "5f6e3cb226daae912843ca82a19860b24b633dc0dbebb67ae4af4717816e4658" }, "downloads": -1, "filename": "binonymizer-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "6d9411a6444e307518ace451a41c3d65", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 33745, "upload_time": "2019-02-27T13:37:52", "url": "https://files.pythonhosted.org/packages/43/c2/22b06d02e1187ce12716b5b339ad23606abf15e85fde9c5a706ba96381d2/binonymizer-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9a9f1ec55072fbaf5dabc84dd0017fe8", "sha256": "8c72e8c1191564ea98bd65e41f86896d278929a3a024b6a84f21dcde12b2a76f" }, "downloads": -1, "filename": "binonymizer-0.1.1.tar.gz", "has_sig": false, "md5_digest": "9a9f1ec55072fbaf5dabc84dd0017fe8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15008, "upload_time": "2019-02-27T13:37:53", "url": "https://files.pythonhosted.org/packages/50/65/bf7e08f216262b6ea4fcbdcf3b873c38e35f53c35c140f0331b880291cd2/binonymizer-0.1.1.tar.gz" } ] }