{
"info": {
"author": "bruno cuconato",
"author_email": "bcclaro+corpushash@gmail.com",
"bugtrack_url": null,
"classifiers": [
"Development Status :: 3 - Alpha",
"Intended Audience :: Developers",
"License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Topic :: Software Development :: Build Tools"
],
"description": "##########\r\ncorpushash\r\n##########\r\n\r\nThe ``corpushash`` library enables the performance of common NLP tasks on\r\nsensitive documents without disclosing their contents. This is done by\r\nhashing every token in the corpus along with a salt.\r\n\r\nprotocol\r\n========\r\n\r\n1. the client --- in her own secure environment --- takes her classified \r\ndocuments and does the linguistic pre-processing (removal of stopwords, \r\ntokenization, etc.)\r\n\r\n2. the client then hashes the tokens: she creates random salts for every \r\ndifferent token and hashes the concatenation of token+salt. the client keeps a \r\ndictionary of the salts used for every token.\r\n\r\n3. the client sends the hashed tokens to the analyst for linguistic processing. \r\nas there is a biunivocal correspondence between every hashed token and \r\nplaintext token, NLP can occur in the same way as with any plaintext corpus.\r\n\r\n4. once the NLP is over, the analyst sends the results to the client, who then \r\nuses the dictionary to recover the plaintext tokens and thus the results' \r\nmeaning.\r\n\r\n\r\nthe library\r\n===========\r\n\r\nThe library requires as input:\r\n\r\n- a tokenized ``corpus`` as a nested list, whose elements are themselves\r\n nested lists of the tokens of each document in the corpus.\r\n\r\n each list corresponds to a document structure: its chapters,\r\n paragraphs, sentences. you decide how the nested list is to be\r\n created or structured, as long as the input is a nested list with\r\n strings as their bottom-most elements..\r\n\r\n- ``corpus_path``, a path to a directory where the output files are to\r\n be stored.\r\n\r\nThe output includes:\r\n\r\n- a ``.json`` file for every document in the ``corpus``, named sequentially as \r\n positive integers, e.g., the first document being ``0.json``, stored in \r\n ``corpus_path/public/$(timestamp-of-hash)/``.\r\n\r\n- two ``.json`` dictionaries stored in ``corpus_path/private``. they are\r\n used to decode the ``.json`` files or the NLP results.\r\n\r\ninstall\r\n=======\r\n\r\nthe package demands python \u2a7e3, but has no external dependencies.\r\n\r\nusing pip:\r\n\r\n.. code-block:: shell\r\n\r\n pip3 install corpushash\r\n\r\nor from repository (most up-to-date):\r\n\r\n.. code-block:: bash\r\n\r\n pip3 install git+https://github.com/NAMD/corpushash.git\r\n\r\nor manually:\r\n\r\n.. code-block:: shell\r\n\r\n git clone https://github.com/NAMD/corpushash.git\r\n cd corpushash\r\n python3 setup.py install\r\n\r\nusage\r\n=====\r\n\r\nthis will hash each word in the first four verses of the `zen of python \r\n`_ to the same .json document:\r\n\r\n.. code-block:: python\r\n\r\n import corpushash as ch\r\n example_corpus = [[\r\n ['Beautiful', 'is', 'better', 'than', 'ugly'],\r\n ['Explicit', 'is', 'better', 'than', 'implicit'],\r\n ['Simple', 'is', 'better', 'than', 'complex'],\r\n ['Complex', 'is', 'better', 'than', 'complicated'],\r\n ['Flat', 'is', 'better', 'than', 'nested']\r\n ]]\r\n hashed_corpus = ch.CorpusHash(example_corpus, 'output_directory')\r\n\r\nthis will hash each word in the first four verses of the zen of python to **four** \r\ndifferent .json documents, as if they were different documents:\r\n\r\n.. code-block:: python\r\n\r\n import corpushash as ch\r\n example_corpus = [\r\n ['Beautiful', 'is', 'better', 'than', 'ugly'],\r\n ['Explicit', 'is', 'better', 'than', 'implicit'],\r\n ['Simple', 'is', 'better', 'than', 'complex'],\r\n ['Complex', 'is', 'better', 'than', 'complicated'],\r\n ['Flat', 'is', 'better', 'than', 'nested']\r\n ]\r\n hashed_corpus = ch.CorpusHash(example_corpus, 'output_directory')\r\n\r\nso be careful when constructing your nested lists! check the tutorial at \r\n``notebooks/tutorial.ipynb``.\r\n\r\nnotes\r\n=====\r\n\r\n- probability of collision is extremely low (check the `preprint `_), but \r\n still we check for them, so they are not an issue.\r\n\r\n- hashing the tokens is not the same as encrypting them. as the same token \r\n always maps to the same hash, the resulting hashed corpus is subject to \r\n frequency analysis. even if a pre-processed text is almost uncomprehensible to \r\n a reader (specially if stopwords are removed), there probably is still a \r\n degree of trust in the analyst. she is usually someone who has no incentive \r\n to attempt a decipherment of the text or someone who has a lesser (but by no \r\n means inexisting) security clearance. this vulnerability will be investigated \r\n in the future.\r\n\r\n- memory complexity is estimated to be at most double the size of the biggest \r\n document in the corpus.\r\n\r\ncredits\r\n=======\r\n\r\n@odanoburu & @fccoelho\r\n\r\nlicense\r\n=======\r\n\r\nLGPL 3, check the ``LICENSE.md`` file for full content.",
"description_content_type": null,
"docs_url": null,
"download_url": "",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "https://github.com/NAMD/corpushash",
"keywords": "python 3,nlp,hash,hashing",
"license": "",
"maintainer": "",
"maintainer_email": "",
"name": "corpushash",
"package_url": "https://pypi.org/project/corpushash/",
"platform": "",
"project_url": "https://pypi.org/project/corpushash/",
"project_urls": {
"Homepage": "https://github.com/NAMD/corpushash"
},
"release_url": "https://pypi.org/project/corpushash/0.1.0/",
"requires_dist": null,
"requires_python": "",
"summary": "Cryptographic hasher of text document corpora",
"version": "0.1.0"
},
"last_serial": 2901250,
"releases": {
"0.1.0": [
{
"comment_text": "",
"digests": {
"md5": "a23f267cb01970ce9af0e0ce85a51eaa",
"sha256": "a4e2f8c303b6bf351eb9bdfe3ae9294dea32a02199948a1a6cf80f280305ed47"
},
"downloads": -1,
"filename": "corpushash-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "a23f267cb01970ce9af0e0ce85a51eaa",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1327093,
"upload_time": "2017-05-24T16:50:04",
"url": "https://files.pythonhosted.org/packages/bf/24/b9d8a9f156ffac5d37254a99f6398c3df670b0f530ad8fd539fec486d7f6/corpushash-0.1.0.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "a23f267cb01970ce9af0e0ce85a51eaa",
"sha256": "a4e2f8c303b6bf351eb9bdfe3ae9294dea32a02199948a1a6cf80f280305ed47"
},
"downloads": -1,
"filename": "corpushash-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "a23f267cb01970ce9af0e0ce85a51eaa",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1327093,
"upload_time": "2017-05-24T16:50:04",
"url": "https://files.pythonhosted.org/packages/bf/24/b9d8a9f156ffac5d37254a99f6398c3df670b0f530ad8fd539fec486d7f6/corpushash-0.1.0.tar.gz"
}
]
}