{ "info": { "author": "Edward Lu", "author_email": "maxminicherrycc@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# piidetect\nA package to build an end-to-end ML pipeline to detect personally identifiable information (PII) from text. This \npackage is still in early stage development. More documentations and tests are coming soon. \n\n# installation\n```\npip install piidetect\n```\n\n## Create fake PII\nfakepii.py is the module to create random text mixed with different types of PII.\n\n### Use in Python\n\nCreating fake text in Python\n```\nfrom piidetect.fakepii import Fake_PII\nfake_ = Fake_PII()\nfake_.create_fake_profile(10)\ntrain_labels, train_text, train_PII = fake_.create_pii_text_train(n_text = 5)\n```\n\nThis package also has some helper functions to create fake pii with text and dump it to disk. \n\n```\nfrom piidetect.fakepii import Fake_PII, write_to_disk_train, write_to_disk_test\n\nwrite_to_disk_train(10)\nwrite_to_disk_test(20)\n```\nThe file name for training data will be \"train_text_with_pii_\" + convert_datetime_underscore(datetime.now()) + \".csv\"\nThe file name for testing data will be \"test_text_with_pii_\" + convert_datetime_underscore(datetime.now()) + \".csv\"\n\nThe dumped data will contain three columns: \"Text\", \"Labels\", \"PII\".\nThe Text column contains the text mixed with PII.\nThe Labels column contains the PII type of the text. If there is no PII in the text, then it is \"None\".\nThe PII column contains the True PII. \n\n### Command line usage\nYou can just download the fakePII.py to your local directory to use with command line. \nHere are some examples for command line usage.\n```\n# creating 1000 training data and 100 testing data. \npython fakePII.py -train 1000 -test 100\n# creating 100 testing data\npython fakePII.py -test 100\n# create 1000 training data\npython fakePII.py -train 1000 \n```\n\nIn the training text, a normal text is repeated used to insert different PIIs into\nit. In the testing text, a normal text is not intentionally repeated to insert different PIIs. \n\n\n## Word embedding training\nThis package wraps the word embedding algorithm **word2vec, doc2vec and fasttext** for detecting PII. \n\nThis word_embedding will allow continued training on the pre-trained model by assigning\nthe model to the **pre_trained** option in class initialization. \n\nAfter training the model, it will dump the word2vec model to the path assigned to \n**dump_file** option (can not dump to a path if the directory does not exist)\n\nIf the **pre_train** is None, then the model will be trained. \n\nIf the **pre_train** model is not None, then the default is to continue training on the new model\nunless option **continue_train_pre_train** is specified as False. The False option will just assign \nthe pre_train model to be the model without training on the text. \n\nIf **re_train_new_sentences** is True, which is the default setting, the model will be re-trained on the new sentences. \nThis will create word embedding for words not in the original vocabulary.\nThis will increase the model inference time since it invovles model training. \n\nFor using word2vec to predict PII data, it is recommended to update the model with new sentences. \nFor fastttext, it is not necessary since it will infer from the character n-grams. The fasttext training\nis much longer than word2vec. \n\n**size**: vector dimension for word. Must be the same as the pre_train model is that is specified.\n\n**min_count**: Ignores all words with total frequency lower than this. Use 1 for PII detection.\n\n**workers**: number of CPU cores for training\n\n\n```\nfrom piidetect.pipeline import word_embedding\nmodel = word_embedding(algo_name = \"word2vec\",size = 100, min_count = 1, workers =2)\nmodel.fit(data['Text'])\n```\n## How to use piidetect to build a pipeline for PII detection. \nBefore you start to train an end-to-end PII detector, you need to create binary labels \nfor ML models.\n```\nfrom piidetect.pipeline import binary_pii\ndata['Target'] = data['Labels'].apply(binary_pii)\n```\n\nThis is an example in building an end-to-end PII detection with logistic regression. \n```\nfrom piidetect.pipeline import word_embedding, text_clean\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.linear_model import LogisticRegression\n\nlogit_clf_word2vec = LogisticRegression(solver = \"lbfgs\", max_iter = 10000)\n\nword2vec_pipe = Pipeline([('text_cleaning', text_clean()),\n (\"word_embedding\", word_embedding(algo_name = \"word2vec\", workers =2)),\n (\"logit_clf_word2vec\",logit_clf_word2vec)\n ])\n\nword2vec_pipe.fit(data[\"Text\"],data['Target'] )\n```\nYou can also use RandomizedSearchCV to hyperparameter selection. (This is going to run for a long time.)\n```\nfrom sklearn.model_selection import RandomizedSearchCV\nfrom piidetect.pipeline import word_embedding, text_clean\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.linear_model import LogisticRegression\n\n\nlogit_clf_word2vec = LogisticRegression(solver = \"lbfgs\", max_iter = 10000)\n\npipe = Pipeline([('text_cleaning', text_clean()),\n (\"word_embedding\", word_embedding( workers =2)),\n (\"logit_clf_word2vec\",logit_clf_word2vec)\n ])\n\n\nparam_grid = {\n 'word_embedding__algo_name':['word2vec', 'doc2vec','fasttext'],\n 'word_embedding__size':[100,200,300], \n 'logit_clf_word2vec__C': uniform(0,10),\n 'logit_clf_word2vec__class_weight':[{0: 0.9, 1: 0.1}, {0: 0.8, 1: 0.2}, {0: 0.7, 1: 0.3},None]\n}\n\npipe_cv = RandomizedSearchCV(estimator = pipe,param_distributions = param_grid,\\\n cv =10, error_score = 0,n_iter = 10 , scoring = 'f1'\\\n ,return_train_score=True, n_jobs = 1)\n```\n\n\nYou can dump the pipeline to disk after training. The compress = 1 will save the pipeline into one file. \nFor a model with size = 300 with word2vec, the model can be around 1GB. \n\n```\nfrom sklearn.externals import joblib\njoblib.dump(pipe_cv.best_estimator_, 'pipe_cv.pkl', compress = 1)\n\n```\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/edwardcooper/piidetect", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "piidetect", "package_url": "https://pypi.org/project/piidetect/", "platform": "", "project_url": "https://pypi.org/project/piidetect/", "project_urls": { "Homepage": "https://github.com/edwardcooper/piidetect" }, "release_url": "https://pypi.org/project/piidetect/0.0.0.2/", "requires_dist": [ "gensim", "numpy", "pandas", "faker", "tqdm", "sklearn" ], "requires_python": "", "summary": "A package to build an end-to-end ML pipeline to detect personally identifiable information from text.", "version": "0.0.0.2" }, "last_serial": 4717056, "releases": { "0.0.0.1": [ { "comment_text": "", "digests": { "md5": "af25fbfac561151416d8312a0d449c52", "sha256": "3b067c26335fd33b90e1479f8448f68beeffb34cf78a41fe154505a7993fe9ba" }, "downloads": -1, "filename": "piidetect-0.0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "af25fbfac561151416d8312a0d449c52", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8820, "upload_time": "2019-01-20T00:33:12", "url": "https://files.pythonhosted.org/packages/8c/b9/46c7e50deca7abd6c02640f36c66045a3a04c7bf418b75a6e0be33da075f/piidetect-0.0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1159679982a4a9180798899d52a591ae", "sha256": "bbe74eda97e5871bb2e15245413d1c186125c45273d9bb6012a2959629b61b34" }, "downloads": -1, "filename": "piidetect-0.0.0.1.tar.gz", "has_sig": false, "md5_digest": "1159679982a4a9180798899d52a591ae", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8424, "upload_time": "2019-01-20T00:33:13", "url": "https://files.pythonhosted.org/packages/f8/b3/83475ef6b28506602b0bd71a64e30f2fa26f55743ac74206a066a88318cb/piidetect-0.0.0.1.tar.gz" } ], "0.0.0.2": [ { "comment_text": "", "digests": { "md5": "325cabef9488340efbec7d98060b6a28", "sha256": "66d23a4a5b8f801ba89639f064ed668fd35dc5f51dd93c86f7ffe8d499b0acad" }, "downloads": -1, "filename": "piidetect-0.0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "325cabef9488340efbec7d98060b6a28", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8861, "upload_time": "2019-01-20T02:10:36", "url": "https://files.pythonhosted.org/packages/e8/ea/08f8b330956fed9908e51e318317c88b86abe564a71bf73a014aaa352157/piidetect-0.0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a76f1d2fc90312cd188729b00c17a234", "sha256": "23746fef9d84f166ee86e2ce77e9bab6fea4d86368616977c6fa08302eed980d" }, "downloads": -1, "filename": "piidetect-0.0.0.2.tar.gz", "has_sig": false, "md5_digest": "a76f1d2fc90312cd188729b00c17a234", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8502, "upload_time": "2019-01-20T02:10:38", "url": "https://files.pythonhosted.org/packages/75/ac/c995d9a9a482579eb24ec067e664bd4f9a2aec189a690d3f1b6339020098/piidetect-0.0.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "325cabef9488340efbec7d98060b6a28", "sha256": "66d23a4a5b8f801ba89639f064ed668fd35dc5f51dd93c86f7ffe8d499b0acad" }, "downloads": -1, "filename": "piidetect-0.0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "325cabef9488340efbec7d98060b6a28", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8861, "upload_time": "2019-01-20T02:10:36", "url": "https://files.pythonhosted.org/packages/e8/ea/08f8b330956fed9908e51e318317c88b86abe564a71bf73a014aaa352157/piidetect-0.0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a76f1d2fc90312cd188729b00c17a234", "sha256": "23746fef9d84f166ee86e2ce77e9bab6fea4d86368616977c6fa08302eed980d" }, "downloads": -1, "filename": "piidetect-0.0.0.2.tar.gz", "has_sig": false, "md5_digest": "a76f1d2fc90312cd188729b00c17a234", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8502, "upload_time": "2019-01-20T02:10:38", "url": "https://files.pythonhosted.org/packages/75/ac/c995d9a9a482579eb24ec067e664bd4f9a2aec189a690d3f1b6339020098/piidetect-0.0.0.2.tar.gz" } ] }