{ "info": { "author": "Yam", "author_email": "haoshaochun@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# pnlp\nThis is a pre-processing tool for NLP.\n\n## Features\n\n- a flexible pipe line for text io\n- a flexible tool for text clean and extract and kinds of length\n- some magic usage in pre-processing\n\n## Install\n\n`pip install pnlp`\n\n## Usage\n\n### Iopipe\n\n```bash\ntree tests/piop_data/\n\u251c\u2500\u2500 a.md\n\u251c\u2500\u2500 b.txt\n\u251c\u2500\u2500 c.data\n\u251c\u2500\u2500 first\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 fa.md\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 fb.txt\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 fc.data\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 second\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 sa.md\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 sb.txt\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 sc.data\n\u251c\u2500\u2500 json.json\n\u251c\u2500\u2500 outfile.file\n\u251c\u2500\u2500 outjson.json\n\u2514\u2500\u2500 yml.yml\n```\n\n```python\nimport os\nfrom pnlp import piop\n\nDATA_PATH = \"./pnlp/tests/piop_data/\"\npattern = '*.md' # also could be '*.txt', 'f*.*', etc.\n\n# Get lines of all files in one directory with line index and file name\nfor line in piop.Reader(DATA_PATH, pattern):\n print(line.lid, line.fname, line.text)\n\"\"\"\n0 a.md line 1 in a.\n1 a.md line 2 in a.\n2 a.md line 3 in a.\n0 fa.md line 1 in fa.\n1 fa.md line 2 in fa\n...\n\"\"\"\n\n# Get lines of one file lines with line index and file name\nfor line in piop.Reader(os.path.join(DATA_PATH, \"a.md\")):\n print(line.lid, line.fname, line.text)\n\"\"\"\n0 a.md line 1 in a.\n1 a.md line 2 in a.\n2 a.md line 3 in a.\n\"\"\"\n\n# Get all filepaths in one directory\nfor path in piop.Reader.gen_files(DATA_PATH, pattern):\n print(path)\n\"\"\"\npnlp/tests/piop_data/a.md\npnlp/tests/piop_data/first/fa.md\npnlp/tests/piop_data/first/second/sa.md\n\"\"\"\n\n# Get content(article) of all files in one directory with file name\npaths = piop.Reader.gen_files(DATA_PATH, pattern)\narticles = piop.Reader.gen_articles(paths)\nfor article in articles:\n print(article.fname)\n print(article.f.read())\n\"\"\"\na.md\nline 1 in a.\nline 2 in a.\nline 3 in a.\n...\n\"\"\"\n\n# Get lines of all files in one directory with line index and file name\n# the same as ip.Reader(DATA_PATH, pattern)\npaths = piop.Reader.gen_files(DATA_PATH, pattern)\narticles = piop.Reader.gen_articles(paths)\nfor line in piop.Reader.gen_flines(articles):\n print(line.lid, line.fname, line.text)\n```\n\n### Text\n\n#### Clean and Extract\n\n```python\nimport re\nfrom pnlp import ptxt\n\ntext = \"\u8fd9\u662fhttps://www.yam.gift\u957f\u5ea6\u6d4b\u8bd5\uff0c\u300a \u300b*)FSJfdsjf\ud83d\ude01![](http://xx.jpg)\u3002233.\"\npattern = re.compile(r'\\w+')\n\n# pattern is re.Pattern or str type\n# Default is '', means do not use any pattern (acctually is re.compile(r'.+')\n# If pattern is a string, a build-in pattern will be used, there are 11 types:\n#\t'chi': Chinese character\n#\t'pun': Punctuations\n#\t'whi': White space\n#\t'nwh': Non White space\n#\t'wnb': Word and number\n#\t'nwn': Non word and number\n#\t'eng': English character\n#\t'num': Number\n#\t'pic': Pictures\n#\t'lnk': Links\n#\t'emj': Emojis\n\npt = ptxt.Text(text, pattern)\n# pt.extract will return matches and their locations\nprint(pt.extract)\n\"\"\"\n{'mats': ['\u8fd9\u662f', '\u957f\u5ea6\u6d4b\u8bd5'], 'locs': [(0, 2), (22, 26)]}\n\"\"\"\nprint(pt.extract.mats, pt.extract.locs)\n\"\"\"\n['\u8fd9\u662f', '\u957f\u5ea6\u6d4b\u8bd5'] [(0, 2), (22, 26)]\n\"\"\"\n# pt.clean will return cleaned text using the pattern\nprint(pt.clean)\n\"\"\"\nhttps://www.yam.gift\uff0c\u300a \u300b*)FSJfdsjf\ud83d\ude01![](http://xx.jpg)\u3002233.\n\"\"\"\n```\n\n#### Length\n\n```python\nfrom pnlp import ptxt\n\ntext = \"\u8fd9\u662fhttps://www.yam.gift\u957f\u5ea6\u6d4b\u8bd5\uff0c\u300a \u300b*)FSJfdsjf\ud83d\ude01![](http://xx.jpg)\u3002233.\"\n\npt = ptxt.Text(text)\n# Note that even a pattern is used, the length is always for the raw text.\n# Length is counted by character, not entire word or number.\nprint(\"Length of all characters: \", pt.len_all)\nprint(\"Length of all non-white characters: \", pt.len_nwh)\nprint(\"Length of all Chinese characters: \", pt.len_chi)\nprint(\"Length of all words and numbers: \", pt.len_wnb)\nprint(\"Length of all punctuations: \", pt.len_pun)\nprint(\"Length of all English characters: \", pt.len_eng)\nprint(\"Length of all numbers: \", pt.len_num)\n\"\"\"\nLength of all characters: 64\nLength of all non-white characters: 63\nLength of all Chinese characters: 6\nLength of all words and numbers: 41\nLength of all punctuations: 14\nLength of all English characters: 32\nLength of all numbers: 3\n\"\"\"\n```\n\n### Magic\n\n```python\nfrom pnlp import pmag\n\n# Nest dict\ndict1 = pmag.MagicDict()\ndict1['a']['b']['c'] = 2\nprint(dict1)\n\"\"\"\n{'a': {'b': {'c': 2}}}\n\"\"\"\n\n# Preserve all repeated value-keys when a Dict is reversed.\ndx = {1: 'a',\n 2: 'a',\n 3: 'a',\n 4: 'b' }\nprint(pmag.MagicDict.reverse(dx))\n\"\"\"\n{'a': [1, 2, 3], 'b': 4}\n\"\"\"\n```\n\n## Test\n\nClone the repo and enter the tests directory: \n\n```bash\ncd ./pnlp/tests\npytest\n```\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/hscspring/pnlp", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "pnlp", "package_url": "https://pypi.org/project/pnlp/", "platform": "", "project_url": "https://pypi.org/project/pnlp/", "project_urls": { "Homepage": "https://github.com/hscspring/pnlp" }, "release_url": "https://pypi.org/project/pnlp/0.11/", "requires_dist": [ "addict", "pyyaml", "smart-open" ], "requires_python": "", "summary": "A pre-processing tool for NLP.", "version": "0.11" }, "last_serial": 5237646, "releases": { "0.0.2": [ { "comment_text": "", "digests": { "md5": "7d4a6255953a91754e1836ae70bc4554", "sha256": "2d3087e0f38096935f87659068356d0cc8160554b5aaf64e16dfbc0cc2b64ab8" }, "downloads": -1, "filename": "pnlp-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "7d4a6255953a91754e1836ae70bc4554", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 3409, "upload_time": "2019-04-19T10:41:38", "url": "https://files.pythonhosted.org/packages/b0/6b/71cd4f29f73e70a5f7d64849537f898c1dfa27ed5d590c4c66edb3068f4f/pnlp-0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3fa247c80a9e5e6a550fe78bfae2eeab", "sha256": "b121d21d21e183b0efd237c258d7ea95a8ad61450e05f18cd2a7658ce4b76f40" }, "downloads": -1, "filename": "pnlp-0.0.2.tar.gz", "has_sig": false, "md5_digest": "3fa247c80a9e5e6a550fe78bfae2eeab", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1990, "upload_time": "2019-04-19T10:41:40", "url": "https://files.pythonhosted.org/packages/2a/fb/ab406c15eb7ee5cbcdb577cbfccd9bab5f81df6993c16f8b4e605a20fcd3/pnlp-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "e8cc609cdadca01f37a406a3ebe50360", "sha256": "658a80a3b8def33d365e0dd9a0a3f8712e18811f76a6b461a5fe69bddd903b89" }, "downloads": -1, "filename": "pnlp-0.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "e8cc609cdadca01f37a406a3ebe50360", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7381, "upload_time": "2019-04-24T04:24:09", "url": "https://files.pythonhosted.org/packages/a3/48/b2d731a4c3b3b1b4faa10c6e7ef65b9145606c989fbf63af7472736f6303/pnlp-0.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b5d730413e1a80a91a0eae9096cac3bc", "sha256": "4d9ba49c92af62883bee5458fdfa0e067edf19a490aca47bccd7303266d8ba96" }, "downloads": -1, "filename": "pnlp-0.0.3.tar.gz", "has_sig": false, "md5_digest": "b5d730413e1a80a91a0eae9096cac3bc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4680, "upload_time": "2019-04-24T04:24:12", "url": "https://files.pythonhosted.org/packages/98/f2/c53df7c78cfc694a2144adc7fafcf1ff96e061e9171662ed6845e58cd73f/pnlp-0.0.3.tar.gz" } ], "0.1": [ { "comment_text": "", "digests": { "md5": "f32479607462bcba6256b21bea984bbb", "sha256": "047c6eb174a8bee14fd63ad72b38588a4e04e9c51e8698d87def656833a5fb31" }, "downloads": -1, "filename": "pnlp-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "f32479607462bcba6256b21bea984bbb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7367, "upload_time": "2019-04-24T09:25:22", "url": "https://files.pythonhosted.org/packages/a2/ee/47bd70f9af790f6d7d0dc563f41d9c72ffbd05ff888f330b74c25a5afa13/pnlp-0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f55a01734655dbf292dee7af89dae0da", "sha256": "d1bcddbc09f1f6c69816cce186e6e9375a439060cbe447bd1105f2e6c7a8f384" }, "downloads": -1, "filename": "pnlp-0.1.tar.gz", "has_sig": false, "md5_digest": "f55a01734655dbf292dee7af89dae0da", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6095, "upload_time": "2019-04-24T09:25:24", "url": "https://files.pythonhosted.org/packages/17/69/10175c45a91ece368a2ce264f0c7f8067945ce88020a742bb8209f634cd6/pnlp-0.1.tar.gz" } ], "0.11": [ { "comment_text": "", "digests": { "md5": "1dba47d12521050d9098fc7c1cb42f81", "sha256": "5ce67b4b9e4b3ee9f6d25458bf0df400dddd16c1d2a6713e5d020d67ed3b1546" }, "downloads": -1, "filename": "pnlp-0.11-py3-none-any.whl", "has_sig": false, "md5_digest": "1dba47d12521050d9098fc7c1cb42f81", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7410, "upload_time": "2019-05-07T10:42:37", "url": "https://files.pythonhosted.org/packages/dc/e0/134e114e60a785c84b0478fc7f2ca93085a77043df17fa256f4f3dbe1e79/pnlp-0.11-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d13ce69ab4b17063d5c9ec99f3971cd1", "sha256": "ca25f73983a6468000e73030be542637f7f8f9201c0ee39568955385670726dd" }, "downloads": -1, "filename": "pnlp-0.11.tar.gz", "has_sig": false, "md5_digest": "d13ce69ab4b17063d5c9ec99f3971cd1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6177, "upload_time": "2019-05-07T10:42:40", "url": "https://files.pythonhosted.org/packages/12/5b/a036fdcfcadee0c1953bf81ed5dbd311725dfe42b85f6c83170c1316b753/pnlp-0.11.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1dba47d12521050d9098fc7c1cb42f81", "sha256": "5ce67b4b9e4b3ee9f6d25458bf0df400dddd16c1d2a6713e5d020d67ed3b1546" }, "downloads": -1, "filename": "pnlp-0.11-py3-none-any.whl", "has_sig": false, "md5_digest": "1dba47d12521050d9098fc7c1cb42f81", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7410, "upload_time": "2019-05-07T10:42:37", "url": "https://files.pythonhosted.org/packages/dc/e0/134e114e60a785c84b0478fc7f2ca93085a77043df17fa256f4f3dbe1e79/pnlp-0.11-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d13ce69ab4b17063d5c9ec99f3971cd1", "sha256": "ca25f73983a6468000e73030be542637f7f8f9201c0ee39568955385670726dd" }, "downloads": -1, "filename": "pnlp-0.11.tar.gz", "has_sig": false, "md5_digest": "d13ce69ab4b17063d5c9ec99f3971cd1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6177, "upload_time": "2019-05-07T10:42:40", "url": "https://files.pythonhosted.org/packages/12/5b/a036fdcfcadee0c1953bf81ed5dbd311725dfe42b85f6c83170c1316b753/pnlp-0.11.tar.gz" } ] }