{ "info": { "author": "James Mishra", "author_email": "j@jamesmishra.com", "bugtrack_url": null, "classifiers": [ "Environment :: Console", "Intended Audience :: Developers", "Operating System :: MacOS :: MacOS X", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX :: Linux", "Programming Language :: Cython", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "cypunct: Fast-ish unicode string splitting\n******************************************\nCypunct is designed to solve the problem of quickly splitting a Unicode\nstring based on a set of characters.\n\nCypunct is designed to work on Python 2.6, 2.7, and 3.3+. Because\nCypunct is a Cython extension, it will (probably) only work in the CPython\nruntime.\n\nFor Python versions 2.6 and 2.7, Cypunct will only run if these\nCPython runtimes are compiled with the flag\n``--enable-unicode=ucs4``. Cypunct will throw an exception\nif your Python 2 runtime was not compiled with UCS-4.\n\nInstallation\n============\nInstallation is easiest with pip. Just run\n\n.. code:: bash\n\n pip install cypunct\n\n\nUsage\n=====\nCypunct takes a Unicode string and a ``frozenset`` of delimiter characters,\nand splits the string based on that set. Every delimiter character\nshould be a single Unicode code point -- ``len(char)`` should be 1.\n\nA simple example, where we provide a small ``frozenset`` is below.\n\n.. code:: python\n\n >>> from cypunct import split\n >>> split(\"James Mishra is the... best human ever, or so I think.\", frozenset({' ', '.', ','}))\n ['James', 'Mishra', 'is', 'the', 'best', 'human', 'ever', 'or', 'so', 'I', 'think', '']\n\nHowever, if you only need to split on whitespace characters, ``str.split()`` much\nbetter performance. If you only need to split on one character, ``str.split(char)``\nwill also be much faster.\n\nCypunct really shines when you need to split on many possible characters,\nsuch as an entire `Unicode character category `_.\n\nThe below example splits on all Unicode punctuation, and nothing else.\n\n.. code:: python\n\n >>> from cypunct.unicode_classes import P\n >>> split(\"James Mishra is the... best human ever, or so I think.\", P)\n ['James Mishra is the', ' best human ever', ' or so I think', '']\n \nThe following Unicode classes are available as sets:\n\n\n======== ===========\nCategory Description\n======== ===========\n**C** **Other**\nCc Other, Format\nCf Other, Not Assigned\nCo Other, Private Use\nCs Other, Surrogate\n**L** **Letter**\nLl Letter, Lowercase\nLm Letter, Modifier\nLo Letter, Other\nLt Letter, Titlecase\nLu Letter, Uppercase\n**M** **Mark**\nMc Mark, Space Combining\nMe Mark, Enclosing\nMn Mark, Nonspacing\nN **Number**\nNd Number, Decimal Digit\nNl Number, Letter\nNo Number, Other\n**P** **Punctuation**\nPc Punctuation, Connector\nPd Punctuation, Dash\nPe Punctuation, Close\nPf Punctuation, Final Quote\nPi Punctuation, Initial Quote\nPo Punctuation, Other\nPs Punctuation, Open\n**S** **Symbol**\nSc Symbol, Currency\nSk Symbol, Modifier\nSm Symbol, Math\nSo Symbol, Other\n**Z** **Separator**\nZl Separator, Line\nZp Separator, Paragraph\nZs Separator, Space\n======== ===========\n\n\n``cypunct.unicode_classes.COMMON_SEPARATORS`` is the union of the ``C``, ``P``, ``S``, and ``Z``\n``frozensets``. I have found it personally useful when splitting text for natural\nlanguage processing applications.\n\nIf you don't specify a ``frozenset`` for Cypunct to use, then Cypunct will\ndefault to ``COMMON_SEPARATORS``.\n \nUpdating Unicode data\n=====================\nCurrently, ``cypunct.unicode_classes`` is a Python module autogenerated from a\n``UnicodeData.txt`` file. The autogeneration script exists in\n`make_punctuation_file.py `_.\n\nMost Cypunct users will not need to concern themselves with this, but this is important\nto know if you are experiencing Unicode bugs or want to contribute to Cypunct.\n\nThe current ``UnicodeData.txt`` is from ftp://ftp.unicode.org/Public/10.0.0/ucd/UnicodeData.txt.\n\nFrequently Asked Questions (FAQ)\n================================\n**Q: I got an installation error involving\n\"pkg_resources.VersionConflict (setuptools xx.xx.xx\".\nHow do I fix this?**\n\nYou have a very old version of setuptools, and we won't be able to\ncompile our Cython extension with it. Run\n``pip install --upgrade setuptools`` and try installing Cypunct again.\n\n**Q: Wouldn't this be way faster if it were written in Pure C?**\n\nYes, it would. I'm too lazy to hand-code a C CPython extension, but it's on my todo list.\nRight now, Cypunct is *\"fast enough\"*, and I can move onto other things in my\ndaily life.\n\nHowever, if you want to take on the challenge of rewriting Cypunct in C and having\nthe exact same functionality as the current Cython version, I'll send you $100 USD.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jamesmishra/cypunct", "keywords": "unicode string splitting", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "cypunct", "package_url": "https://pypi.org/project/cypunct/", "platform": "", "project_url": "https://pypi.org/project/cypunct/", "project_urls": { "Homepage": "https://github.com/jamesmishra/cypunct" }, "release_url": "https://pypi.org/project/cypunct/0.1.1/", "requires_dist": null, "requires_python": "", "summary": "Cypunct is a Cython package to split Unicode strings based on a given frozenset of Unicode code points.", "version": "0.1.1" }, "last_serial": 2998004, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "70463a1222548a58f851c38debb1f325", "sha256": "df23f7c8b176b23d833445ceac91ad882c1366f6dea84679270f57d8ce835901" }, "downloads": -1, "filename": "cypunct-0.1.0.tar.gz", "has_sig": false, "md5_digest": "70463a1222548a58f851c38debb1f325", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 185555, "upload_time": "2017-07-03T22:52:38", "url": "https://files.pythonhosted.org/packages/ea/fa/dee18d5e00eb55b842fd24f79f8960e98aa84afb57a1e333382191857e7f/cypunct-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "16121e30b3385ed5135dbd68f2eda173", "sha256": "6f3999419b1a6541c223991b64f7255e3c519e568aaf93cd04109c1df3240056" }, "downloads": -1, "filename": "cypunct-0.1.1.tar.gz", "has_sig": false, "md5_digest": "16121e30b3385ed5135dbd68f2eda173", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 185556, "upload_time": "2017-07-03T23:03:44", "url": "https://files.pythonhosted.org/packages/e3/26/23e1e676fc8b1c86cdf5360969942cd1bbd2ad86606fee1911d5b4edcdfc/cypunct-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "16121e30b3385ed5135dbd68f2eda173", "sha256": "6f3999419b1a6541c223991b64f7255e3c519e568aaf93cd04109c1df3240056" }, "downloads": -1, "filename": "cypunct-0.1.1.tar.gz", "has_sig": false, "md5_digest": "16121e30b3385ed5135dbd68f2eda173", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 185556, "upload_time": "2017-07-03T23:03:44", "url": "https://files.pythonhosted.org/packages/e3/26/23e1e676fc8b1c86cdf5360969942cd1bbd2ad86606fee1911d5b4edcdfc/cypunct-0.1.1.tar.gz" } ] }