{
"info": {
"author": "Weihan Jiang",
"author_email": "weihan.github@gmail.com",
"bugtrack_url": null,
"classifiers": [],
"description": "Python Word Segmentation\r\n========================\r\n \r\nWordSegmentation is an Apache2 licensed module for English word\r\nsegmentation, written in pure-Python, and based on a trillion-word corpus.\r\n \r\nInspired by Grant Jenks https://pypi.python.org/pypi/wordsegment.\r\nBased on word weighing algorithm from the chapter \"Natural Language Corpus Data\" by Peter Norvig\r\nfrom the book \"Beautiful Data\" (Segaran and Hammerbacher, 2009).\r\n \r\nData files are derived from the Google Web Trillion Word Corpus, as described\r\nby Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium.\r\n\r\nFeatures\r\n--------\r\n \r\n- Pure-Python\r\n- Segmentation Algorithm Using Divide and Conquer so that there is NO max length limit set to input text.\r\n- Segmentation Algotrithm used Dynamic Programming to achieve a polynomial time complexity.\r\n- Used Google Trillion Corpus to do scoring for the word segmentation.\r\n- Developed on Python 2.7\r\n- Tested on CPython 2.6, 2.7, 3.4.\r\n\r\nQuickstart\r\n----------\r\n \r\nInstalling WordSegment is simple with `pip `_::\r\n \r\n $ pip install wordsegmentation\r\n\r\nDependency required `networkx `_::\r\n\r\n $ pip install networkx\r\n\r\nTutorial\r\n--------\r\n \r\nIn your own Python programs, you'll mostly want to use `segment` to divide a phrase into a list of its parts::\r\n \r\n >>> from wordsegmentation import WordSegment\r\n >>> ws = WordSegment()\r\n \r\n >>> ws.segment('universityofwashington')\r\n ['university', 'of', 'washington']\r\n >>> ws.segment('thisisatest')\r\n ['this', 'is', 'a', 'test'] \r\n >>> ws.segment('thisisanURLcontaining123345and-&**^&butitstillworks')\r\n ['this', 'is', 'an', 'url', 'containing', '123345', 'and', '-&**^&', 'but', 'it', 'still', 'works']\r\n >>> ws.segment('NoMatterHowLongThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextMightBe')\r\n ['no', 'matter', 'how', 'long', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'might', 'be']\r\n\r\nBug Report\r\n------------\r\n\r\n`Weihan@Github `_\r\n\r\nweihan.github AT gmail.com\r\n\r\nTech Details\r\n------------\r\n \r\nIn the code, the segmentation algorithm consists of the following steps,\r\n\r\n1. divide and conquer -- safely divide the input string into substring. This way we solved the length limit which will dramatically slow down the performance. For example, \"facebook123helloworld\" will be treated as 3 sub-problems -- \"facebook\", \"123\", and \"helloworld\". \r\n2. for each sub-string. I used dynamic programming to calculate and get the optimal words.\r\n3. combine the sub-problems, and return the result for the original string.\r\n\r\nSegmentation algorithm used in this module, has achieved a time-complexity of O(n^2). \r\nBy comparison to existing segmentation algorithms, this module does better on following aspects,\r\n\r\n1. can handle very long input. There is no arbitary max lenght limit set to input string.\r\n2. segmentation finished in polynomial time via dynamic programming.\r\n3. by default, the algorithm uses a filtered Google corpus, which contains only English words that could be found in dictionary.\r\n \r\nAn extreme example is to solve the classic `English Scriptio_continua segmentation problem `_ as shown below::\r\n >>>ws.segment('MARGARETAREYOUGRIEVINGOVERGOLDENGROVEUNLEAVINGLEAVESLIKETHETHINGSOFMANYOUWITHYOURFRESHTHOUGHTSCAREFORCANYOUAHASTHEHEARTGROWSOLDERITWILLCOMETOSUCHSIGHTSCOLDERBYANDBYNORSPAREASIGHTHOUGHWORLDSOFWANWOODLEAFMEALLIEANDYETYOUWILLWEEPANDKNOWWHYNOWNOMATTERCHILDTHENAMESORROWSSPRINGSARETHESAMENORMOUTHHADNONORMINDEXPRESSEDWHATHEARTHEARDOFGHOSTGUESSEDITISTHEBLIGHTMANWASBORNFORITISMARGARETYOUMOURNFOR')\r\n\r\n\r\nOur algorithm solved this issue in polynomial time, and the output is::\r\n \r\n [\u2018margaret', 'are', 'you', 'grieving', 'over', 'golden', 'grove', 'un', 'leaving', 'leaves', 'like', 'the', 'things', 'of', 'man', 'you', 'with', 'your', 'fresh', 'thoughts', 'care', 'for', 'can', 'you', 'a', 'has', 'the', 'he', 'art', 'grows', 'older', 'it', 'will', 'come', 'to', 'such', 'sights', 'colder', 'by', 'and', 'by', 'nor', 'spa', 're', 'a', 'sigh', 'though', 'worlds', 'of', 'wan', 'wood', 'leaf', 'me', 'allie', 'and', 'yet', 'you', 'will', 'weep', 'and', 'know', 'why', 'now', 'no', 'matter', 'child', 'the', 'name', 'sorrows', 'springs', 'are', 'the', 'same', 'nor', 'mouth', 'had', 'non', 'or', 'mind', 'expressed', 'what', 'he', 'art', 'heard', 'of', 'ghost', 'guessed', 'it', 'is', 'the', 'blight', 'man', 'was', 'born', 'for', 'it', 'is', 'margaret', 'you', 'mourn', 'for']",
"description_content_type": null,
"docs_url": null,
"download_url": "",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "",
"keywords": "word segmentation; word segmentation python; word segment; URL segment; domain segmentation; long sentence segment; long text algorithm polynomial",
"license": "Copyright 2015 Weihan Jiang",
"maintainer": "Weihan Jiang",
"maintainer_email": "weihan.github@gmail.com",
"name": "wordsegmentation",
"package_url": "https://pypi.org/project/wordsegmentation/",
"platform": "",
"project_url": "https://pypi.org/project/wordsegmentation/",
"project_urls": null,
"release_url": "https://pypi.org/project/wordsegmentation/0.3.5/",
"requires_dist": [
"networkx"
],
"requires_python": "",
"summary": "English word segmentation.",
"version": "0.3.5"
},
"last_serial": 1792364,
"releases": {
"0.0.2": [
{
"comment_text": "",
"digests": {
"md5": "f9e617f1fca5681cf45d96efdeb9337a",
"sha256": "0bb1543d37b2130f76561e657205ff25021dd9d205ce0f9d105b853d381fca62"
},
"downloads": -1,
"filename": "wordsegmentation-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "f9e617f1fca5681cf45d96efdeb9337a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4890959,
"upload_time": "2015-10-25T23:25:42",
"url": "https://files.pythonhosted.org/packages/b2/a6/7cd76401fdfa872ce4bdc57e94136924ad0f851f9f976c5ac11acba0aa26/wordsegmentation-0.0.2.tar.gz"
}
],
"0.3": [
{
"comment_text": "",
"digests": {
"md5": "05350b554bea45998357f393d7e22581",
"sha256": "d23435a67bcb8a8aefedb582996e7f19c077544ca306dff0f7bddd31ac1e569a"
},
"downloads": -1,
"filename": "wordsegmentation-0.3-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "05350b554bea45998357f393d7e22581",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 5947585,
"upload_time": "2015-10-25T23:18:46",
"url": "https://files.pythonhosted.org/packages/56/57/3b5c3d10de1520da6e31be9f49d1b156ec2a723f8e7f8f4f8ca7c81fe0b0/wordsegmentation-0.3-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "6cf9a4ed8cd34cada0067c44339f4003",
"sha256": "f6bdbe497cc06a5c5e960a265f5255f311eaa6da8b3d8aa7a5028651d1c32a46"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.tar.gz",
"has_sig": false,
"md5_digest": "6cf9a4ed8cd34cada0067c44339f4003",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5941725,
"upload_time": "2015-10-25T23:18:57",
"url": "https://files.pythonhosted.org/packages/6a/96/1442581eeae46e34fe409becabdad3eda93461c45217e8a76dcae6de5738/wordsegmentation-0.3.tar.gz"
}
],
"0.3.2": [
{
"comment_text": "",
"digests": {
"md5": "0a2b04c29fe5ed63d2566d65491e3507",
"sha256": "c5ec0f7edb76959f794d5a439ab6e72814bf5f455ad570f4951536810e05e5e1"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.2.tar.gz",
"has_sig": false,
"md5_digest": "0a2b04c29fe5ed63d2566d65491e3507",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4890964,
"upload_time": "2015-10-25T23:27:09",
"url": "https://files.pythonhosted.org/packages/29/df/c3d06385842c4f51129dbc25e7a645028938bbb493aac0e712ecf78e7adf/wordsegmentation-0.3.2.tar.gz"
}
],
"0.3.3": [
{
"comment_text": "",
"digests": {
"md5": "d836486e7c9ac416e6079987a9b187d0",
"sha256": "15aed3932525c1f1e227e186abb866226979cf7e8762aaab8a3c8a3f174651a9"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.3-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "d836486e7c9ac416e6079987a9b187d0",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 8183218,
"upload_time": "2015-10-25T23:34:07",
"url": "https://files.pythonhosted.org/packages/93/86/d98802c0e369179d154a0239df49bf8804d013ff7e9e60127e6d8f8288ff/wordsegmentation-0.3.3-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "cf0085a69366f4f64ddf5a98b09f76a9",
"sha256": "293e82cc9b7133aef43d774a97fb0bab33970aa852ff22ba00bc453f47e14b87"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.3.tar.gz",
"has_sig": false,
"md5_digest": "cf0085a69366f4f64ddf5a98b09f76a9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4890970,
"upload_time": "2015-10-25T23:34:15",
"url": "https://files.pythonhosted.org/packages/20/9c/d992ea16a1a4553181780e836ffc0df3d70e426e4918309ca0589743b4dc/wordsegmentation-0.3.3.tar.gz"
}
],
"0.3.4": [
{
"comment_text": "",
"digests": {
"md5": "bb63ee2d162087bf929e3c65dc2f370c",
"sha256": "d7c409c5528d5eaa6e0f296050e8cb7d43c5e460275496a41acd7d293cee0979"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.4-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "bb63ee2d162087bf929e3c65dc2f370c",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 8185392,
"upload_time": "2015-10-25T23:51:40",
"url": "https://files.pythonhosted.org/packages/8c/42/61db9985247c761a01441bab8208a575a2f30a03f0c5ad3bc77b4a769034/wordsegmentation-0.3.4-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "6797cc812e2eefec5c1df599bf586dc0",
"sha256": "5249899ffbb9e42e297a2459ac30cb52fbc51c83dff68ffe4cb8dd8b49c19a8e"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.4.tar.gz",
"has_sig": false,
"md5_digest": "6797cc812e2eefec5c1df599bf586dc0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4891055,
"upload_time": "2015-10-25T23:51:49",
"url": "https://files.pythonhosted.org/packages/ff/4f/f4c8da60216d08fde4845e9435652df256eb1111553d84f945d245971c69/wordsegmentation-0.3.4.tar.gz"
}
],
"0.3.5": [
{
"comment_text": "",
"digests": {
"md5": "f07aafe599a8776d8ec43fe2f4673c8b",
"sha256": "3c55f55941678627a879678a22c01687db27915e73e53a0d2763901f679d9150"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.5-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "f07aafe599a8776d8ec43fe2f4673c8b",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 8184946,
"upload_time": "2015-10-26T00:10:14",
"url": "https://files.pythonhosted.org/packages/97/cf/83b624050e4f9d7d3d99fdf15c2ec38af149b4e205bab0ecf28d6772b528/wordsegmentation-0.3.5-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "102ada6b311e0e817a9435ccb4cbab62",
"sha256": "1ba8cc24567816cee2df32844aed6d55e1863b7d3eb548ce5188ea23b7d9caf4"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.5.tar.gz",
"has_sig": false,
"md5_digest": "102ada6b311e0e817a9435ccb4cbab62",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4890852,
"upload_time": "2015-10-26T00:10:32",
"url": "https://files.pythonhosted.org/packages/54/3d/eaf59a59d8a27b54332cecbebfc34cc6c885f29e39b8c9ebc9d04e71045c/wordsegmentation-0.3.5.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "f07aafe599a8776d8ec43fe2f4673c8b",
"sha256": "3c55f55941678627a879678a22c01687db27915e73e53a0d2763901f679d9150"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.5-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "f07aafe599a8776d8ec43fe2f4673c8b",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 8184946,
"upload_time": "2015-10-26T00:10:14",
"url": "https://files.pythonhosted.org/packages/97/cf/83b624050e4f9d7d3d99fdf15c2ec38af149b4e205bab0ecf28d6772b528/wordsegmentation-0.3.5-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "102ada6b311e0e817a9435ccb4cbab62",
"sha256": "1ba8cc24567816cee2df32844aed6d55e1863b7d3eb548ce5188ea23b7d9caf4"
},
"downloads": -1,
"filename": "wordsegmentation-0.3.5.tar.gz",
"has_sig": false,
"md5_digest": "102ada6b311e0e817a9435ccb4cbab62",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4890852,
"upload_time": "2015-10-26T00:10:32",
"url": "https://files.pythonhosted.org/packages/54/3d/eaf59a59d8a27b54332cecbebfc34cc6c885f29e39b8c9ebc9d04e71045c/wordsegmentation-0.3.5.tar.gz"
}
]
}