{ "info": { "author": "Kyle Gorman", "author_email": "kylebgorman@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "Detector Morse\n==============\n\nDetector Morse is a program for sentence boundary detection (henceforth, SBD), also known as sentence segmentation. Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:\n\n Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain\n steady at about 1,200 cars in 1990.\n\nThis sentence contains 4 periods, but only the last denotes a sentence boundary. The first one in `U.S.` is unambiguously part of an acronym, not a sentence boundary; the same is true of expressions like `$12.53`. But the periods at the end of `Inc.` and `U.S.` could easily denote a sentence boundary. Humans use the local context to determine that neither period denote sentence boundaries (e.g. the selectional properties of the verb _expect_ are not met if there is a sentence bounary immediately after `U.S.`). Detector Morse uses artisinal, handcrafted contextual features and low-impact, leave-no-trace machine learning methods to automatically detect sentence boundaries.\n\nSBD is one of the earliest pieces of many natural language processing pipelines. Since errors at this step are likely to propagate, SBD is an important---albeit overlooked---problem in natural language processing.\n\nDetector Morse has been tested on CPython 3.4 and PyPy3 (2.3.1, corresponding to Python 3.2); the latter is much faster. Detector Morse depends on the Python module `nlup` (which in turn relies on `jsonpickle`) to (de)serialize models. For the versions used, see `requirements.txt`.\n\nInstallation\n============\n\n```\npip install detectormorse\n```\n\nUsage\n=====\n\n```\nDetector Morse, by Kyle Gorman\n \nusage: python -m detectormorse [-h] [-v | -V] (-t TRAIN | -r [READ])\n (-s SEGMENT | -w WRITE | -e EVALUATE)\n [-E EPOCHS] [-C] [--preserve-whitespace]\n\nDetector Morse\n\noptional arguments:\n -h, --help show this help message and exit\n -v, --verbose enable verbose output\n -V, --really-verbose enable even more verbose output\n -t TRAIN, --train TRAIN\n training data\n -r [READ], --read [READ]\n read in a serialized model from a path or read the\n default model if no path is specified\n -s SEGMENT, --segment SEGMENT\n segment sentences\n -w WRITE, --write WRITE\n write out serialized model\n -e EVALUATE, --evaluate EVALUATE\n evaluate on segmented data\n -E EPOCHS, --epochs EPOCHS\n # of epochs (default: 20)\n -C, --nocase disable case features\n --preserve-whitespace\n preserve whitespace when segmenting\n```\n\nFiles used for training (`-t`/`--train`) and evaluation (`-e`/`--evaluate`) should contain one sentence per line; newline characters are ignored otherwise.\n\nWhen segmenting a file (`-s`/`--segment`), DetectorMorse simply inserts a newline after predicted sentence boundaries that aren't already marked by one. All other newline characters are passed through, unmolested.\n\nThe included `DM-wsj.json.gz` is a segmenter model trained on the Wall St. Journal portion of the Penn Treebank. This model can be loaded by using `detector.default_model()` or by specifying `-r` with no path at the command line.\n\nMethod\n======\n\nSee [this blog post](http://www.wellformedness.com/blog/simpler-sentence-boundary-detection/).\n\nCaveats\n=======\n\nDetectorMorse processes text by reading the entire file into memory. This means it will not work with files that won't fit into the available RAM. The easiest way to get around this is to import the `Detector` instance in your own Python script.\n\nExciting extras!\n================\n\nI've included a Perl script `untokenize.pl` which attempts to invert the Penn Treebank tokenization process. Tokenization is an inherently \"lossy\" procedure, so there is no guarantee that the output is exactly how it appeared in the WSJ. But, the rules appear to be correct and produce sane text, and I have used it for all experiments. **Update (2015-02-10): I've removed this script; I just use the Stanford tokenizer for this purpose, now.**", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "DetectorMorse", "package_url": "https://pypi.org/project/DetectorMorse/", "platform": "", "project_url": "https://pypi.org/project/DetectorMorse/", "project_urls": null, "release_url": "https://pypi.org/project/DetectorMorse/0.4.1/", "requires_dist": null, "requires_python": "", "summary": "DetectorMorse, a sentence splitter", "version": "0.4.1" }, "last_serial": 4576271, "releases": { "0.4.0": [ { "comment_text": "", "digests": { "md5": "a550868d552da42420ee2ca3d44635ba", "sha256": "30a10182a666d1a4c102c44d8503d209dbb1046b289909a265af7219556cbaa9" }, "downloads": -1, "filename": "DetectorMorse-0.4.0.tar.gz", "has_sig": false, "md5_digest": "a550868d552da42420ee2ca3d44635ba", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 521269, "upload_time": "2018-10-16T18:07:48", "url": "https://files.pythonhosted.org/packages/4c/9f/1c6e8b10b8a5e61daaa664b07be1f4b1c5e745b66e860e361bab7a4fbd14/DetectorMorse-0.4.0.tar.gz" } ], "0.4.1": [ { "comment_text": "", "digests": { "md5": "263d08c208eb94f9ab4f27de2444a362", "sha256": "8404509be1f6ed3bba4baff2394257a080eae855f8188abb077854b684d8107d" }, "downloads": -1, "filename": "DetectorMorse-0.4.1.tar.gz", "has_sig": false, "md5_digest": "263d08c208eb94f9ab4f27de2444a362", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 336958, "upload_time": "2018-12-08T23:21:43", "url": "https://files.pythonhosted.org/packages/c6/d6/950d6fffeebd24718f444a927259e743f148bc2a63141bf040f650e2b4d2/DetectorMorse-0.4.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "263d08c208eb94f9ab4f27de2444a362", "sha256": "8404509be1f6ed3bba4baff2394257a080eae855f8188abb077854b684d8107d" }, "downloads": -1, "filename": "DetectorMorse-0.4.1.tar.gz", "has_sig": false, "md5_digest": "263d08c208eb94f9ab4f27de2444a362", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 336958, "upload_time": "2018-12-08T23:21:43", "url": "https://files.pythonhosted.org/packages/c6/d6/950d6fffeebd24718f444a927259e743f148bc2a63141bf040f650e2b4d2/DetectorMorse-0.4.1.tar.gz" } ] }