{ "info": { "author": "Dan Gillick", "author_email": "dgillick@cs.berkeley.edu", "bugtrack_url": null, "classifiers": [], "description": "Improved Sentence Boundary Detection\r\n====================================\r\n\r\nOverview\r\n--------\r\n\r\nConsider the following text:\r\n\r\n\"On Jan. 20, former Sen. Barack Obama became the 44th President of the\r\nU.S. Millions attended the Inauguration.\"\r\n\r\nThe periods are potentially ambiguous, signifying either the end of a\r\nsentence, an abbreviation, or both. The sentence boundary detection\r\n(SBD) task involves disambiguating the periods, and in particular,\r\nclassifying each period as end-of-sentence () or not. In the example,\r\nonly the period at the end of U.S. should be classified as :\r\n\r\n\"On Jan. 20, former Sen. Barack Obama became the 44th President of the\r\nU.S. Millions attended the Inauguration.\"\r\n\r\nChances are, if you are using some SBD system, it has an error rate of\r\n1%-3% on English newswire text. The system described here achieves the\r\nbest known error rate on the Wall Street Journal corpus: 0.25% and\r\ncomparable error rates on the Brown corpus (mixed genre) and other test\r\ncorpora.\r\n\r\nBackground\r\n----------\r\n\r\nSBD is fundamental to many natural language processing problems, but\r\nonly a few papers describe solutions. A variety of rule-based systems\r\nare floating around, and a few semi-statistical systems are available if\r\nyou know where to look. The most widely cited are:\r\n\r\n- Alembic (Aberdeen, et al. 1995): Abbreviation list and ~100\r\n hand-crafted regular expressions.\r\n- Satz (Palmer & Hearst at Berkeley, 1997): Part of speech features and\r\n abbreviation lists as input to a classifier (neural nets and decision\r\n trees have similar performance).\r\n- mxTerminator (Reynar & Ratnaparkhi, 1997): Maximum entropy\r\n classification with simple lexical features.\r\n- Mikheev (Mikheev, 2002): Observes that perfect labels for\r\n abbreviations and names gives almost perfect SBD results. Creates\r\n heuristics for marking these, unsupervised, from held-out data.\r\n- Punkt (Strunk and Kiss, 2006): Unsupervised method uses heuristics to\r\n identify abbreviations and sentence starters.\r\n\r\nI have not been able to find publicly available copies of any of these\r\nsystems, with the exception of Punkt, which ships with NLTK.\r\nNonetheless, here are some error rates reported on what I believe to be\r\nthe same subset of the WSJ corpus (sections 03-16).\r\n\r\n- Alembic: 0.9%\r\n- Satz: 1.5%; 1.0% with extra hand-written lists of abbreviations and\r\n non-names.\r\n- mxTerminator: 2.0%; 1.2% with extra abbreviation list.\r\n- Mikheev: 1.4%; 0.45% with abbreviation list (assembled automatically\r\n but carefully tuned; test-set-dependent parameters are a concern)\r\n- Punkt: 1.65% (Though if you use the model that ships with NLTK,\r\n you'll get over 3%)\r\n\r\nAll of these systems use lists of abbreviations in some capacity, which\r\nI think is a mistake. Some abbreviations almost never end a sentence\r\n(Mr.), which makes list-building appealing. But many abbreviations are\r\nmore ambiguous (U.S., U.N.), which complicates the decision.\r\n\r\n--------------\r\n\r\nWhile 1%-3% is a low error rate, this is often not good enough. In\r\nautomatic document summarization, for example, including a sentence\r\nfragment usually renders the resulting summary unintelligible. With\r\n10-sentence summaries, 1 in 10 is ruined by an SBD system with 99%\r\naccuracy. Improving the accuracy to 99.75%, only 1 in 40 is ruined.\r\nImproved sentence boundary detection is also likely to help with\r\nlanguage modeling and text alignment.\r\n\r\n--------------\r\n\r\nI built a supervised system that classifies sentence boundaries without\r\nany heuristics or hand-generated lists. It uses the same training data\r\nas mxTerminator, and allows for Naive Bayes or SVM models (SVM Light).\r\n\r\n+-------------------------------------+---------+---------------+\r\n| Corpus | SVM | Naive Bayes |\r\n+=====================================+=========+===============+\r\n| WSJ | 0.25% | 0.35% |\r\n+-------------------------------------+---------+---------------+\r\n| Brown | 0.36% | 0.45% |\r\n+-------------------------------------+---------+---------------+\r\n| Complete Works of Edgar Allen Poe | 0.52% | 0.44 |\r\n+-------------------------------------+---------+---------------+\r\n\r\nI've packaged this code, written in Python, for general use. Word-level\r\ntokenization, which is particularly important for good sentence boundary\r\ndetection, is included.\r\n\r\nNote that the included models use all of the labeled data listed here,\r\nmeaning that the expected results are somewhat better than the numbers\r\nreported above. Including the Brown data as training improves the WSJ\r\nresult to 0.22% and the Poe result to 0.37 (using the SVM).\r\n\r\nPerformance\r\n-----------\r\n\r\nA few other notes on performance. The standard WSJ test corpus includes\r\n26977 possible sentence boundaries. About 70% are in fact sentence\r\nboundaries. Classification with the included SVM model will give 59\r\nerrors. Of these, 24 (41%) involve the word \"U.S.\", a particularly\r\ninteresting case. In training, \"U.S.\" appears 2029 times, and 90 of\r\nthese are sentence boundaries. Further complicating the situation,\r\n\"U.S.\" often appears in a context like \"U.S. Security Council\" or \"U.S.\r\nGovernment\", and either \"Security\" or \"Government\" are viable sentence\r\nstarters.\r\n\r\nOther confusing cases include \"U.N.\", \"U.K.\", and state abbreviations\r\nlike \"N.Y.\" which have similar characteristics as \"U.S.\" but appear\r\nsomewhat less frequently.\r\n\r\nSetup\r\n-----\r\n\r\nSee SETUP.md for setup instructions and notes.", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/hinstitute/splitta/tarball/0.1.0", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/hinstitute/splitta", "keywords": "splitta,setence bounadry detection,sbd", "license": "(c) 2009 Dan Gillick", "maintainer": "", "maintainer_email": "", "name": "splitta", "package_url": "https://pypi.org/project/splitta/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/splitta/", "project_urls": { "Download": "https://github.com/hinstitute/splitta/tarball/0.1.0", "Homepage": "https://github.com/hinstitute/splitta" }, "release_url": "https://pypi.org/project/splitta/0.1.0/", "requires_dist": null, "requires_python": null, "summary": "Sentence boundary detection", "version": "0.1.0" }, "last_serial": 1349492, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "219e00ebfe0d946c53cb0bfaf7e36490", "sha256": "578c601fca7a93ddbbf07a06298c44057248dbd786e2479ed3c9e6a71548503a" }, "downloads": -1, "filename": "splitta-0.1.0.tar.gz", "has_sig": false, "md5_digest": "219e00ebfe0d946c53cb0bfaf7e36490", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6943603, "upload_time": "2014-12-18T16:41:41", "url": "https://files.pythonhosted.org/packages/cf/d2/9771eb65f1dc3925dbcfc7c4b2adaefa38e1549e4e4e75409df316f8c453/splitta-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "219e00ebfe0d946c53cb0bfaf7e36490", "sha256": "578c601fca7a93ddbbf07a06298c44057248dbd786e2479ed3c9e6a71548503a" }, "downloads": -1, "filename": "splitta-0.1.0.tar.gz", "has_sig": false, "md5_digest": "219e00ebfe0d946c53cb0bfaf7e36490", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6943603, "upload_time": "2014-12-18T16:41:41", "url": "https://files.pythonhosted.org/packages/cf/d2/9771eb65f1dc3925dbcfc7c4b2adaefa38e1549e4e4e75409df316f8c453/splitta-0.1.0.tar.gz" } ] }