{ "info": { "author": "Chris Spencer", "author_email": "chrisspen@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# Punctuator\n\n[![](https://img.shields.io/pypi/v/punctuator.svg)](https://pypi.python.org/pypi/punctuator)\n[![Build Status](https://img.shields.io/travis/chrisspen/punctuator2.svg?branch=master)](https://travis-ci.org/chrisspen/punctuator)\n\nThis is a fork of [Ottokar Tilk's punctuator2](https://github.com/ottokart/punctuator2) cleaned up into a formal Python3 package with testing.\n\n**[DEMO](http://bark.phon.ioc.ee/punctuator)** and **[DEMO2](http://bark.phon.ioc.ee/punctuator/game)**\n\nA bidirectional recurrent neural network model with attention mechanism for restoring missing inter-word punctuation in unsegmented text.\n\nThe model can be trained in two stages (second stage is optional):\n\n1. First stage is trained on punctuation annotated text. Here the model learns to restore puncutation based on textual features only.\n2. Optional second stage can be trained on punctuation *and* pause annotated text. In this stage the model learns to combine pause durations with textual features and adapts to the target domain. If pauses are omitted then only adaptation is performed. Second stage with pause durations can be used for example for restoring punctuation in automatic speech recognition system output.\n\n# Installation\n\nTo install:\n\n virtualenv -p python3.7 .env\n . .env/bin/activate\n pip install punctuator\n\nAdditionally, you'll need a trained model. You can create your own following the instructions below, or you can use a pre-trained model from [here](https://drive.google.com/drive/folders/0B7BsN5f2F1fZQnFsbzJ3TWxxMms?usp=sharing).\n\nPlace these models in `PUNCTUATOR_DATA_DIR` directory, which defaults to `~/.punctuator`.\n\nFor example, to download `Demo-Europarl-EN.pcl`, activate your virtual environment and run:\n\n . .env/bin/activate\n mkdir -p ~/.punctuator\n cd ~/.punctuator\n gdown https://drive.google.com/uc?id=0B7BsN5f2F1fZd1Q0aXlrUDhDbnM\n\nTo download other model files, find the Google Drive id via the share link, and substitute that in the command above.\n\n# Usage\n\nTo use from the command line:\n\n cat input.txt | python punctuator.py model.pcl output.txt\n\nTo use from Python:\n\n from punctuator import Punctuator\n p = Punctuator('model.pcl')\n print(p.punctuate('some text'))\n\n# How well does it work?\n\n* A working demo can be seen here: http://bark.phon.ioc.ee/punctuator\n* You can try to compete with this model here: http://bark.phon.ioc.ee/punctuator/game\n\nRemember that all the scores given below are on _unsegmented_ text and we did not use prosodic features, so, among other things, the model has to detect sentence boundaries in addition to the boundary type (?QUESTIONMARK, .PERIOD or !EXCLAMATIONMARK) based entirely on textual features. The scores are computed on the test set.\n\nTraining speed with default settings, an optimal Theano installation and a modern GPU should be around 10000 words per second.\n\nPretrained models can be downloaded [here](https://drive.google.com/drive/folders/0B7BsN5f2F1fZQnFsbzJ3TWxxMms?usp=sharing) (Demo + 2 models from the Interspeech paper).\n\n## English TED talks\nTraining set size: 2.1M words. First stage only. More details can be found in [this paper](http://www.isca-speech.org/archive/Interspeech_2016/pdfs/1517.PDF).\nFor comparison, our [previous model](https://github.com/ottokart/punctuator) got an overall F1-score of 50.8.\n\nPUNCTUATION | PRECISION | RECALL | F-SCORE\n--- | --- | --- | ---\n,COMMA | 64.4 | 45.2 | 53.1\n?QUESTIONMARK | 67.5 | 58.7 | 62.8\n.PERIOD | 72.3 | 71.5 | 71.9\n_Overall_ | _68.9_ | _58.1_ | _63.1_\n\n## English Europarl v7\nTraining set size: 40M words. First stage only. Details in [./example](https://github.com/ottokart/punctuator2/tree/master/example).\n\nYou can try to compete with this model [here](http://bark.phon.ioc.ee/punctuator/game).\n\nPUNCTUATION | PRECISION | RECALL | F-SCORE\n--- | --- | --- | ---\n?QUESTIONMARK | 77.7 | 73.2 | 75.4\n!EXCLAMATIONMARK | 50.0 | 0.1 | 0.1\n,COMMA | 68.9 | 72.0 | 70.4\n-DASH | 55.9 | 8.8 | 15.2\n:COLON | 60.9 | 23.8 | 34.2\n;SEMICOLON | 44.7 | 1.1 | 2.2\n.PERIOD | 84.7 | 84.1 | 84.4\n_Overall_ | _75.7_ | _73.9_ | _74.8_\n\n# Requirements\n* Python 2.7\n* Numpy\n* Theano\n\n# Requirements for data:\n\n* Cleaned text files for training and validation of the first phase model. Each punctuation symbol token must be surrounded by spaces.\n\n Example:\n ```to be ,COMMA or not to be ,COMMA that is the question .PERIOD```\n* *(Optional)* Pause annotated text files for training and validation of the second phase model. These should be cleaned in the same way as the first phase data. Pause durations in seconds should be marked after each word with a special tag ``. Punctuation mark, if any, must come after the pause tag.\n\n Example:\n ```to be ,COMMA or not to be ,COMMA that is the question .PERIOD```\n\n Second phase data can also be without pause annotations to do just target domain adaptation.\n\nMake sure that first words of sentences don't have capitalized first letters. This would give the model unfair hints about period locations. Also, the text files you use for training and validation must be large enough (at least minibatch_size x sequence_length of words, which is 128x50=6400 words with default settings), otherwise you might get an error.\n\n# Configuration\nVocabulary size, punctuation tokens and their mappings, and converted data location can be configured in the header of data.py.\nSome model hyperparameters can be configured in the headings of main.py and main2.py. Learning rate and hidden layer size can be passed as arguments.\n\n# Usage\n\nFirst step is data conversion. Assuming that preprocessed and cleaned *.train.txt, *.dev.txt and *.test.txt files are located in ``, the conversion can be initiated with:\n\n`python data.py `\n\nIf you have second stage data as well, then:\n\n`python data.py `\n\n\n\nThe first stage can be trained with:\n\n`python main.py `\n\ne.g `python main.py 256 0.02` works well.\n\n\n\nSecond stage can be trained with:\n\n`python main2.py `\n\n\n\nPreprocessed text can be punctuated with e.g:\n\n`cat data.dev.txt | python punctuator.py `\n\nor, if pause annotations are present in data.dev.txt and you have a second stage model trained on pause annotated data, then:\n\n`cat data.dev.txt | python punctuator.py 1`\n\nPunctuation tokens in data.dev.txt don't have to be removed - the punctuator.py script ignores them.\n\n\nError statistics in this example can be computed with:\n\n`python error_calculator.py data.dev.txt `\n\n\nYou can play with a trained model with (assumes the input text is similarly preprocessed as the training data):\n\n`python play_with_model.py `\n\nor with:\n\n`python play_with_model.py 1`\n\nif you want to see, which words the model sees as UNKs (OOVs).\n\n# Development\n\nRun all tests with:\n\n export TESTNAME=; tox\n\nRun a specific test in a specific environment with:\n\n export TESTNAME=.test_punctuate; tox -e py37", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/chrisspen/punctuator2", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "punctuator", "package_url": "https://pypi.org/project/punctuator/", "platform": "", "project_url": "https://pypi.org/project/punctuator/", "project_urls": { "Homepage": "https://github.com/chrisspen/punctuator2" }, "release_url": "https://pypi.org/project/punctuator/0.9.5/", "requires_dist": null, "requires_python": "", "summary": "Adds punctuation to a block of text.", "version": "0.9.5" }, "last_serial": 5926110, "releases": { "0.9.2": [ { "comment_text": "", "digests": { "md5": "00078ae74efef14cb98e69cbda9a6b0a", "sha256": "a0e28acc9120f6ac8b592bd2d97cb0c5cf123b6094b63653d81afe200b744b94" }, "downloads": -1, "filename": "punctuator-0.9.2.tar.gz", "has_sig": false, "md5_digest": "00078ae74efef14cb98e69cbda9a6b0a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21974, "upload_time": "2019-09-27T00:25:56", "url": "https://files.pythonhosted.org/packages/e4/37/2c6cfbf956a9f5a92ac9b76011bc25ab07b0ec101ae87f8fbbae9566e9b8/punctuator-0.9.2.tar.gz" } ], "0.9.3": [ { "comment_text": "", "digests": { "md5": "a98cd6e964c854aca4848efc667e9ac2", "sha256": "b123b851f36cb65e829b67b0a51fbf1dc84943e503dd01faf0b90cc3a2a5a771" }, "downloads": -1, "filename": "punctuator-0.9.3.tar.gz", "has_sig": false, "md5_digest": "a98cd6e964c854aca4848efc667e9ac2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22292, "upload_time": "2019-09-29T05:25:44", "url": "https://files.pythonhosted.org/packages/b0/85/3cb5fc510a99ea3dbef58fba6898d784bb3cacbbe3cf89b21a451c12b764/punctuator-0.9.3.tar.gz" } ], "0.9.4": [ { "comment_text": "", "digests": { "md5": "c0263c4fe5a0dda12b336876211dbaf6", "sha256": "0c92416d48bed3b32f8d45f6b1438372692ec01fa1ed12722df78d75d4dfacf3" }, "downloads": -1, "filename": "punctuator-0.9.4.tar.gz", "has_sig": false, "md5_digest": "c0263c4fe5a0dda12b336876211dbaf6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22295, "upload_time": "2019-10-04T00:47:25", "url": "https://files.pythonhosted.org/packages/87/ab/d424d01d5e9269ab15592c2b4cf2d8cade523ecc3a0f038a7fcd19b886a2/punctuator-0.9.4.tar.gz" } ], "0.9.5": [ { "comment_text": "", "digests": { "md5": "218d6220343596e879957c2a69e23d8a", "sha256": "119f4c75d4ce00dd3cfcb5b99fd5cc6b9f7331f4bb5884911f96791caf5058ce" }, "downloads": -1, "filename": "punctuator-0.9.5.tar.gz", "has_sig": false, "md5_digest": "218d6220343596e879957c2a69e23d8a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22287, "upload_time": "2019-10-04T00:54:54", "url": "https://files.pythonhosted.org/packages/8e/1a/1b86aac001c282d60829ebb8b0b2e5d3b37d081df0bdc1ca57ef93d31a96/punctuator-0.9.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "218d6220343596e879957c2a69e23d8a", "sha256": "119f4c75d4ce00dd3cfcb5b99fd5cc6b9f7331f4bb5884911f96791caf5058ce" }, "downloads": -1, "filename": "punctuator-0.9.5.tar.gz", "has_sig": false, "md5_digest": "218d6220343596e879957c2a69e23d8a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22287, "upload_time": "2019-10-04T00:54:54", "url": "https://files.pythonhosted.org/packages/8e/1a/1b86aac001c282d60829ebb8b0b2e5d3b37d081df0bdc1ca57ef93d31a96/punctuator-0.9.5.tar.gz" } ] }