{
    "info": {
        "author": "Alberto Pettarin",
        "author_email": "alberto@albertopettarin.it",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 2 - Pre-Alpha",
            "Environment :: Console",
            "Intended Audience :: Developers",
            "Intended Audience :: Education",
            "Intended Audience :: End Users/Desktop",
            "Intended Audience :: Science/Research",
            "License :: OSI Approved :: GNU Affero General Public License v3",
            "Natural Language :: English",
            "Operating System :: MacOS :: MacOS X",
            "Operating System :: Microsoft :: Windows",
            "Operating System :: POSIX :: Linux",
            "Programming Language :: C",
            "Programming Language :: Python",
            "Programming Language :: Python :: 2",
            "Programming Language :: Python :: 2.7",
            "Programming Language :: Python :: 3",
            "Programming Language :: Python :: 3.4",
            "Programming Language :: Python :: 3.5",
            "Topic :: Education",
            "Topic :: Multimedia",
            "Topic :: Multimedia :: Sound/Audio",
            "Topic :: Multimedia :: Sound/Audio :: Analysis",
            "Topic :: Multimedia :: Sound/Audio :: Speech",
            "Topic :: Printing",
            "Topic :: Scientific/Engineering",
            "Topic :: Scientific/Engineering :: Mathematics",
            "Topic :: Software Development :: Libraries :: Python Modules",
            "Topic :: Text Processing",
            "Topic :: Text Processing :: Linguistic",
            "Topic :: Text Processing :: Markup",
            "Topic :: Text Processing :: Markup :: HTML",
            "Topic :: Text Processing :: Markup :: XML",
            "Topic :: Utilities"
        ],
        "description": "lachesis\n========\n\n**lachesis** automates the segmentation of a transcript into closed\ncaptions\n\n-  Version: 0.0.3\n-  Date: 2017-01-26\n-  Developed by: `Alberto Pettarin <http://www.albertopettarin.it/>`__\n-  License: the GNU Affero General Public License Version 3 (AGPL v3)\n-  Contact: info@readbeyond.it\n\n**DO NOT USE THIS PACKAGE IN PRODUCTION UNTIL IT REACHES v1.0.0 !!!**\n\nGoal\n----\n\n**lachesis** automates the segmentation of a transcript into closed\ncaptions (CCs).\n\nThe general idea is that writing a transcription (raw text) is easier\nand faster than writing CCs, especially if you need to respect\nconstraints like a certain minimum/maximum number of characters per\nline, a maximum number of lines per CC, etc.\n\nYou can transcribe your video into raw text and ``lachesis`` will take\non the job of segmenting the text into CCs for you. Once you have the\nCCs, you can use a `forced\naligner <https://github.com/pettarin/forced-alignment-tools/>`__ like\n`aeneas <https://github.com/readbeyond/aeneas/>`__ to align them with\nthe audio of your video, obtaining a subtitle file (SRT, TTML, VTT,\netc.).\n\nWith ``lachesis`` and a forced aligner, the manual labor for producing\nCCs for a video is reduced to a. transcribing the video in raw text\nform, and b. checking the final CCs and audio alignment. Instead of\ntranscribing from scratch, you can even start by checking/editing a\nrough transcription made by an automated speech recognition engine, like\nthe \"automatic CCs\" from YouTube, speeding the process up further.\n\nThe \"magic\" behind ``lachesis`` consists in combining machine learning\ntechniques like `conditional random\nfields <https://en.wikipedia.org/wiki/Conditional_random_field>`__ (CRF)\nand classical NLP tools like `POS\ntagging <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__ and\n`sentence\nsegmentation <https://en.wikipedia.org/wiki/Text_segmentation>`__ to\nsplit the text into CC lines. The machine learning models are learned\nfrom existing, manually-edited, high-quality CCs, like those of\n`TED <https://www.youtube.com/user/TEDtalksDirector>`__/`TEDx <https://www.youtube.com/user/TEDxTalks>`__\ntalks on YouTube. The NLP tools come from the well-established, free NLP\nlibraries for Python listed below.\n\nIn summary, ``lachesis`` contains the following major functions:\n\n-  download closed captions from YouTube;\n-  parse closed caption TTML files (downloaded from YouTube);\n-  add POS tags to a given text or closed caption file;\n-  segment a given text into sentences;\n-  segment a given text into closed captions (several algorithms are\n   available);\n-  train and use machine learning models to segment raw text into CC\n   lines.\n\nInstallation\n------------\n\n**DO NOT USE THIS PACKAGE IN PRODUCTION UNTIL IT REACHES v1.0.0 !!!**\n\n.. code:: bash\n\n    pip install lachesis\n\nInstalling dependencies\n~~~~~~~~~~~~~~~~~~~~~~~\n\nYou might need additional packages, depending on how you plan to use\n``lachesis``:\n\n-  ``lxml >= 3.6.0`` for reading or downloading TTML files;\n-  ``youtube-dl >= 2017.1.16`` for downloading TTML files;\n-  ``python-crfsuite >= 0.9.1`` for training and using CRF-based\n   splitters.\n\nBy design choice, none of the above dependencies is installed by\n``pip install lachesis``. If you want to install them all, you can use:\n\n.. code:: bash\n\n    pip install lachesis[full]\n\nAlternatively, manually install only the dependencies you need. (You can\ndo it before or after installing ``lachesis``, the order does not\nmatter.)\n\nInstalling NLP Libraries\n~~~~~~~~~~~~~~~~~~~~~~~~\n\nIn addition to the dependencies listed above, to perform POS tagging and\nsentence segmentation ``lachesis`` can use one or more of the following\nlibraries:\n\n-  ``Pattern`` (install with ``pip install pattern``, `see\n   here <http://www.clips.ua.ac.be/pattern>`__)\n-  ``NLTK`` (install with ``pip install nltk``, `see\n   here <http://www.nltk.org/>`__)\n-  ``spaCy`` (install with ``pip install spacy``, `see\n   here <https://spacy.io/>`__)\n-  ``UDPipe`` (install with ``pip install ufal.udpipe``, `see\n   here <https://ufal.mff.cuni.cz/>`__)\n\nIf you want to install them all, you can use:\n\n.. code:: bash\n\n    pip install lachesis[nlp]\n\nor ``[fullnlp]`` if you also want ``[full]`` as above.\n\nEach NLP library also needs language models which you need to\ndownload/install separately. Consult the documentation of your NLP\nlibrary for details.\n\n``lachesis`` expects the following directories in your home directory\n(you can symlink them, if you installed each NLP library in a different\nplace):\n\n-  ``~/lachesis_data/nltk_data`` for ``NLTK`` (`see\n   here <http://www.nltk.org/data.html>`__);\n-  ``~/lachesis_data/spacy_data`` for ``spaCy`` (`see\n   here <https://spacy.io/docs/usage/>`__);\n-  ``~/lachesis_data/udpipe_data`` for ``UDPipe`` (`see\n   here <https://ufal.mff.cuni.cz/udpipe>`__).\n\nThe NLP library ``Pattern`` does not need a separate download of its\nlanguage models, as they are bundled in the file you download when\ninstalling through ``pip install pattern``.\n\nThe following table summarizes the languages supported by each library\nin their standard language models pack. (Additional languages might be\nsupported by third party projects/downloads or added over time.)\n\n+-----------------------+-----------+--------+---------+----------+\n| Language / Library    | Pattern   | NLTK   | spaCy   | UDPipe   |\n+=======================+===========+========+=========+==========+\n| Arabic                |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Basque                |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Bulgarian             |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Croatian              |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Czech                 |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Danish                |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Dutch                 | \u2713         | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| English               | \u2713         | \u2713      | \u2713       | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Estonian              |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Finnish               |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| French                | \u2713         | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| German                | \u2713         | \u2713      | \u2713       | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Gothic                |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Greek                 |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Greek (ancient)       |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Hebrew                |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Hindi                 |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Hungarian             |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Indonesian            |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Irish                 |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Italian               | \u2713         | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Latin                 |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Norwegian             |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Old Church Slavonic   |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Persian               |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Polish                |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Portuguese            |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Romanian              |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Slovenian             |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Spanish               | \u2713         | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Swedish               |           | \u2713      |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Tamil                 |           |        |         | \u2713        |\n+-----------------------+-----------+--------+---------+----------+\n| Turkish               |           | \u2713      |         |          |\n+-----------------------+-----------+--------+---------+----------+\n\nUsage\n-----\n\nDownload closed captions from YouTube\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from lachesis.downloaders import Downloader\n    from lachesis.language import Language\n\n    # set URL of the video and language of the CCs\n    url = u\"http://www.youtube.com/watch?v=NSL_xx2Qnyc\"\n    language = Language.ENGLISH\n\n    # download automatic CC, do not save to file\n    options = { \"auto\": True }\n    doc = Downloader.download_closed_captions(url, language, options)\n    print(doc)\n\n    # download manually-edited CC, saving the raw TTML file to disk\n    options = { \"auto\": False, \"output_file_path\": \"/tmp/ccs.ttml\" }\n    doc = Downloader.download_closed_captions(url, language, options)\n    print(doc)\n\nParse an existing TTML file downloaded from YouTube\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from lachesis.downloaders import Downloader\n\n    # parse a given TTML file downloaded from YouTube\n    ifp = \"/tmp/ccs.ttml\"\n    doc = Downloader.read_closed_captions(ifp, options={u\"downloader\": u\"youtube\"})\n    print(doc.language)\n\n    # print several representations of the CCs\n    print(doc.raw_string)                       # multi line string, similar to SRT but w/o ids or times\n    print(doc.raw_flat_clean_string)            # single line string, w/o CC line marks\n    print(doc.raw.string(flat=True, eol=u\"|\"))  # single line string, CC lines separated by '|' characters\n\nTokenize, split sentences, and POS tagging\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from lachesis.elements import Document\n    from lachesis.language import Language\n    from lachesis.nlpwrappers import NLPEngine\n\n    # work on this Unicode string\n    s = u\"Hello, World. This is a second sentence, with a comma too! And a third sentence.\"\n\n    # but you can also pass a list with pre-split sentences\n    # s = [u\"Hello World.\", u\"This is a second sentence.\", u\"Third one, bla bla\"]\n\n    # create a Text object from the Unicode string\n    doc = Document(raw=s, language=Language.ENGLISH)\n\n    # tokenize, split sentences, and POS tagging\n    # the best available NLP library will be chosen\n    nlp1 = NLPEngine()\n    nlp1.analyze(doc)\n\n    # the text has been divided into tokens, grouped in sentences\n    for s in doc.sentences:\n        print(s)                                        # raw\n        print(s.string(tagged=True))                    # tagged\n        print(s.string(raw=True, eol=u\"|\", eos=u\"\"))    # raw w/o CC line and sentence marks\n\n    # explicitly specify the NLP library NLTK,\n    # other options include: \"pattern\", \"spacy\", \"udpipe\"\n    nlp2 = NLPEngine()\n    nlp2.analyze(doc, wrapper=u\"nltk\")\n    ...\n\n    # if you need to analyze many documents,\n    # preload (and keep in cache) an NLP library,\n    # even different ones for different languages\n    nlp3 = NLPEngine(preload=[\n        (u\"en\", u\"spacy\"),\n        (u\"de\", u\"nltk\"),\n        (u\"it\", u\"pattern\"),\n        (u\"fr\", u\"udpipe\")\n    ])\n    nlp3.analyze(doc)\n    ...\n\nSplit into closed captions\n~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from lachesis.elements import Document\n    from lachesis.language import Language\n    from lachesis.nlpwrappers import NLPEngine\n    from lachesis.splitters import CRFSplitter\n    from lachesis.splitters import GreedySplitter\n\n    # create a document from a raw string\n    s = u\"Hello, World. This is a second sentence, with a comma too! And a third sentence.\"\n    doc = Document(raw=s, language=Language.ENGLISH)\n\n    # analyze it using the NLP library Pattern\n    nlpe = NLPEngine()\n    nlpe.analyze(doc, wrapper=u\"pattern\")\n\n    # feed the document into the CRF splitter (max 42 chars/line, max 2 lines/cc)\n    spl = CRFSplitter(doc.language, 42, 2)\n    spl.split(doc)\n\n    # print the segmented CCs\n    for cc in doc.ccs:\n        for line in cc.elements:\n            print(line)\n        print(u\"\")\n\n    # the default location for CRF model files is ~/lachesis_data/crf_data/\n    # but you can also specify a different path\n    spl = CRFSplitter(doc.language, 42, 2, model_file_path=\"/tmp/yourmodel.crfsuite\")\n    spl.split(doc)\n\n    # if you do not have pycrfsuite installed\n    # or the CRF model file for the document language,\n    # you can use the GreedySplitter\n    gs = GreedySplitter(doc.language, 42, 2)\n    gs.split(doc)\n\nTrain a CRF model to segment raw text into CC lines\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n    $ # /tmp/ccs/train contains several TTML files to learn from\n    $ # you can download them from YouTube using lachesis (see above)\n    $ ls /tmp/ccs/train\n    0001.ttml\n    0002.ttml\n    ...\n\n    $ # extract features and labels from them:\n    $ python -m lachesis.ml.crf dump eng /tmp/ccs/train/ /tmp/ccs/train.pickle\n    ...\n\n    $ # train the CRF model:\n    $ python -m lachesis.ml.crf train eng /tmp/ccs/train.pickle /tmp/ccs/model.crfsuite\n    ...\n\n    $ # evaluate the model on the training set\n    $ python -m lachesis.ml.crf test eng /tmp/ccs/train.pickle /tmp/ccs/model.crfsuite\n    ...\n\n    $ # you might want to evaluate on a test set, disjoint from the training set,\n    $ # that is, the test set contains CCs not seen during the training:\n    $ ls /tmp/css/test\n    1001.ttml\n    1002.ttml\n    ...\n    $ python -m lachesis.ml.crf dump eng /tmp/ccs/test/ /tmp/ccs/test.pickle\n    $ python -m lachesis.ml.crf test eng /tmp/ccs/test.pickle /tmp/ccs/model.crfsuite\n    ...\n    $ # now you can build a CRFSplitter\n    $ # with model_file_path=\"/tmp/ccs/model.crfsuite\" as shown above\n\nTODO: decide and document where pre-trained model files can be\ndownloaded\n\nLicense\n-------\n\n**lachesis** is released under the terms of the GNU Affero General\nPublic License Version 3. See the `LICENSE <LICENSE>`__ file for\ndetails.",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/readbeyond/lachesis",
        "keywords": "ReadBeyond Sync,ReadBeyond,SBV,SRT,SSV,SUB,TSV,TTML,VTT,aeneas,captioning,captions,closed captions,forced alignment,lachesis,media overlay,speech to text,subtitles,sync,synchronization,transcript,video captions",
        "license": "GNU Affero General Public License v3 (AGPL v3)",
        "maintainer": null,
        "maintainer_email": null,
        "name": "lachesis",
        "package_url": "https://pypi.org/project/lachesis/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/lachesis/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "https://github.com/readbeyond/lachesis"
        },
        "release_url": "https://pypi.org/project/lachesis/0.0.3.0/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "lachesis automates the segmentation of a transcript into closed captions",
        "version": "0.0.3.0"
    },
    "last_serial": 2600532,
    "releases": {
        "0.0.1.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "dd3c0fcf47181bfbb752245e3d6b3c92",
                    "sha256": "7fba2dad4ac30f2909d3d76e2ae7e932bf258dd6b0771663b05ee6bc793644a4"
                },
                "downloads": -1,
                "filename": "lachesis-0.0.1.0.tar.gz",
                "has_sig": false,
                "md5_digest": "dd3c0fcf47181bfbb752245e3d6b3c92",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 32962,
                "upload_time": "2017-01-18T14:19:29",
                "url": "https://files.pythonhosted.org/packages/58/20/96cd95cfc072164e96bd9d97411ae0e621fc2df3de4458425a5a0180097d/lachesis-0.0.1.0.tar.gz"
            }
        ],
        "0.0.3.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "f9e332a6964f8cf4b5fe055ade146ad5",
                    "sha256": "c1b9d6f6f6dd96582dc1e35cf3c59ed63dfd5223719ccdaed600eb6a311fcb39"
                },
                "downloads": -1,
                "filename": "lachesis-0.0.3.0.tar.gz",
                "has_sig": false,
                "md5_digest": "f9e332a6964f8cf4b5fe055ade146ad5",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 48202,
                "upload_time": "2017-01-26T21:14:44",
                "url": "https://files.pythonhosted.org/packages/93/db/79b429b28a32f2485faf17f6adfdfce042ead6f305c02b4378644baece14/lachesis-0.0.3.0.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "f9e332a6964f8cf4b5fe055ade146ad5",
                "sha256": "c1b9d6f6f6dd96582dc1e35cf3c59ed63dfd5223719ccdaed600eb6a311fcb39"
            },
            "downloads": -1,
            "filename": "lachesis-0.0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f9e332a6964f8cf4b5fe055ade146ad5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 48202,
            "upload_time": "2017-01-26T21:14:44",
            "url": "https://files.pythonhosted.org/packages/93/db/79b429b28a32f2485faf17f6adfdfce042ead6f305c02b4378644baece14/lachesis-0.0.3.0.tar.gz"
        }
    ]
}