{
"info": {
"author": "Alberto Pettarin",
"author_email": "alberto@albertopettarin.it",
"bugtrack_url": null,
"classifiers": [
"Development Status :: 2 - Pre-Alpha",
"Environment :: Console",
"Intended Audience :: Developers",
"Intended Audience :: Education",
"Intended Audience :: End Users/Desktop",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: GNU Affero General Public License v3",
"Natural Language :: English",
"Operating System :: MacOS :: MacOS X",
"Operating System :: Microsoft :: Windows",
"Operating System :: POSIX :: Linux",
"Programming Language :: C",
"Programming Language :: Python",
"Programming Language :: Python :: 2",
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Topic :: Education",
"Topic :: Multimedia",
"Topic :: Multimedia :: Sound/Audio",
"Topic :: Multimedia :: Sound/Audio :: Analysis",
"Topic :: Multimedia :: Sound/Audio :: Speech",
"Topic :: Printing",
"Topic :: Scientific/Engineering",
"Topic :: Scientific/Engineering :: Mathematics",
"Topic :: Software Development :: Libraries :: Python Modules",
"Topic :: Text Processing",
"Topic :: Text Processing :: Linguistic",
"Topic :: Text Processing :: Markup",
"Topic :: Text Processing :: Markup :: HTML",
"Topic :: Text Processing :: Markup :: XML",
"Topic :: Utilities"
],
"description": "lachesis\n========\n\n**lachesis** automates the segmentation of a transcript into closed\ncaptions\n\n- Version: 0.0.3\n- Date: 2017-01-26\n- Developed by: `Alberto Pettarin `__\n- License: the GNU Affero General Public License Version 3 (AGPL v3)\n- Contact: info@readbeyond.it\n\n**DO NOT USE THIS PACKAGE IN PRODUCTION UNTIL IT REACHES v1.0.0 !!!**\n\nGoal\n----\n\n**lachesis** automates the segmentation of a transcript into closed\ncaptions (CCs).\n\nThe general idea is that writing a transcription (raw text) is easier\nand faster than writing CCs, especially if you need to respect\nconstraints like a certain minimum/maximum number of characters per\nline, a maximum number of lines per CC, etc.\n\nYou can transcribe your video into raw text and ``lachesis`` will take\non the job of segmenting the text into CCs for you. Once you have the\nCCs, you can use a `forced\naligner `__ like\n`aeneas `__ to align them with\nthe audio of your video, obtaining a subtitle file (SRT, TTML, VTT,\netc.).\n\nWith ``lachesis`` and a forced aligner, the manual labor for producing\nCCs for a video is reduced to a. transcribing the video in raw text\nform, and b. checking the final CCs and audio alignment. Instead of\ntranscribing from scratch, you can even start by checking/editing a\nrough transcription made by an automated speech recognition engine, like\nthe \"automatic CCs\" from YouTube, speeding the process up further.\n\nThe \"magic\" behind ``lachesis`` consists in combining machine learning\ntechniques like `conditional random\nfields `__ (CRF)\nand classical NLP tools like `POS\ntagging `__ and\n`sentence\nsegmentation `__ to\nsplit the text into CC lines. The machine learning models are learned\nfrom existing, manually-edited, high-quality CCs, like those of\n`TED `__/`TEDx `__\ntalks on YouTube. The NLP tools come from the well-established, free NLP\nlibraries for Python listed below.\n\nIn summary, ``lachesis`` contains the following major functions:\n\n- download closed captions from YouTube;\n- parse closed caption TTML files (downloaded from YouTube);\n- add POS tags to a given text or closed caption file;\n- segment a given text into sentences;\n- segment a given text into closed captions (several algorithms are\n available);\n- train and use machine learning models to segment raw text into CC\n lines.\n\nInstallation\n------------\n\n**DO NOT USE THIS PACKAGE IN PRODUCTION UNTIL IT REACHES v1.0.0 !!!**\n\n.. code:: bash\n\n pip install lachesis\n\nInstalling dependencies\n~~~~~~~~~~~~~~~~~~~~~~~\n\nYou might need additional packages, depending on how you plan to use\n``lachesis``:\n\n- ``lxml >= 3.6.0`` for reading or downloading TTML files;\n- ``youtube-dl >= 2017.1.16`` for downloading TTML files;\n- ``python-crfsuite >= 0.9.1`` for training and using CRF-based\n splitters.\n\nBy design choice, none of the above dependencies is installed by\n``pip install lachesis``. If you want to install them all, you can use:\n\n.. code:: bash\n\n pip install lachesis[full]\n\nAlternatively, manually install only the dependencies you need. (You can\ndo it before or after installing ``lachesis``, the order does not\nmatter.)\n\nInstalling NLP Libraries\n~~~~~~~~~~~~~~~~~~~~~~~~\n\nIn addition to the dependencies listed above, to perform POS tagging and\nsentence segmentation ``lachesis`` can use one or more of the following\nlibraries:\n\n- ``Pattern`` (install with ``pip install pattern``, `see\n here `__)\n- ``NLTK`` (install with ``pip install nltk``, `see\n here `__)\n- ``spaCy`` (install with ``pip install spacy``, `see\n here `__)\n- ``UDPipe`` (install with ``pip install ufal.udpipe``, `see\n here `__)\n\nIf you want to install them all, you can use:\n\n.. code:: bash\n\n pip install lachesis[nlp]\n\nor ``[fullnlp]`` if you also want ``[full]`` as above.\n\nEach NLP library also needs language models which you need to\ndownload/install separately. Consult the documentation of your NLP\nlibrary for details.\n\n``lachesis`` expects the following directories in your home directory\n(you can symlink them, if you installed each NLP library in a different\nplace):\n\n- ``~/lachesis_data/nltk_data`` for ``NLTK`` (`see\n here `__);\n- ``~/lachesis_data/spacy_data`` for ``spaCy`` (`see\n here `__);\n- ``~/lachesis_data/udpipe_data`` for ``UDPipe`` (`see\n here `__).\n\nThe NLP library ``Pattern`` does not need a separate download of its\nlanguage models, as they are bundled in the file you download when\ninstalling through ``pip install pattern``.\n\nThe following table summarizes the languages supported by each library\nin their standard language models pack. (Additional languages might be\nsupported by third party projects/downloads or added over time.)\n\n+-----------------------+-----------+--------+---------+----------+\n| Language / Library | Pattern | NLTK | spaCy | UDPipe |\n+=======================+===========+========+=========+==========+\n| Arabic | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Basque | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Bulgarian | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Croatian | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Czech | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Danish | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Dutch | \u2713 | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| English | \u2713 | \u2713 | \u2713 | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Estonian | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Finnish | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| French | \u2713 | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| German | \u2713 | \u2713 | \u2713 | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Gothic | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Greek | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Greek (ancient) | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Hebrew | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Hindi | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Hungarian | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Indonesian | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Irish | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Italian | \u2713 | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Latin | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Norwegian | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Old Church Slavonic | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Persian | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Polish | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Portuguese | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Romanian | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Slovenian | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Spanish | \u2713 | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Swedish | | \u2713 | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Tamil | | | | \u2713 |\n+-----------------------+-----------+--------+---------+----------+\n| Turkish | | \u2713 | | |\n+-----------------------+-----------+--------+---------+----------+\n\nUsage\n-----\n\nDownload closed captions from YouTube\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from lachesis.downloaders import Downloader\n from lachesis.language import Language\n\n # set URL of the video and language of the CCs\n url = u\"http://www.youtube.com/watch?v=NSL_xx2Qnyc\"\n language = Language.ENGLISH\n\n # download automatic CC, do not save to file\n options = { \"auto\": True }\n doc = Downloader.download_closed_captions(url, language, options)\n print(doc)\n\n # download manually-edited CC, saving the raw TTML file to disk\n options = { \"auto\": False, \"output_file_path\": \"/tmp/ccs.ttml\" }\n doc = Downloader.download_closed_captions(url, language, options)\n print(doc)\n\nParse an existing TTML file downloaded from YouTube\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from lachesis.downloaders import Downloader\n\n # parse a given TTML file downloaded from YouTube\n ifp = \"/tmp/ccs.ttml\"\n doc = Downloader.read_closed_captions(ifp, options={u\"downloader\": u\"youtube\"})\n print(doc.language)\n\n # print several representations of the CCs\n print(doc.raw_string) # multi line string, similar to SRT but w/o ids or times\n print(doc.raw_flat_clean_string) # single line string, w/o CC line marks\n print(doc.raw.string(flat=True, eol=u\"|\")) # single line string, CC lines separated by '|' characters\n\nTokenize, split sentences, and POS tagging\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from lachesis.elements import Document\n from lachesis.language import Language\n from lachesis.nlpwrappers import NLPEngine\n\n # work on this Unicode string\n s = u\"Hello, World. This is a second sentence, with a comma too! And a third sentence.\"\n\n # but you can also pass a list with pre-split sentences\n # s = [u\"Hello World.\", u\"This is a second sentence.\", u\"Third one, bla bla\"]\n\n # create a Text object from the Unicode string\n doc = Document(raw=s, language=Language.ENGLISH)\n\n # tokenize, split sentences, and POS tagging\n # the best available NLP library will be chosen\n nlp1 = NLPEngine()\n nlp1.analyze(doc)\n\n # the text has been divided into tokens, grouped in sentences\n for s in doc.sentences:\n print(s) # raw\n print(s.string(tagged=True)) # tagged\n print(s.string(raw=True, eol=u\"|\", eos=u\"\")) # raw w/o CC line and sentence marks\n\n # explicitly specify the NLP library NLTK,\n # other options include: \"pattern\", \"spacy\", \"udpipe\"\n nlp2 = NLPEngine()\n nlp2.analyze(doc, wrapper=u\"nltk\")\n ...\n\n # if you need to analyze many documents,\n # preload (and keep in cache) an NLP library,\n # even different ones for different languages\n nlp3 = NLPEngine(preload=[\n (u\"en\", u\"spacy\"),\n (u\"de\", u\"nltk\"),\n (u\"it\", u\"pattern\"),\n (u\"fr\", u\"udpipe\")\n ])\n nlp3.analyze(doc)\n ...\n\nSplit into closed captions\n~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from lachesis.elements import Document\n from lachesis.language import Language\n from lachesis.nlpwrappers import NLPEngine\n from lachesis.splitters import CRFSplitter\n from lachesis.splitters import GreedySplitter\n\n # create a document from a raw string\n s = u\"Hello, World. This is a second sentence, with a comma too! And a third sentence.\"\n doc = Document(raw=s, language=Language.ENGLISH)\n\n # analyze it using the NLP library Pattern\n nlpe = NLPEngine()\n nlpe.analyze(doc, wrapper=u\"pattern\")\n\n # feed the document into the CRF splitter (max 42 chars/line, max 2 lines/cc)\n spl = CRFSplitter(doc.language, 42, 2)\n spl.split(doc)\n\n # print the segmented CCs\n for cc in doc.ccs:\n for line in cc.elements:\n print(line)\n print(u\"\")\n\n # the default location for CRF model files is ~/lachesis_data/crf_data/\n # but you can also specify a different path\n spl = CRFSplitter(doc.language, 42, 2, model_file_path=\"/tmp/yourmodel.crfsuite\")\n spl.split(doc)\n\n # if you do not have pycrfsuite installed\n # or the CRF model file for the document language,\n # you can use the GreedySplitter\n gs = GreedySplitter(doc.language, 42, 2)\n gs.split(doc)\n\nTrain a CRF model to segment raw text into CC lines\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n $ # /tmp/ccs/train contains several TTML files to learn from\n $ # you can download them from YouTube using lachesis (see above)\n $ ls /tmp/ccs/train\n 0001.ttml\n 0002.ttml\n ...\n\n $ # extract features and labels from them:\n $ python -m lachesis.ml.crf dump eng /tmp/ccs/train/ /tmp/ccs/train.pickle\n ...\n\n $ # train the CRF model:\n $ python -m lachesis.ml.crf train eng /tmp/ccs/train.pickle /tmp/ccs/model.crfsuite\n ...\n\n $ # evaluate the model on the training set\n $ python -m lachesis.ml.crf test eng /tmp/ccs/train.pickle /tmp/ccs/model.crfsuite\n ...\n\n $ # you might want to evaluate on a test set, disjoint from the training set,\n $ # that is, the test set contains CCs not seen during the training:\n $ ls /tmp/css/test\n 1001.ttml\n 1002.ttml\n ...\n $ python -m lachesis.ml.crf dump eng /tmp/ccs/test/ /tmp/ccs/test.pickle\n $ python -m lachesis.ml.crf test eng /tmp/ccs/test.pickle /tmp/ccs/model.crfsuite\n ...\n $ # now you can build a CRFSplitter\n $ # with model_file_path=\"/tmp/ccs/model.crfsuite\" as shown above\n\nTODO: decide and document where pre-trained model files can be\ndownloaded\n\nLicense\n-------\n\n**lachesis** is released under the terms of the GNU Affero General\nPublic License Version 3. See the `LICENSE `__ file for\ndetails.",
"description_content_type": null,
"docs_url": null,
"download_url": "UNKNOWN",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "https://github.com/readbeyond/lachesis",
"keywords": "ReadBeyond Sync,ReadBeyond,SBV,SRT,SSV,SUB,TSV,TTML,VTT,aeneas,captioning,captions,closed captions,forced alignment,lachesis,media overlay,speech to text,subtitles,sync,synchronization,transcript,video captions",
"license": "GNU Affero General Public License v3 (AGPL v3)",
"maintainer": null,
"maintainer_email": null,
"name": "lachesis",
"package_url": "https://pypi.org/project/lachesis/",
"platform": "UNKNOWN",
"project_url": "https://pypi.org/project/lachesis/",
"project_urls": {
"Download": "UNKNOWN",
"Homepage": "https://github.com/readbeyond/lachesis"
},
"release_url": "https://pypi.org/project/lachesis/0.0.3.0/",
"requires_dist": null,
"requires_python": null,
"summary": "lachesis automates the segmentation of a transcript into closed captions",
"version": "0.0.3.0"
},
"last_serial": 2600532,
"releases": {
"0.0.1.0": [
{
"comment_text": "",
"digests": {
"md5": "dd3c0fcf47181bfbb752245e3d6b3c92",
"sha256": "7fba2dad4ac30f2909d3d76e2ae7e932bf258dd6b0771663b05ee6bc793644a4"
},
"downloads": -1,
"filename": "lachesis-0.0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "dd3c0fcf47181bfbb752245e3d6b3c92",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 32962,
"upload_time": "2017-01-18T14:19:29",
"url": "https://files.pythonhosted.org/packages/58/20/96cd95cfc072164e96bd9d97411ae0e621fc2df3de4458425a5a0180097d/lachesis-0.0.1.0.tar.gz"
}
],
"0.0.3.0": [
{
"comment_text": "",
"digests": {
"md5": "f9e332a6964f8cf4b5fe055ade146ad5",
"sha256": "c1b9d6f6f6dd96582dc1e35cf3c59ed63dfd5223719ccdaed600eb6a311fcb39"
},
"downloads": -1,
"filename": "lachesis-0.0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "f9e332a6964f8cf4b5fe055ade146ad5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 48202,
"upload_time": "2017-01-26T21:14:44",
"url": "https://files.pythonhosted.org/packages/93/db/79b429b28a32f2485faf17f6adfdfce042ead6f305c02b4378644baece14/lachesis-0.0.3.0.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "f9e332a6964f8cf4b5fe055ade146ad5",
"sha256": "c1b9d6f6f6dd96582dc1e35cf3c59ed63dfd5223719ccdaed600eb6a311fcb39"
},
"downloads": -1,
"filename": "lachesis-0.0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "f9e332a6964f8cf4b5fe055ade146ad5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 48202,
"upload_time": "2017-01-26T21:14:44",
"url": "https://files.pythonhosted.org/packages/93/db/79b429b28a32f2485faf17f6adfdfce042ead6f305c02b4378644baece14/lachesis-0.0.3.0.tar.gz"
}
]
}