{ "info": { "author": "Jonathan Raiman", "author_email": "jraiman at mit dot edu", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Science/Research", "Operating System :: OS Independent", "Programming Language :: Python :: 3.3", "Topic :: Text Processing :: Linguistic" ], "description": "epub conversion\n---------------\n\nCreate text corpuses using epubs and wiki dumps.\nThis is a python package with a Converter for epub and xml (wiki dumps) to text, lines, or Python generators.\n\nUsage:\n------\n\n### Epub usage\n\n#### Book by book\n\nTo convert epubs to text files, usage is straightforward. First create a converter object:\n\n\tconverter = Converter(\"my_ebooks_folder/\")\n\nThen using this converter let's concatenate all the text within the ebooks into a single mega text file:\n\n\tconverter.convert(\"my_succinct_text_file.gz\")\n\n#### Line by line\n\nYou can also proceed line by line:\n\n\tfrom epub_conversion.utils import open_book\n\n\tbook = open_book(\"twilight.epub\")\n\n\tlines = convert_epub_to_lines(book)\n\n### Wikidump usage\n\n#### Redirections\n\nSuppose you are interested in all redirections in a given Wikipedia dump file\nthat is still compressed, then you can access the dump as follows:\n\n\n\twiki = epub_conversion.wiki_decoder.almost_smart_open(\"enwiki.bz2\")\n\n\nTaking this dump as our **input** let us now use a generator to output all pairs of `title` and `redirection title` in this dump:\n\n\tredirections = {redirect_from:redirect_to\n\t\tfor redirect_from, redirect_to in epub_conversion.wiki_decoder.get_redirection_list(wiki)\n\t}\n\n#### Page text\n\nSuppose you are interested in the lines within each page's text section only, then:\n\n\n\tfor line in epub_conversion.wiki_decoder.convert_wiki_to_lines(wiki):\n\t\tprocess_line( line )\n\n\nSee Also:\n---------\n\n* [Wikipedia NER](https://github.com/JonathanRaiman/wikipedia_ner) a Python module that uses `epub_conversion` to process Wikipedia dumps and output only the lines that contain page to page links, with the link anchor texts extracted, and all markup removed.", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/JonathanRaiman/epub_conversion", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/JonathanRaiman/epub_conversion", "keywords": "XML,epub,tokenization,NLP", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "epub-conversion", "package_url": "https://pypi.org/project/epub-conversion/", "platform": "any", "project_url": "https://pypi.org/project/epub-conversion/", "project_urls": { "Download": "https://github.com/JonathanRaiman/epub_conversion", "Homepage": "https://github.com/JonathanRaiman/epub_conversion" }, "release_url": "https://pypi.org/project/epub-conversion/1.0.7/", "requires_dist": null, "requires_python": null, "summary": "Python package for converting xml and epubs to text files", "version": "1.0.7" }, "last_serial": 1625535, "releases": { "1.0.1": [ { "comment_text": "", "digests": { "md5": "86420bccbab1e17daab03b77e9ed8e3d", "sha256": "6fa2efd73846de0e5ec04744934b9b9b851185a1b809627da31b16054b2f5842" }, "downloads": -1, "filename": "epub-conversion-1.0.1.tar.gz", "has_sig": false, "md5_digest": "86420bccbab1e17daab03b77e9ed8e3d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4811, "upload_time": "2014-10-27T18:25:52", "url": "https://files.pythonhosted.org/packages/22/c3/f4f994d0012a4dd587db28c3f09e88bb1baf11608a46864cf31887c02559/epub-conversion-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "df640e1d847b451b20ae443e3f7d8400", "sha256": "ef5b1f16d9f083245f785262e3fa894a910f18631f8dde9ec994efab03c0f02d" }, "downloads": -1, "filename": "epub-conversion-1.0.2.tar.gz", "has_sig": false, "md5_digest": "df640e1d847b451b20ae443e3f7d8400", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4805, "upload_time": "2014-12-22T11:34:53", "url": "https://files.pythonhosted.org/packages/34/2e/ef61576a6e43f6ba0156dab7298503ffe0216539851876e055ea33bfbfae/epub-conversion-1.0.2.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "2267d45605163060b926d376b2bb67aa", "sha256": "92805283d1033956809064694703dce6bad7de505a40666d100851b22058cf62" }, "downloads": -1, "filename": "epub-conversion-1.0.4.tar.gz", "has_sig": false, "md5_digest": "2267d45605163060b926d376b2bb67aa", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4803, "upload_time": "2014-12-22T12:05:45", "url": "https://files.pythonhosted.org/packages/07/ee/5b85d529917ab3753b0372d9cbbafe529074ac0e7c6164f591bdd48b599b/epub-conversion-1.0.4.tar.gz" } ], "1.0.5": [ { "comment_text": "", "digests": { "md5": "e6eca75350c77f660d5e8284dcb7423d", "sha256": "cf039be8abf112b2eb8faee252afe783c6ca913f60f79546f03b3cf4bb64c893" }, "downloads": -1, "filename": "epub-conversion-1.0.5.tar.gz", "has_sig": false, "md5_digest": "e6eca75350c77f660d5e8284dcb7423d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5715, "upload_time": "2014-12-28T11:39:05", "url": "https://files.pythonhosted.org/packages/48/20/42693b1e6e981f9bd640cf4c0386eadf25ea0dc5107bdff53368084240a8/epub-conversion-1.0.5.tar.gz" } ], "1.0.6": [ { "comment_text": "", "digests": { "md5": "e0d88766780146c2c24d11d80ead1b3e", "sha256": "91e9b169fbd035d7ed0628a9652d1f5e7e6495996efb3f83a5ed141f09594310" }, "downloads": -1, "filename": "epub-conversion-1.0.6.tar.gz", "has_sig": false, "md5_digest": "e0d88766780146c2c24d11d80ead1b3e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5727, "upload_time": "2014-12-28T11:42:17", "url": "https://files.pythonhosted.org/packages/3b/13/4c972b720edc161be18d7c97e36bb060806ca8e126b7ac142b43a5f3bad9/epub-conversion-1.0.6.tar.gz" } ], "1.0.7": [ { "comment_text": "", "digests": { "md5": "856600f2879ac1041ec071a557ff9a90", "sha256": "db0fb6c6878ffbee84e14428eb2a5a69cddfe257c29435a3cc9eefdec743b89b" }, "downloads": -1, "filename": "epub-conversion-1.0.7.tar.gz", "has_sig": false, "md5_digest": "856600f2879ac1041ec071a557ff9a90", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5715, "upload_time": "2015-07-09T00:50:36", "url": "https://files.pythonhosted.org/packages/a7/75/d312eb095d498777cdfc818f938c8b44ef49c70dc84aea6b7aacbadbb505/epub-conversion-1.0.7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "856600f2879ac1041ec071a557ff9a90", "sha256": "db0fb6c6878ffbee84e14428eb2a5a69cddfe257c29435a3cc9eefdec743b89b" }, "downloads": -1, "filename": "epub-conversion-1.0.7.tar.gz", "has_sig": false, "md5_digest": "856600f2879ac1041ec071a557ff9a90", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5715, "upload_time": "2015-07-09T00:50:36", "url": "https://files.pythonhosted.org/packages/a7/75/d312eb095d498777cdfc818f938c8b44ef49c70dc84aea6b7aacbadbb505/epub-conversion-1.0.7.tar.gz" } ] }