{ "info": { "author": "", "author_email": "mohsenikiasari@ce.sharif.edu", "bugtrack_url": null, "classifiers": [ "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "![](https://i.ibb.co/sCJXhmz/header-sp.png)\n![](https://img.shields.io/apm/l/vim-mode.svg)\n\n\n# gutenberg-cleaner\n\na python package for cleaning Gutenberg books and dataset.\n\n### Prerequisites\nnltk package\n\n### Installing\n```\n[sudo] pip install gutenberg-cleaner\n```\n\n## How to use it?\n\nit has two methods called \"simple_cleaner\" and \"super_cleaner\".\n### simple_claner:\nJust removes lines that are part of the Project Gutenberg header or footer.\nDoesnt go deeply in the text to remove other things like titles or footnotes or etc...\n```\nsimple_cleaner(book: str) -> str\n```\n### super_cleaner:\nSuper clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too.\n```\nsuper_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str\n```\nmin_token: The minimum tokens of a paragraph that is not \"dialog\" or \"quote\", -1 means don't tokenize the txt (so it will be faster, but less efficient cleaning).\nmax_token: The maximum tokens of a paragraph.\n\nit will mark deleted paragraphs with: [deleted]\n\n\n## Author\n\n* **Peyman Mohseni kiasari**\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kiasar/gutenberg_cleaner", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "gutenberg-cleaner", "package_url": "https://pypi.org/project/gutenberg-cleaner/", "platform": "", "project_url": "https://pypi.org/project/gutenberg-cleaner/", "project_urls": { "Homepage": "https://github.com/kiasar/gutenberg_cleaner" }, "release_url": "https://pypi.org/project/gutenberg-cleaner/0.1.6/", "requires_dist": [ "nltk" ], "requires_python": "", "summary": "cleans gutenberg dataset books", "version": "0.1.6" }, "last_serial": 5318636, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "8adb8f795c396309ba81d7a017c15c9e", "sha256": "f6f7eddca60e70c81a15e7aa3bb159dbc150ec416900b28dbf726615cc0c3132" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "8adb8f795c396309ba81d7a017c15c9e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 3649, "upload_time": "2019-05-25T17:30:27", "url": "https://files.pythonhosted.org/packages/94/b7/e19865e9b0d782129102b65cd3f927d5102b164fc60322e1eed94a65c655/gutenberg_cleaner-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "35b92d87d3e3945cc8292ce21852206e", "sha256": "2cdd9b8123cf3f523ed4cd871cf111f92e0f19271691c580821996e6bf1ded5c" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.0.tar.gz", "has_sig": false, "md5_digest": "35b92d87d3e3945cc8292ce21852206e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2256, "upload_time": "2019-05-25T17:30:29", "url": "https://files.pythonhosted.org/packages/6e/e0/7b7e74ac20e41bfb97e06b187f0c9baa6fe5ee856ad83d419bb04c6da83b/gutenberg_cleaner-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "0b1c39946a7a50f00369873ac2349d3a", "sha256": "73d0b5f78e701fd4dba209ddb36ef6dfe644165fca94532d4dc5a6fc2a8aa604" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "0b1c39946a7a50f00369873ac2349d3a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 2889, "upload_time": "2019-05-25T18:39:01", "url": "https://files.pythonhosted.org/packages/3b/43/b0c20dd3e3e13dd61c68080e8e8a324d83761b85b2b64ce8aebadd911c7a/gutenberg_cleaner-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cce9d5f51a304bbb775d86cfb6296636", "sha256": "62d760e4fb50c02e056043d8478762f0e4d3cc1de82e4ed64288e619d277aadc" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.1.tar.gz", "has_sig": false, "md5_digest": "cce9d5f51a304bbb775d86cfb6296636", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1768, "upload_time": "2019-05-25T18:39:03", "url": "https://files.pythonhosted.org/packages/8c/21/b66e193e83dfdaaf7f3e5d3b95a384d93e96904fc993160df2ca651c2a70/gutenberg_cleaner-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "73ca0b57b0cbe8789d17cb97cf7fadc3", "sha256": "0166df88c800346df7948c56d48803a2976f4b8b2a2ddabcb04ab9fecb496dfc" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "73ca0b57b0cbe8789d17cb97cf7fadc3", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7434, "upload_time": "2019-05-25T18:54:10", "url": "https://files.pythonhosted.org/packages/4d/25/a13a1f8c6d5e13b0d0761be0babe4faed6b19d6a3d4c830cf73806ece1e9/gutenberg_cleaner-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ed253acc11f8bd9c99c75137813e7647", "sha256": "f721cf6e2d38e49d58abc04917b0047c60d45cf9a67f4d41540bd2fa6ca3111f" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.2.tar.gz", "has_sig": false, "md5_digest": "ed253acc11f8bd9c99c75137813e7647", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5039, "upload_time": "2019-05-25T18:54:13", "url": "https://files.pythonhosted.org/packages/cd/b1/207f2b6c9c1820405e0a229806a98b24e2c3100fc0d9252bf3005fa01d22/gutenberg_cleaner-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "550ed3243a566446527e1d946516d917", "sha256": "fa4bb8fb3e0668bd07b72b9cc929762a1f475a4864080510a91d5dd8b99053bf" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "550ed3243a566446527e1d946516d917", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7352, "upload_time": "2019-05-25T19:11:02", "url": "https://files.pythonhosted.org/packages/45/f1/84ea5148324a415f0695aa8953c36e2130b5c44ff27d395dee41d553ee6c/gutenberg_cleaner-0.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e6e50908c08b136af5a74d63a866b077", "sha256": "b4daacf5e6348163e634be882840673e9876a01cb1b081534f2210f94f1826bf" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.3.tar.gz", "has_sig": false, "md5_digest": "e6e50908c08b136af5a74d63a866b077", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4964, "upload_time": "2019-05-25T19:11:05", "url": "https://files.pythonhosted.org/packages/2c/26/0a692c6f694817ff19ff96cb445003a8b570300012cff4b79c682beab5c6/gutenberg_cleaner-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "bfefbc1de34ad683ece5365b5d0d3ea4", "sha256": "a4a3e5cbb2a1e4121bf5b44c5fdbe1edfba24edb507f21425b86b9f173717f87" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.4-py3-none-any.whl", "has_sig": false, "md5_digest": "bfefbc1de34ad683ece5365b5d0d3ea4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7342, "upload_time": "2019-05-26T09:06:35", "url": "https://files.pythonhosted.org/packages/99/a5/81e3b16954903631874c19ef4a62b85ce8411d323e7cf0ae7539809136ea/gutenberg_cleaner-0.1.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "863b75caacea8886fbc057f3b19ee086", "sha256": "e57fb57516f6d4d8b572c8d8c8b510c39bd11fe2aa899cbb8936880e44ef0731" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.4.tar.gz", "has_sig": false, "md5_digest": "863b75caacea8886fbc057f3b19ee086", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4956, "upload_time": "2019-05-26T09:06:37", "url": "https://files.pythonhosted.org/packages/b4/fd/7e1e6c2a3827c768e1632bd49ace5033ec119584f8535e27ba3daed0e0f3/gutenberg_cleaner-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "86cc01f336bfb5593a419e1c04fba92e", "sha256": "1c8da1792017917e5fa1647e524cf2a7d017e82a44afd8e57548d9e58ee5af56" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.5-py3-none-any.whl", "has_sig": false, "md5_digest": "86cc01f336bfb5593a419e1c04fba92e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7396, "upload_time": "2019-05-26T09:38:03", "url": "https://files.pythonhosted.org/packages/1c/12/366e1457f29a2214d04cc3ba7596e95dbccdd2c0bd9ae3fbabbe19b6a30d/gutenberg_cleaner-0.1.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fb5d12e4c9675170395e746d57864e55", "sha256": "fdc7db72b12120cbc21980cd7007ab7e031652cba1b8fbcb51d4931910f7c961" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.5.tar.gz", "has_sig": false, "md5_digest": "fb5d12e4c9675170395e746d57864e55", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5008, "upload_time": "2019-05-26T09:38:04", "url": "https://files.pythonhosted.org/packages/13/87/b15d8760dc0acb793505d5924be99071a4d7a498404df00a6981c0517a7d/gutenberg_cleaner-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "8636a4b12ef512f9d208dcbefc902cfb", "sha256": "6d0c2cd095087ada346f6836df99bc7aa01af2833cb322fc0414e821995e8e01" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.6-py3-none-any.whl", "has_sig": false, "md5_digest": "8636a4b12ef512f9d208dcbefc902cfb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7391, "upload_time": "2019-05-26T11:28:39", "url": "https://files.pythonhosted.org/packages/d4/11/3b83da7620e9c05f48fb4791ef712791fabeda09c78c5200d0860ce1e97e/gutenberg_cleaner-0.1.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2d896ec3cbe0c612b9432df9bddcdbb4", "sha256": "1f54ea893d5c31a42cdd9fccf083956ac1a1f9f722b1385569f1d7bca319395d" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.6.tar.gz", "has_sig": false, "md5_digest": "2d896ec3cbe0c612b9432df9bddcdbb4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5008, "upload_time": "2019-05-26T11:28:41", "url": "https://files.pythonhosted.org/packages/34/c5/c73ebc4def0f0ea222a25143dce37bfb677abd98ccbcb92de141980a1ff1/gutenberg_cleaner-0.1.6.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "8636a4b12ef512f9d208dcbefc902cfb", "sha256": "6d0c2cd095087ada346f6836df99bc7aa01af2833cb322fc0414e821995e8e01" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.6-py3-none-any.whl", "has_sig": false, "md5_digest": "8636a4b12ef512f9d208dcbefc902cfb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7391, "upload_time": "2019-05-26T11:28:39", "url": "https://files.pythonhosted.org/packages/d4/11/3b83da7620e9c05f48fb4791ef712791fabeda09c78c5200d0860ce1e97e/gutenberg_cleaner-0.1.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2d896ec3cbe0c612b9432df9bddcdbb4", "sha256": "1f54ea893d5c31a42cdd9fccf083956ac1a1f9f722b1385569f1d7bca319395d" }, "downloads": -1, "filename": "gutenberg_cleaner-0.1.6.tar.gz", "has_sig": false, "md5_digest": "2d896ec3cbe0c612b9432df9bddcdbb4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5008, "upload_time": "2019-05-26T11:28:41", "url": "https://files.pythonhosted.org/packages/34/c5/c73ebc4def0f0ea222a25143dce37bfb677abd98ccbcb92de141980a1ff1/gutenberg_cleaner-0.1.6.tar.gz" } ] }