{ "info": { "author": "Alexandre Kabbach", "author_email": "akb@3azouz.net", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Linguistic" ], "description": "# WiToKit\n[![GitHub release][release-image]][release-url]\n[![PyPI release][pypi-image]][pypi-url]\n[![Build][travis-image]][travis-url]\n[![MIT License][license-image]][license-url]\n\nWelcome to `WiToKit`, a Python toolkit to download and generate preprocessed Wikipedia dumps for all languages.\n\nWiToKit can be used to converts a Wikipedia archive into a single .txt file, one (tokenized) sentence per line.\n\n*Note: WiToKit currently only supports `xx-pages-articles.xml.xx.bz2` Wikipedia archives corresponding to articles, templates, media/file descriptions, and primary meta-pages.*\n\n## Install\n\n```bash\npip3 install witokit\n```\n\nOn python3.5 you may need to pass on the `--process-dependency-link` flag:\n```bash\npip3 install witokit --process-dependency-link\n```\n\n## Use\n\n### Download\nTo download a .bz2-compressed Wikipedia XML dump, do:\n```bash\nwitokit download \u2060\\\n --lang lang_wp_code \\\n --date wiki_date \\\n --output /abs/path/to/output/dir/where/to/store/bz2/archives \\\n --num-threads num_cpu_threads\n```\n\nFor example, to download the latest English Wikipedia, do:\n```bash\nwitokit download \u2060--lang en --date latest --output /abs/path/to/output/dir --num-threads 2\n```\n\nThe `--lang` parameter expects the WP (language) code corresponding\nto the desired Wikipedia archive.\nCheck out the full list of Wikipedias with their corresponding WP codes [here](https://en.wikipedia.org/wiki/List_of_Wikipedias).\n\nThe `--date` parameter expects a string corresponding to one of the dates\nfound under the Wikimedia dump site corresponding to a given Wikipedia dump\n(e.g. https://dumps.wikimedia.org/enwiki/ for the English Wikipedia).\n\n**Important** Keep num-threads <= 3 to avoid rejection from Wikimedia servers\n\n### Extract\nTo extract the content of the downloaded .bz2 archives, do:\n\n```bash\nwitokit extract \\\n --input /abs/path/to/downloaded/wikipedia/bz2/archives \\\n --num-threads num_cpu_threads\n```\n\n### Process\nTo preprocess the content of the extracted XML archives and output a single .txt file, tokenize, one sentence per line:\n```bash\nwitokit process \\\n --input /abs/path/to/wikipedia/extracted/xml/archives \\\n --output /abs/path/to/single/output/txt/file \\\n --lower \\ # if set, will lowercase text\n --num-threads num_cpu_threads\n```\n\nPreprocessing for all languages is performed with [Polyglot](https://github.com/aboSamoor/polyglot).\n\n### Sample\nYou can also use WiToKit to sample the content of a preprocess .txt file, using:\n```bash\nwitokit sample \\\n --input /abs/path/to/witokit/preprocessed/txt/file \\\n --percent \\ # percentage of total lines to keep\n --balance # if set, will balance sampling, otherwise, will take top n sentences only\n```\n\n[release-image]:https://img.shields.io/github/release/akb89/witokit.svg?style=flat-square\n[release-url]:https://github.com/akb89/witokit/releases/latest\n[pypi-image]:https://img.shields.io/pypi/v/witokit.svg?style=flat-square\n[pypi-url]:https://pypi.org/project/witokit/\n[travis-image]:https://img.shields.io/travis/akb89/witokit.svg?style=flat-square\n[travis-url]:https://travis-ci.org/akb89/witokit\n[license-image]:http://img.shields.io/badge/license-MIT-000000.svg?style=flat-square\n[license-url]:LICENSE.txt\n[req-url]:https://requires.io/github/akb89/witokit/requirements/?branch=master\n[req-image]:https://img.shields.io/requires/github/akb89/witokit.svg?style=flat-square", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://pypi.org/project/witokit/#files", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/akb89/witokit", "keywords": "wikipedia,dump,tokenization,nlp", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "witokit", "package_url": "https://pypi.org/project/witokit/", "platform": "any", "project_url": "https://pypi.org/project/witokit/", "project_urls": { "Download": "https://pypi.org/project/witokit/#files", "Homepage": "https://github.com/akb89/witokit" }, "release_url": "https://pypi.org/project/witokit/1.1.0/", "requires_dist": null, "requires_python": "", "summary": "A python module to generate a tokenized dump of Wikipedia for NLP", "version": "1.1.0" }, "last_serial": 5821402, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "2b89662b144d28a9b980cea8144b1d6d", "sha256": "24e0b9c1c7562ff591e7032bfcf65e237e92a37955bf58acb744f72b283a3d0f" }, "downloads": -1, "filename": "witokit-0.1.0.tar.gz", "has_sig": false, "md5_digest": "2b89662b144d28a9b980cea8144b1d6d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7257, "upload_time": "2018-11-16T09:56:15", "url": "https://files.pythonhosted.org/packages/4d/d9/54d727ecbfc5a66f8f053f953c15309a39c0ac1aa4f01747a4f84999d981/witokit-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "04965cdb9f557779a3a46e8f69a40d5e", "sha256": "d83000a10d735b932673b85ce87b70b8ffa07e03470f97f7d6e98de3e439575c" }, "downloads": -1, "filename": "witokit-0.1.1.tar.gz", "has_sig": false, "md5_digest": "04965cdb9f557779a3a46e8f69a40d5e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7318, "upload_time": "2018-11-16T10:13:41", "url": "https://files.pythonhosted.org/packages/d4/68/7ddeee65fd37acc6b727d03876d11e7d3e288ba47f5e06a52ca8d64e94e6/witokit-0.1.1.tar.gz" } ], "0.1.10": [ { "comment_text": "", "digests": { "md5": "6d3ab41c46c21c144fecf0a3a15d3a14", "sha256": "3814f63e2f230f65f4d9945b390010ca8e7e99b96af0dd1a5c245d2bf58eefed" }, "downloads": -1, "filename": "witokit-0.1.10.tar.gz", "has_sig": false, "md5_digest": "6d3ab41c46c21c144fecf0a3a15d3a14", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7483, "upload_time": "2018-11-24T13:34:19", "url": "https://files.pythonhosted.org/packages/f7/12/6412630150b77cc24393555464c05fcf72e27be0447690d967ee1f5b1710/witokit-0.1.10.tar.gz" } ], "0.1.12": [ { "comment_text": "", "digests": { "md5": "22a7b184dd43631bac5cdac563a26d06", "sha256": "77f9e80fa7bc6676b586bcf7134f20cff34ba1d10221522eb92cbedfaeb3ed1a" }, "downloads": -1, "filename": "witokit-0.1.12.tar.gz", "has_sig": false, "md5_digest": "22a7b184dd43631bac5cdac563a26d06", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7533, "upload_time": "2018-11-24T14:09:34", "url": "https://files.pythonhosted.org/packages/94/0a/3715e1954d058f1db06c3321c5c754a4956d1175c11db592f91d832878e7/witokit-0.1.12.tar.gz" } ], "0.1.13": [ { "comment_text": "", "digests": { "md5": "82663d507c057ccd8170d835118a5bd0", "sha256": "a5102f91bca2df2e7a019f8a4cc88a4a74861743dbf81381854db8633ee8833b" }, "downloads": -1, "filename": "witokit-0.1.13.tar.gz", "has_sig": false, "md5_digest": "82663d507c057ccd8170d835118a5bd0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7623, "upload_time": "2018-11-24T16:50:11", "url": "https://files.pythonhosted.org/packages/c5/b4/d6e7f713c23be15fbb8e1193d90d36e56084c26e50d9f4df328352be453f/witokit-0.1.13.tar.gz" } ], "0.1.14": [ { "comment_text": "", "digests": { "md5": "7fdc7a3b658addf3daec4552cf6dc760", "sha256": "2fce53e61a7fd8e1bee1beee65a2f2f46b39f4f866c701f54ad89ade041a82e1" }, "downloads": -1, "filename": "witokit-0.1.14.tar.gz", "has_sig": false, "md5_digest": "7fdc7a3b658addf3daec4552cf6dc760", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7686, "upload_time": "2018-12-12T08:19:06", "url": "https://files.pythonhosted.org/packages/a5/2f/2183a1ec6c218df99b61cf6c5e559997421795fffa4922b4b7da90c7b4b9/witokit-0.1.14.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "52da975f8a314ee5b48b3b75bdec7db3", "sha256": "386a2faa321c26e815915de688f73cfb0e74745ea5e8dc3ea0e5808d9a1feb75" }, "downloads": -1, "filename": "witokit-0.1.2.tar.gz", "has_sig": false, "md5_digest": "52da975f8a314ee5b48b3b75bdec7db3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7307, "upload_time": "2018-11-16T12:15:56", "url": "https://files.pythonhosted.org/packages/1b/42/a23798b9c3a746b932f56755ef29d23d836851350581f0452209ed0840ce/witokit-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "1679f5a1fd44fb75b7015769b0902b37", "sha256": "15fdf87be96c13affb54b99cdb94f938d0983ba8d457e3d5939c73951fb2c54d" }, "downloads": -1, "filename": "witokit-0.1.3.tar.gz", "has_sig": false, "md5_digest": "1679f5a1fd44fb75b7015769b0902b37", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7311, "upload_time": "2018-11-16T12:30:43", "url": "https://files.pythonhosted.org/packages/76/2a/6aac9f231e3314df83136037bf3a2a66267954df0d27161e20fd4b594452/witokit-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "7f2392bcac1e9a63e3c1a2265d45893a", "sha256": "23daf8eec32c60904df1b3caca3a9345719aa49dac000c7643163f581dbdad89" }, "downloads": -1, "filename": "witokit-0.1.4.tar.gz", "has_sig": false, "md5_digest": "7f2392bcac1e9a63e3c1a2265d45893a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7353, "upload_time": "2018-11-16T12:40:44", "url": "https://files.pythonhosted.org/packages/2b/83/ecb582b9b07890639ec5135f0454a75538376ebbf0383a8f14c556bc2aa4/witokit-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "da6e061081fcb3646222c410c0a42f74", "sha256": "30c51fa4b770eaf03016cfd4bf1e2c658e087577c69ce6d50121b2df42bfb911" }, "downloads": -1, "filename": "witokit-0.1.5.tar.gz", "has_sig": false, "md5_digest": "da6e061081fcb3646222c410c0a42f74", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7350, "upload_time": "2018-11-16T14:44:07", "url": "https://files.pythonhosted.org/packages/08/87/b2fac40a673f33d49ef154367e98d1bae1aab4bc39e2711098b9c49807cb/witokit-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "474fb9f04d4a9971b572f0d1d208edc1", "sha256": "b3c5c50188be7277fe2e46e454d71f18dabe370bd2b55e5d1dd848d2251f8a30" }, "downloads": -1, "filename": "witokit-0.1.6.tar.gz", "has_sig": false, "md5_digest": "474fb9f04d4a9971b572f0d1d208edc1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7373, "upload_time": "2018-11-17T14:36:18", "url": "https://files.pythonhosted.org/packages/99/09/0930775e9b8155fcc900bbaaf90d742aa471bb73acc5388fee9a97aa031d/witokit-0.1.6.tar.gz" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "dfd3987fd414d38dd1f1ac45d371b9df", "sha256": "d1032225a12e69a34f18ff23669580f01c3b6db53ab520fb93eadfeb30269bbe" }, "downloads": -1, "filename": "witokit-0.1.7.tar.gz", "has_sig": false, "md5_digest": "dfd3987fd414d38dd1f1ac45d371b9df", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7360, "upload_time": "2018-11-17T15:04:07", "url": "https://files.pythonhosted.org/packages/e1/38/4d41ea40b5c3201a71aa95b8241730022e055a8773f852de0a944928d993/witokit-0.1.7.tar.gz" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "12d5c1e84a531f0aaaf1c7323e7d6dee", "sha256": "0cf306b96e70c228ff610ddc5155135d1ee9a408220d2d76add3e17495125d25" }, "downloads": -1, "filename": "witokit-0.1.8.tar.gz", "has_sig": false, "md5_digest": "12d5c1e84a531f0aaaf1c7323e7d6dee", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7434, "upload_time": "2018-11-18T07:55:59", "url": "https://files.pythonhosted.org/packages/d3/a1/99cce243713d4198f1442e7f67316bca70ba8f287a3e8ac09f06fd838790/witokit-0.1.8.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "d02da243d6125118fcdd72cea912c42d", "sha256": "f0537faa2f1cd6a7e505bb0f84275ac948210148c90a355404d87e1735eb4e57" }, "downloads": -1, "filename": "witokit-0.2.0.tar.gz", "has_sig": false, "md5_digest": "d02da243d6125118fcdd72cea912c42d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7680, "upload_time": "2019-01-28T08:36:01", "url": "https://files.pythonhosted.org/packages/6a/f8/31618e4a91f525ee348131cf11c46e4060cb596b0a7b8f814119fe0eb1cb/witokit-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "86ee5f7400d87a1a88cd11ec9537d312", "sha256": "455a071417084e4424f605f61e0878bac7ef8b39f9bff1782f912a9d1e35f804" }, "downloads": -1, "filename": "witokit-0.2.1.tar.gz", "has_sig": false, "md5_digest": "86ee5f7400d87a1a88cd11ec9537d312", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7681, "upload_time": "2019-01-28T09:48:09", "url": "https://files.pythonhosted.org/packages/6b/aa/a31f01bf5b8bceb0a6b0ba832a4cca2dca384a401d23bd70983ef7bddc20/witokit-0.2.1.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "57e26e7fb74f1ac91c3914d1cd4a01b9", "sha256": "f644a0916b10a05d2b4e2352e1037efe016ed15a480a1f2ffc846ef2cb34af60" }, "downloads": -1, "filename": "witokit-0.4.0.tar.gz", "has_sig": false, "md5_digest": "57e26e7fb74f1ac91c3914d1cd4a01b9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8083, "upload_time": "2019-01-29T09:12:13", "url": "https://files.pythonhosted.org/packages/23/45/f32042e0c36421052a4e39f4a853beb7489f2f54ab74f0be874b4e12d993/witokit-0.4.0.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "a54a126001f643da84d263b0ea21e9ee", "sha256": "a1ec0f20a810828d74549da16dae6b025702b897d7a201c17675540196199611" }, "downloads": -1, "filename": "witokit-1.0.0.tar.gz", "has_sig": false, "md5_digest": "a54a126001f643da84d263b0ea21e9ee", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8240, "upload_time": "2019-09-12T12:10:49", "url": "https://files.pythonhosted.org/packages/df/34/724425efedd1d8722945e91ddbb5117ad5c3897df9f850bf283796d3e28e/witokit-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "44695adf8416bd7b0f732b6e711e4e10", "sha256": "828dbcacd6dcb0eb7b65f4736d25d8a037cb8f6eef944d028caf24c7345f02c7" }, "downloads": -1, "filename": "witokit-1.0.1.tar.gz", "has_sig": false, "md5_digest": "44695adf8416bd7b0f732b6e711e4e10", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8250, "upload_time": "2019-09-12T12:33:44", "url": "https://files.pythonhosted.org/packages/b5/20/667de638510b690bf67849840faf4768fe2f9c81e43db828210e0ad3b69e/witokit-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "c792ed26c7921d20e41f5e59c7a1c6e7", "sha256": "12a7dd633463939c67b3b050a66eefe7ee52e25d5c8e287e1385f43d5ca822a6" }, "downloads": -1, "filename": "witokit-1.0.2.tar.gz", "has_sig": false, "md5_digest": "c792ed26c7921d20e41f5e59c7a1c6e7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8226, "upload_time": "2019-09-12T12:48:44", "url": "https://files.pythonhosted.org/packages/94/f4/2a6cca8c123d5ccdb2288ee8a60f02875c4e32999fb9ae6809dca059bf9f/witokit-1.0.2.tar.gz" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "4b978064d5f0d24e66edfc8b570c3393", "sha256": "c0126f1fde980ded5e1359977387b09a45be66c645a931c350988c948433ebb0" }, "downloads": -1, "filename": "witokit-1.1.0.tar.gz", "has_sig": false, "md5_digest": "4b978064d5f0d24e66edfc8b570c3393", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8239, "upload_time": "2019-09-12T16:24:16", "url": "https://files.pythonhosted.org/packages/59/69/338c656703434d75a92b37220729f822d17a523f121173c1b65388866406/witokit-1.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4b978064d5f0d24e66edfc8b570c3393", "sha256": "c0126f1fde980ded5e1359977387b09a45be66c645a931c350988c948433ebb0" }, "downloads": -1, "filename": "witokit-1.1.0.tar.gz", "has_sig": false, "md5_digest": "4b978064d5f0d24e66edfc8b570c3393", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8239, "upload_time": "2019-09-12T16:24:16", "url": "https://files.pythonhosted.org/packages/59/69/338c656703434d75a92b37220729f822d17a523f121173c1b65388866406/witokit-1.1.0.tar.gz" } ] }