{ "info": { "author": "Maciej Brencz", "author_email": "maciej.brencz@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Text Processing :: Markup :: XML" ], "description": "# mediawiki-dump\n[![Build Status](https://travis-ci.org/macbre/mediawiki-dump.svg?branch=master)](https://travis-ci.org/macbre/mediawiki-dump)\n\n```\npip install mediawiki_dump\n```\n\n[Python3 package](https://pypi.org/project/mediawiki_dump/) for working with [MediaWiki XML content dumps](https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki#Backup_the_content_of_the_wiki_(XML_dump)).\n\nWikipedia (bz2 compressed) and Wikia (7zip) content dumps are supported.\n\n## Dependencies\n\nIn order to read 7zip archives (used by Wikia's XML dumps) you need to install [`libarchive`](http://libarchive.org/):\n\n```\nsudo apt install libarchive-dev\n```\n\n## API\n\n### Tokenizer\n\nAllows you to clean up the wikitext:\n\n```python\nfrom mediawiki_dump.tokenizer import clean\nclean('[[Foo|bar]] is a link')\n'bar is a link'\n```\n\nAnd then tokenize the text:\n\n```python\nfrom mediawiki_dump.tokenizer import tokenize\ntokenize('11. juni 2007 var\u00f0 kunngj\u00f8rt, at Sv\u00ednoyar kommuna ver\u00f0ur l\u00f8gd saman vi\u00f0 Klaksv\u00edkar kommunu eftir komandi bygdar\u00e1\u00f0sval.')\n['juni', 'var\u00f0', 'kunngj\u00f8rt', 'at', 'Sv\u00ednoyar', 'kommuna', 'ver\u00f0ur', 'l\u00f8gd', 'saman', 'vi\u00f0', 'Klaksv\u00edkar', 'kommunu', 'eftir', 'komandi', 'bygdar\u00e1\u00f0sval']\n```\n\n### Dump reader\n\nFetch and parse dumps (using a local file cache):\n\n```python\nfrom mediawiki_dump.dumps import WikipediaDump\nfrom mediawiki_dump.reader import DumpReader\n\ndump = WikipediaDump('fo')\npages = DumpReader().read(dump)\n\n[page.title for page in pages][:10]\n\n['Main Page', 'Br\u00fakari:Jon Harald S\u00f8by', 'Fors\u00ed\u00f0a', 'Ormurin Langi', 'Regin smi\u00f0ur', 'Fyrimynd:InterLingvLigoj', 'Heimsyvirl\u00fdsingin um mannar\u00e6ttindi', 'B\u00f3lkur:Kv\u00e6\u00f0i', 'B\u00f3lkur:Yrking', 'Kjak:Fors\u00ed\u00f0a']\n```\n\n`read` method yields the `DumpEntry` object for each revision.\n\nBy using `DumpReaderArticles` class you can read article pages only:\n\n```python\nimport logging; logging.basicConfig(level=logging.INFO)\n\nfrom mediawiki_dump.dumps import WikipediaDump\nfrom mediawiki_dump.reader import DumpReaderArticles\n\ndump = WikipediaDump('fo')\nreader = DumpReaderArticles()\npages = reader.read(dump)\n\nprint([page.title for page in pages][:25])\n\nprint(reader.get_dump_language()) # fo\n```\n\nWill give you:\n\n```\nINFO:DumpReaderArticles:Parsing XML dump...\nINFO:WikipediaDump:Checking /tmp/wikicorpus_62da4928a0a307185acaaa94f537d090.bz2 cache file...\nINFO:WikipediaDump:Fetching fo dump from ...\nINFO:WikipediaDump:HTTP 200 (14105 kB will be fetched)\nINFO:WikipediaDump:Cache set\n...\n['WIKIng', 'F\u00f8royar', 'Bor\u00f0oy', 'Eysturoy', 'Fugloy', 'Fors\u00ed\u00f0a', 'L\u00f8gmenn \u00ed F\u00f8royum', 'GNU Free Documentation License', 'GFDL', 'Opi\u00f0 innihald', 'Wikipedia', 'Alfr\u00f8\u00f0i', '2004', '20. juni', 'WikiWiki', 'Wiki', 'Danmark', '21. juni', '22. juni', '23. juni', 'L\u00edvfr\u00f8\u00f0i', '24. juni', '25. juni', '26. juni', '27. juni']\n```\n\n## Reading Wikia's dumps\n\n ```python\nimport logging; logging.basicConfig(level=logging.INFO)\n\nfrom mediawiki_dump.dumps import WikiaDump\nfrom mediawiki_dump.reader import DumpReaderArticles\n\ndump = WikiaDump('plnordycka')\npages = DumpReaderArticles().read(dump)\n\nprint([page.title for page in pages][:25])\n```\n\nWill give you:\n\n```\nINFO:DumpReaderArticles:Parsing XML dump...\nINFO:WikiaDump:Checking /tmp/wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b.7z cache file...\nINFO:WikiaDump:Fetching plnordycka dump from ...\nINFO:WikiaDump:HTTP 200 (129 kB will be fetched)\nINFO:WikiaDump:Cache set\nINFO:WikiaDump:Reading wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b file from dump\n...\nINFO:DumpReaderArticles:Parsing completed, entries found: 615\n['Nordycka Wiki', 'Strona g\u0142\u00f3wna', '1968', '1948', 'Ormurin Langi', 'Mykines', 'Trollsj\u00f6n', 'Wyspy Owcze', 'N\u00f3lsoy', 'Sandoy', 'V\u00e1gar', 'M\u00f8rk', 'Eysturoy', 'Rakfisk', 'H\u00e1karl', '1298', 'Sztokfisz', '1978', '1920', 'Najbardziej na p\u00f3\u0142noc', 'Svalbard', 'Hamfer\u00f0', 'Rok w Skandynawii', 'Islandia', 'Rissajaure']\n```\n\n## Fetching full history\n\nPass `full_history` to `BaseDump` constructor to fetch the XML content dump with full history:\n\n```python\nimport logging; logging.basicConfig(level=logging.INFO)\n\nfrom mediawiki_dump.dumps import WikiaDump\nfrom mediawiki_dump.reader import DumpReaderArticles\n\ndump = WikiaDump('macbre', full_history=True) # fetch full history, including old revisions\npages = DumpReaderArticles().read(dump)\n\nprint('\\n'.join([repr(page) for page in pages]))\n```\n\nWill give you:\n\n```\nINFO:DumpReaderArticles:Parsing completed, entries found: 384\n\n\n\n\n\n\n\n\n\n\n\n...\n\n\n\n\n\n\n\n\n```\n\n## Reading dumps of selected articles\n\nYou can use [`mwclient` Python library](https://mwclient.readthedocs.io/en/latest/index.html)\nand fetch \"live\" dumps of selected articles from any MediaWiki-powered site.\n\n```python\nimport mwclient\nsite = mwclient.Site('vim.fandom.com', path='/')\n\nfrom mediawiki_dump.dumps import MediaWikiClientDump\nfrom mediawiki_dump.reader import DumpReaderArticles\n\ndump = MediaWikiClientDump(site, ['Vim documentation', 'Tutorial'])\n\npages = DumpReaderArticles().read(dump)\n\nprint('\\n'.join([repr(page) for page in pages]))\n```\n\nWill give you:\n\n```\n\n\n```", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/macbre/mediawiki_dump", "keywords": "dump fandom mediawiki wikipedia wikia", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "mediawiki-dump", "package_url": "https://pypi.org/project/mediawiki-dump/", "platform": "", "project_url": "https://pypi.org/project/mediawiki-dump/", "project_urls": { "Homepage": "https://github.com/macbre/mediawiki_dump" }, "release_url": "https://pypi.org/project/mediawiki-dump/0.6.7/", "requires_dist": null, "requires_python": "", "summary": "Python package for working with MediaWiki XML content dumps", "version": "0.6.7" }, "last_serial": 5595548, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "7e7237daf035c606aa70d4d005755a66", "sha256": "19bde3f8559f300e38fe9efb951762f114d462e2f7cf9cc9320cfb014cc9d164" }, "downloads": -1, "filename": "mediawiki_dump-0.2.tar.gz", "has_sig": false, "md5_digest": "7e7237daf035c606aa70d4d005755a66", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5706, "upload_time": "2018-10-26T19:52:54", "url": "https://files.pythonhosted.org/packages/30/1a/d213615c64218d458a42f4889a128ac2d8e00a0497d462d0d965585fbc56/mediawiki_dump-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "ffd6242bd8020f33d529dc7ef96e7939", "sha256": "5ad07d89a24c001cb17b3fea2542c1c615b158fa690b33ff986e64a4035d6e2f" }, "downloads": -1, "filename": "mediawiki_dump-0.3.tar.gz", "has_sig": false, "md5_digest": "ffd6242bd8020f33d529dc7ef96e7939", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8286, "upload_time": "2018-10-30T20:18:37", "url": "https://files.pythonhosted.org/packages/bc/8a/1ac5528ca475dcc9943952e0e02f9a63481a5ff11186d64e826e64bece72/mediawiki_dump-0.3.tar.gz" } ], "0.4": [ { "comment_text": "", "digests": { "md5": "845eb4de982d48f54cc0cc7deda91e9f", "sha256": "f2480c411fe31adc7ebf71d7f611cdfabc68c31fad12863acffed2e8aa0b8326" }, "downloads": -1, "filename": "mediawiki_dump-0.4.tar.gz", "has_sig": false, "md5_digest": "845eb4de982d48f54cc0cc7deda91e9f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9654, "upload_time": "2018-11-12T16:47:41", "url": "https://files.pythonhosted.org/packages/3b/3e/b94af7b09b959a23c4e3c90da1a88c10b20cabdf32bc31c31da3067de520/mediawiki_dump-0.4.tar.gz" } ], "0.5": [ { "comment_text": "", "digests": { "md5": "a52ef03754beeb0d1a378dcb5afba4db", "sha256": "e5fac7fbbcd545c597ef31ab753cd24ba9dd382df6dd99ae5696c4d26621d7ac" }, "downloads": -1, "filename": "mediawiki_dump-0.5.tar.gz", "has_sig": false, "md5_digest": "a52ef03754beeb0d1a378dcb5afba4db", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9779, "upload_time": "2018-11-22T19:06:16", "url": "https://files.pythonhosted.org/packages/a2/e9/53323f26fd66b3c6c7e89a644c64064cead02226afbb01727537d7911e13/mediawiki_dump-0.5.tar.gz" } ], "0.6": [ { "comment_text": "", "digests": { "md5": "70a06d7635ae4b3f0e950edd63964315", "sha256": "29f41073160ad20571f80acb1130872af05edca41954e745045b55fad6de2eeb" }, "downloads": -1, "filename": "mediawiki_dump-0.6.tar.gz", "has_sig": false, "md5_digest": "70a06d7635ae4b3f0e950edd63964315", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9798, "upload_time": "2018-11-24T14:06:03", "url": "https://files.pythonhosted.org/packages/0a/f6/aa190eb62058d3c1cb1dbe96298f130bb0badef71e9800f7f0add8628ccc/mediawiki_dump-0.6.tar.gz" } ], "0.6.1": [ { "comment_text": "", "digests": { "md5": "f05f102aad0f4d22216bb7a05da3bc3b", "sha256": "54610f72f8972232a8d139d74bfdc67dbc5b52d68db972b28b87b53018ea54c1" }, "downloads": -1, "filename": "mediawiki_dump-0.6.1.tar.gz", "has_sig": false, "md5_digest": "f05f102aad0f4d22216bb7a05da3bc3b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9809, "upload_time": "2018-11-24T22:02:34", "url": "https://files.pythonhosted.org/packages/77/5e/1c470ee6a72bef0d2fc629fc1cf85b41aec0c614b4172383060d59807a66/mediawiki_dump-0.6.1.tar.gz" } ], "0.6.2": [ { "comment_text": "", "digests": { "md5": "15e25ec233fedd5075d48e87f729eac2", "sha256": "938889e6a5d9977220a0ca4bb79732a6db18b5027e6e08a56fd496916d16c174" }, "downloads": -1, "filename": "mediawiki_dump-0.6.2.tar.gz", "has_sig": false, "md5_digest": "15e25ec233fedd5075d48e87f729eac2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9817, "upload_time": "2018-11-24T22:11:02", "url": "https://files.pythonhosted.org/packages/44/64/4d3de191e91772dd8369a7805639a55bcc46cecdc54a2ea24332e6c008a9/mediawiki_dump-0.6.2.tar.gz" } ], "0.6.3": [ { "comment_text": "", "digests": { "md5": "e4453d2a97edd4d976d8908629a59d2a", "sha256": "56115de8104f0f1589918c680bbf43b710634996d9192bb7cb4448bea8cc7cc8" }, "downloads": -1, "filename": "mediawiki_dump-0.6.3.tar.gz", "has_sig": false, "md5_digest": "e4453d2a97edd4d976d8908629a59d2a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9834, "upload_time": "2019-03-25T15:39:39", "url": "https://files.pythonhosted.org/packages/a8/69/b29e570b42d8bf4441e6962746816f6d54e12b40be71ff99189b4f793632/mediawiki_dump-0.6.3.tar.gz" } ], "0.6.4": [ { "comment_text": "", "digests": { "md5": "0cc23b2f290753845ecadd7e87671400", "sha256": "f33ef157ae8999be73cd3a4f77526b2b4bbbf903a80657fbabe5145589ae86b1" }, "downloads": -1, "filename": "mediawiki_dump-0.6.4.tar.gz", "has_sig": false, "md5_digest": "0cc23b2f290753845ecadd7e87671400", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10169, "upload_time": "2019-04-01T19:17:38", "url": "https://files.pythonhosted.org/packages/c8/31/dc3213bcbfd6e79514d1c1c82a2d70218d82fe5bfddd83366a8410257977/mediawiki_dump-0.6.4.tar.gz" } ], "0.6.5": [ { "comment_text": "", "digests": { "md5": "0a594bc1a8ddf1da09aa1484dba48f1f", "sha256": "af211bea992f6d5fea422829e5b05586f1992584672b714a591989f74ab1bc58" }, "downloads": -1, "filename": "mediawiki_dump-0.6.5.tar.gz", "has_sig": false, "md5_digest": "0a594bc1a8ddf1da09aa1484dba48f1f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10444, "upload_time": "2019-06-13T20:52:51", "url": "https://files.pythonhosted.org/packages/36/22/db3e4882c9dc00dc60341e017267f3c51159165e210f0fb4f912342cf129/mediawiki_dump-0.6.5.tar.gz" } ], "0.6.6": [ { "comment_text": "", "digests": { "md5": "d0b5e385efeb656c92110f4080680135", "sha256": "2e82f943705f13508fd6eeeeca582af3af219ded7749cb55d6d60c3e47c68215" }, "downloads": -1, "filename": "mediawiki_dump-0.6.6.tar.gz", "has_sig": false, "md5_digest": "d0b5e385efeb656c92110f4080680135", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12673, "upload_time": "2019-07-22T20:09:16", "url": "https://files.pythonhosted.org/packages/db/31/06f3363555dcab39eab01ef3477258fc377015c74ef29aad7978aadddd0d/mediawiki_dump-0.6.6.tar.gz" } ], "0.6.7": [ { "comment_text": "", "digests": { "md5": "df8b8cfe77d278e2594a511e2334dc69", "sha256": "e0509b8e783cbbd9cd5798d8b8974525ed55cd6e3dfa6d00c9cf3394df0e3e26" }, "downloads": -1, "filename": "mediawiki_dump-0.6.7.tar.gz", "has_sig": false, "md5_digest": "df8b8cfe77d278e2594a511e2334dc69", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12687, "upload_time": "2019-07-28T12:33:16", "url": "https://files.pythonhosted.org/packages/b5/47/a21d6b3c9c987fd29202609eea2526e9c7d147f86a6169c2dc733b9164fd/mediawiki_dump-0.6.7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "df8b8cfe77d278e2594a511e2334dc69", "sha256": "e0509b8e783cbbd9cd5798d8b8974525ed55cd6e3dfa6d00c9cf3394df0e3e26" }, "downloads": -1, "filename": "mediawiki_dump-0.6.7.tar.gz", "has_sig": false, "md5_digest": "df8b8cfe77d278e2594a511e2334dc69", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12687, "upload_time": "2019-07-28T12:33:16", "url": "https://files.pythonhosted.org/packages/b5/47/a21d6b3c9c987fd29202609eea2526e9c7d147f86a6169c2dc733b9164fd/mediawiki_dump-0.6.7.tar.gz" } ] }