{ "info": { "author": "Baptiste Fontaine", "author_email": "b@ptistefontaine.fr", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3.7" ], "description": "# WPyDumps\n\n**WpyDumps** is a Python module to work with [dumps of Wikipedia][dumps].\n\nIt allows one to parse and extract relevant information from dump files without\nuncompressing them on-disk.\n\nOnly the \u201cAll pages with complete edit history\u201d dump is supported.\n\nThis is quite experimental for now.\n\n[dumps]: https://dumps.wikimedia.org\n\n## Usage\nThe parser uses [SAX][] to read the files as a stream. It takes a reader or a\nfilename and a page callback function. It parses the file and call that\nfunction with each page.\n\nPages are represented as `wpydumps.model.Page` objects. They include the pages\u2019\ndetails as well as their revisions (`wpydumps.model.Revision`). Each revision\nholds a reference to its contributor (`wpydumps.model.Contributor`).\n\n```python3\nimport wpydumps.parser as p\n\ndef simple_page_callback(page):\n print(page.title)\n\n# parse from a local archive\np.parse_pages_from_archive_filename(\"myfile.7z\", simple_page_callback)\n\n# parse from an uncompressed file\nwith open(\"myfile\") as f:\n p.parse_pages_from_reader(f, simple_page_callback)\n```\n\nThe text of each revision is dropped by default. You can disable this behavior\nby passing `keep_revisions_text=True` to the parser function. Revisions always\nhave a `text_length` and `diff_length` `int` attributes.\n\n[SAX]: https://docs.python.org/3.6/library/xml.sax.html\n\n### Examples\n```python3\nfrom wpydumps.parser import parse_pages_from_archive_filename\n\ndef page_callback(page):\n pass # do something with the page\n\n# use the appropriate filename\nparse_pages_from_archive_filename(\n \"frwiki-20190901-pages-meta-history1.xml-p3p1630.7z\",\n page_callback)\n```\n#### Print all pages and their number of revisions\n```python3\ndef page_callback(page):\n print(page.title, len(page.revisions))\n```\n#### Print all pages and their number of contributors\n```python3\ndef page_callback(page):\n contributors = set()\n for rev in page.revisions:\n contributors.add(rev.contributor.username or rev.contributor.ip)\n\n print(\"%s: %d contributors\" % (page.title, len(contributors)))\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/bfontaine/wpydumps", "keywords": "", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "wpydumps", "package_url": "https://pypi.org/project/wpydumps/", "platform": "", "project_url": "https://pypi.org/project/wpydumps/", "project_urls": { "Homepage": "https://github.com/bfontaine/wpydumps" }, "release_url": "https://pypi.org/project/wpydumps/0.0.1/", "requires_dist": [ "libarchive" ], "requires_python": "", "summary": "Work with Wikipedia dumps", "version": "0.0.1" }, "last_serial": 5933132, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "4af881ea56c15cf7c7fb0e64df20ec28", "sha256": "04607a350b0cf73ad553cde22d5485fff4231b6e47febda1bb8b3ff5fe6ca9cd" }, "downloads": -1, "filename": "wpydumps-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "4af881ea56c15cf7c7fb0e64df20ec28", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5759, "upload_time": "2019-10-05T21:19:18", "url": "https://files.pythonhosted.org/packages/73/55/db90d7fcc7c4c47560b68e3638f119be43b088df4b16391722175417a659/wpydumps-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cf8f62ec50e209ffe3c191839e96716f", "sha256": "f9cb21d80a5497d3635fab89fdd35edf90e83582ad42db46134280ddd0ab4e37" }, "downloads": -1, "filename": "wpydumps-0.0.1.tar.gz", "has_sig": false, "md5_digest": "cf8f62ec50e209ffe3c191839e96716f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4284, "upload_time": "2019-10-05T21:19:21", "url": "https://files.pythonhosted.org/packages/10/b1/680c43931406d5f6212792a179a97a3952aafc46b64c22b38eb83d241fd5/wpydumps-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4af881ea56c15cf7c7fb0e64df20ec28", "sha256": "04607a350b0cf73ad553cde22d5485fff4231b6e47febda1bb8b3ff5fe6ca9cd" }, "downloads": -1, "filename": "wpydumps-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "4af881ea56c15cf7c7fb0e64df20ec28", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5759, "upload_time": "2019-10-05T21:19:18", "url": "https://files.pythonhosted.org/packages/73/55/db90d7fcc7c4c47560b68e3638f119be43b088df4b16391722175417a659/wpydumps-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cf8f62ec50e209ffe3c191839e96716f", "sha256": "f9cb21d80a5497d3635fab89fdd35edf90e83582ad42db46134280ddd0ab4e37" }, "downloads": -1, "filename": "wpydumps-0.0.1.tar.gz", "has_sig": false, "md5_digest": "cf8f62ec50e209ffe3c191839e96716f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4284, "upload_time": "2019-10-05T21:19:21", "url": "https://files.pythonhosted.org/packages/10/b1/680c43931406d5f6212792a179a97a3952aafc46b64c22b38eb83d241fd5/wpydumps-0.0.1.tar.gz" } ] }