{ "info": { "author": "Ilya Kreymer", "author_email": "ikreymer@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Web Environment", "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Utilities" ], "description": "CDXJ Indexer\n~~~~~~~~~~~~\n\nA command-line tool for generating CDXJ (and CDX) indexes from WARC and ARC files.\nThe indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb `_)\n\nInstall with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``\n\n\nThe indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).\n\n\nUsage examples\n~~~~~~~~~~~~~~~~~~~~\n\nGenerate CDXJ index:\n\n.. code:: console\n\n > cdxj-indexer /path/to/archive-file.warc.gz\n com,example)/ 20170730223850 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK\", \"length\": \"1219\", \"offset\": \"771\", \"filename\": \"example-20170730223917.warc.gz\"}\n\n\nCDX Index (11 field):\n\n.. code:: console\n\n > cdxj-indexer -11 /path/to/archive-file.warc.gz\n CDX N b a m s k r M S V g\n com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz\n\n\nMore advanced use cases: add additonal http headers as fields. ``http:`` prefix specifies current record headers, while ``req.http:`` specifies corresponding request record headers. The following adds the Date, Referer headers, and the request method to the index:\n\n.. code:: console\n\n > cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz\n com,example)/ 20170801032435 {\"url\": \"http://example.com/\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7\", \"length\": \"1207\", \"offset\": \"834\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 03:24:35 GMT\", \"referrer\": \"https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/\"}\n org,iana)/domains/example 20170801032437 {\"url\": \"http://www.iana.org/domains/example\", \"mime\": \"text/html\", \"status\": \"302\", \"digest\": \"RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2\", \"length\": \"675\", \"offset\": \"2652\", \"filename\": \"temp-20170801032445.warc.gz\", \"req.http:method\": \"GET\", \"http:date\": \"Tue, 01 Aug 2017 02:35:05 GMT\", \"referrer\": \"http://example.com/\"}\n\n\nThe CDXJ Indexer extends the ``Indexer`` functionality in `warcio `_ and should be flexible to extend.\n\n\n\n\n\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/webrecorder/cdxj_indexer", "keywords": "", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "cdxj-indexer", "package_url": "https://pypi.org/project/cdxj-indexer/", "platform": "", "project_url": "https://pypi.org/project/cdxj-indexer/", "project_urls": { "Homepage": "https://github.com/webrecorder/cdxj_indexer" }, "release_url": "https://pypi.org/project/cdxj-indexer/1.0/", "requires_dist": [ "surt", "warcio" ], "requires_python": "", "summary": "CDXJ Indexer for WARC and ARC files", "version": "1.0" }, "last_serial": 3117800, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "71ea63820023558670dbd14335cc9927", "sha256": "d86a1ea9617c21c14a0f3ff9b452cffa70f39189e77b4bc0efcd40fec43786dc" }, "downloads": -1, "filename": "cdxj_indexer-1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "71ea63820023558670dbd14335cc9927", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 8668, "upload_time": "2017-08-23T15:11:29", "url": "https://files.pythonhosted.org/packages/fe/96/2ff539146b9c16370f88b240cd2ddc9333bb5126f7e81e94740c9690630c/cdxj_indexer-1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a6197cf358db2d3da6ccb39fdce1c47b", "sha256": "98fce07f2e6262e815880958efa03578c8c174f4528c7476accb0257b1533638" }, "downloads": -1, "filename": "cdxj_indexer-1.0.tar.gz", "has_sig": false, "md5_digest": "a6197cf358db2d3da6ccb39fdce1c47b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7413, "upload_time": "2017-08-23T15:11:47", "url": "https://files.pythonhosted.org/packages/07/ff/17a3e0a56c4cec9b3e2d984831162b0a208e7483178f3a8f43cb3b43b81a/cdxj_indexer-1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "71ea63820023558670dbd14335cc9927", "sha256": "d86a1ea9617c21c14a0f3ff9b452cffa70f39189e77b4bc0efcd40fec43786dc" }, "downloads": -1, "filename": "cdxj_indexer-1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "71ea63820023558670dbd14335cc9927", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 8668, "upload_time": "2017-08-23T15:11:29", "url": "https://files.pythonhosted.org/packages/fe/96/2ff539146b9c16370f88b240cd2ddc9333bb5126f7e81e94740c9690630c/cdxj_indexer-1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a6197cf358db2d3da6ccb39fdce1c47b", "sha256": "98fce07f2e6262e815880958efa03578c8c174f4528c7476accb0257b1533638" }, "downloads": -1, "filename": "cdxj_indexer-1.0.tar.gz", "has_sig": false, "md5_digest": "a6197cf358db2d3da6ccb39fdce1c47b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7413, "upload_time": "2017-08-23T15:11:47", "url": "https://files.pythonhosted.org/packages/07/ff/17a3e0a56c4cec9b3e2d984831162b0a208e7483178f3a8f43cb3b43b81a/cdxj_indexer-1.0.tar.gz" } ] }