{ "info": { "author": "Ilya Kreymer", "author_email": "ikreymer@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Web Environment", "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Utilities" ], "description": "WARCIT\n======\n\n``warcit`` is a command-line tool to convert on-disk directories of web documents (commonly HTML, web assets and any other data files) into an ISO standard web archive (WARC) files.\n\nConversion to WARC file allows for improved durability in a standardized format, and allows for any web files stored on disk to be uploaded into `Webrecorder `_, or replayed locally with `Webrecorder Player `_ or `pywb `_\n\n(Many other tools also operate on WARC files, see: `Awesome Web Archiving -- Tools and Software `_)\n\nWARCIT supports converting individual files, directories (including any nested directories) as well as ZIP files into WARCs.\n\n\nBasic Usage\n-----------\n\n``warcit ...``\n\nSee ``warcit -h`` for a complete list of flags and options.\n\n\nFor example, the following example will download a simple web site via ``wget`` (for simplicity, this retrieves one level deep only), then use ``warcit`` to convert to ``www.iana.org.warc.gz``::\n\n wget -l 1 -r www.iana.org/\n warcit http://www.iana.org/ ./www.iana.org/\n\nThe WARC ``www.iana.org.warc.gz`` should now have been created!\n\n\nMime Type Detection and Overrides\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nBy default, ``warcit`` supports the the default Python ``mimetypes`` library to determine a mime type based on a file extension.\n\nHowever, it also supports using `python-magic `_ (libmagic) if available and custom mime overrides configured via the command line.\n\nThe mime detection is as follows:\n\n1) If the filename matches an override specified via ``--mime-overrides``, use that as the mime type.\n\n2) If ``mimetypes.guess_type()`` returns a valid mime type, use that as the mime type.\n\n3) If ``--use-magic`` flag is specified, use the ``magic`` api to determine mime type (``warcit`` will error if ``magic`` is not available when using this flag).\n\n4) Default to ``text/html`` if all previous attempts did not yield a mime type.\n\n\nThe ``--mime-overrides`` flag can be used to specify wildcard query (applied to the full url) and corresponding mime types as a comma-delimeted property list::\n\n warcit '--mime-overrides=*.html=text/html; charset=\"utf-8\",image.ico=image/png' http://www.iana.org/ ./www.iana.org/\n\nWhen a url ending in ``*.html`` or ``*.ico`` is encountered, the specified mime type will be used for the ``Content-Type`` header, by passing any auto-detection.\n\nCharset Detection\n~~~~~~~~~~~~~~~~~\n\nCharset detection is disabled by default, but can be enabled with the ``--charset auto`` flag.\n\nDetection is done using the `cchardet `_ native chardet library.\n\nA specific charset can also be specified, eg. ``--charset utf-8`` will add ``; charset=utf-8`` to all ``text/*`` resources.\n\nIf detection does not produce a result, or if the result is ``ascii``, no charset is added to the ``Content-Type``.\n\n\nZIP Files\n~~~~~~~~~\n\n``warcit`` also supports converting ZIP files to WARCs, including portions of zip files.\n\nFor example, if a zip file contains::\n\n my_zip_file.zip\n |\n +-- www.example.com/\n |\n +-- another.example.com/\n |\n +-- some_other_data/\n\nIt is possible to specify the two paths in the zip file to be converted to a WARC separately::\n\n warcit --name my-warc.gz http:// my_zip_file.zip/www.example.com/ my_zip_file.zip/another.example.com/\n\nThis should result in a new WARC ``my-warc.gz`` converting the specified zip file paths. The ``some_other_data`` path is not processed.\n\n\nWARC Structure and Format\n-------------------------\n\nThe tool produces ISO standard WARC 1.0 files.\n\nA ``warcinfo`` record is added at the beginning of the WARC, unless the ``--no-warcinfo`` flag is specified.\n\nThe warcinfo record contains the full command line and warcit version::\n\n WARC/1.0\n WARC-Type: warcinfo\n WARC-Record-ID: ...\n WARC-Filename: example.com.warc.gz\n WARC-Date: 2017-12-05T18:30:58Z\n Content-Type: application/warc-fields\n Content-Length: ...\n\n software: warcit 0.2.0\n format: WARC File Format 1.0\n cmdline: warcit --fixed-dt 2011-02 http://example.com/ ./path/to/somefile.html\n\n\nEach file specified or found in the directory is stored as a WARC ``resource`` record.\n\nBy default, warcit uses the file-modified date as the ``WARC-Date`` of each url.\nThis setting can be overriden with a fixed date time by specifying the ``--fixed-dt`` flag.\nThe datetime can be specified as ``--fixed-dt YYYY-MM-DDTHH:mm:ss`` or ``--fixed-dt YYYYMMDDHHmmss`` or partial date,\neg. ``--fixed-dt YYYY-MM``\n\n\nThe actual WARC creation time and path to the source file on disk are also stored, using the ``WARC-Created-Date``\nand ``WARC-Source-URI`` extension headers, respectively.\n\nFor example, if when running ``warcit --fixed-dt 2011-02 http://example.com/ ./path/to/somefile.html``, the resulting WARC Record might look as follows::\n\n WARC/1.0\n WARC-Date: 2011-02-01T00:00:00Z\n WARC-Created-Date: 2017-12-05T18:30:58Z\n WARC-Source-URI: file://./path/to/somefile.html\n WARC-Type: resource\n WARC-Record-ID: ...\n WARC-Target-URI: http://www.example.com/to/somefile.html\n Content-Type: text/html\n Content-Length ...\n\n ...\n\nAdditionally, warcit adds ``revisit`` records for top-level directories if index files are present.\nIndex files can be specified via the ``--index-files`` flag, the default being ``--index-files=index.html,index.htm``\n\nFor example, when running:\n``warcit http://example.com/ ./path/`` and there exists a file: ``./path/subdir/index.html``, warcit will create:\n\n- a ``resource`` record for ``http://example.com/path/subdir/index.html``\n\n- a ``revisit`` record for ``http://example.com/path/subdir/`` pointing to ``http://example.com/path/subdir/index.html``\n\n\n\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/webrecorder/warcit", "keywords": "", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "warcit", "package_url": "https://pypi.org/project/warcit/", "platform": "", "project_url": "https://pypi.org/project/warcit/", "project_urls": { "Homepage": "https://github.com/webrecorder/warcit" }, "release_url": "https://pypi.org/project/warcit/0.3.1/", "requires_dist": [ "cchardet", "warcio (>=1.6.1)" ], "requires_python": "", "summary": "Convert Directories, Files and Zip Files to Web Archives (WARC)", "version": "0.3.1" }, "last_serial": 4361875, "releases": { "0.2.0": [ { "comment_text": "", "digests": { "md5": "15c4eccd532d2f4465f7b2587f78e620", "sha256": "ec903eae4a2db1934beca8043804c503de7b45b7201f6a3e3abad318817abbc2" }, "downloads": -1, "filename": "warcit-0.2.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "15c4eccd532d2f4465f7b2587f78e620", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 9495, "upload_time": "2017-12-05T05:50:52", "url": "https://files.pythonhosted.org/packages/84/9c/b6518e7682e8a4cc3e526782a67859b2c8cddffaefda8ddcf7aa3dc8240e/warcit-0.2.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "57b100b589b2fd8ef6ac71f82087315b", "sha256": "0fe89cb0f6540994cbf5ecb6bb25a6632089ea853364dc57671287daae17268b" }, "downloads": -1, "filename": "warcit-0.2.0.tar.gz", "has_sig": false, "md5_digest": "57b100b589b2fd8ef6ac71f82087315b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9119, "upload_time": "2017-12-05T05:50:31", "url": "https://files.pythonhosted.org/packages/25/12/004c04543720ccc1716096142627c0d0b26c01c02a6d1f3db0064c607d29/warcit-0.2.0.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "d8aab699027eaea6e0042984b6551c0b", "sha256": "95e479f8ae3ac946562c93855563f71eab6f50e6d00eee91b644dd471a819abc" }, "downloads": -1, "filename": "warcit-0.3.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d8aab699027eaea6e0042984b6551c0b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 11566, "upload_time": "2017-12-07T00:08:26", "url": "https://files.pythonhosted.org/packages/61/12/ed54f4acea6f9b9a685fce6b4a65d40f40a217ae8fef99c17420f82a8cda/warcit-0.3.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "68860fbf04c38c86e517d08cd450a6a7", "sha256": "4d9fe814915f2eea5d531435b5f0d693eaf8395afde38c088a93d3c008c6a3b0" }, "downloads": -1, "filename": "warcit-0.3.0.tar.gz", "has_sig": false, "md5_digest": "68860fbf04c38c86e517d08cd450a6a7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11448, "upload_time": "2017-12-07T00:08:28", "url": "https://files.pythonhosted.org/packages/5f/3c/7e894a6f014dc0329045f6e945669c10d25972f20f061428b1b6f27c0662/warcit-0.3.0.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "2c216b95c4d290f83f12df092f3032f9", "sha256": "703176b8bf44736bcef534c25bc8936ad7de598b48396682fe4f3d1eab3e6303" }, "downloads": -1, "filename": "warcit-0.3.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "2c216b95c4d290f83f12df092f3032f9", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 14422, "upload_time": "2018-10-10T22:31:28", "url": "https://files.pythonhosted.org/packages/d7/55/5f16456a71dbb225055254fecfc29d3144573f5b3676b00a36b8e52782c1/warcit-0.3.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d4c02f6963494f321abf1e98ceddf8a0", "sha256": "88716dab7b5a263b1986ce3acce4ebd0fddf95c5cadd9673d8d080ea61cc1cce" }, "downloads": -1, "filename": "warcit-0.3.1.tar.gz", "has_sig": false, "md5_digest": "d4c02f6963494f321abf1e98ceddf8a0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14446, "upload_time": "2018-10-10T22:31:29", "url": "https://files.pythonhosted.org/packages/a8/45/00785ba3aab76459254f254792ca0be4e448b04c840e1943fe4f9d7c0c68/warcit-0.3.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2c216b95c4d290f83f12df092f3032f9", "sha256": "703176b8bf44736bcef534c25bc8936ad7de598b48396682fe4f3d1eab3e6303" }, "downloads": -1, "filename": "warcit-0.3.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "2c216b95c4d290f83f12df092f3032f9", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 14422, "upload_time": "2018-10-10T22:31:28", "url": "https://files.pythonhosted.org/packages/d7/55/5f16456a71dbb225055254fecfc29d3144573f5b3676b00a36b8e52782c1/warcit-0.3.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d4c02f6963494f321abf1e98ceddf8a0", "sha256": "88716dab7b5a263b1986ce3acce4ebd0fddf95c5cadd9673d8d080ea61cc1cce" }, "downloads": -1, "filename": "warcit-0.3.1.tar.gz", "has_sig": false, "md5_digest": "d4c02f6963494f321abf1e98ceddf8a0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14446, "upload_time": "2018-10-10T22:31:29", "url": "https://files.pythonhosted.org/packages/a8/45/00785ba3aab76459254f254792ca0be4e448b04c840e1943fe4f9d7c0c68/warcit-0.3.1.tar.gz" } ] }