{ "info": { "author": "Brad Solomon", "author_email": "brad.solomon.1124@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3 :: Only", "Programming Language :: Python :: Implementation :: CPython" ], "description": "# guessenc\n\n[![Build](https://img.shields.io/circleci/project/github/bsolomon1124/guessenc.svg)](https://circleci.com/gh/bsolomon1124/guessenc/tree/master)\n[![License](https://img.shields.io/pypi/l/guessenc)](https://github.com/bsolomon1124/guessenc/blob/master/LICENSE)\n[![PyPI](https://img.shields.io/pypi/v/guessenc.svg)](https://pypi.org/project/guessenc/)\n[![Status](https://img.shields.io/pypi/status/guessenc.svg)](https://pypi.org/project/guessenc/)\n[![Python](https://img.shields.io/pypi/pyversions/guessenc.svg)](https://pypi.org/project/guessenc)\n\nInfer HTML encoding from response headers & content. Goes above and beyond the encoding detection done by most HTTP client libraries.\n\n## Basic Usage\n\nThe main function exported by `guessenc` is `infer_encoding()`.\n\n```python\n>>> import requests\n>>> from guessenc import infer_encoding\n\n>>> resp = requests.get(\"http://www.fatehwatan.ps/page-183525.html\")\n>>> resp.raise_for_status()\n>>> infer_encoding(resp.content, resp.headers)\n(, 'cp1256')\n```\n\nThis tells us that the detected encoding is cp1256, and that it was retrieved from a HTML tag with ``http-equiv='Content-Type'``.\n\nDetail on the signature of `infer_encoding()`:\n\n```python\ndef infer_encoding(\n content: Optional[bytes] = None,\n headers: Optional[Mapping[str, str]] = None\n) -> Pair:\n ...\n```\n\nThe `content` represents the page HTML, such as `response.content`.\n\nThe `headers` represents the HTTP response headers, such as `response.headers`.\nIf provided, this should be a data structure supporting a case-insensitive lookup, such as `requests.structures.CaseInsensitiveDict`\nor `multidict.CIMultiDict`.\n\nBoth parameters are optional.\n\nThe return type is a `tuple`.\n\nThe first element of the tuple is a member of the `Source` enum (see [Search Process](#search-process) below). The source indicates where\nthe detected encoding comes from.\n\nThe second element of the tuple is either a `str`, which is the canonical name of the detected encoding, or `None` if no encoding is found.\n\n## Where Do Other Libraries Fall Short?\n\nThe `requests` library \"[follows] RFC 2616 to the letter\" in using the HTTP headers to determine the encoding of the response content. This\nmeans, among other things, using `ISO-8859-1` as a fallback if no charset is given, despite the fact that UTF-8 has [absolutely\ndwarfed](https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg) all other encodings in usage on web pages.\n\n```python\n# requests/adapters.py\nresponse.encoding = get_encoding_from_headers(response.headers)\n```\n\nIf `requests` does not find an HTTP `Content-Type` header at all, it will fall back to detection via `chardet` rather than looking in the\nHTML tags for meaningful information. There's nothing at all _wrong_ with this; it just means that the `requests` maintainers have chosen to\nfocus on the power of `requests` [as an HTTP library, not an HTML library](https://github.com/psf/requests/issues/2266). If you want more fine-grained control over encoding detection,\ntry `infer_encoding()`.\n\nThis is not to single out `requests` either; there are other libraries that do the same dance with encoding detection;\n[`aiohttp`](https://github.com/aio-libs/aiohttp/blob/master/aiohttp/client_reqrep.py) checks the `Content-Type` header, or otherwise\ndefaults to UTF-8 without looking anywhere else.\n\n## Search Process\n\nThe function `guessenc.infer_encoding()` looks in a handful of places to extract an encoding, in this order, and stops when it finds one:\n\n1. In the `charset` value from the [`Content-Type` HTTP entity header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type).\n2. In the `charset` value from a `` HTML tag.\n3. In the `charset` value from a `` tag with `http-equiv=\"Content-Type\"`.\n4. Using the [`chardet`](https://chardet.readthedocs.io/en/latest/) library.\n\nEach of the above \"sources\" is signified by a corresponding member of the `Source` enum:\n\n```python\nclass Source(enum.Enum):\n \"\"\"Indicates where our detected encoding came from.\"\"\"\n\n CHARSET_HEADER = 0\n META_CHARSET = 1\n META_HTTP_EQUIV = 2\n CHARDET = 3\n COULD_NOT_DETECT = 4\n```\n\nIf none of the 4 sources from the list above return a viable encoding, this is indicated by `Source.COULD_NOT_DETECT`.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/bsolomon1124/guessenc", "keywords": "encoding http html chardet detection", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "guessenc", "package_url": "https://pypi.org/project/guessenc/", "platform": "", "project_url": "https://pypi.org/project/guessenc/", "project_urls": { "Homepage": "https://github.com/bsolomon1124/guessenc" }, "release_url": "https://pypi.org/project/guessenc/0.3/", "requires_dist": [ "lxml", "chardet" ], "requires_python": ">=3.5", "summary": "Infer HTML encoding from response headers & content", "version": "0.3" }, "last_serial": 5886267, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "cd25c6b09bf03b05b18777e5a8d3ca34", "sha256": "0d7c5bece71fd2b8d888f049e7f15f2e24ee913d7756dc75eed7ceb795d92f33" }, "downloads": -1, "filename": "guessenc-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "cd25c6b09bf03b05b18777e5a8d3ca34", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 5441, "upload_time": "2019-09-25T15:52:13", "url": "https://files.pythonhosted.org/packages/b6/ed/a0b52222b1d259aa8540a679c6840b6157b88d97689c5508e6205a21ea90/guessenc-0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d5a837589ee33e3924d6d993759b50dc", "sha256": "7ef23651794938742fa2701296b0d76fc74b048c71c0e1a3f4505dd84ef54932" }, "downloads": -1, "filename": "guessenc-0.1.tar.gz", "has_sig": false, "md5_digest": "d5a837589ee33e3924d6d993759b50dc", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 330004, "upload_time": "2019-09-25T15:52:16", "url": "https://files.pythonhosted.org/packages/ca/f3/c11a87290cdd389e8c6c02a98d32d1ff07f9e4b05c963f73a03abaf2c813/guessenc-0.1.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "74043884412d2c91da1722dd35aa08a3", "sha256": "54ea8fa775e1a904289f8ed4de1be2ce1e5c42c82f00358bc67eb3dfde41cbac" }, "downloads": -1, "filename": "guessenc-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "74043884412d2c91da1722dd35aa08a3", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 5575, "upload_time": "2019-09-25T16:10:24", "url": "https://files.pythonhosted.org/packages/a1/80/52fcb7448beef3ef3b65aeddca11be53bb86cabfb201f287c01a43c50ef2/guessenc-0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "dcd9309d11cfc256283404c93f14e085", "sha256": "59afc94ba514251918a77f980585763d99c5c1bc20301da5b402598c53472e3c" }, "downloads": -1, "filename": "guessenc-0.2.tar.gz", "has_sig": false, "md5_digest": "dcd9309d11cfc256283404c93f14e085", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 330306, "upload_time": "2019-09-25T16:10:26", "url": "https://files.pythonhosted.org/packages/de/09/7555fa6fdb093acff3c84d5384044fcf92155ab27e29097b9807f5204005/guessenc-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "f96cf457be86908169781dd4dd95e82e", "sha256": "010e2a3e39a6b90a40dd704c6e62b6988e5841992f5947d2637c9f2ffb5719ac" }, "downloads": -1, "filename": "guessenc-0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "f96cf457be86908169781dd4dd95e82e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 5577, "upload_time": "2019-09-25T16:35:03", "url": "https://files.pythonhosted.org/packages/0c/19/df737e83c1d7f47438d69bbba36120824be606246e974738d5fa35640da2/guessenc-0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1460a5fa4fecd47992c81028b2ad77d3", "sha256": "e081977d3ae2ed55835ca0d92f86d8987fd5b0e4372241cbbde1e5abf0315190" }, "downloads": -1, "filename": "guessenc-0.3.tar.gz", "has_sig": false, "md5_digest": "1460a5fa4fecd47992c81028b2ad77d3", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 330302, "upload_time": "2019-09-25T16:35:04", "url": "https://files.pythonhosted.org/packages/1b/51/70a2c4fd02c29dc330a7cfb793fd81a3f8ad238a6e0d6ddb13218c97dc16/guessenc-0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f96cf457be86908169781dd4dd95e82e", "sha256": "010e2a3e39a6b90a40dd704c6e62b6988e5841992f5947d2637c9f2ffb5719ac" }, "downloads": -1, "filename": "guessenc-0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "f96cf457be86908169781dd4dd95e82e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 5577, "upload_time": "2019-09-25T16:35:03", "url": "https://files.pythonhosted.org/packages/0c/19/df737e83c1d7f47438d69bbba36120824be606246e974738d5fa35640da2/guessenc-0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1460a5fa4fecd47992c81028b2ad77d3", "sha256": "e081977d3ae2ed55835ca0d92f86d8987fd5b0e4372241cbbde1e5abf0315190" }, "downloads": -1, "filename": "guessenc-0.3.tar.gz", "has_sig": false, "md5_digest": "1460a5fa4fecd47992c81028b2ad77d3", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 330302, "upload_time": "2019-09-25T16:35:04", "url": "https://files.pythonhosted.org/packages/1b/51/70a2c4fd02c29dc330a7cfb793fd81a3f8ad238a6e0d6ddb13218c97dc16/guessenc-0.3.tar.gz" } ] }