{ "info": { "author": "Wouter Bolsterlee", "author_email": "uws@xs4all.nl", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Programming Language :: Python", "Topic :: System :: Recovery Tools", "Topic :: Text Processing" ], "description": "=================================================\nPython string codec for MySQL's *latin1* encoding\n=================================================\n\n:Author: Wouter Bolsterlee\n:License: 3-clause BSD\n:URL: https://github.com/wbolster/mysql-latin1-codec\n\nOverview\n========\n\nThis project provides a Python string codec for MySQL's *latin1* encoding, and\nan accompanying *iconv*-like command line script for use in shell pipes.\n\n\nRationale\n=========\n\n* MySQL defaults to using the *latin1* encoding for all its textual data, but\n its *latin1* encoding is not actually *latin1* but a MySQL-specific variant.\n\n* Due to improperly written applications or wrongly configured databases, many\n existing databases keep data in MySQL *latin1* columns, even if that data is\n not actually *latin1* data.\n\n* MySQL will *not* complain about this, so this often goes unnoticed. The\n problems only appear when you try to access the database from another\n application, or when you try to ``grep`` through database dumps produced by\n ``mysqldump``.\n\n* Many libraries do not support this encoding properly, and using the real\n *latin1* encoding leads to corruption when processing data, especially when\n the database contains text that is not in a West-European language.\n\n* This string codec makes it possible to reconstruct the exact bytes as they\n were stored in MySQL. This is a valuable tool for fixing text encoding issues\n with such databases. For convenience, a command line interface that behaves\n like *iconv* is also included.\n\n\nVersion history\n===============\n\n* Version 2.0 (2014-01-31)\n\n * Added explicit ``register()`` function instead of having side-effects upon\n module import.\n\n* Version 1.0 (2013-12-04)\n \n * Initial release\n\n\nInstallation\n============\n\n::\n\n $ pip install mysql-latin1-codec\n\nThe package supports both Python 2 and Python 3.\n\n\nUsage\n=====\n\nYou can use this project in two ways: as a stand-alone command line tool and as\na Python module.\n\nCommand line tool\n-----------------\n\nThe command line tool behaves like *iconv*::\n\n $ python -m mysql_latin1_codec --help\n usage: mysql_latin1_codec.py [-h] [-f encoding] [-t encoding] [-o filename]\n [-c]\n [inputs [inputs ...]]\n\n iconv-like tool to encode or decode data using MySQL's \"latin1\" dialect,\n which, despite the name, is a mix of cp1252 and ISO-8859-1.\n\n positional arguments:\n inputs Input file(s) (defaults to stdin)\n\n optional arguments:\n -h, --help show this help message and exit\n -f encoding, --from encoding\n Source encoding (uses MySQL's \"latin1\" if omitted)\n -t encoding, --to encoding\n Target encoding (uses MySQL's \"latin1\" if omitted)\n -o filename, --output filename\n Output file (defaults to stdout)\n -c, --skip-invalid Omit invalid characters from output (not the default)\n\n\nPython API\n----------\n\nIn Python code, simply import the module named ``mysql_latin1_codec`` and call\nthe ``register()`` function. A string codec named ``mysql_latin1`` will be\nregistered in Python's codec registry::\n\n import mysql_latin1_codec\n\n mysql_latin1_codec.register()\n\nYou can use it using the normal ``.decode()`` and ``.encode()`` methods on\n(byte)strings, and you can also specify it as the ``encoding`` argument to\nvarious I/O functions like ``io.open()``. Example::\n\n\n # String encoding/decoding round-trip\n s1 = u'foobar'\n s2 == text.encode('mysql_latin1').decode('mysql_latin1')\n assert s1 == s2\n\n # Reading files\n import io\n with io.open('/path/to/file', 'r', encoding='mysql_latin1') as fp:\n for line in fp:\n pass\n\n\nPractical examples\n==================\n\nThe example below \u2018fixes\u2019 dumps that contain doubly encoded UTF-8 data, i.e.\nreal UTF-8 data stored in a MySQL *latin1* table. By default ``mysqldump`` makes\nUTF-8 dumps, but if MySQL thinks the data is *latin1*, it will convert it again,\nresulting in double encoded data. ::\n\n $ cat backup-of-broken-database-produced-by-mysqldump.sql \\\n | python -m mysql_latin1_codec -f UTF-8 \\\n | iconv -c -f UTF-8 -t UTF-8 \\\n > legible-text-in-utf8.sql\n\nThe *iconv* pipe in this example removes invalid UTF-8 sequences, while keeping\nthe valid parts as-is. MySQL truncates values whose size exceeds the column's\nmaximum size, but if MySQL doesn't know that it is handling UTF-8 data (because\nthe database schema and the broken application did not tell it to do so) it\ntruncates the byte sequence, not the character sequence. This may result in\nincomplete UTF-8 sequences when a multi-byte sequence is truncated somewhere in\nthe middle. Since those characters cannot be recovered anyway, removing them is\nthe right solution in this case.\n\nIn code you can do something similar to the example above::\n\n original = b'...' # byte string containing doubly-encoded UTF-8 data\n s = original.decode('UTF-8').encode('mysql_latin1').decode('UTF-8', 'replace')\n\nAnother example to \u2018fix\u2019 a dump that contains GB2312 (Simplified Chinese) data\nstored in a MySQL *latin1* column, again misinterpreted and encoded to UTF-8 by\n``mysqldump``::\n\n $ cat mojibake-crap.sql \\\n | python -m mysql_latin1_codec -f UTF-8 \\\n | iconv -f GB2312 -t UTF-8 \\\n > legible-text-in-utf8.sql\n\n\nTechnical background\n====================\n\nHow MySQL defines *latin1*\n--------------------------\n\nThe character set that MySQL uses when *latin1* is specified, is not actually\nthe well-known *latin1* character set, officially known as ISO-8859-1. What\nMySQL calls *latin1* is actually a custom encoding based on *cp-1252* (also\nknown as *windows-1252*).\n\nThe MySQL documentation on `West European Character Sets 9\u00a7 10.1.14.2)\n`_ contains:\n\n ``latin1`` is the default character set. MySQL's ``latin1`` is the same as\n the Windows ``cp1252`` character set. THis means it is the same as official\n ``ISO 8859-1`` or IANA (Internet Assigned Numbers Authority) ``latin1``,\n except that IANA ``latin1`` treats the code points between ``0x80`` and\n ``0x9f`` as \u201cundefined\u201d, whereas ``cp1252``, and therefore MySQL's\n ``latin``, assign characters for those positions. For example, ``0x80`` is\n the Euro sign. For the \u201cundefined\u201d entries in ``cp1252``, MySQL translates\n ``0x81`` to Unicode ``0x0081``, ``0x8d`` to ``0x008d``, ``0x8ff`` to\n ``0x008f``, ``0x90`` to ``0x0090``, and ``0x9d`` to ``0x009d``.\n\nSome more details can be found in the MySQL source code in the file\n``strings/ctype-latin1.c``::\n\n WL#1494 notes:\n\n We'll use cp1252 instead of iso-8859-1.\n cp1252 contains printable characters in the range 0x80-0x9F.\n In ISO 8859-1, these code points have no associated printable\n characters. Therefore, by converting from CP1252 to ISO 8859-1,\n one would lose the euro (for instance). Since most people are\n unaware of the difference, and since we don't really want a\n \"Windows ANSI\" to differ from a \"Unix ANSI\", we will:\n\n - continue to pretend the latin1 character set is ISO 8859-1\n - actually allow the storage of euro etc. so it's actually cp1252\n\n Also we'll map these five undefined cp1252 character:\n 0x81, 0x8D, 0x8F, 0x90, 0x9D\n into corresponding control characters:\n U+0081, U+008D, U+008F, U+0090, U+009D.\n like ISO-8859-1 does. Otherwise, loading \"mysqldump\"\n output doesn't reproduce these undefined characters.\n\nAs you can see, this encoding is significantly different from ISO-8859-1 (the\nreal *latin1*), but MySQL misleadingly labels it as *latin* anyway.\n\n\nWhy this can be a problem\n-------------------------\n\nMySQL's *latin1* encoding allows for arbitrary data to be stored in database\ncolumns, without any validation. This means *latin1* text columns can store any\nbyte sequence, for example UTF-8 encoded text (which uses a variable number of\nbytes per character) or even JPEG images (which is not text at all).\n\nThis is of course not the proper use of *latin1* columns. Even in this modern\nUnicode-aware world, in which all properly written software that handles text\nshould use UTF-8 (or another Unicode encoding), it is quite common to stumble\nupon wrongly configured databases or badly written software. Most applications\nuse the same (incorrect) assumptions for both storing and retrieving data, so in\nmany setups this will still \u2018just work\u2019, and the problem can go unnoticed for a\nlong time.\n\nWhat makes this problem worse, is that MySQL defaults to using the *latin1*\ncharacter encoding, mostly for historical and backward-compatibility reasons.\nThis means many databases in the real world are (perhaps mistakingly) configured\nto store data in columns that use MySQL's *latin1* encoding, even though the\nactual data stores in those columns is not encoding using *latin1* at all.\n\nThis can lead to a variety of problems, such as encoding or decoding errors,\ndouble encoded text, malfunctioning string operations, or incorrect truncation\nwhich can lead to data corruption. In many cases this manifests itself as\n`mojibake `_ text. This may be caused by\na misinterpretation of the characters that the bytes represent, or by double\nencodings, e.g. UTF-8 in a *latin1* column that was converted to UTF-8 again by\na backup script.\n\nMany tools, like Python's built-in text codecs and the *iconv* (both the command\nline tool and the C library) cannot convert data encoding using this custom\nMySQL encoding. This makes it quite hard to \u2018recover\u2019 e.g. UTF-8 data that was\nstored in a *latin1* column, and subsequently dumped using *mysqldump*, even if\nyou know what you're doing and which actual encoding was used.\n\nWhen invoked on the command line, this script converts the dump file(s)\nspecified on the command line (or standard input if no files were given). The\ndata is interpreted as UTF-8 and encoded as MySQL's *latin1* and written to the\nstandard output. The output is the raw data, which likely needs further\nprocessing, e.g. using iconv to \"reinterpret\" the data correctly (e.g. as\nUTF-8).\n\n\nI have no idea what you are talking about!\n==========================================\n\nNo worries, that's okay.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/wbolster/mysql-latin1-codec", "keywords": null, "license": "BSD License", "maintainer": null, "maintainer_email": null, "name": "mysql-latin1-codec", "package_url": "https://pypi.org/project/mysql-latin1-codec/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/mysql-latin1-codec/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/wbolster/mysql-latin1-codec" }, "release_url": "https://pypi.org/project/mysql-latin1-codec/2.0/", "requires_dist": null, "requires_python": null, "summary": "Python string codec for MySQL's latin1 encoding", "version": "2.0" }, "last_serial": 987572, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "15dd2f41a2c392400c4e95a478ae3ad1", "sha256": "19d3388fea03dd393e12b223936128108f188e2b56e29135006edf5c0da9cabe" }, "downloads": -1, "filename": "mysql-latin1-codec-1.0.tar.gz", "has_sig": false, "md5_digest": "15dd2f41a2c392400c4e95a478ae3ad1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5910, "upload_time": "2013-12-06T12:10:14", "url": "https://files.pythonhosted.org/packages/cb/db/cd2aac2faed2b893f3a00eca30e8ca262c7142e727b0c0f6a768234133fe/mysql-latin1-codec-1.0.tar.gz" } ], "2.0": [ { "comment_text": "", "digests": { "md5": "ecb565946614a65938283db17449a773", "sha256": "009bf64f9ee0f33c5fb220c3571c5b8f62b22b7822a1512382a1b9944eed2442" }, "downloads": -1, "filename": "mysql-latin1-codec-2.0.tar.gz", "has_sig": false, "md5_digest": "ecb565946614a65938283db17449a773", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7343, "upload_time": "2014-01-31T20:20:34", "url": "https://files.pythonhosted.org/packages/25/c4/7221f3bb37e1a055a948515062309b773606d883069201da82ce34753550/mysql-latin1-codec-2.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "ecb565946614a65938283db17449a773", "sha256": "009bf64f9ee0f33c5fb220c3571c5b8f62b22b7822a1512382a1b9944eed2442" }, "downloads": -1, "filename": "mysql-latin1-codec-2.0.tar.gz", "has_sig": false, "md5_digest": "ecb565946614a65938283db17449a773", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7343, "upload_time": "2014-01-31T20:20:34", "url": "https://files.pythonhosted.org/packages/25/c4/7221f3bb37e1a055a948515062309b773606d883069201da82ce34753550/mysql-latin1-codec-2.0.tar.gz" } ] }