{ "info": { "author": "Guido Serra aka Zeph", "author_email": "zeph@fsfe.org", "bugtrack_url": null, "classifiers": [], "description": "========\nSqlHBase\n========\n\nSqlHBase is an HBase ingestion tool for MySQL generated dumps\n\nThe aim of this tool is to provide a 1:1 mapping of a MySQL table\ninto an HBase table, mapped on Hive (schema is handled too)\n\nTo run this requires a working HBase with Thrift enabled,\nand a Hive instance, with metastore properly configured and\nThrift enabled as well. If u need I/O performance, I recommend to\nlook into Pig or Jython, or directly a native Map Reduce job.\n\nSQOOP was discarded as an option, as it doesn't cope with dump files\nand it does not compute the difference between dumps before ingestion.\n\nSqlHBase does a 2 level ingestion process, described below.\n\n\"INSERT INTO `table_name` VALUE (), ()\" statements are hashed\nand stored (dropping anything at the left side of the first open\nround bracket) as a single row into a staging table on HBase (the\nmd5 hash of the row is the row_key on HBase).\nWhen multiple dumps of the same table/database are inserted, this\nprevents (or at least reduce) the duplication of data on HBase side.\n\nMySQL by default chunks rows as tuples, up to 16Mb, in a single\nINSERT statement. Given that, we basically have a list of tuples:\n\n [(1, \"c1\", \"c2\", \"c3\"), (2, \"c1\", \"c2\", \"c3\"), ... ]\n\nInitial attempt of parsing/splitting such a string with a regexp\nfailed, of course. Since a column value could contain ANYTHING,\neven round brackets and quotes. This kind of language is not\nrecognizable by a Finite State Automata, so something else had to\nbe implemented, to keep track of the nested brackets for example.\nA PDA (push down automata) would have helped but... as u can\nlook above, the syntax is exactly the one from a list of tuples\nin python.... an eval() is all we needed in such a case.\n(and it is also, I guess, optimized on C level by the interpreter)\n\nTo be taken in consideration that the IDs of the rows are integers\nwhile HBase wants a string... plus, we need to do some zero padding\ndue to the fact that HBase does lexicographic sorting of its keys.\n\nThere are tons of threads on forums about how bad is to use a\nmonotonically incrementing key on HBase, but... this is what we needed.\n\n[...]\n\nA 2-level Ingestion Process\n===========================\n\nA staging, -> (bin/sqlhbase-mysqlimport)\n--------------------------------------------\nwithout any kind of interpretation of the content of the MySQL dump\nfile apart of the splitting between schema data and raw data (INSERTs).\n2 tables are created _\"namespace\"_creates, _\"namespace\"_values\nThe first table contains an entry/row for each dumpfile ingested,\nhaving as a rowkey the timestamp of the day at the bottom of the dumpfile\n(or a command line provided one, in case that information is missing).\nSuch row contains the list of hashes that for a table (see below),\na create statement for each table, and a create statement for each view,\nplus some statistics related to the time of parsing of the file,\nand the amount of rows it was containing, and the overall md5 hash.\n\nA publishing, -> (bin/sqlhbase-populate)\n-----------------------------------------\ngiven a namespace (as of initial import) and a timestamp (from a list):\n - the content of the table CREATE statement gets interpreted, the data\n types mapped from MySQL to HIVE, and the table created on HIVE.\n - if not existing, the table gets created fully, reading each 16Mb chunk\n - the table gets created with such convention: \"namespace\"_\"table_name\"\n - if the table exists, and it contains data, we compute the difference\n between the 2 lists of hashes that were created at ingestion time\n -- then we check what has already been ingested in the range of row ids\n which is contained in the mysql chunk (we took the assumption that\n mysql is sequentially dumping a table, hopefully)\n -- if a row id which is in the sequence in the database is not in the\n sequence from the chunk we are ingesting, than we might have a DELETE\n (DELETE that we do not execute on HBase due to HBASE-5154, HBASE-5241)\n -- if a row id is also in our chunk, we check each column for changes\n -- duplicated columns are removed from the list that is going to be sent\n to the server, this to avoid waste of bandwidth consumption\n - at this stage, we get a copy of the data on the next known ingestion\n date (dates are known from the list of dumps in the meta table)\n -- if data are found, each row gets diffed with the data to be ingested\n that are left from the previous cleaning... if there are real changes\n those are kept and will be sent to the HBase server for writing\n (timestamps are verified at this stage, to avoid to resend data\n that have already been written previously)\n\nFIXME: ingesting data, skipping a day, will need proper recalculation\n of the difference of the hashes list...\n ingesting data, from a backup that was not previously ingested\n (while we kept ingesting data in the tables) will cause some\n redundant data duplicated in HBase, simply cause we do not dare\n to delete the duplicate that are \"in the future\"\n\n ...anyway, it is pretty easy to delete a table and reconstruct it\n having all the history into the staging level of HBase\n\nLast but not least, we do parse VIEWs and apply them on HIVE\n... be careful about https://issues.apache.org/jira/browse/HIVE-2055 !!!", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://www.rocket-internet.de", "keywords": null, "license": "GPL 3", "maintainer": null, "maintainer_email": null, "name": "SqlHBase", "package_url": "https://pypi.org/project/SqlHBase/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/SqlHBase/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://www.rocket-internet.de" }, "release_url": "https://pypi.org/project/SqlHBase/0.4/", "requires_dist": null, "requires_python": null, "summary": "MySQLDump to HBase, ETL scripts", "version": "0.4" }, "last_serial": 785714, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "acc14faef4f3b3a591d5e8516dbf2d89", "sha256": "f3faee7cf809a69213918bed1f27dad50934347c0e4da8a7e5dad56414bab140" }, "downloads": -1, "filename": "SqlHBase-0.1.tar.gz", "has_sig": false, "md5_digest": "acc14faef4f3b3a591d5e8516dbf2d89", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15222, "upload_time": "2013-02-28T17:37:43", "url": "https://files.pythonhosted.org/packages/92/d1/13dabbc40da3626d0352afc3de061998f23cf58aa1ac537cc703dfd2e95c/SqlHBase-0.1.tar.gz" } ], "0.1dev": [], "0.2": [ { "comment_text": "", "digests": { "md5": "e27c1b8d498339f3cf565da0eedef846", "sha256": "36fc3028ff657158338092da840d4e875af2e0044395fe8336311698130c76f3" }, "downloads": -1, "filename": "SqlHBase-0.2.tar.gz", "has_sig": false, "md5_digest": "e27c1b8d498339f3cf565da0eedef846", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18243, "upload_time": "2013-03-04T09:51:01", "url": "https://files.pythonhosted.org/packages/be/99/61951d6088948176187406ee9c8ee6040589e5e217a3c003356a990a9138/SqlHBase-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "f24dd75db796b4bc08dc04f78958aef9", "sha256": "f65b7eb9ff2b5d9e801e899fb4f6784df308eebe30e45029d56d1abd30e8db8b" }, "downloads": -1, "filename": "SqlHBase-0.3.tar.gz", "has_sig": false, "md5_digest": "f24dd75db796b4bc08dc04f78958aef9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30751, "upload_time": "2013-03-04T09:54:07", "url": "https://files.pythonhosted.org/packages/53/28/a65d8b5b11f5064e8e191df905343415f96fca1d901d1bfc890741481997/SqlHBase-0.3.tar.gz" } ], "0.4": [ { "comment_text": "", "digests": { "md5": "0bbd8b2ab80513b94ac160f92547396c", "sha256": "86d6847a86f9fb7eccc372eb4519b32c91a45960a3cc561a56e579a5a34ee9d0" }, "downloads": -1, "filename": "SqlHBase-0.4.tar.gz", "has_sig": false, "md5_digest": "0bbd8b2ab80513b94ac160f92547396c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30766, "upload_time": "2013-03-07T10:16:04", "url": "https://files.pythonhosted.org/packages/ec/ff/d063962a4abf1b85fb0c1363277f32c077d49b1bab83e0c4415d283b7671/SqlHBase-0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "0bbd8b2ab80513b94ac160f92547396c", "sha256": "86d6847a86f9fb7eccc372eb4519b32c91a45960a3cc561a56e579a5a34ee9d0" }, "downloads": -1, "filename": "SqlHBase-0.4.tar.gz", "has_sig": false, "md5_digest": "0bbd8b2ab80513b94ac160f92547396c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30766, "upload_time": "2013-03-07T10:16:04", "url": "https://files.pythonhosted.org/packages/ec/ff/d063962a4abf1b85fb0c1363277f32c077d49b1bab83e0c4415d283b7671/SqlHBase-0.4.tar.gz" } ] }