{ "info": { "author": "Niels Zeilemaker, Giovanni Lanzani", "author_email": "niels@zeilemaker.nl, giovanni@lanzani.nl", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.2", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Database :: Front-Ends" ], "description": "ParallelConnection\n==================\n\nThis class manages multiple database connections, handles the parallel\naccess to it, and hides the complexity this entails. The execution of\nqueries is distributed by running it for each connection in parallel.\nThe result (as retrieved by fetchall() and fetchone()) is the union of\nthe parallelized query results from each connection.\n\nThe use case we had in mind when we created this class was having\nsharded tables (distributed across n database instances) that we needed\nto query concurrently and merging the results.\n\nSee below for an architecture overview for possible inspiration.\n\nUsage\n-----\n\nThe package is in principle database independent, as long as you expect\na connection object to respond to (part of) these methods:\n\n.. code:: python\n\n cursor\n cursor.execute\n cursor.fetchone\n cursor.fetchall\n cursor.mogrify\n cursor.commit\n cursor.close\n putconn\n\nTo use in its simplest form, do (example using psycopg2 and assuming\nthat dsns is a list containing your databases' connection strings)\n\n.. code:: python\n\n n = 20 # maximum connections per pool\n pools = [psycopg2.ThreadedConnectionPool(1, n, dsn=d) for d in dsns]\n connections = [p.getconn() for p in pools]\n pdb = ParallelConnection(connections)\n\nThe ``pdb`` object works for the rest like a normal, single, database\nconnection object, but it merges results return by each database. You\ncan therefore use it like so:\n\n.. code:: python\n\n c = pdb.cursor(cursor_factory=psycopg2.extras.RealDictCursor)\n c.execute(\"SELECT * FROM my_shrd_tbl WHERE shrd_column = 1543\", parameters)\n results = c.fetchall()\n c.close\n\nResults will fetch everything from all database. In case your query has\na where in the sharded column (``shrd_column``) the results from all but\none databases will be empty. This is fine, as the package handles it for\nyou. When it gets more interesting is a query like\n\n.. code:: python\n\n c = pdb.cursor(cursor_factory=psycopg2.extras.RealDictCursor)\n c.execute(\"SELECT * FROM my_shrd_tbl WHERE not_shrd_column = 543\", parameters)\n results = c.fetchall()\n c.close\n\nIn this case the query does *not* have a ``WHERE`` on a sharded column,\nso the package will fetch results from each database and merge them. Why\nthat may be of interest for you, will be shown below.\n\nIf you are executing a query on a non-sharded table you should use a\nnormal connection object.\n\nArchitectural motivation\n------------------------\n\nWe found ourselves having long running queries that where aggregating\nrecords on a particular column (let's call it ``shrd_column``). To\nreduce the run-time we decided to split only the table(s) containing\n``shrd_column`` between multiple databases, and have each database have\na copy of all non-sharded tables.\n\nThen each query grouping on ``shrd_column`` can be basically be executed\nindependently in each databases. The results still need to be merged\nthough, so that's why we build this package (we call it package even if\nNiels insists on calling it \"just a class\").\n\nFAQ\n---\n\n**Q.** Why don't you use things as\n`pg\\_shard `__?\n\n**A.** Because pg\\_shard doesn't handle ``JOIN`` on the distributed\nquery, which we want to do. Our package has the additional advantage\nthat all the databases are completely unaware of each other. It all\nhappens on the application layer and on the ingestion layer.\n\n--------------\n\n**Q.** What about if a machine goes down, etc.?\n\n**A.** Just use two machines with a load balancer in front of them.\n\n--------------\n\n**Q.** What about ``INSERT``?\n\n**A.** Yeah, we don't do that sort of things (it's read-only application\nfor explorative purposes). But feel to see if it works, and fix it plus\nsubmitting a PR if it doesn't.\n\n--------------\n\n**Q.** What about how to shard the data? This package does nothing,\nNiels is right!\n\n**A.** We trust you are savvy enough to do that by yourself before\ningestion. We could help you with that though, just drop\n`us `__ a line.\n\n--------------\n\n**Q.** I want to know more!\n\n**A.** That's technically not a question, but you can begin by watching\nNiels present the project at PyData Paris 2015. It's on\n`Youtube `__.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/godatadriven/ParallelConnection", "keywords": "development databases parallel", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "parallel-connection", "package_url": "https://pypi.org/project/parallel-connection/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/parallel-connection/", "project_urls": { "Homepage": "https://github.com/godatadriven/ParallelConnection" }, "release_url": "https://pypi.org/project/parallel-connection/0.1.2/", "requires_dist": [ "nose; extra == 'test'" ], "requires_python": "", "summary": "A package to query Postgres databases in parallel", "version": "0.1.2" }, "last_serial": 1972540, "releases": { "0.0.1": [], "0.1.1": [ { "comment_text": "", "digests": { "md5": "c3dacba9bdfd02aed4f842f3c3227e11", "sha256": "bea320e2d4c17d2bae4ebb00e2444a27230c44168053cc60f34a2b5f4188e2e8" }, "downloads": -1, "filename": "parallel_connection-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c3dacba9bdfd02aed4f842f3c3227e11", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 7391, "upload_time": "2015-08-21T12:35:04", "url": "https://files.pythonhosted.org/packages/e5/5d/6c0712917449662f006ded7df0c36bc9acd0df98e68105bb52158468e53a/parallel_connection-0.1.1-py2.py3-none-any.whl" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "c3832447c54575484fee62bacddfa15f", "sha256": "be75a7ca03db93a9be7385a4a1d02297d7c67c6767f1be0eaf3a6ea2fbc9482c" }, "downloads": -1, "filename": "parallel_connection-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c3832447c54575484fee62bacddfa15f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 7363, "upload_time": "2016-02-23T21:12:49", "url": "https://files.pythonhosted.org/packages/a1/92/7535166d52a65e6adfbedfc2e68aa6778eb2642f6cbf7cdfc23b6aefc734/parallel_connection-0.1.2-py2.py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c3832447c54575484fee62bacddfa15f", "sha256": "be75a7ca03db93a9be7385a4a1d02297d7c67c6767f1be0eaf3a6ea2fbc9482c" }, "downloads": -1, "filename": "parallel_connection-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c3832447c54575484fee62bacddfa15f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 7363, "upload_time": "2016-02-23T21:12:49", "url": "https://files.pythonhosted.org/packages/a1/92/7535166d52a65e6adfbedfc2e68aa6778eb2642f6cbf7cdfc23b6aefc734/parallel_connection-0.1.2-py2.py3-none-any.whl" } ] }