{ "info": { "author": "David J. C. Beach", "author_email": "info@djcinnovations.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Operating System :: Microsoft :: Windows", "Operating System :: Unix", "Programming Language :: Python", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: Implementation :: CPython", "Topic :: Database" ], "description": "Introduction\r\n============\r\n\r\nMongoDB is a document-oriented database, organized for quick access to records\r\n(or rows) of data. When doing analytics on a large data set, it is often\r\ndesirable to have it in a column-oriented format. Columns of data may be\r\nthought of as mathematical vectors, and a wealth of techniques exist for\r\ngathering statistics about data that is stored in vector form.\r\n\r\nFor *small to medium* sized collections, it is possible to materialize several\r\ncolumns of data in the memory of a modern PC. For example, an array of 100\r\nmillion double-precision numbers consumes 800 million bytes, or about 0.75 GB.\r\nFor larger problems, it's still possible to materialize a substantial portion\r\nof the data, or to work with data in multiple segments. (Very large problems\r\nrequire more powerful weapons, such as map/reduce.)\r\n\r\nExtracting column data from MongoDB using Python is fairly straightforward. In\r\nPyMongo, ``collection.find()`` generates a sequence of dictionary objects.\r\nWhen dealing with millions of records, the trick is not to keep these\r\ndictionaries in memory, as they tend to be large. Fortunately, it's easy to\r\nmove the data in to arrays as it is loaded.\r\n\r\nFirst, let's create 3.5 million rows of test data:\r\n\r\n.. code-block:: python\r\n\r\n #!/usr/bin/env python\r\n import random\r\n import pymongo\r\n\r\n NUM_BATCHES = 3500\r\n BATCH_SIZE = 1000\r\n # 3500 batches * 1000 per batch = 3.5 million records\r\n\r\n c = pymongo.MongoClient()\r\n collection = c.mydb.collection\r\n\r\n for i in xrange(NUM_BATCHES):\r\n stuff = [ ]\r\n for j in xrange(BATCH_SIZE):\r\n record = dict(x1=random.uniform(0, 1),\r\n x2=random.uniform(0, 2),\r\n x3=random.uniform(0, 3),\r\n x4=random.uniform(0, 4),\r\n x5=random.uniform(0, 5)\r\n )\r\n stuff.append(record)\r\n collection.insert(stuff)\r\n\r\nHere's an example that uses numpy arrays:\r\n\r\n.. code-block:: python\r\n\r\n #!/usr/bin/env python\r\n import numpy\r\n import pymongo\r\n\r\n c = pymongo.MongoClient()\r\n collection = c.mydb.collection\r\n num = collection.count()\r\n arrays = [ numpy.zeros(num) for i in range(5) ]\r\n\r\n for i, record in enumerate(collection.find()):\r\n for x in range(5):\r\n arrays[x][i] = record[\"x%i\" % x+1]\r\n\r\n for array in arrays: # prove that we did something...\r\n print numpy.mean(array)\r\n\r\nWith 3.5 million records, this query takes 85 seconds on an EC2 Large instance\r\nrunning Ubuntu 10.10 64-bit, and takes 88 seconds on my MacBook Pro (2.66 GHz\r\nIntel Core 2 Duo with 8 GB RAM).\r\n\r\nThese timings might seem impressive, given that they're loading 200,000+ values\r\nper second. However, closer examination reveals that much of that time is\r\nspent by pymongo as it reads each query result and transforms the BSON result\r\nto a Python dictionary. (If you watch the CPU usage, you'll see Python is\r\nusing 90% or more of the CPU.)\r\n\r\nMonary\r\n======\r\n\r\nIt is possible to get (much) more speed from the query if we bypass the PyMongo\r\ndriver. To demonstrate this, I've developed *monary*, a simple C library and\r\naccompanying Python wrapper which make use of MongoDB C driver. The code is\r\ndesigned to accept a list of desired fields, and to load exactly those fields\r\nfrom the BSON results into some provided array storage.\r\n\r\nHere's an example of the same query using monary:\r\n\r\n.. code-block:: python\r\n\r\n #!/usr/bin/env python\r\n\r\n from monary import Monary\r\n import numpy\r\n\r\n with Monary(\"127.0.0.1\") as monary:\r\n arrays = monary.query(\r\n \"mydb\", # database name\r\n \"collection\", # collection name\r\n {}, # query spec\r\n [\"x1\", \"x2\", \"x3\", \"x4\", \"x5\"], # field names (in Mongo record)\r\n [\"float64\"] * 5 # Monary field types (see below)\r\n )\r\n\r\n for array in arrays: # prove that we did something...\r\n print numpy.mean(array)\r\n\r\nMonary is able to perform the same query in 4 seconds flat, for a rate of about\r\n4 million values per second (20 times faster!) Here's a quick summary of how\r\nthis Monary query stacks up against PyMongo:\r\n\r\n* PyMongo Insert -- EC2: 102 s -- Mac: 76 s\r\n* PyMongo Query -- EC2: 85 s -- Mac: 88 s\r\n* Monary Query -- EC2: 5.4 s -- Mac: 3.8 s\r\n\r\nOf course, this test has created some fairly ideal circumstances: It's\r\nquerying for every record in the collection, the records contain only the\r\nqueried data (plus ObjectIDs), and the database is running locally. The\r\nperformance may degrade if we used a remote server, if the records were larger,\r\nor if queried for a only subset of the records (requiring either that more\r\nrecords be scanned, or that an index be used).\r\n\r\nMonary now knows about the following types:\r\n\r\n* id (Mongo's 12-byte ObjectId)\r\n* int8\r\n* int16\r\n* int32\r\n* int64\r\n* float32\r\n* float64\r\n* bool\r\n* date (stored as int64, milliseconds since epoch)\r\n\r\nMonary's source code is available on bitbucket. It includes a copy of the\r\nMongo C driver, and requires compilation and installation, which can be done\r\nvia the included \"setup.py\" file. (The installation script works, but is in a\r\nsomewhat rough state. Any help from a distutils guru would be greatly\r\nappreciated!) To run Monary from Python, you will need to have the pymongo and\r\nnumpy packages installed.\r\n\r\nMonary has been slowly gaining functionality (including the recent additions of\r\nmore numeric types and the date type). Here are some planned future\r\nimprovements:\r\n\r\n* Support for string / binary types\r\n \r\n (I hope to develop Monary to support some reasonable mapping of most BSON\r\n types onto array storage.)\r\n\r\n* Support for fetching nested fields (e.g. \"x.y\")\r\n\r\n* Remove dependencies on PyMongo and NumPy (possibly)\r\n\r\n (Currently these must be installed in order to use Monary.)", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://bitbucket.org/djcbeach/monary/", "keywords": "monary pymongo mongo mongodb numpy array", "license": "Apache 2", "maintainer": "", "maintainer_email": "", "name": "Monary", "package_url": "https://pypi.org/project/Monary/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/Monary/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://bitbucket.org/djcbeach/monary/" }, "release_url": "https://pypi.org/project/Monary/0.5.0/", "requires_dist": null, "requires_python": null, "summary": "Monary performs high-performance column queries on MongoDB", "version": "0.5.0" }, "last_serial": 2280777, "releases": { "0.4.0": [], "0.4.0.post1": [ { "comment_text": "", "digests": { "md5": "929122d8050e5f711d8235b91a849a35", "sha256": "576db023c8cff19952c04460d1bca59f481d39b966272d281209fa1ba8af8d91" }, "downloads": -1, "filename": "Monary-0.4.0.post1.tar.gz", "has_sig": false, "md5_digest": "929122d8050e5f711d8235b91a849a35", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33770, "upload_time": "2015-04-30T21:52:49", "url": "https://files.pythonhosted.org/packages/42/d7/dcd17c167f21c5f1890d8039ea74c9a6fb16a1656ee6d6a07ae409cd0f86/Monary-0.4.0.post1.tar.gz" } ], "0.4.0.post2": [ { "comment_text": "", "digests": { "md5": "c9b145bb9c7b2aa4efaac7d851a074e1", "sha256": "4df96c5f84968694575cff3c9d30818ce02665449c28039d2596c141303a5574" }, "downloads": -1, "filename": "Monary-0.4.0.post2.tar.gz", "has_sig": false, "md5_digest": "c9b145bb9c7b2aa4efaac7d851a074e1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33777, "upload_time": "2015-04-30T22:19:44", "url": "https://files.pythonhosted.org/packages/88/8c/ba8cb641dfc500e511356ddc0861d3e1bc1ee24698bffd4306b7d03b5d9a/Monary-0.4.0.post2.tar.gz" } ], "0.5.0": [ { "comment_text": "", "digests": { "md5": "0baf204f6bea7f0406d47f1bf57b03bd", "sha256": "a69a3f9ddd493958432b0a57f37d6efed757bb0493ecc8d6c271685cc6505268" }, "downloads": -1, "filename": "Monary-0.5.0.tar.gz", "has_sig": false, "md5_digest": "0baf204f6bea7f0406d47f1bf57b03bd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33797, "upload_time": "2016-08-14T16:15:23", "url": "https://files.pythonhosted.org/packages/35/b6/230a3ec114337e324f372106b83a88efe2043f9adda551292ff57cc1262d/Monary-0.5.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "0baf204f6bea7f0406d47f1bf57b03bd", "sha256": "a69a3f9ddd493958432b0a57f37d6efed757bb0493ecc8d6c271685cc6505268" }, "downloads": -1, "filename": "Monary-0.5.0.tar.gz", "has_sig": false, "md5_digest": "0baf204f6bea7f0406d47f1bf57b03bd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33797, "upload_time": "2016-08-14T16:15:23", "url": "https://files.pythonhosted.org/packages/35/b6/230a3ec114337e324f372106b83a88efe2043f9adda551292ff57cc1262d/Monary-0.5.0.tar.gz" } ] }