{ "info": { "author": "Thomas Levine", "author_email": "_@thomaslevine.com", "bugtrack_url": null, "classifiers": [], "description": "lcw: Estimate the number of lines in a file.\n==============================================\nlcw is like ``wc -l`` but faster, less precise, and equally accurate. ::\n\n usage: lcw [-h] [--sample-size N] [--page-size PAGE_SIZE] [--pattern PATTERN]\n [--regex]\n file [file ...]\n\n Estimate how many lines are in a file.\n\n positional arguments:\n file\n\n optional arguments:\n -h, --help show this help message and exit\n --sample-size N, -n N\n How many pages to count (default: 1000)\n --page-size PAGE_SIZE, -p PAGE_SIZE\n Size of an observation (default: 16384)\n --pattern PATTERN, -e PATTERN\n The pattern to match (default: b'\\n')\n --regex, -r Use regular expressions (statistically unsound)\n (default: False)\n\nSpeed\n--------\nIt's faster than ``wc -l`` on big files. ::\n\n $ wc -c big-file.csv\n 1071895374 big-file.csv\n\n $ time lcw big-file.csv\n 2386238 \u00b1 22903 lines (99% confidence)\n\n real 0m0.172s\n user 0m0.140s\n sys 0m0.027s\n\n $ time wc -l big-file.csv\n 2388430 big-file.csv\n\n real 0m1.379s\n user 0m1.170s\n sys 0m0.197s\n\nMath\n------\nlcw uses elementary statistics to perform unbiased estimates of the\nnumber of lines in a file. It takes a random sample of \"pages\" within\nthe file and counts how many newlines are in each page.\n\nIt multiplies the average count by the number of pages in the file\nin order to get its best guess at the number of lines in the file\n(the maximum likelihood estimate) and then computes a 99% normal\nconfidence interval, applying a finite population correction for the\nestimate the standard deviation of sample totals.\n\nTuning\n--------\nIt is best to use the page size that your storage medium uses;\nmodern storage media read entire pages at once, so using a page size\nthat is too small will be bad for performance.\n\nThe sample size is set with ``-n``, and typical rules of thumb say\nthat this should be at least 20 for the confidence level to be valid.\nThe page size is set with ``-p`` and should be something like\n2048, 4096, 8192, or 16384.\n\nMatching things other than newline\n-----------------------------------\nYou can count occurrences of a string other than newline; specify\nthe string with ``-e``. It will be interpreted as a regular expression\nif you pass ``-r``. The statistical estimates do not account for the\nvariable length of regular expression matches, so you are better off\nusing plain strings if you care about accuracy.\n\nFuture work\n--------------\nI have been thinking about how to quickly sample from lots of files.\nThings like lcw can help us with samples within files, but it can be\ncould be part of a broader survey plan, with cluster sampling or\nstratification on directories or filenames and with multistage sampling,\nusing pilot tests to estimate the costs of the sampling of different\nfiles.\n\nlcw presently uses a simple random sample. Because data in text files\noften vary with their position within the file, (Later lines often\ncorrespond to later dates.) systematic sampling would be appropriate.\n\nOr, does this already exist so I don't have to write it?\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://dada.pink/lcw/", "keywords": null, "license": "LGPL", "maintainer": null, "maintainer_email": null, "name": "lcw", "package_url": "https://pypi.org/project/lcw/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/lcw/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://dada.pink/lcw/" }, "release_url": "https://pypi.org/project/lcw/0.0.5/", "requires_dist": null, "requires_python": null, "summary": "Estimate the number of lines in a file.", "version": "0.0.5" }, "last_serial": 1680191, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "d06e37f377d3370eb4546511e4a7f867", "sha256": "b131b335eb54b42353c1ee0815ea60c68bc289da9647baf343c569e38f30ced5" }, "downloads": -1, "filename": "lcw-0.0.1.tar.gz", "has_sig": false, "md5_digest": "d06e37f377d3370eb4546511e4a7f867", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2783, "upload_time": "2015-08-17T05:46:50", "url": "https://files.pythonhosted.org/packages/cf/d3/c1d54222ad2f61137d6ecfa9601dbceeceec8f7882e86b45dc425eaccb64/lcw-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "d6e60d1fb8d4e4dca7eb353672c7d745", "sha256": "67739de8a31423d67d7effcf5e69751bccf59ca739483f0203c434bc6888f365" }, "downloads": -1, "filename": "lcw-0.0.2.tar.gz", "has_sig": false, "md5_digest": "d6e60d1fb8d4e4dca7eb353672c7d745", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3564, "upload_time": "2015-08-17T06:59:10", "url": "https://files.pythonhosted.org/packages/a2/5c/3c954f05133b1fa7ace4fd7890aa3c1d15a654bed60de4cdb08bde2677df/lcw-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "edff184154e920ecafe80e2d91a87a0c", "sha256": "d0838e3ede795ffc43eb1e0316b8cdf219163bff97869c057b558e78a5d0552f" }, "downloads": -1, "filename": "lcw-0.0.3.tar.gz", "has_sig": false, "md5_digest": "edff184154e920ecafe80e2d91a87a0c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3565, "upload_time": "2015-08-17T07:01:04", "url": "https://files.pythonhosted.org/packages/e6/ce/a07c54d894cb981486242ad5c940aecd9e6398f028e689422b6a9a1cdc0f/lcw-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "d7a21b19cfa30c73ffbbaf41fa3e0077", "sha256": "6b282c308ac7e4003910e80802dc2d038027cdf51c3b9317571e102af3bdd8d9" }, "downloads": -1, "filename": "lcw-0.0.4.tar.gz", "has_sig": false, "md5_digest": "d7a21b19cfa30c73ffbbaf41fa3e0077", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3555, "upload_time": "2015-08-17T07:07:29", "url": "https://files.pythonhosted.org/packages/1b/55/27e091b7504a40aaa4b655bea010a247e9ff7344285a1c9b61dbeb341b08/lcw-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "3bae78f0eb19536c35e6084fff96de80", "sha256": "bcb44e82b27150a3e9fdee400ffa435a4c97124fbaeb97a6b62bd3c9814932dc" }, "downloads": -1, "filename": "lcw-0.0.5.tar.gz", "has_sig": false, "md5_digest": "3bae78f0eb19536c35e6084fff96de80", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3577, "upload_time": "2015-08-17T07:10:14", "url": "https://files.pythonhosted.org/packages/2b/c0/7c5b2e662049e789f261f018574a7f2ecf64aef83f756767be9fa404b74e/lcw-0.0.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "3bae78f0eb19536c35e6084fff96de80", "sha256": "bcb44e82b27150a3e9fdee400ffa435a4c97124fbaeb97a6b62bd3c9814932dc" }, "downloads": -1, "filename": "lcw-0.0.5.tar.gz", "has_sig": false, "md5_digest": "3bae78f0eb19536c35e6084fff96de80", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3577, "upload_time": "2015-08-17T07:10:14", "url": "https://files.pythonhosted.org/packages/2b/c0/7c5b2e662049e789f261f018574a7f2ecf64aef83f756767be9fa404b74e/lcw-0.0.5.tar.gz" } ] }