{ "info": { "author": "Thomas Levine", "author_email": "_@thomaslevine.com", "bugtrack_url": null, "classifiers": [], "description": "Sample lines from a file that has already been written.\n\nInstall\n----------\nInstall like so. ::\n\n pip install sample-lines\n\nHow to\n----------\nSee the help for documentation. ::\n\n sample-lines -h\n usage: Randomly select lines from a file. [-h] [--sample-size N]\n [--method {simple-random,systematic}]\n [--repeat REPEAT]\n file\n\n positional arguments:\n file\n\n optional arguments:\n -h, --help show this help message and exit\n --sample-size N, -n N\n Number of lines to emit\n --method {simple-random,systematic}, -m {simple-random,systematic}\n Sampling method\n --repeat REPEAT, -r REPEAT\n Number of repetitions for systematic sampling\n\nSamples are with replacement and weighted by line length. The probability\nof selecting a line is proportional the length of the previous line.\nThis allows us to\nsample very quickly, but it makes this approach appropriate only if your\nfile has reasonably consistent line lengths or at least if there is no\nperiodic variation in line length.\n\nHow fast\n----------\nConsider this 1-gigabyte CSV file. ::\n\n $ wc big-file.csv\n 2388430 27673790 1071895374 big-file.csv\n\nRunning ``wc`` took three seconds. ::\n\n time wc big-file.csv\n 2388430 27673790 1071895374 big-file.csv\n\n real 0m3.789s\n user 0m3.560s\n sys 0m0.190s\n\nHere's how long it takes to parse the whole file. ::\n\n $ time python3 -c 'for line in open(\"big-file.csv\"): pass'\n\n real 0m2.892s\n user 0m2.641s\n sys 0m0.245s\n\n``sample-lines`` is much faster. Here's a simple random sample of 40 lines, ::\n\n $ time sample-lines -n 40 -m simple-random big-file.csv > /dev/null\n\n real 0m0.136s\n user 0m0.113s\n sys 0m0.018s\n\na systematic sample of 40 lines, ::\n\n $ time sample-lines -n 40 -m systematic -r 4 big-file.csv > /dev/null\n\n real 0m0.148s\n user 0m0.122s\n sys 0m0.019s\n\nand repeated systematic sample, with 4 repeats and 10 lines each, for\na total of 40 lines. ::\n\n $ time sample-lines -n 10 -m systematic -r 4 big-file.csv > /dev/null\n\n real 0m0.175s\n user 0m0.140s\n sys 0m0.025s\n\nMost of the time in the above examples was spent loading Python and the\nvarious modules; printing the help takes almost as long as running the sample.\n\n::\n\n $ time sample-lines -h > /dev/null\n\n real 0m0.157s\n user 0m0.129s\n sys 0m0.021s\n\nSo even a pretty big sample is still fast to run. ::\n\n $ time sample-lines -n 2000 -m systematic -r 50 big-file.csv > /dev/null\n\n real 0m2.695s\n user 0m2.435s\n sys 0m0.231s\n\nAlternatives\n--------------\nUse `sample `_\nif you want to sample from a stream.\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://dada.pink/sample-lines/", "keywords": null, "license": "LGPL", "maintainer": null, "maintainer_email": null, "name": "sample-lines", "package_url": "https://pypi.org/project/sample-lines/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/sample-lines/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://dada.pink/sample-lines/" }, "release_url": "https://pypi.org/project/sample-lines/0.0.4/", "requires_dist": null, "requires_python": null, "summary": "Sample lines from a file.", "version": "0.0.4" }, "last_serial": 1678900, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "5b24fa6ab2ad908ea0780da824850cc7", "sha256": "69f636e9e7737dd763cd13d5501498886c20fa4ccd734e4357e088c6860272e4" }, "downloads": -1, "filename": "sample-lines-0.0.1.tar.gz", "has_sig": false, "md5_digest": "5b24fa6ab2ad908ea0780da824850cc7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1616, "upload_time": "2015-08-15T20:29:26", "url": "https://files.pythonhosted.org/packages/95/28/490a335aa5bab7355f29fa31ff2ce94c19b83ed17d805a8c7ddc66003845/sample-lines-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "abbf2ab198554a307d3e66bfc877912a", "sha256": "19f910626ad6139054a1143fb2031bce58a7ee78251bc5bad9b031a718517dbe" }, "downloads": -1, "filename": "sample-lines-0.0.2.tar.gz", "has_sig": false, "md5_digest": "abbf2ab198554a307d3e66bfc877912a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2512, "upload_time": "2015-08-15T21:00:33", "url": "https://files.pythonhosted.org/packages/0c/71/fc8080b76e9ee82002ab82278304af70efc37787f66ff86ca50bf2e90899/sample-lines-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "3a8cdd742cca978f57356ad1e3e771e1", "sha256": "99a409d85d7f965474c9558d46bcc3f6c027e96096e4be3ffc0cdccf57fd61b4" }, "downloads": -1, "filename": "sample-lines-0.0.3.tar.gz", "has_sig": false, "md5_digest": "3a8cdd742cca978f57356ad1e3e771e1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2523, "upload_time": "2015-08-15T21:06:36", "url": "https://files.pythonhosted.org/packages/c7/89/1734fe9794f56593621e66b9327b403916f2ae3212f18df9582b2ed796bf/sample-lines-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "81bfe730fe306ee32753c09fc46a969d", "sha256": "45fc4c35a9fea5d044d3dc3bd16de5bc03f38f5568ed33d58cb81850066f9afa" }, "downloads": -1, "filename": "sample-lines-0.0.4.tar.gz", "has_sig": false, "md5_digest": "81bfe730fe306ee32753c09fc46a969d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2602, "upload_time": "2015-08-15T21:17:23", "url": "https://files.pythonhosted.org/packages/78/f7/e5c40e9ea630d3a1de53ff5478536941ac2566d03be206af0c605537113d/sample-lines-0.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "81bfe730fe306ee32753c09fc46a969d", "sha256": "45fc4c35a9fea5d044d3dc3bd16de5bc03f38f5568ed33d58cb81850066f9afa" }, "downloads": -1, "filename": "sample-lines-0.0.4.tar.gz", "has_sig": false, "md5_digest": "81bfe730fe306ee32753c09fc46a969d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2602, "upload_time": "2015-08-15T21:17:23", "url": "https://files.pythonhosted.org/packages/78/f7/e5c40e9ea630d3a1de53ff5478536941ac2566d03be206af0c605537113d/sample-lines-0.0.4.tar.gz" } ] }