{ "info": { "author": "eLife Sciences Publications, Ltd", "author_email": "", "bugtrack_url": null, "classifiers": [], "description": "# ScienceBeam Utils\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\nProvides utility functions related to the [ScienceBeam](https://github.com/elifesciences/sciencebeam) project.\n\nPlease refer to the [development documentation](https://github.com/elifesciences/sciencebeam-utils/blob/develop/doc/development.md)\nif you wish to contribute to the project.\n\nMost tools are not yet documented. Please feel free to browse the code or tests or ask raise an issue.\n\n## Pre-requisites\n\n- Python 2.7 or 3 (Apache Beam may not fully support Python 3 yet)\n- [Apache Beam](https://beam.apache.org/)\n\n[Apache Beam](https://beam.apache.org/) may be used to for preprocessing but also its transparent FileSystems API which makes it easy to access files in the cloud.\n\n## Install\n\n```bash\npip install apache_beam[gcp]\n```\n\n```bash\npip install sciencebeam-utils\n```\n\n## CLI Tools\n\n### Find File Pairs\n\nThe preferred input layout is a directory containing a gzipped pdf (`.pdf.gz`) and gzipped xml (`.nxml.gz`), e.g.:\n\n- manuscript_1/\n - manuscript_1.pdf.gz\n - manuscript_1.nxml.gz\n- manuscript_2/\n - manuscript_2.pdf.gz\n - manuscript_2.nxml.gz\n\nUsing compressed files is optional but recommended to reduce file storage cost.\n\nThe parent directory per manuscript is optional. If that is not the case then the name before the extension must be identical (which is recommended in general).\n\nRun:\n\n```bash\npython -m sciencebeam_utils.tools.find_file_pairs \\\n--data-path \\\n--source-pattern *.pdf.gz --xml-pattern *.nxml.gz \\\n--out \n```\n\ne.g.:\n\n```bash\npython -m sciencebeam_utils.tools.find_file_pairs \\\n--data-path gs://some-bucket/some-dataset \\\n--source-pattern *.pdf.gz --xml-pattern *.nxml.gz \\\n--out gs://some-bucket/some-dataset/file-list.tsv\n```\n\nThat will create the TSV (tab separated) file `file-list.tsv` with the following columns:\n\n- _source_url_\n- _xml_url_\n\nThat file could also be generated using any other preferred method.\n\n### Split File List\n\nTo separate the file list into a _training_, _validation_ and _test_ dataset, the following script can be used:\n\n```bash\npython -m sciencebeam_utils.tools.split_csv_dataset \\\n--input \\\n--train 0.5 --validation 0.2 --test 0.3 --random --fill\n```\n\ne.g.:\n\n```bash\npython -m sciencebeam_utils.tools.split_csv_dataset \\\n--input gs://some-bucket/some-dataset/file-list.tsv \\\n--train 0.5 --validation 0.2 --test 0.3 --random --fill\n```\n\nThat will create three separate files in the same directory:\n\n- `file-list-train.tsv`\n- `file-list-validation.tsv`\n- `file-list-test.tsv`\n\nThe file pairs will be randomly selected (_--random_) and one group will also include all remaining file pairs that wouldn't get include due to rounding (_--fill_).\n\nAs with the previous step, you may decide to use your own process instead.\n\nNote: those files shouldn't change anymore once you used those files\n\n### Get Output Files\n\nSince ScienceBeam is intended to convert files, there will be output files. To make it specific what the filenames are,\nthe output files are also kept in a file list. This tool will generate the file list (it doesn't matter whether the files actually exist for this purpose).\n\ne.g.\n\n```bash\npython -m sciencebeam_utils.tools.get_output_files \\\n --source-file-list path/to/source/file-list-train.tsv \\\n --source-file-column=source_url \\\n --output-file-suffix=.xml \\\n --output-file-list path/to/results/file-list.lst\n```\n\nBy adding the `--check` argument, it will check whether the output files exist (see below).\n\n### Check File List\n\nAfter generating an output file list, this tool can be used whether the output files exist or are complete.\n\ne.g.\n\n```bash\npython -m sciencebeam_utils.tools.check_file_list \\\n --file-list path/to/results/file-list.lst \\\n --file-column=source_url \\\n --limit=100\n```\n\nThis will check the first 100 output files and report on it. The command will fail if none of the output files exist.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/elifesciences/sciencebeam-utils", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "sciencebeam-utils", "package_url": "https://pypi.org/project/sciencebeam-utils/", "platform": "", "project_url": "https://pypi.org/project/sciencebeam-utils/", "project_urls": { "Homepage": "https://github.com/elifesciences/sciencebeam-utils" }, "release_url": "https://pypi.org/project/sciencebeam-utils/0.1.0/", "requires_dist": [ "backports.tempfile (>=1.0)", "backports.csv (>=1.0)", "lxml (>=4.2.0)", "numpy (>=1.14.2)", "six (>=1.11.0)", "tqdm (>=4.0.0)" ], "requires_python": "", "summary": "ScienceBeam Utils", "version": "0.1.0" }, "last_serial": 5710703, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "fccedbb56fde87845dd38bdb4b84244d", "sha256": "bfb5f64bdd06656ba7c41984bc193ab43698637f714732305bbff8f16aa22def" }, "downloads": -1, "filename": "sciencebeam_utils-0.0.1-py2-none-any.whl", "has_sig": false, "md5_digest": "fccedbb56fde87845dd38bdb4b84244d", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 35699, "upload_time": "2018-08-23T09:51:21", "url": "https://files.pythonhosted.org/packages/c3/eb/5c8df6ad9c944220e2143f874098640f03565a4cdb1fd795f626cd17bfa0/sciencebeam_utils-0.0.1-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a6a01434a98552eb9f8220084ea7ce06", "sha256": "a50cd0c2a02971557375ac22a303b8afa90c6cc0b36282e600a665599208e346" }, "downloads": -1, "filename": "sciencebeam_utils-0.0.1.tar.gz", "has_sig": false, "md5_digest": "a6a01434a98552eb9f8220084ea7ce06", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22129, "upload_time": "2018-08-23T09:51:22", "url": "https://files.pythonhosted.org/packages/97/1e/4fbea1d7d4e036bb30c22e8f99adb53626f2c7bf543fe0314bfdcdc53ca5/sciencebeam_utils-0.0.1.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "3c1f4a49c86260f583a71014b56f2996", "sha256": "17a70c452ea90ec6473fc45c92ffcf525bdbb29abc01cb6d4ca26f441283ef82" }, "downloads": -1, "filename": "sciencebeam_utils-0.0.4-py2-none-any.whl", "has_sig": false, "md5_digest": "3c1f4a49c86260f583a71014b56f2996", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 39129, "upload_time": "2019-05-09T14:56:59", "url": "https://files.pythonhosted.org/packages/79/fa/987da7ce548b488f83ee318d201ddb124eecda5997cea2cef1a3e9a2c29b/sciencebeam_utils-0.0.4-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cc8444ca5faa3a79a5c4b551489cc31b", "sha256": "a7400d66ca846451288693c54178f2d5f6c6823fbfe4126b2e0740a351bcd948" }, "downloads": -1, "filename": "sciencebeam_utils-0.0.4.tar.gz", "has_sig": false, "md5_digest": "cc8444ca5faa3a79a5c4b551489cc31b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24797, "upload_time": "2019-05-09T14:57:00", "url": "https://files.pythonhosted.org/packages/83/7a/1ea187a13fe7afddfab0d754c6c81bb219406bb94484a45f5948f2b8d7f4/sciencebeam_utils-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "3a8cea644c549f5f38ff9fdc295589ec", "sha256": "7cb847fd3735355226c8caadcd19eb2c7c734adea4c7510ad85d75379ae6a336" }, "downloads": -1, "filename": "sciencebeam_utils-0.0.5-py2-none-any.whl", "has_sig": false, "md5_digest": "3a8cea644c549f5f38ff9fdc295589ec", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 39340, "upload_time": "2019-05-09T15:39:29", "url": "https://files.pythonhosted.org/packages/ce/45/5ad195276a2f46a51884dcff9917cb103822041b826f03f1de6d9e0fb6c8/sciencebeam_utils-0.0.5-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "99be451a8f14abe043bcc30834fd1eea", "sha256": "8ba7da917c1349f04f497af9b7e3396c18fe5a07490c1bd1c0679423df52eb85" }, "downloads": -1, "filename": "sciencebeam_utils-0.0.5.tar.gz", "has_sig": false, "md5_digest": "99be451a8f14abe043bcc30834fd1eea", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24978, "upload_time": "2019-05-09T15:39:30", "url": "https://files.pythonhosted.org/packages/0d/2f/4bf410165628370d004e03a19611d81142a4ffcf91dbf0ea9f042c461a86/sciencebeam_utils-0.0.5.tar.gz" } ], "0.0.6": [ { "comment_text": "", "digests": { "md5": "b70dc4e0d0b27ca070087f9fd140c662", "sha256": "a52067fdb937e8a480d6e5d8a78f7529e4b07e97100d0ce777be74c26b278b15" }, "downloads": -1, "filename": "sciencebeam_utils-0.0.6-py2-none-any.whl", "has_sig": false, "md5_digest": "b70dc4e0d0b27ca070087f9fd140c662", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 41536, "upload_time": "2019-08-02T14:21:52", "url": "https://files.pythonhosted.org/packages/34/ea/15f4287c3244eece28f391995aad86df32d40f2e3af9188d5b3231d1dbc4/sciencebeam_utils-0.0.6-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4318352f9ab08201b8e992c91b746410", "sha256": "314165da535fa0f53f89c4318fbe115c652e9d9645eb38850ac0a58e09d7c915" }, "downloads": -1, "filename": "sciencebeam_utils-0.0.6.tar.gz", "has_sig": false, "md5_digest": "4318352f9ab08201b8e992c91b746410", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27368, "upload_time": "2019-08-02T14:21:53", "url": "https://files.pythonhosted.org/packages/a2/0d/678a6b121a0621bbf4f5be51c1f2d10dc7e51e16d69f3fd745a5cc511336/sciencebeam_utils-0.0.6.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "fbe2b9b1c4784d11dfcaf860a56fdabc", "sha256": "856660bfb4ec1663c32b7ed8b26923aea336669d5b818fc8d84c6965450c4262" }, "downloads": -1, "filename": "sciencebeam_utils-0.1.0-py2-none-any.whl", "has_sig": false, "md5_digest": "fbe2b9b1c4784d11dfcaf860a56fdabc", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 42750, "upload_time": "2019-08-21T16:44:58", "url": "https://files.pythonhosted.org/packages/a6/08/ccc18ae2e60586c2b1e353d0cd4c9ea8f84e8d0406d74ca2c4bc0887a9b1/sciencebeam_utils-0.1.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "df316cd9eb0bcdf7ddd5c0eaa74a53b1", "sha256": "f08c6eb19695f8b8142f548e5277da6dc8c1de1f2d4e10c213ac22f4196fa9c2" }, "downloads": -1, "filename": "sciencebeam_utils-0.1.0.tar.gz", "has_sig": false, "md5_digest": "df316cd9eb0bcdf7ddd5c0eaa74a53b1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27950, "upload_time": "2019-08-21T16:44:59", "url": "https://files.pythonhosted.org/packages/a8/d7/de7f0eefe8f4f3255d4c4e3172b93714f223c094605d1b209bdf4d7bd1e0/sciencebeam_utils-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fbe2b9b1c4784d11dfcaf860a56fdabc", "sha256": "856660bfb4ec1663c32b7ed8b26923aea336669d5b818fc8d84c6965450c4262" }, "downloads": -1, "filename": "sciencebeam_utils-0.1.0-py2-none-any.whl", "has_sig": false, "md5_digest": "fbe2b9b1c4784d11dfcaf860a56fdabc", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 42750, "upload_time": "2019-08-21T16:44:58", "url": "https://files.pythonhosted.org/packages/a6/08/ccc18ae2e60586c2b1e353d0cd4c9ea8f84e8d0406d74ca2c4bc0887a9b1/sciencebeam_utils-0.1.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "df316cd9eb0bcdf7ddd5c0eaa74a53b1", "sha256": "f08c6eb19695f8b8142f548e5277da6dc8c1de1f2d4e10c213ac22f4196fa9c2" }, "downloads": -1, "filename": "sciencebeam_utils-0.1.0.tar.gz", "has_sig": false, "md5_digest": "df316cd9eb0bcdf7ddd5c0eaa74a53b1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27950, "upload_time": "2019-08-21T16:44:59", "url": "https://files.pythonhosted.org/packages/a8/d7/de7f0eefe8f4f3255d4c4e3172b93714f223c094605d1b209bdf4d7bd1e0/sciencebeam_utils-0.1.0.tar.gz" } ] }