{ "info": { "author": "Dylan Culfogienis", "author_email": "dtc9bb@virginia.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# igem-wikiscraper\n\nA quick-and-dirty Python CLI tool for scraping iGEM wikis. Can be used to pull out any relevant HTML-based information. Note that this tool will not function on, for example, lab manuals involving PDFs, pages that dynamically load in content on user interaction or page scroll, or pages that redirect to other websites (not that this is allowed by iGEM).\n\n## Use\n\nThis tool currently only functions with Python3. Python2 compatibility is on the backburner. Any instructions assume you have Python3 installed and on PATH.\n\n1. Install via `pip` or download from [releases](https://github.com/Virginia-iGEM/igem-wikiscraper/releases)\n2. Create a new folder to hold all your information and scrapes. Add `data` and `output` subdirectories\n3. Copy `config.json` from [here](https://raw.githubusercontent.com/Virginia-iGEM/igem-wikiscraper/master/igemwikiscraper/config.json) into this new folder\n4. Retrieve data from http://igem.org/Team_List.\n5. Modify `config.json` as you see fit. Descriptions of config options are available [below](#configuration).\n6. Use either `wikiscraper` or `wikiscraper-gui`. See below.\n\n### GUI\n\nA GUI tool is available. This tool can be accessed via `wikiscraper-gui` once installed with `pip install igemwikiscraper`. Steps 1-5 above should be followed before using this GUI.\n\n### Terminal\n\nThese commands should work for Ubuntu. For Windows, substitute `pip` for `pip3`. For Mac or other UNIXes, you're on your own.\n\n```bash\npip3 install igemwikiscraper\nmkdir igem-scrapes\ncd igem-scrapes\nwget https://github.com/Virginia-iGEM/igem-wikiscraper/blob/master/igemwikiscraper/config.json\nnano config.json # modify as needed\nmkdir {data,output}\ncd data\nwget https://raw.githubusercontent.com/Virginia-iGEM/igem-wikiscraper/master/data/2018__team_list__2018-07-02.csv\ncd ..\nwikiscraper data/2018__team_list__2018-07-02.csv\n```\n\nOutput file can be found under `output/` or as configured.\n\n\n## Brief Description of Tool Function\n\nThe tool retrieves raw HTML from URLs generated from a list of team names and years. This raw HTML is then passed through BeautifulSoup, a combination HTML Parser and HTML Selector built specifically for webscraping. \n\nSpecific HTML tags are removed in preprocessing (such as script and style tags, which we don't need), then a set of tags is selected with an emulated JQuery selector (with the `htmlselector` option). This set of tags is then \"strained\" by their content, removing entries that are too short, do not contain sentences, and which contain any JavaScript or CSS motifs.\n\nAll of the children of selected tags are then unwrapped, discarding their attributes and keeping only their content in (what I believe is) a pre-ordered tree traversal. \n\nThis process occurs for each page specified by `subpages`, leaving us with a list of lists of strings for each team. This 2d list is flattened to 1d by joining the sublists with newlines, before being appended to team information and written out as one row in a .csv file specified by `outputfile`. This is repeated for each team in the teamlist, building up a list of team information paired with useful scraped data.\n\n## Configuration\n\nNote: Some of these options can be set when using the CLI. Enter `wikiscraper -h` for a list of options which can be set through the CLI. This is convenient for options like `-vv` which you may not want to see on every scrape attempt.\n\n- data: Options for what kind of data we take in and look for\n - subpages: Which pages on the wiki to look at. `\"\"` denotes the index. Note that each page is lead with a forward slash; every page must be lead with a forward slash. Examples: `/Team`, `/Safety`\n - filedelimiter: What kind of csv delimiter is the team name list using? The iGEM Team List page seems to generate CSV's with comma delimiters.\n - start: Which team in the list to start with. Set to negative to remove limit.\n - end: Which team in the list to end with. Set to negative to remove limit.\n- scraper: Options for how the scraper filters and interprets data\n - htmlselector: JQuery-style selector for html elements. `#content` seems to include all hand-written team info without including any extra junk. Upon inspecting many iGEM wikis, we found all content _should_ be wrapped in `#mw-content-text`, but because there are _so many_ (about 10% of) iGEM wikis with broken HTML, mostly in the form of unpaired HTML tags, we have to be a bit generous with our selectors.\n - gracetime: Number of seconds to wait between HTML GET requests. Strongly reccommend this to be kept at 1 or above; any lower is considered impolite by web scraping standards, may break the iGEM Wiki, and may be considered a crime (a Denial of Service Attack, albeit a poor one).\n - stripwhitespace: Removes leading and trailing whitespace. Cleans up output a little.\n - collapsenewlines: Removes strings of newlines in retrieved strings\n - excisescripts: Removes any script tags when preprocessing HTML\n - excisestyles: Removes any style tags when preprocessing HTML\n - use: Enables/disables filters listed below.\n - space_count: Filters strings by the number of spaces they contain. If a string has less than this number of spaces, it will be excluded.\n - period_count: Filters string by number of periods contained. If a string has less than this number of periods, it will be excluded.\n - negative_regex: Regex with double-escaped backslashes. If a string matches this regex, it will be excluded.\n- output: Options for the output file\n - outputfile: Path to the file we output our results to.\n - filedelimiter: Delimits columns in the output .csv file. Default is a vertical pipe, `|`, useful for plain HTML content.\n - filequotechar: Used to encapsulate escaped strings in .csv files. Doublequotes `\"` are fairly standard.\n - verbose: Set to `0` to print progress reports only. Set to `1` to print useful information above each wiki page scraped. Set to `2` to print all content scraped.\n\n## Todo\n\n- Expand collapsenewlines option to also collapse strings of newlines and tabs into just one newline\n- Make it so that options under `scraper.use` actually enable/disable the filter\n- Replace `excisescripts` and `excisestyles` flags with an HTML selector that excises arbitrary HTML elements\n- Either enable output to multiple files or ensure multiple inputs lead to a sane single output\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Virginia-iGEM/igem-wikiscraper", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "igemwikiscraper", "package_url": "https://pypi.org/project/igemwikiscraper/", "platform": "", "project_url": "https://pypi.org/project/igemwikiscraper/", "project_urls": { "Homepage": "https://github.com/Virginia-iGEM/igem-wikiscraper" }, "release_url": "https://pypi.org/project/igemwikiscraper/0.5.1/", "requires_dist": [ "requests", "beautifulsoup4", "lxml", "gooey (>=1.0.1)" ], "requires_python": "", "summary": "GUI+CLI+Library webscraper built specifically for iGEM team wiki pages", "version": "0.5.1" }, "last_serial": 4125276, "releases": { "0.4.3": [ { "comment_text": "", "digests": { "md5": "c9090fede2d14878b56c0431c436e934", "sha256": "1b33578ca6e07278c0ad450b0e5118b83ec379a8de8fdf11221219a88413a9b6" }, "downloads": -1, "filename": "igemwikiscraper-0.4.3-py3-none-any.whl", "has_sig": false, "md5_digest": "c9090fede2d14878b56c0431c436e934", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8883, "upload_time": "2018-08-01T03:53:33", "url": "https://files.pythonhosted.org/packages/aa/12/3dc6bc364cf4f86482c60c62ac9c3aca6231fb400a8878edcfd37f2c13fe/igemwikiscraper-0.4.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "713ff29001a5e0b2670ca036b5c11c98", "sha256": "7a8cd5d5ce587b66492f356cb07a89b6b9f0fb2fcb9eed9c6c2ea688c6354474" }, "downloads": -1, "filename": "igemwikiscraper-0.4.3.tar.gz", "has_sig": false, "md5_digest": "713ff29001a5e0b2670ca036b5c11c98", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7992, "upload_time": "2018-08-01T03:53:34", "url": "https://files.pythonhosted.org/packages/83/9c/41832f3981159cc698fbff5f8cd3732720fd2fb33c3fcbae7409f48ca1ba/igemwikiscraper-0.4.3.tar.gz" } ], "0.5.0": [ { "comment_text": "", "digests": { "md5": "d18a9b2d57c503baa6f30dd6c1e6cf0f", "sha256": "e5967d7cac093fb35a9e570ea9c826bd33725ca16a4c50ac3fb9bd9f4b6893f6" }, "downloads": -1, "filename": "igemwikiscraper-0.5.0-py3-none-any.whl", "has_sig": false, "md5_digest": "d18a9b2d57c503baa6f30dd6c1e6cf0f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8832, "upload_time": "2018-08-01T04:26:12", "url": "https://files.pythonhosted.org/packages/c7/1b/eb9ed2006e10ed8796f3499e7ae2e59528acf111650239e010924a7668bd/igemwikiscraper-0.5.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2f6fc17724a0234d44d63d66229336c1", "sha256": "eeaed44cbedb263a9ba4a61d4b35574cab8de5066aab8327db5aa0c92f1cc54d" }, "downloads": -1, "filename": "igemwikiscraper-0.5.0.tar.gz", "has_sig": false, "md5_digest": "2f6fc17724a0234d44d63d66229336c1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7919, "upload_time": "2018-08-01T04:26:14", "url": "https://files.pythonhosted.org/packages/91/8f/157908dc7313de8d236c67ab2998c9a0c0e8bae53b8e20ecc67f649461cd/igemwikiscraper-0.5.0.tar.gz" } ], "0.5.1": [ { "comment_text": "", "digests": { "md5": "3f1f765848bfc712a7b059ffcaaaf240", "sha256": "4a967a75c06b06fbc4e989131cc38168a8d1491c833b2c0cf6baeae7739ae78e" }, "downloads": -1, "filename": "igemwikiscraper-0.5.1-py3-none-any.whl", "has_sig": false, "md5_digest": "3f1f765848bfc712a7b059ffcaaaf240", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8839, "upload_time": "2018-08-01T15:58:12", "url": "https://files.pythonhosted.org/packages/77/bb/9d22d874c023a1f8786421a3ffc3e0d39bb9c5006311d68cb8d0f747356a/igemwikiscraper-0.5.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0fb999c015fac78c4d90bed57e1a6629", "sha256": "35024c41b602ab05f8c317dad5a2622b801a8ae0947133300e025dfdf325b6d7" }, "downloads": -1, "filename": "igemwikiscraper-0.5.1.tar.gz", "has_sig": false, "md5_digest": "0fb999c015fac78c4d90bed57e1a6629", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7930, "upload_time": "2018-08-01T15:58:13", "url": "https://files.pythonhosted.org/packages/64/3a/c1b6ec342109e6a2e8225d9d0163f6a199c13aaf00832f39f46ae2fe2cf1/igemwikiscraper-0.5.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "3f1f765848bfc712a7b059ffcaaaf240", "sha256": "4a967a75c06b06fbc4e989131cc38168a8d1491c833b2c0cf6baeae7739ae78e" }, "downloads": -1, "filename": "igemwikiscraper-0.5.1-py3-none-any.whl", "has_sig": false, "md5_digest": "3f1f765848bfc712a7b059ffcaaaf240", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8839, "upload_time": "2018-08-01T15:58:12", "url": "https://files.pythonhosted.org/packages/77/bb/9d22d874c023a1f8786421a3ffc3e0d39bb9c5006311d68cb8d0f747356a/igemwikiscraper-0.5.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0fb999c015fac78c4d90bed57e1a6629", "sha256": "35024c41b602ab05f8c317dad5a2622b801a8ae0947133300e025dfdf325b6d7" }, "downloads": -1, "filename": "igemwikiscraper-0.5.1.tar.gz", "has_sig": false, "md5_digest": "0fb999c015fac78c4d90bed57e1a6629", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7930, "upload_time": "2018-08-01T15:58:13", "url": "https://files.pythonhosted.org/packages/64/3a/c1b6ec342109e6a2e8225d9d0163f6a199c13aaf00832f39f46ae2fe2cf1/igemwikiscraper-0.5.1.tar.gz" } ] }