{ "info": { "author": "Dylan Jay", "author_email": "software@pretaweb.com", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "Crawling - html to import\n=========================\n\n`transmogrify.webcrawler` will crawl html to extract pages and files as a source for your transmogrifier pipeline.\n`transmogrify.webcrawler.typerecognitor` aids in setting '_type' based on the crawled mimetype.\n`transmogrify.webcrawler.cache` helps speed up crawling and reduce memory usage by storing items locally.\n\nThese blueprints are designed to work with the `funnelweb` pipeline but can be used independently.\n\n\n\ntransmogrify.webcrawler\n=======================\n\nA source blueprint for crawling content from a site or local html files.\n\nWebcrawler imports HTML either from a live website, for a folder on disk, or a folder\non disk with html which used to come from a live website and may still have absolute\nlinks refering to that website.\n\nTo crawl a live website supply the crawler with a base http url to start crawling with.\nThis url must be the url which all the other urls you want from the site start with.\n\nFor example ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url = http://www.whitehouse.gov\n max = 50\n\nwill restrict the crawler to the first 50 pages.\n\nYou can also crawl a local directory of html with relative links by just using a file: style url ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url = file:///mydirectory\n\nor if the local directory contains html saved from a website and might have absolute urls in it\nthe you can set this as the cache. The crawler will always look up the cache first ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url = http://therealsite.com --crawler:cache=mydirectory\n\nThe following will not crawl anything larget than 4Mb ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url = http://www.whitehouse.gov\n maxsize=400000\n\nTo skip crawling links by regular expression ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url=http://www.whitehouse.gov\n ignore = \\.mp3\n \\.mp4\n\nIf webcrawler is having trouble parsing the html of some pages you can preprocesses\nthe html before it is parsed. e.g. ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n patterns = ()\n subs = \\1\\2\n\nIf you'd like to skip processing links with certain mimetypes you can use the\ndrop:condition. This TALES expression determines what will be processed further.\nsee http://pypi.python.org/pypi/collective.transmogrifier/#condition-section\n::\n\n [drop]\n blueprint = collective.transmogrifier.sections.condition\n condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']\n\n\nOptions:\n\n:site_url:\n - the top url to crawl\n\n:ignore:\n - list of regex for urls to not crawl\n\n:cache:\n - local directory to read crawled items from instead of accessing the site directly\n\n:patterns:\n - Regular expressions to substitute before html is parsed. New line seperated\n\n:subs:\n - Text to replace each item in patterns. Must be the same number of lines as patterns. Due to the way buildout handles empty lines, to replace a pattern with nothing (eg to remove the pattern), use ```` as a substitution.\n\n:maxsize:\n - don't crawl anything larger than this\n\n:max:\n - Limit crawling to this number of pages\n\n:start-urls:\n - a list of urls to initially crawl\n\n:ignore-robots:\n - if set, will ignore the robots.txt directives and crawl everything\n\nWebCrawler will emit items like ::\n\n item = dict(_site_url = \"Original site_url used\",\n _path = \"The url crawled without _site_url,\n _content = \"The raw content returned by the url\",\n _content_info = \"Headers returned with content\"\n _backlinks = names,\n _sortorder = \"An integer representing the order the url was found within the page/site\n\t )\n\n\r\ntransmogrify.webcrawler.cache\r\n=============================\r\n\r\nA blueprint that saves crawled content into a directory structure\r\n\r\nOptions:\r\n\r\n:path-key:\r\n Allows you to override the field path is stored in. Defaults to '_path'\r\n\r\n:output:\r\n Directory to store cached content in\r\n\r\n\ntransmogrify.webcrawler.typerecognitor\n======================================\n\nA blueprint for assigning content type based on the mime-type as given by the\nwebcrawler\n\nChangelog\n=========\n\n1.2.1 (2012-1-10)\n-----------------\n\n- setuptools-git wasn't installed so release was missing files [djay]\n\n1.2 (2012-12-28)\n----------------\n- fix cache check to prevent overwriting cache [djay]\n- turn redirects into Link objects [djay]\n- summary stats of which mimetypes were crawled [djay]\n- fixed bug where redirected pages weren't getting uploaded [djay]\n- fixed bugs with storing default pages in cache [djay]\n- fixed bug with space chars in urls [ivanteoh]\n- better handling of charset detection [djay]\n\n\n1.1 (2012-04-17)\n----------------\n\n- add start-urls option [djay]\n- add ignore_robots option [djay]\n- fixed bug in http-equiv refresh handling [djay]\n- fixes to disk caching [djay]\n- better logging [djay]\n- default maxsize is unlimited [djay]\n- Provide ability for the reformat function to substitute patterns with \n empty strings (nothing). Buildout does not support empty lines within\n configuration, so if a substitution is this becomes an empty\n string. [davidjb]\n- Provide a logger in the LXMLPage class so the reformat function can \n succeed [davidjb]\n- Reformat spacing in webcrawler reformat function [davidjb] \n\n\n1.0 (2011-06-29)\n----------------\n- many fixes for importing from local directory w/ many languages [simahawk]\n- fix UnicodeEncodeError when file name/language is not english [simahawk]\n- fix iterating over non-sequence [simahawk]\n- fix missing import for MyStringIO [simahawk]\n\n1.0b7 (2011-02-17)\n------------------\n- fix bug in cache check [djay]\n\n1.0b6 (2011-02-12)\n------------------\n- only open cache files when needed so don't run out of handles [djay]\n- follow http-equiv refresh links [djay]\n\n1.0b5 (2011-02-06)\n------------------\n- files use file pointers to reduce memory usage [djay]\n- cache saves .metadata files to record and playback headersx [djay]\n\n1.0b4 (2010-12-13)\n------------------\n- improve logging [djay]\n- fix encoding bug caused by cache [djay]\n\n1.0b3 (2010-11-10)\n------------------\n\n- Fixed bug in cache that caused many links to be ignored in some cases [djay]\n- Fix documentation up [djay]\n\n1.0b2 (2010-11-09)\n------------------\n\n- Stopped localhost output when no output set [djay]\n\n1.0b1 (2010-11-08)\n------------------\n\n- change site_url to just url. [djay]\n\n- rename maxpage to maxsize [djay]\n\n- fix file: style urls [djay]\n\n- Added cache option to replace base_alias [djay]\n\n- fix _origin key set by webcrawler, instead of url now it is path as expected by further blue\n [Vitaliy Podoba]\n\n- add _orig_path to pipeline item to keep original path for any further purposes, we will need\n [Vitaliy Podoba]\n\n- make all url absolute taking into account base tags inside webcrawler blueprint\n [Vitaliy Podoba] \n\n\n0.1 (2008-09-25)\n----------------\n\n- renamed package from pretaweb.blueprints to transmogrify.webcrawler.\n [djay]\n\n- enhanced import view [djay]", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/collective/transmogrify.webcrawler", "keywords": "transmogrifier blueprint funnelweb source plone import conversion microsoft office", "license": "GPL", "maintainer": null, "maintainer_email": null, "name": "transmogrify.webcrawler", "package_url": "https://pypi.org/project/transmogrify.webcrawler/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/transmogrify.webcrawler/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://github.com/collective/transmogrify.webcrawler" }, "release_url": "https://pypi.org/project/transmogrify.webcrawler/1.2.1/", "requires_dist": null, "requires_python": null, "summary": "Crawling and feeding html content into a transmogrifier pipeline", "version": "1.2.1" }, "last_serial": 800897, "releases": { "0.2": [], "1.0": [ { "comment_text": "", "digests": { "md5": "9c611ccd9c3beaafcb6959f2a5d3ec43", "sha256": "a7bf5e2ccc19526cfa74284e845092ccae54fc07aa8c19e193759e9054d5e942" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.0.zip", "has_sig": false, "md5_digest": "9c611ccd9c3beaafcb6959f2a5d3ec43", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 536301, "upload_time": "2011-06-29T16:42:26", "url": "https://files.pythonhosted.org/packages/93/65/5de7badf3e1d6c0c69bce8a0b4b9780bbf962276f58563d4cf34c252cbac/transmogrify.webcrawler-1.0.zip" } ], "1.0b1": [ { "comment_text": "", "digests": { "md5": "b58aff6badf076352e49b504f628859e", "sha256": "6d4f33760b61da1d21a8993d6bf8ab9904ac0a2bdaa5d77b29c80e7b1b95a470" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.0b1.zip", "has_sig": false, "md5_digest": "b58aff6badf076352e49b504f628859e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 530844, "upload_time": "2010-11-07T15:58:09", "url": "https://files.pythonhosted.org/packages/1d/65/dd333fef4adec7e48e863e9aafbd71671a8d841e69ff72c00fd68ef9497e/transmogrify.webcrawler-1.0b1.zip" } ], "1.0b2": [ { "comment_text": "", "digests": { "md5": "91c90bfdea1d39984e81eb15a399dabb", "sha256": "2d8a6608259821ec2ca21ab63b01f53f87e9b8b3e44c79802928f13a7dcb62a7" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.0b2.zip", "has_sig": false, "md5_digest": "91c90bfdea1d39984e81eb15a399dabb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 531415, "upload_time": "2010-11-08T17:06:06", "url": "https://files.pythonhosted.org/packages/89/c7/35a7ccd03f728aaaa4f568d44af3c07ed52dfcb29897d2c90d9500815110/transmogrify.webcrawler-1.0b2.zip" } ], "1.0b3": [ { "comment_text": "", "digests": { "md5": "fde2dfd665c4d651c5c5c475365f6ac3", "sha256": "12bc7251d6bbc78d76be978ed726af253e161dc78126f7c34aad4bea3977cf03" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.0b3.zip", "has_sig": false, "md5_digest": "fde2dfd665c4d651c5c5c475365f6ac3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 534426, "upload_time": "2010-11-09T16:31:14", "url": "https://files.pythonhosted.org/packages/69/2b/cf020c875af5e9fb11db06df3e48fac229d86ffcefaa0c1b2e60444bda8d/transmogrify.webcrawler-1.0b3.zip" } ], "1.0b4": [ { "comment_text": "", "digests": { "md5": "34aef3928f048688ab006d93e4973905", "sha256": "d217b1bdfef028cfc5fb03a8dfb659a625dd0a008b33d1ce23ab6342490fc007" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.0b4.zip", "has_sig": false, "md5_digest": "34aef3928f048688ab006d93e4973905", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 534469, "upload_time": "2010-12-13T16:41:50", "url": "https://files.pythonhosted.org/packages/8d/16/f3b1710e95d29633db27f0f9e2da429cb294f353a4dcd0613216f300797b/transmogrify.webcrawler-1.0b4.zip" } ], "1.0b5": [ { "comment_text": "", "digests": { "md5": "36b6658f97642a3200dd4541741d564f", "sha256": "5b9c70cd06c55edb39ba3c55a9dd119cd6496c3eb907413cc6635cedf095a82c" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.0b5.zip", "has_sig": false, "md5_digest": "36b6658f97642a3200dd4541741d564f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 535075, "upload_time": "2011-02-06T17:21:50", "url": "https://files.pythonhosted.org/packages/5b/2e/01dfa4baf6ef837f8d1f84becd1f419f28f6ea3e69074d4f0aef51ec6dda/transmogrify.webcrawler-1.0b5.zip" } ], "1.0b6": [ { "comment_text": "", "digests": { "md5": "82b9400424a40e48fbfa9c618117e5f1", "sha256": "d545850298644cb2dfd8f182156bb8985683238ef692f6f38ef3bdb12c948db8" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.0b6.zip", "has_sig": false, "md5_digest": "82b9400424a40e48fbfa9c618117e5f1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 535809, "upload_time": "2011-02-12T01:52:54", "url": "https://files.pythonhosted.org/packages/dd/bf/4ac0c0b4c114acf19fc3054da1648b1b43b96011c057de578279eff84174/transmogrify.webcrawler-1.0b6.zip" } ], "1.0b6dev": [], "1.0b7": [ { "comment_text": "", "digests": { "md5": "d9971be4d9e21377f5a89790365a184d", "sha256": "93272e059c9aa899919479bec8c16d5271e752076ab7247f397f4967586eaa4d" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.0b7.zip", "has_sig": false, "md5_digest": "d9971be4d9e21377f5a89790365a184d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 536217, "upload_time": "2011-02-17T13:37:09", "url": "https://files.pythonhosted.org/packages/04/f7/6344afa489403231f19438a7b8fbe40d6965de6f74973f36f55544a8331d/transmogrify.webcrawler-1.0b7.zip" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "c3110296c20b6ac1c2d5632f203df8b4", "sha256": "c9f8c4c10032e1681909aaa0f4d8705918c0dcd1558d27157e72ac46a9d1418c" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.1.zip", "has_sig": false, "md5_digest": "c3110296c20b6ac1c2d5632f203df8b4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 537893, "upload_time": "2012-04-17T16:23:02", "url": "https://files.pythonhosted.org/packages/97/9f/e599c01c5ebac581f04fede3d4a63663dde44da7a90aba024a5b03e00d65/transmogrify.webcrawler-1.1.zip" } ], "1.2": [ { "comment_text": "", "digests": { "md5": "51653dde0f186051a165c388ddeea10a", "sha256": "f7ae89bf5349e0e871a0efbd6982d6408c8549e6b7478ba84e600698a52c9cb5" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.2.tar.gz", "has_sig": false, "md5_digest": "51653dde0f186051a165c388ddeea10a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 512206, "upload_time": "2012-12-28T04:40:11", "url": "https://files.pythonhosted.org/packages/5d/52/24f11a4799fac7a304d92fc40e0189f895d75bcd2d6b768a915df99bace5/transmogrify.webcrawler-1.2.tar.gz" } ], "1.2.1": [ { "comment_text": "", "digests": { "md5": "70c89a205a04d1c199dd3b612e044253", "sha256": "03fd249e315d2b94f79d49d7d00516f6a658e84a7fce8c8a1546465dac63fb2a" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.2.1.tar.gz", "has_sig": false, "md5_digest": "70c89a205a04d1c199dd3b612e044253", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 512357, "upload_time": "2013-01-09T21:45:30", "url": "https://files.pythonhosted.org/packages/6a/2c/106607699781bdbdc2db99ddd63660bcd4b8075d3056563991f904fc65ac/transmogrify.webcrawler-1.2.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "70c89a205a04d1c199dd3b612e044253", "sha256": "03fd249e315d2b94f79d49d7d00516f6a658e84a7fce8c8a1546465dac63fb2a" }, "downloads": -1, "filename": "transmogrify.webcrawler-1.2.1.tar.gz", "has_sig": false, "md5_digest": "70c89a205a04d1c199dd3b612e044253", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 512357, "upload_time": "2013-01-09T21:45:30", "url": "https://files.pythonhosted.org/packages/6a/2c/106607699781bdbdc2db99ddd63660bcd4b8075d3056563991f904fc65ac/transmogrify.webcrawler-1.2.1.tar.gz" } ] }