{
"info": {
"author": "Dylan Jay",
"author_email": "software@pretaweb.com",
"bugtrack_url": null,
"classifiers": [
"Programming Language :: Python",
"Topic :: Software Development :: Libraries :: Python Modules"
],
"description": "Crawling - html to import\n=========================\n\n`transmogrify.webcrawler` will crawl html to extract pages and files as a source for your transmogrifier pipeline.\n`transmogrify.webcrawler.typerecognitor` aids in setting '_type' based on the crawled mimetype.\n`transmogrify.webcrawler.cache` helps speed up crawling and reduce memory usage by storing items locally.\n\nThese blueprints are designed to work with the `funnelweb` pipeline but can be used independently.\n\n\n\ntransmogrify.webcrawler\n=======================\n\nA source blueprint for crawling content from a site or local html files.\n\nWebcrawler imports HTML either from a live website, for a folder on disk, or a folder\non disk with html which used to come from a live website and may still have absolute\nlinks refering to that website.\n\nTo crawl a live website supply the crawler with a base http url to start crawling with.\nThis url must be the url which all the other urls you want from the site start with.\n\nFor example ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url = http://www.whitehouse.gov\n max = 50\n\nwill restrict the crawler to the first 50 pages.\n\nYou can also crawl a local directory of html with relative links by just using a file: style url ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url = file:///mydirectory\n\nor if the local directory contains html saved from a website and might have absolute urls in it\nthe you can set this as the cache. The crawler will always look up the cache first ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url = http://therealsite.com --crawler:cache=mydirectory\n\nThe following will not crawl anything larget than 4Mb ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url = http://www.whitehouse.gov\n maxsize=400000\n\nTo skip crawling links by regular expression ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n url=http://www.whitehouse.gov\n ignore = \\.mp3\n \\.mp4\n\nIf webcrawler is having trouble parsing the html of some pages you can preprocesses\nthe html before it is parsed. e.g. ::\n\n [crawler]\n blueprint = transmogrify.webcrawler\n patterns = ()\n subs = \\1\\2\n\nIf you'd like to skip processing links with certain mimetypes you can use the\ndrop:condition. This TALES expression determines what will be processed further.\nsee http://pypi.python.org/pypi/collective.transmogrifier/#condition-section\n::\n\n [drop]\n blueprint = collective.transmogrifier.sections.condition\n condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']\n\n\nOptions:\n\n:site_url:\n - the top url to crawl\n\n:ignore:\n - list of regex for urls to not crawl\n\n:cache:\n - local directory to read crawled items from instead of accessing the site directly\n\n:patterns:\n - Regular expressions to substitute before html is parsed. New line seperated\n\n:subs:\n - Text to replace each item in patterns. Must be the same number of lines as patterns. Due to the way buildout handles empty lines, to replace a pattern with nothing (eg to remove the pattern), use ```` as a substitution.\n\n:maxsize:\n - don't crawl anything larger than this\n\n:max:\n - Limit crawling to this number of pages\n\n:start-urls:\n - a list of urls to initially crawl\n\n:ignore-robots:\n - if set, will ignore the robots.txt directives and crawl everything\n\nWebCrawler will emit items like ::\n\n item = dict(_site_url = \"Original site_url used\",\n _path = \"The url crawled without _site_url,\n _content = \"The raw content returned by the url\",\n _content_info = \"Headers returned with content\"\n _backlinks = names,\n _sortorder = \"An integer representing the order the url was found within the page/site\n\t )\n\n\r\ntransmogrify.webcrawler.cache\r\n=============================\r\n\r\nA blueprint that saves crawled content into a directory structure\r\n\r\nOptions:\r\n\r\n:path-key:\r\n Allows you to override the field path is stored in. Defaults to '_path'\r\n\r\n:output:\r\n Directory to store cached content in\r\n\r\n\ntransmogrify.webcrawler.typerecognitor\n======================================\n\nA blueprint for assigning content type based on the mime-type as given by the\nwebcrawler\n\nChangelog\n=========\n\n1.2.1 (2012-1-10)\n-----------------\n\n- setuptools-git wasn't installed so release was missing files [djay]\n\n1.2 (2012-12-28)\n----------------\n- fix cache check to prevent overwriting cache [djay]\n- turn redirects into Link objects [djay]\n- summary stats of which mimetypes were crawled [djay]\n- fixed bug where redirected pages weren't getting uploaded [djay]\n- fixed bugs with storing default pages in cache [djay]\n- fixed bug with space chars in urls [ivanteoh]\n- better handling of charset detection [djay]\n\n\n1.1 (2012-04-17)\n----------------\n\n- add start-urls option [djay]\n- add ignore_robots option [djay]\n- fixed bug in http-equiv refresh handling [djay]\n- fixes to disk caching [djay]\n- better logging [djay]\n- default maxsize is unlimited [djay]\n- Provide ability for the reformat function to substitute patterns with \n empty strings (nothing). Buildout does not support empty lines within\n configuration, so if a substitution is this becomes an empty\n string. [davidjb]\n- Provide a logger in the LXMLPage class so the reformat function can \n succeed [davidjb]\n- Reformat spacing in webcrawler reformat function [davidjb] \n\n\n1.0 (2011-06-29)\n----------------\n- many fixes for importing from local directory w/ many languages [simahawk]\n- fix UnicodeEncodeError when file name/language is not english [simahawk]\n- fix iterating over non-sequence [simahawk]\n- fix missing import for MyStringIO [simahawk]\n\n1.0b7 (2011-02-17)\n------------------\n- fix bug in cache check [djay]\n\n1.0b6 (2011-02-12)\n------------------\n- only open cache files when needed so don't run out of handles [djay]\n- follow http-equiv refresh links [djay]\n\n1.0b5 (2011-02-06)\n------------------\n- files use file pointers to reduce memory usage [djay]\n- cache saves .metadata files to record and playback headersx [djay]\n\n1.0b4 (2010-12-13)\n------------------\n- improve logging [djay]\n- fix encoding bug caused by cache [djay]\n\n1.0b3 (2010-11-10)\n------------------\n\n- Fixed bug in cache that caused many links to be ignored in some cases [djay]\n- Fix documentation up [djay]\n\n1.0b2 (2010-11-09)\n------------------\n\n- Stopped localhost output when no output set [djay]\n\n1.0b1 (2010-11-08)\n------------------\n\n- change site_url to just url. [djay]\n\n- rename maxpage to maxsize [djay]\n\n- fix file: style urls [djay]\n\n- Added cache option to replace base_alias [djay]\n\n- fix _origin key set by webcrawler, instead of url now it is path as expected by further blue\n [Vitaliy Podoba]\n\n- add _orig_path to pipeline item to keep original path for any further purposes, we will need\n [Vitaliy Podoba]\n\n- make all url absolute taking into account base tags inside webcrawler blueprint\n [Vitaliy Podoba] \n\n\n0.1 (2008-09-25)\n----------------\n\n- renamed package from pretaweb.blueprints to transmogrify.webcrawler.\n [djay]\n\n- enhanced import view [djay]",
"description_content_type": null,
"docs_url": null,
"download_url": "UNKNOWN",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "http://github.com/collective/transmogrify.webcrawler",
"keywords": "transmogrifier blueprint funnelweb source plone import conversion microsoft office",
"license": "GPL",
"maintainer": null,
"maintainer_email": null,
"name": "transmogrify.webcrawler",
"package_url": "https://pypi.org/project/transmogrify.webcrawler/",
"platform": "UNKNOWN",
"project_url": "https://pypi.org/project/transmogrify.webcrawler/",
"project_urls": {
"Download": "UNKNOWN",
"Homepage": "http://github.com/collective/transmogrify.webcrawler"
},
"release_url": "https://pypi.org/project/transmogrify.webcrawler/1.2.1/",
"requires_dist": null,
"requires_python": null,
"summary": "Crawling and feeding html content into a transmogrifier pipeline",
"version": "1.2.1"
},
"last_serial": 800897,
"releases": {
"0.2": [],
"1.0": [
{
"comment_text": "",
"digests": {
"md5": "9c611ccd9c3beaafcb6959f2a5d3ec43",
"sha256": "a7bf5e2ccc19526cfa74284e845092ccae54fc07aa8c19e193759e9054d5e942"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.0.zip",
"has_sig": false,
"md5_digest": "9c611ccd9c3beaafcb6959f2a5d3ec43",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 536301,
"upload_time": "2011-06-29T16:42:26",
"url": "https://files.pythonhosted.org/packages/93/65/5de7badf3e1d6c0c69bce8a0b4b9780bbf962276f58563d4cf34c252cbac/transmogrify.webcrawler-1.0.zip"
}
],
"1.0b1": [
{
"comment_text": "",
"digests": {
"md5": "b58aff6badf076352e49b504f628859e",
"sha256": "6d4f33760b61da1d21a8993d6bf8ab9904ac0a2bdaa5d77b29c80e7b1b95a470"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.0b1.zip",
"has_sig": false,
"md5_digest": "b58aff6badf076352e49b504f628859e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 530844,
"upload_time": "2010-11-07T15:58:09",
"url": "https://files.pythonhosted.org/packages/1d/65/dd333fef4adec7e48e863e9aafbd71671a8d841e69ff72c00fd68ef9497e/transmogrify.webcrawler-1.0b1.zip"
}
],
"1.0b2": [
{
"comment_text": "",
"digests": {
"md5": "91c90bfdea1d39984e81eb15a399dabb",
"sha256": "2d8a6608259821ec2ca21ab63b01f53f87e9b8b3e44c79802928f13a7dcb62a7"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.0b2.zip",
"has_sig": false,
"md5_digest": "91c90bfdea1d39984e81eb15a399dabb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 531415,
"upload_time": "2010-11-08T17:06:06",
"url": "https://files.pythonhosted.org/packages/89/c7/35a7ccd03f728aaaa4f568d44af3c07ed52dfcb29897d2c90d9500815110/transmogrify.webcrawler-1.0b2.zip"
}
],
"1.0b3": [
{
"comment_text": "",
"digests": {
"md5": "fde2dfd665c4d651c5c5c475365f6ac3",
"sha256": "12bc7251d6bbc78d76be978ed726af253e161dc78126f7c34aad4bea3977cf03"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.0b3.zip",
"has_sig": false,
"md5_digest": "fde2dfd665c4d651c5c5c475365f6ac3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 534426,
"upload_time": "2010-11-09T16:31:14",
"url": "https://files.pythonhosted.org/packages/69/2b/cf020c875af5e9fb11db06df3e48fac229d86ffcefaa0c1b2e60444bda8d/transmogrify.webcrawler-1.0b3.zip"
}
],
"1.0b4": [
{
"comment_text": "",
"digests": {
"md5": "34aef3928f048688ab006d93e4973905",
"sha256": "d217b1bdfef028cfc5fb03a8dfb659a625dd0a008b33d1ce23ab6342490fc007"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.0b4.zip",
"has_sig": false,
"md5_digest": "34aef3928f048688ab006d93e4973905",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 534469,
"upload_time": "2010-12-13T16:41:50",
"url": "https://files.pythonhosted.org/packages/8d/16/f3b1710e95d29633db27f0f9e2da429cb294f353a4dcd0613216f300797b/transmogrify.webcrawler-1.0b4.zip"
}
],
"1.0b5": [
{
"comment_text": "",
"digests": {
"md5": "36b6658f97642a3200dd4541741d564f",
"sha256": "5b9c70cd06c55edb39ba3c55a9dd119cd6496c3eb907413cc6635cedf095a82c"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.0b5.zip",
"has_sig": false,
"md5_digest": "36b6658f97642a3200dd4541741d564f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 535075,
"upload_time": "2011-02-06T17:21:50",
"url": "https://files.pythonhosted.org/packages/5b/2e/01dfa4baf6ef837f8d1f84becd1f419f28f6ea3e69074d4f0aef51ec6dda/transmogrify.webcrawler-1.0b5.zip"
}
],
"1.0b6": [
{
"comment_text": "",
"digests": {
"md5": "82b9400424a40e48fbfa9c618117e5f1",
"sha256": "d545850298644cb2dfd8f182156bb8985683238ef692f6f38ef3bdb12c948db8"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.0b6.zip",
"has_sig": false,
"md5_digest": "82b9400424a40e48fbfa9c618117e5f1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 535809,
"upload_time": "2011-02-12T01:52:54",
"url": "https://files.pythonhosted.org/packages/dd/bf/4ac0c0b4c114acf19fc3054da1648b1b43b96011c057de578279eff84174/transmogrify.webcrawler-1.0b6.zip"
}
],
"1.0b6dev": [],
"1.0b7": [
{
"comment_text": "",
"digests": {
"md5": "d9971be4d9e21377f5a89790365a184d",
"sha256": "93272e059c9aa899919479bec8c16d5271e752076ab7247f397f4967586eaa4d"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.0b7.zip",
"has_sig": false,
"md5_digest": "d9971be4d9e21377f5a89790365a184d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 536217,
"upload_time": "2011-02-17T13:37:09",
"url": "https://files.pythonhosted.org/packages/04/f7/6344afa489403231f19438a7b8fbe40d6965de6f74973f36f55544a8331d/transmogrify.webcrawler-1.0b7.zip"
}
],
"1.1": [
{
"comment_text": "",
"digests": {
"md5": "c3110296c20b6ac1c2d5632f203df8b4",
"sha256": "c9f8c4c10032e1681909aaa0f4d8705918c0dcd1558d27157e72ac46a9d1418c"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.1.zip",
"has_sig": false,
"md5_digest": "c3110296c20b6ac1c2d5632f203df8b4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 537893,
"upload_time": "2012-04-17T16:23:02",
"url": "https://files.pythonhosted.org/packages/97/9f/e599c01c5ebac581f04fede3d4a63663dde44da7a90aba024a5b03e00d65/transmogrify.webcrawler-1.1.zip"
}
],
"1.2": [
{
"comment_text": "",
"digests": {
"md5": "51653dde0f186051a165c388ddeea10a",
"sha256": "f7ae89bf5349e0e871a0efbd6982d6408c8549e6b7478ba84e600698a52c9cb5"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.2.tar.gz",
"has_sig": false,
"md5_digest": "51653dde0f186051a165c388ddeea10a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 512206,
"upload_time": "2012-12-28T04:40:11",
"url": "https://files.pythonhosted.org/packages/5d/52/24f11a4799fac7a304d92fc40e0189f895d75bcd2d6b768a915df99bace5/transmogrify.webcrawler-1.2.tar.gz"
}
],
"1.2.1": [
{
"comment_text": "",
"digests": {
"md5": "70c89a205a04d1c199dd3b612e044253",
"sha256": "03fd249e315d2b94f79d49d7d00516f6a658e84a7fce8c8a1546465dac63fb2a"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "70c89a205a04d1c199dd3b612e044253",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 512357,
"upload_time": "2013-01-09T21:45:30",
"url": "https://files.pythonhosted.org/packages/6a/2c/106607699781bdbdc2db99ddd63660bcd4b8075d3056563991f904fc65ac/transmogrify.webcrawler-1.2.1.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "70c89a205a04d1c199dd3b612e044253",
"sha256": "03fd249e315d2b94f79d49d7d00516f6a658e84a7fce8c8a1546465dac63fb2a"
},
"downloads": -1,
"filename": "transmogrify.webcrawler-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "70c89a205a04d1c199dd3b612e044253",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 512357,
"upload_time": "2013-01-09T21:45:30",
"url": "https://files.pythonhosted.org/packages/6a/2c/106607699781bdbdc2db99ddd63660bcd4b8075d3056563991f904fc65ac/transmogrify.webcrawler-1.2.1.tar.gz"
}
]
}