{ "info": { "author": "Dylan Jay", "author_email": "software@pretaweb.com", "bugtrack_url": null, "classifiers": [ "Framework :: Buildout", "Intended Audience :: Developers", "License :: OSI Approved :: Zope Public License", "Topic :: Software Development :: Build Tools" ], "description": "FunnelWeb - Content conversion made easy\n****************************************\n\nEasily convert content from existing html into Plone.\n\n:Code repository: http://github.com/collective/funnelweb\n:Questions and comments to: http://github.com/collective/funnelweb/issues\n:Report bugs at:\n http://github.com/collective/funnelweb/issues\n https://github.com/collective/transmogrify.webcrawler/issues\n https://github.com/collective/transmogrify.htmlcontentextractor/issues\n https://github.com/collective/transmogrify.siteanalyser/issues\n https://github.com/collective/transmogrify.ploneremote/issues\n\n.. contents::\n\nIntroduction\n------------\n\nFunnelWeb is a webcrawler which extracts website content such as titles, descriptions,\nimages and content blocks from existing websites. It filters this content and uploads\nit into a new website which uses the `Plone`_ CMS. It gives you many options for adjusting\nhow content is migrated. It is an invaluable tool when you want to migrate a site which doesn't\nuse a CMS or there isn't a tool can migrate content directly from the sites database.\n\nFor those familar with `collective.transmogrifier`_, funnelweb is a prebuilt pipeline that combines\nblueprints from four different packages (`transmogrify.webcrawler`_, `transmogrify.htmlcontentextractor`_\n`transmogrify.siteanalyser`_ and `transmogrify.ploneremote`_). Due to the flexible nature of the\n`collective.transmogrifier`_ framework underneath advanced users can add further steps to their conversion\nprocess.\n\nThe work performed by the funnelweb script can be broken down into four sections:\n\n1. Crawling the site including caching locally so subsequent crawls are quicker and filtering out\n unwanted content (`transmogrify.webcrawler`_)\n2. Remove boilerplate/templates (automatically or via rules) so just content remains (`transmogrify.htmlcontentextractor`_)\n3. Analysing the site structure to improve the content quality including working out titles, default\n views, types of objects to create, what to show in navigation etc (`transmogrify.siteanalyser`_)\n4. Uploading to the CMS such as Plone, or saving cleaned HTML to local directory (`transmogrify.ploneremote`_)\n\nFunnelWeb uses the `mr.migrator`_ framework which allows it's funnelweb `collective.transmogrifier`_ pipeline to be run:\n\n1. Within Plone itself . see `mr.migrator`_ for how to install.\n\n2. A command line script which can be installed via zc.buildout. Content is uploaded\n into `Plone`_ via it's web services API.\n\n\nInstallation for commandline\n----------------------------\n\nYou can install via easy_install ::\n\n $> easy_install funnelweb\n\nThis can be run by ::\n\n $> buildout\n\nOr funnelweb can be installed via a buildout recipe (see `zc.buildout`_) ::\n\n [buildout]\n parts += funnelweb\n\n [funnelweb]\n recipe = funnelweb\n\n $> buildout init\n $> bin/buildout\n\n\nThis can be run by ::\n\n $> bin/funnelweb\n\nThe examples here will assume installation via buildout\n\n.. _`zc.buildout`: http://www.buildout.org\n\nConfiguration for commandline\n-----------------------------\n\nFunnelweb is organised as a series of steps through which crawled items pass before eventually being\nuploaded. Each step has one or more configuration options so you can customise import process\nfor your needs. Almost all imports will require some level of configurations.\n\nFunnelweb gives three methods to configure your pipeline.\n\nUsing a local pipeline configuration\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nYou can create your own pipeline.cfg that overrides and extends the default funnelweb\npipeline.\n\nFor example, create a file called pipeline.cfg with the following ::\n\n [transmogrifier]\n include = funnelweb.remote\n\n [crawler]\n url=http://collective-docs.readthedocs.org/en/latest/\n\nThis will override the crawler blueprint setting \"url\". You can run this by ::\n\n $> bin/funnelweb --pipeline=pipeline.cfg\n\nYou can view funnelweb.remote pipeline and all its options via the following command ::\n\n $> bin/funnelweb --show-pipeline\n\nYou can also save this pipeline and customise it for your own needs ::\n\n $> bin/funnelweb --show-pipeline > pipeline.cfg\n $> {edit} pipeline.cfg\n $> bin/funnelweb --pipeline=pipeline.cfg\n\n\n\nCommandline arguments\n~~~~~~~~~~~~~~~~~~~~~\n\nAny arguement from the pipeline can be overridden via the command-line\n\ne.g ::\n\n $> bin/funnelweb --crawler:url=http://www.whitehouse.gov\n\nAll arguments are --(step:argument)=value.\nThe first part of each configuration key is the step e.g. `crawler`. The second part is the particular\nconfiguration option for that particular step. e.g. `url`. This is then followed by = and value or values.\n\nsome options require multiple lines within a buildout part. These can be overridden\nvia the commandline by repeating the same argument e.g. ::\n\n $> bin/funnelweb --crawler:ignore=\\.mp3 --crawler:ignore=\\.pdf\n\n\nYou can see a list of all the arguments via ::\n\n $> bin/funnelweb --help\n\n\nBuildout Override\n~~~~~~~~~~~~~~~~~\n\nAny command-line override can also be \"baked\" into the funnelweb script. e.g. ::\n\n [buildout]\n parts += funnelweb\n\n [funnelweb]\n recipe = funnelweb\n crawler-url=http://www.whitehouse.gov\n pipeline=pipeline.cfg\n\n\nAny paramater of the form ::\n\n [step]\n blah = blah\n\nwill become in buildout ::\n\n [funnelweb]\n recipe = funnelweb\n step-blah = blah\n\nand on the command line ::\n\n bin/funnelweb --step:blah=blah\n\n\nRecommended Usage\n-----------------\n\nBelow is an outline of how you might typically use funnelweb.\n\n1. First set up buildout to make a command line funnelweb\n2. Create a pipeline.cfg including funnelweb.remote (see `Using a local pipeline configuration`_)\n3. Bake pipeline file into buildout (see `Buildout Override`_)\n4. Test crawl your site and store it into the cache (see `Crawling - HTML to import`_)\n5. You might need to set some crawler:ignore rules\n6. crawl the whole site into your cache (see `Crawling - HTML to import`_)\n7. Crawl the first 10 pages using --crawler:max=10\n8. Use `Templates`_ in debug mode to find Title, Description and Text your pages\n9. `Upload to Plone`_ to test\n10. if the structure and urls are what you expect use `Site Analysis`_\n11. Repeat crawling more pages\n\n\nConfiguration Options\n---------------------\n\nThe full list of steps that can be configured along with the transmogrifier\nblueprint for each\n\n1. Crawling\n\n:crawler: `transmogrify.webcrawler`_\n:cache: `transmogrify.webcrawler.cache`_\n:typeguess: `transmogrify.webcrawler.typerecognitor`_\n:drop: `collective.transmogrifier.sections.condition`_\n\n2. Templates\n\n:template1: `transmogrify.htmlcontentextractor`_\n:template2: `transmogrify.htmlcontentextractor`_\n:template3: `transmogrify.htmlcontentextractor`_\n:template4: `transmogrify.htmlcontentextractor`_\n:templateauto: `transmogrify.htmlcontentextractor.auto`_\n\n3. Site Analysis\n\n:sitemapper: `transmogrify.siteanalyser.sitemapper`_\n:indexguess: `transmogrify.siteanalyser.defaultpage`_\n:titleguess: `transmogrify.siteanalyser.title`_\n:attachmentguess: `transmogrify.siteanalyser.attach`_\n:hideguess: `transmogrify.siteanalyser.hidefromnav`_\n:urltidy: `transmogrify.siteanalyser.urltidy`_\n:addfolders: `transmogrify.pathsorter`_\n:changetype: `collective.transmogrifier.sections.inserter`_\n\n4. Uploading\n\n:ploneupload: `transmogrify.ploneremote.remoteconstructor`_\n:ploneupdate: `transmogrify.ploneremote.remoteschemaupdater`_\n:plonehide: `transmogrify.ploneremote.remotenavigationexcluder`_\n:publish: `collective.transmogrifier.sections.inserter`_\n:plonepublish: `transmogrify.ploneremote.remoteworkflowupdater`_\n:plonealias: `transmogrify.ploneremote.remoteredirector`_\n:ploneprune: `transmogrify.ploneremote.remoteprune`_\n:localupload: `transmogrify.webcrawler.cache`_\n\n.. _transmogrify.webcrawler: http://pypi.python.org/pypi/transmogrify.webcrawler#transmogrify-webcrawler\n.. _transmogrify.webcrawler.cache: http://pypi.python.org/pypi/transmogrify.webcrawler#transmogrify-webcrawler-cache\n.. _transmogrify.webcrawler.typerecognitor: http://pypi.python.org/pypi/transmogrify.webcrawler#transmogrify-webcrawler-typerecognitor\n.. _collective.transmogrifier.sections.condition: http://pypi.python.org/pypi/collective.transmogrifier#condition\n\n.. _transmogrify.htmlcontentextractor: http://pypi.python.org/pypi/transmogrify.htmlcontentextractor#transmogrify-htmlcontentextractor\n.. _transmogrify.htmlcontentextractor.auto: http://pypi.python.org/pypi/transmogrify.htmlcontentextractor#transmogrify-htmlcontentextractor.auto\n\n.. _transmogrify.siteanalyser: http://pypi.python.org/pypi/transmogrify.siteanalyser\n.. _transmogrify.siteanalyser.sitemapper: http://pypi.python.org/pypi/transmogrify.siteanalyser#transmogrify-siteanalyser-sitemapper\n.. _`transmogrify.siteanalyser.defaultpage`: http://pypi.python.org/pypi/transmogrify.siteanalyser#transmogrify-siteanalyser-defaultpage\n.. _`transmogrify.siteanalyser.title`: http://pypi.python.org/pypi/transmogrify.siteanalyser#transmogrify-siteanalyser-title\n.. _`transmogrify.siteanalyser.attach`: http://pypi.python.org/pypi/transmogrify.siteanalyser#transmogrify-siteanalyser-attach\n.. _`transmogrify.siteanalyser.hidefromnav`: http://pypi.python.org/pypi/transmogrify.siteanalyser#transmogrify-siteanalyser-hidefromnav\n.. _`transmogrify.siteanalyser.urltidy`: http://pypi.python.org/pypi/transmogrify.siteanalyser#transmogrify-siteanalyser-urltidy\n.. _`transmogrify.pathsorter`: http://pypi.python.org/pypi/transmogrify.siteanalyser#transmogrify-pathsorter\n.. _collective.transmogrifier.sections.inserter: http://pypi.python.org/pypi/collective.transmogrifier#inserter\n\n.. _`transmogrify.ploneremote`: http://pypi.python.org/pypi/transmogrify.ploneremote\n.. _`transmogrify.ploneremote.remoteconstructor`: http://pypi.python.org/pypi/transmogrify.ploneremote#transmogrify-ploneremote-remoteconstructor\n.. _`transmogrify.ploneremote.remoteschemaupdater`: http://pypi.python.org/pypi/transmogrify.ploneremote#transmogrify-ploneremote-remoteschemaupdater\n.. _`transmogrify.ploneremote.remotenavigationexcluder`: http://pypi.python.org/pypi/transmogrify.ploneremote#transmogrify-ploneremote-remotenavigationexcluder\n.. _`transmogrify.ploneremote.remoteworkflowupdater`: http://pypi.python.org/pypi/transmogrify.ploneremote#transmogrify-ploneremote-remoteworkflowupdater\n.. _`transmogrify.ploneremote.remoteredirector`: http://pypi.python.org/pypi/transmogrify.ploneremote#transmogrify-ploneremote-remoteredirector\n.. _`transmogrify.ploneremote.remoteprune`: http://pypi.python.org/pypi/transmogrify.ploneremote#transmogrify-ploneremote-remoteprune\n\nor you use the commandline help to view the list of available options ::\n\n $> bin/funnelweb --help\n\n\nThe most common configuration options for these steps are detailed below.\n\nCrawling - HTML to import\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\nFunnelweb imports HTML either from a live website, from a folder on disk, or a folder\non disk with HTML which was retrieved from a live website and may still have absolute\nlinks refering to that website.\n\nFunnelweb can only import things it can crawl, i.e. content that is linked from\nHTML. If your site contains javascript links or password protected content, then\nyou may have to perform some extra steps to get funnelweb to crawl your\ncontent.\n\nTo crawl a live website, supply the crawler with a base HTTP URL to start crawling from.\nThis URL must be the URL which all the other URLs you want from the site start with.\n\nFor example ::\n\n $> bin/funnelweb --crawler:url=http://www.whitehouse.gov --crawler:max=50 --ploneupload=http://admin:admin@localhost:8080/Plone\n\nwill restrict the crawler to the first 50 pages and then convert the content\ninto a local Plone site.\n\nThe site you crawl will be cached locally, so if you run funnelweb again it will run much quicker. If you'd like\nto disable the local caching use ::\n\n $> bin/funnelweb --cache:output=\n \nIf you'd like to reset the cache, refreshing it's data, set the crawlers cache to nothing ::\n\n $> bin/funnelweb --crawler:cache=\n\nBy default the cache is stored in ``var/funnelwebcache/{site url}/``. You can set this to another directory using::\n\n $> bin/funnelweb --cache:output=my_new_dir\n\n\nYou can also crawl a local directory of HTML with relative links by just using a ``file://`` style URL ::\n\n $> bin/funnelweb --crawler:url=file:///mydirectory\n\nor if the local directory contains HTML saved from a website and might have absolute URLs in it,\nthe you can set this as the cache. The crawler will always look up the cache first ::\n\n $> bin/funnelweb --crawler:url=http://therealsite.com --crawler:cache=mydirectory\n\nThe following will not crawl anything larger than 4Mb ::\n\n $> bin/funnelweb --crawler:maxsize=400000\n\nTo skip crawling links by regular expression ::\n\n [funnelweb]\n recipe = funnelweb\n crawler-url=http://www.whitehouse.gov\n crawler-ignore = \\.mp3\n \\.mp4 \n\nIf funnelweb is having trouble parsing the HTML of some pages, you can preprocesses\nthe HTML before it is parsed. e.g. ::\n\n [funnelweb]\n recipe = funnelweb\n crawler-patterns = ()\n crawler-subs = \\1\\2\n\nIf you'd like to skip processing links with certain mimetypes you can use the\n``drop:condition`` option. This TALES expression determines what will be processed further ::\n\n [funnelweb]\n recipe = funnelweb\n drop-condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']\n\n\nTemplates\n~~~~~~~~~\n\nFunnelweb has a built-in clustering algorithm that tries to automatically extract the content from the HTML template.\nThis is slow and not always effective. Often you will need to input your own template extraction rules.\n\nIf you'd like to turn off the automatic templates ::\n\n $> bin/funnelweb --templateauto:condition=python:False\n\n\nRules are in the form of ::\n\n (title|description|text|anything) = (text|html|optional) XPath\n\nFor example ::\n\n [funnelweb]\n recipe = funnelweb\n crawler-site_url=http://www.whitehouse.gov\n ploneupload-target=http://admin:admin@localhost:8080/Plone\n template1-title = text //div[@class='body']//h1[1]\n template1-_delete1 = optional //div[@class='body']//a[@class='headerlink']\n template1-_delete2 = optional //div[contains(@class,'admonition-description')]\n template1-description = text //div[contains(@class,'admonition-description')]//p[@class='last']\n template1-text = html //div[@class='body']\n\nNote that for a single template e.g. template1, ALL of the XPaths need to match otherwise\nthat template will be skipped and the next template tried. If you'd like to make it\nso that a single XPath isn't nessary for the template to match then use the keyword `optional` or `optionaltext`\ninstead of `text` or `html` before the XPath.\n\n\nIn the default pipeline there are four templates called `template1`, `template2`, `template3` and `template4`.\n\nWhen an XPath is applied within a single template, the HTML it matches will be removed from the page.\nAnother rule in that same template can't match the same HTML fragment.\n\nIf a content part is not useful to Plone (e.g. redundant text, title or description) it is a way to effectively remove that HTML\nfrom the content.\n\nTo help debug your template rules you can set debug mode ::\n\n $> bin/funnelweb --template1:debug --template2:debug\n\nSetting debug mode on templateauto will give you details about the rules it uses. ::\n\n $> bin/funnelweb --templateauto:debug\n ...\n DEBUG:templateauto:'icft.html' discovered rules by clustering on 'http://...'\n Rules:\n\ttext= html //div[@id = \"dal_content\"]//div[@class = \"content\"]//p\n\ttitle= text //div[@id = \"dal_content\"]//div[@class = \"content\"]//h3\n Text:\n\tTITLE: ...\n\tMAIN-10: ...\n\tMAIN-10: ...\n\tMAIN-10: ...\n\n\nFor more information about XPath see\n\n- http://www.w3schools.com/xpath/default.asp\n- http://blog.browsermob.com/2009/04/test-your-selenium-xpath-easily-with-firebug/\n\n\nSite Analysis\n~~~~~~~~~~~~~\n\nIn order to provide a cleaner-looking Plone site, there are several options to analyse\nthe entire crawled site and clean it up. These are turned off by default.\n\nTo determine if an item is a default page for a container (it has many links\nto items in that container, even if not contained in that folder), and then move\nit to that folder, use ::\n\n $> bin/funnelweb --indexguess:condition=python:True\n\nYou can automatically find better page titles by analysing backlink text ::\n\n [funnelweb]\n recipe = funnelweb\n titleguess-condition = python:True\n titleguess-ignore =\n\tclick\n\tread more\n\tclose\n\tClose\n\thttp:\n\thttps:\n\tfile:\n\timg\n\n\nThe following will find items only referenced by one page and move them into\na new folder with the page as the default view. ::\n\n $> bin/funnelweb --attachmentguess:condition=python:True\n\nor the following will only move attachments that are images and use ``index-html`` as the new\nname for the default page of the newly created folder ::\n\n [funnelweb]\n recipe = funnelweb\n attachmentguess-condition = python: subitem.get('_type') in ['Image']\n attachmentguess-defaultpage = index-html\n\nThe following will tidy up the URLs based on a TALES expression ::\n\n $> bin/funnelweb --urltidy:link_expr=\"python:item['_path'].endswith('.html') and item['_path'][:-5] or item['_path']\"\n\nIf you'd like to move content around before it's uploaded you can use the urltidy step as well e.g. ::\n\n $> bin/funnelweb --urltidy:link_expr=python:item['_path'].startswith('/news') and '/otn/news'+item['path'][5:] or item['_path']\n\nIf you want to hide content from navigation you can use `hideguess`\n\n $> bin/funnelweb --hideguess:condition=python:item['path']=='musthide'\n\n\n\nUpload to Plone\n~~~~~~~~~~~~~~~\n\nUploading happens via remote XML-RPC calls so can be done to a live running site anywhere.\n\nTo set where a the site will be uploaded to use ::\n\n $> bin/funnelweb --ploneupload:target=http://username:password@myhost.com/myfolder\n\nCurrently only basic authentication via setting the username and password in the url is supported. If no target\nis set then the site will be crawled but not uploaded.\n\nIf you'd like to change the type of what's uploaded ::\n\n $> bin/funnelweb --changetype:value=python:{'Folder':'HelpCenterReferenceManualSection','Document':HelpCenterLeafPage}.get(item['_type'],item['_type'])\n\nThis will set a new value for the type of the item. You could make this conditional e.g ::\n\n $> bin/funnelweb --changetype:condition=python:item['_path].startswith('/news')\n \nor by using a more complex expression for the new type\n\n $> bin/funnelweb --changetype:value=python:item['_path'].startswith('/news') and 'NewNewsType' or item['_type]\n\n\nBy default, funnelweb will automatically create Plone aliases based on the original crawled URLs, so that any old links\nwill automatically be redirected to the new cleaned-up urls. You can disable this by ::\n\n $> bin/funnelweb --plonealias:target=\n\nYou can change what items get published to which state by setting the following ::\n\n [funnelweb]\n recipe = funnelweb\n publish-value = python:[\"publish\"]\n publish-condition = python:item.get('_type') != 'Image' and not options.get('disabled')\n\nFunnelweb will hide certain items from Plone's navigation if that item was only ever linked\nto from within the content area. You can disable this behavior by ::\n\n $> bin/funnelweb --plonehide:target=\n\nYou can get a local file representation of what will be uploaded by using the following ::\n\n $> bin/funnelweb --localupload:output=var/mylocaldir\n \nExamples\n--------\n\nFeel free to fork and add your own examples for extracting content for common sites or\nCMS's\n\nRead The Docs\n~~~~~~~~~~~~~\n\nAs an example the following buildout will create a funnelweb script that will\nconvert a regular sphinx documentation into remote Plone content\ninside a PloneHelpCenter ::\n\n [transmogrifier]\n include = funnelweb.remote\n\n [crawler]\n url=http://collective-docs.readthedocs.org/en/latest/\n ignore=\n cgi-bin\n javascript:\n _static\n _sources\n genindex\\.html\n search\\.html\n saesrchindex\\.js\n\n [template1]\n title = text //div[@class='body']//h1[1]\n description = optional //div[contains(@class,'admonition-description')]/p[@class='last']/text()\n text = html //div[@class='body']\n # Fields with '_' won't be uploaded to Plone so will be effectively removed\n _permalink = text //div[@class='body']//a[@class='headerlink']\n _label = optional //p[contains(@class,'admonition-title')]\n _remove_useless_links = optional //div[@id = 'indices-and-tables']\n\n # Images will get titles from backlink text\n [titleguess]\n condition = python:True\n\n # Pages linked to content will be moved together\n [indexguess]\n condition = python:True\n\n # Hide the images folder from navigation\n [hideguess]\n condition = python:item.get(\"_path\",\"\").startswith('_images') and item.get('_type')=='Folder'\n\n # Upload as PHC instead of Folders and Pages\n [changetype]\n value=python:{'Folder':'HelpCenterReferenceManualSection','Document':'HelpCenterLeafPage'}.get(item['_type'],item['_type'])\n\n # Save locally for debugging purposes\n [localupload]\n output=manual\n\n # All folderish content should be checked if they contain\n # any items on the remote site which are not presented locally. including base folder\n [ploneprune]\n condition=python:item.get('_type') in ['HelpCenterReferenceManualSection','HelpCenterReferenceManual'] or item['_path'] == ''\n\nJoomla\n~~~~~~\n\n#TODO\n\nWordpress\n~~~~~~~~~\n\n#TODO\n\nDrupal\n~~~~~~\n\n#TODO\n\nOthers\n~~~~~~\n\nAdd your own examples here\n\nControlling Logging\n-------------------\n\nYou can show additional debug output on any particular set by setting a debug commandline switch.\nFor instance to see see additional details about template matching failures ::\n\n $> bin/funnelweb --template1:debug\n \n \n\nWorking directly with transmogrifier (advanced)\n-----------------------------------------------\n\nYou might need to insert further transformation steps for your particular\nconversion usecase. To do this, you can extend funnelweb's underlying\ntransmogrifier pipeline. Funnelweb uses a transmogrifier pipeline to perform the needed transformations and all\ncommandline and recipe options refer to options in the pipeline.\n\n\nYou can view pipeline and all its options via the following command ::\n\n $> bin/funnelweb --show-pipeline\n\nYou can also save this pipeline and customise it for your own needs ::\n\n $> bin/funnelweb --show-pipeline > pipeline.cfg\n $> {edit} pipeline.cfg\n $> bin/funnelweb --pipeline=pipeline.cfg\n\nCustomising the pipeline allows you add your own personal transformations which\nhaven't been pre-considered by the standard funnelweb tool.\n\nSee transmogrifier documentation to see how to add your own blueprints or add blueprints that\nalready exist to your custom pipeline.\n\nUsing external blueprints\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIf you have decided you need to customise your pipeline and you want to install transformation\nsteps that use blueprints not already included in funnelweb or transmogrifier, you can include\nthem using the ``eggs`` option in a funnelweb buildout part ::\n\n [funnelweb]\n recipe = funnelweb\n eggs = myblueprintpackage\n pipeline = mypipeline.cfg\n\nHowever, this only works if your blueprint package includes the following setuptools entrypoint\nin its ``setup.py`` ::\n\n entry_points=\"\"\"\n [z3c.autoinclude.plugin]\n target = transmogrify\n \"\"\",\n )\n\n.. NOTE:: Some transmogrifier blueprints assume they are running inside a Plone\n process such as those in `plone.app.transmogrifier` (see http://pypi.python.org/pypi/plone.app.transmogrifier). Funnelweb\n doesn't run inside a Plone process so these blueprints won't work. If\n you want upload content into Plone, you can instead use\n `transmogrify.ploneremote`_ which provides alternative implementations\n which will upload content remotely via XML-RPC.\n `transmogrify.ploneremote`_ is already included in funnelweb as it is\n what funnelweb's default pipeline uses.\n\nAttributes available in funnelweb pipeline\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nWhen using the default blueprints in funnelweb the following are some of the attributes that\nwill become attached to the items that each blueprint has access to. These can be used in the various\ncondition statements etc. as well as your own blueprints.\n\n``_site_url``\n The base of the url as passed into the webcrawler\n\n``_path``\n The remainder of the URL. ``_site_url`` + ``_path`` = URL\n\n``_mimetype``\n The mimetype as returned by the crawler\n\n``_content``\n The content of the item crawled, include image, file or HTML data.\n\n``_orig_path``\n The original path of the item that was crawled. This is useful for setting redirects so\n you don't get 404 errors after migrating content.\n\n``_sort_order``\n An integer representing the order in which this item was crawled. Helps to determine\n what order items should be sorted in folders created on the server if your site\n has navigation which has links ordered top to bottom.\n\n``_type``\n The type of object to be created as returned by the \"typeguess\" step\n\n``title``, ``description``, ``text``, etc.\n The template steps will typically create fields with content in them taken from ``_content``\n\n``_template``\n The template steps will leave the HTML that wasn't seperated out into different fields in this\n attribute.\n\n``_defaultpage``\n Set on an Folder item where you want to tell the uploading steps to set the containing item\n mentioned in ``_defaultpage`` to be the default page shown on that folder instead of a content listing.\n\n``_transitions``\n Specify the workflow action you'd like to make on an item after it's uploaded or updated.\n\n``_origin``\n This is used internally with the `transmogrify.siteanalysis.relinker` blueprint as a way to\n tell it that you have changed the ``_path`` and you now want the relinker to find any links that\n refer to ``_origin`` to now point to ``_path``.\n\nThe Funnelweb Pipeline\n~~~~~~~~~~~~~~~~~~~~~~\n\nsee http://github.com/collective/funnelweb/blob/master/funnelweb/remote.cfg\nor type ::\n\n $> bin/funnelweb --show-pipeline\n\n\n \nContributing\n------------\n\n- Code repository: http://github.com/collective/funnelweb\n- Questions and comments to http://github.com/collective/funnelweb/issues\n- Report bugs at http://github.com/collective/funnelweb/issues\n\nThe code of funnelweb itself is fairly minimal. It just sets up and runs a transmogrifier pipeline.\nThe hard work is actually done by five packages which each contain one or more transmogrifier\nblueprints. These are:\n\nWebcrawler\n http://pypi.python.org/pypi/transmogrify.webcrawler\n https://github.com/djay/transmogrify.webcrawler\n\nHTMLContentExtractor\n http://pypi.python.org/pypi/transmogrify.htmlcontentextractor\n https://github.com/djay/transmogrify.htmlcontentextractor\n \nSiteAnalyser\n http://pypi.python.org/pypi/transmogrify.siteanalyser\n https://github.com/djay/transmogrify.siteanalyser\n \nPathSorter\n http://pypi.python.org/pypi/transmogrify.pathsorter \n https://github.com/djay/transmogrify.pathsorter \n \nPloneRemote\n http://pypi.python.org/pypi/transmogrify.ploneremote\n https://github.com/djay/transmogrify.ploneremote\n \nEach has it's own issue tracker and we will accept pull requests for new functionality or bug\nfixes. The current state of documentation and testing is not yet at a high level.\n\n\nHistory\n-------\n\n- 2008 Built to import large corporate intranet\n- 2009 released pretaweb.funnelweb (deprecated). Built into Plone UI > Actions > Import\n- 2010 Split blueprints into transmogrify.* release on pypi\n- 2010 collective.developermanual sphinx to Plone uses funnelweb blueprints\n- 2010 funnelweb Recipe + Script released\n- 2011 split runner out into mr.migrator\n\n\n\n\n.. _`collective.transmogrifier`: http://pypi.python.org/pypi/collective.transmogrifier\n.. _`Plone`: http://plone.org\n.. _`mr.migrator`: http://pypi.python.org/pypi/mr.migrator\n\n\nContributors\n************\n\n\"Dylan Jay\", Author\n\"Vitaliy Podoba\", Contributor\n\"Rok Garbas\", Contributor\n\"Mikko Ohtamaa\", Contributor\n\"Tim Knapp\", Contributor\n\n\nChange history\n**************\n\n1.1.1 (2012-06-28)\n------------------\n- fix an issue in setup.py that broke buildout\n\n\n1.1 (2012-04-28)\n----------------\n\n- set default pipeline so can be used without buildout [djay]\n- better documentation [djay]\n- change recommended way to use funnelweb [djay]\n- new sitemapper step [djay]\n- can now crawl get requests [djay]\n- various fixes in transmogrify.* dependencies. see change logs [djay]\n\n1.0 (2011-06-29)\n----------------\n\n- fix default cmd line pipeline to funnelweb.remote\n- remove runner code and checkin ttw.cfg\n- funnelweb now depends on mr.migrator\n- improve urltidy to handle .asp .php\n- include index.asp and index.php as default pages\n- handle override pipeline in buildout\n- add --show-pipeline command\n- fix handling of --pipeline\n\n\n\n1.0b7 (2011-02-12)\n------------------\n- fix bug in commandline overrides\n- only open cache files when needed so don't run out of handles\n- follow http-equiv refresh links\n- don't strip html head\n\n1.0b6 (2011-02-06)\n------------------\n\n- turn off templateauto by default\n- added hideguess step. currently just manual setting of what to hide\n- multiline value overrides possible from commandline\n- files use file pointers to reduce memory usage\n- cache saves .metadata files to record and playback headersx\n- ploneremote: fix bug in debug output\n- show error if text is None\n- fix bug with bad chars in rewritten links\n- fix bug in losing items\n- templates: handle '/text()' in xpaths\n- templates: new 'optionaltext' rule format\n- set default page on the import root\n\n\n1.0b5 (2010-12-13)\n------------------\n\n- fix ordering of commandline help\n\n- fix help for @debug\n\n1.0b4 (2010-12-13)\n------------------\n\n- fix encoding problems caused by cache\n\n- better debugging\n\n- commandline to turn on debug info\n\n- script install uses buildout part name\n\n- extra documentation\n\n- commandline help\n\n\n1.0b3 (2010-11-20)\n------------------\n\n- fixed --pipeline option\n\n- fixed eggs= options\n\n- added prune support\n\n- removed transmogrify.htmltesting as a dependency\n\n- improved documentation\n [Jean Jordaan]\n\n- moved main repository to github collective https://github.com/collective/funnelweb\n\n\n\n1.0b2 (2010-11-09)\n------------------\n\n- Removed z3c.recipe.scripts as a dependency since it creates version conflicts with older zope installs.\n [\"Dylan Jay\"]\n\n- Make default cache be put in domain specific directory\n [\"Dylan Jay\"]\n\n- Put conditions on site analyser and turn off by default\n\n1.0b1 (2010-11-08)\n------------------\n\n- Initial release tying togeather new and original funnelweb recipes into\n documented commandline/buildout interface\n [\"Dylan Jay\"]\n\nDownload\n********", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://pypi.python.org/pypi/funnelweb", "keywords": "buildout crawler spider plone", "license": "GPL", "maintainer": null, "maintainer_email": null, "name": "funnelweb", "package_url": "https://pypi.org/project/funnelweb/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/funnelweb/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://pypi.python.org/pypi/funnelweb" }, "release_url": "https://pypi.org/project/funnelweb/1.1.1/", "requires_dist": null, "requires_python": null, "summary": "Crawl and parse static sites and import to Plone", "version": "1.1.1" }, "last_serial": 792220, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "059e42855e75a717e7cdddbb0f37f31e", "sha256": "e6dc8315d0c7f3db3d169fa48f614eeb88ac8b355be8127a65b104f478d7d184" }, "downloads": -1, "filename": "funnelweb-1.0.zip", "has_sig": false, "md5_digest": "059e42855e75a717e7cdddbb0f37f31e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41649, "upload_time": "2011-06-29T18:02:40", "url": "https://files.pythonhosted.org/packages/38/de/d27152c39cd6c50d29a00bd292087b594ba40117aaa51da0507d86dc4c2e/funnelweb-1.0.zip" } ], "1.0b1": [ { "comment_text": "", "digests": { "md5": "46691f224ca011655b7b7d7af8f4871a", "sha256": "2207666ebe221c999f87c0e7726b557f0330b63f56d76d22db7d62d45b955e67" }, "downloads": -1, "filename": "funnelweb-1.0b1.tar.gz", "has_sig": false, "md5_digest": "46691f224ca011655b7b7d7af8f4871a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15311, "upload_time": "2010-11-07T17:04:13", "url": "https://files.pythonhosted.org/packages/c7/12/573484178551698795e57923c24162b412fc07019fb3ce63db6967ff6826/funnelweb-1.0b1.tar.gz" } ], "1.0b2": [ { "comment_text": "", "digests": { "md5": "83d52bad8e6366e63e3097796063f526", "sha256": "d83d46a55e3f3d658fb3665a6f6d42edb3150ef645435b3acb1e9a1aee37b451" }, "downloads": -1, "filename": "funnelweb-1.0b2.tar.gz", "has_sig": false, "md5_digest": "83d52bad8e6366e63e3097796063f526", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13574, "upload_time": "2010-11-08T16:15:59", "url": "https://files.pythonhosted.org/packages/7d/20/b7d373fdd94d3f9b51df37cce73193d57ae2b21f5d26fddff13088666a28/funnelweb-1.0b2.tar.gz" }, { "comment_text": "", "digests": { "md5": "9bd5870d81acb7cda1e4753efce20417", "sha256": "46352c5b4156fd41f83f62df65ed166a057b9d9354dd408ca66e57c3bdb188ce" }, "downloads": -1, "filename": "funnelweb-1.0b2.zip", "has_sig": false, "md5_digest": "9bd5870d81acb7cda1e4753efce20417", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25120, "upload_time": "2010-11-08T15:53:22", "url": "https://files.pythonhosted.org/packages/cb/f3/a345d1a5fa5aa136493def2d80e68e161450ef6c37f665119a7c231c2e75/funnelweb-1.0b2.zip" } ], "1.0b3": [ { "comment_text": "", "digests": { "md5": "8a648076169f3e9611b2db1697712038", "sha256": "5137e184800a302d62bfefc0b2723a1289178775b9814b7bda80decafcc6d436" }, "downloads": -1, "filename": "funnelweb-1.0b3.zip", "has_sig": false, "md5_digest": "8a648076169f3e9611b2db1697712038", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33800, "upload_time": "2010-11-20T04:01:26", "url": "https://files.pythonhosted.org/packages/bf/9a/e906511b55f6a390a44197bd5e6d1e5c32ef2494b70a24e844fe38d2f116/funnelweb-1.0b3.zip" } ], "1.0b4": [ { "comment_text": "", "digests": { "md5": "8ad40291ff04aebb6a4373ead328ae2f", "sha256": "60868372fd8a5d2ad8fd236b954b5730aa1776951ba40a46b7f7f45a1ef82242" }, "downloads": -1, "filename": "funnelweb-1.0b4.zip", "has_sig": false, "md5_digest": "8ad40291ff04aebb6a4373ead328ae2f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37741, "upload_time": "2010-12-13T16:51:52", "url": "https://files.pythonhosted.org/packages/5c/5e/93090332a9c5fbde9adfd8ce1d1edf18d5bdf556c79ae61b0199ecc9b45b/funnelweb-1.0b4.zip" } ], "1.0b6": [ { "comment_text": "", "digests": { "md5": "f7a2b01f468c2190173baba5726e57c4", "sha256": "0f96d5eb6bd24cbd70342b0f8a58c194705ee7ceecb99e60e41a61a4fc692d0d" }, "downloads": -1, "filename": "funnelweb-1.0b6.zip", "has_sig": false, "md5_digest": "f7a2b01f468c2190173baba5726e57c4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 39126, "upload_time": "2011-02-06T17:51:16", "url": "https://files.pythonhosted.org/packages/78/02/1faaa9753332efb09e0a163bee5a6eff370599fc31eaf25ac9f151d070af/funnelweb-1.0b6.zip" } ], "1.0b7": [ { "comment_text": "", "digests": { "md5": "9f19ae4785b76e631c76beb9d1c0caa2", "sha256": "2c6d7741726190acdb75a541fa413fd586584b015f1746533c426b3ee8657676" }, "downloads": -1, "filename": "funnelweb-1.0b7.zip", "has_sig": false, "md5_digest": "9f19ae4785b76e631c76beb9d1c0caa2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 40202, "upload_time": "2011-02-12T02:28:04", "url": "https://files.pythonhosted.org/packages/61/82/51af49c4680be61704c4504c2514ea89be886a67cd4e7c03636aef9575e4/funnelweb-1.0b7.zip" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "01442cc77247f4f70c82b80cafcee552", "sha256": "00897d658024fd178d9ae4e498852947b7d9ee2d9de62b27fb9991a2b3c9de44" }, "downloads": -1, "filename": "funnelweb-1.1.zip", "has_sig": false, "md5_digest": "01442cc77247f4f70c82b80cafcee552", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 60918, "upload_time": "2012-04-25T17:04:55", "url": "https://files.pythonhosted.org/packages/c9/67/da9397485b856cd59b23237b5a218d45b471c88901bdd10f876714397871/funnelweb-1.1.zip" } ], "1.1.1": [ { "comment_text": "", "digests": { "md5": "83676ceeb193234154240dd9bbc1b6d1", "sha256": "ee87afe491cd7c9891c1e6f5b27bcd46f9482f3c25ce2c4935527553b1877f0c" }, "downloads": -1, "filename": "funnelweb-1.1.1.zip", "has_sig": false, "md5_digest": "83676ceeb193234154240dd9bbc1b6d1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61125, "upload_time": "2012-06-28T14:51:01", "url": "https://files.pythonhosted.org/packages/a5/d1/640ac37888c03d955e3b2965fde95c8bbc090e83a7a67b39386c480a9b96/funnelweb-1.1.1.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "83676ceeb193234154240dd9bbc1b6d1", "sha256": "ee87afe491cd7c9891c1e6f5b27bcd46f9482f3c25ce2c4935527553b1877f0c" }, "downloads": -1, "filename": "funnelweb-1.1.1.zip", "has_sig": false, "md5_digest": "83676ceeb193234154240dd9bbc1b6d1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61125, "upload_time": "2012-06-28T14:51:01", "url": "https://files.pythonhosted.org/packages/a5/d1/640ac37888c03d955e3b2965fde95c8bbc090e83a7a67b39386c480a9b96/funnelweb-1.1.1.zip" } ] }