{ "info": { "author": "keul", "author_email": "luca@keul.it", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "License :: OSI Approved :: GNU General Public License (GPL)", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Topic :: Internet :: WWW/HTTP", "Topic :: System :: Shells", "Topic :: Utilities" ], "description": ".. contents::\n\nIntroduction\n============\n\nLet's say that you want to access a slow streaming site to see something (obviously: something not\nprotected by copyright).\n\nThe streaming site use URLs in that format:\n\n http://legal-streaming-site.org/program-name/season5/episode4/\n\nEvery page contains some HTML code like the following::\n\n ....\n
\n ...\n \n ...\n\nLet say this is the URL for the episode 4 of the fifth season of your program.\nYou know that this program has 6 seasons with 22 episode each.\n\nAs said before: this site is very slow so you prefer downloading episodes in background\nthen watch them later.\n\nTo download them you need to watch the HTML inside the page and get some resources\n(commonly: and FLV file).\nThe best would be download *all* episode in a single (long running) operation instead of manually\ndoing it.\n\n**Allanon** will help you exactly in such tasks.\nYou simply need to provide it:\n\n* a simple URL or a *dynamic URL pattern*\n* a *query selector* for resources inside the page\n\nQuick example (you can keep it single lined)::\n\n $ allanon --search=\"#movie-container embed\" \\\n > \"http://legal-streaming-site.org/program-name/season{1:6}/episode{1:22}\"\n\nDocumentation\n=============\n\nInstallation\n------------\n\nYou can use `distribute`__ or `pip`__ to install the utility in your Python environment.\n\n__ http://pypi.python.org/pypi/distribute\n__ http://pypi.python.org/pypi/pip\n\n::\n\n $ easy_install Allanon\n\nor alternately::\n\n $ pip install Allanon\n\nInvocation\n----------\n\nAfter installing you will be able to run the ``allanon`` script from command line.\nFor example: run the following for access the utility help::\n\n $ allanon --help\n\nBasic usage (you probably don't need Allanon at all for this)\n-------------------------------------------------------------\n\nThe ``allanon`` script accept an URL (or a list of URLs) to be downloaded::\n\n $ allanon http://myhost/folder/image1.jpg http://myhost/folder/image2.jpg ...\n\nEvery command line URL given to Allanon can be a simple URL or an *URL model* like the following::\n\n $ allanon \"http://myhost/folder/image{1:50}.jpg\"\n\nThis will crawl 50 different URLs automatically. \n\nMain usage (things became interesting now)\n------------------------------------------\n\nThe ``allanon`` script take an additional ``--search`` parameter (see the first example given\nabove).\nWhen you provide it, you are meaning:\n\n \"*I don't want to download those URLs directly, but those URLs contain links to\n file that I really want*\".\n\nThe search parameter format must be CSS 3 compatible, like the one supported the famous\n`jQuery library`__, and it's based onto the `pyquery`__ library.\nSee it's documentation for more details about what you can look for.\n\n__ http://api.jquery.com/category/selectors/\n__ http://packages.python.org/pyquery/\n\nExtreme usage\n-------------\n\nThe ``--search`` parameter can be provided multiple times::\n\n $ allanon --search=\"ul.image-repos a\" \\\n > --search=\"div.image-containers img\" \\\n > \"http://image-repository-sites.org/category{1:30}.html\"\n\nWhen you provide (for example) two different search parameters, you are meaning:\n\n \"*I don't want to download resources at given URLs. Those URLs contain links to secondary pages,\n and inside those pages there're links to resources I want to download*\"\n\nFilters are applied in the given order, so:\n\n* Allanon will search inside 30 pages named *category1.html*, *category2.html*, ...\n* inside those pages, Allanon will look for all links inside ``ul`` tags with CSS class\n *image-repos* and recursively crawl them.\n* inside those pages, Allanon will looks for images inside ``div`` with class *image-containers*.\n* images will be downloaded.\n\nPotentially you can continue this way, providing a third level of filters, and so on.\n\nNaming and storing downloaded resources\n---------------------------------------\n\nBy default Allanon download all files in the current directory so a filename conflict\nis possible.\nYou can control how/where download, changing dynamically the filename using the\n``--filename`` option and/or change the directory where to store files with the\n``--directory`` option.\n\nAn example::\n\n $ allanon --filename=\"%HOST-%INDEX-section%1-version%3-%FULLNAME\" \\\n > \"http://foo.org/pdf-repo-{1:10}/file{1:50}.pdf?version={0:3}\"\n\nAs you seen ``--filename`` accept some *markers* that can be used to better organize\nresources:\n\n``%HOST``\n Will be replaced with the hostname used in the URL.\n``%INDEX``\n Is a progressive from 1 to the number of downloaded resources.\n``%X``\n When using dynamic URLs models you can refer to the current number of an URL\n section.\n \n In this case \"%1\" is the current \"pdf-repo-*x*\" number and \"%3\" is the \"version\"\n parameter value.\n``%FULLNAME``\n The original filename (the one used if ``--filename`` is not provided).\n \n You can also use the ``%NAME`` and ``%EXTENSION`` to get only the name of the file\n (without extension) or simply the extension.\n\nThe ``--directory`` option can be a simple directory name or a directory path (in unix-like\nformat, for example \"``foo/bar/baz``\").\n\nAn example::\n\n $ allanon --directory=\"/home/keul/%HOST/%1\" \\\n > \"http://foo.org/pdf-repo-{1:10}/file{1:50}.pdf\" \\\n > \"http://baz.net/pdf-repo-{1:10}/file{1:50}.pdf\"\n\nAlso the ``--directory`` option supports some of the markers: you can use ``%HOST``, ``%INDEX`` and ``%X``\nwith the same meaning given above.\n\nTODO\n====\n\nThis utility is in alpha stage, a lot of thing can goes wrong when downloading and many features\nare missing:\n\n* verbosity controls\n* bandwidth control\n* multi-thread (let's look at `grequests`__)\n* Python 3\n\n__ https://github.com/kennethreitz/grequests\n\nIf you find other bugs or want to ask for missing features, use the `product's issue tracker`__.\n\n__ https://github.com/keul/Allanon/issues\n\n\nChangelog\n=========\n\n0.2 (2014-01-02)\n----------------\n\n- Do not crawl or download when on error pages\n- Handle duplicate filename when downloading resources:\n added the ``--check-duplicate`` option\n- Application specific user agent header (configurable\n through ``--user-agent`` option)\n- The ``--directory`` option can be a path and so create\n intermediate directories, and accept markers\n- More efficient memory usage\n- Show progress bar when getting resources\n (now requires `progress`__)\n- Fixed problem when getting quoted filename from response\n header\n- Added the ``--timeout`` option\n- Added the ``--sleep`` option\n\n__ https://pypi.python.org/pypi/progress\n\n0.1 (2013-01-05)\n----------------\n\n- first release", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/keul/Allanon", "keywords": "crawler robot spider downloader parser", "license": "GPL", "maintainer": null, "maintainer_email": null, "name": "Allanon", "package_url": "https://pypi.org/project/Allanon/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/Allanon/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/keul/Allanon" }, "release_url": "https://pypi.org/project/Allanon/0.2/", "requires_dist": null, "requires_python": null, "summary": "A Web crawler that visit a predictable set of URLs, and automatically download resources you want from them", "version": "0.2" }, "last_serial": 958991, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "c1ac97d8dbe1e7f3ab6211bf1f2d29ea", "sha256": "ade95756bee811f691da3781a6a3c5e40bfa41bf9eb4de63fd8c2ba405febe40" }, "downloads": -1, "filename": "Allanon-0.1.zip", "has_sig": false, "md5_digest": "c1ac97d8dbe1e7f3ab6211bf1f2d29ea", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30153, "upload_time": "2013-01-05T18:13:29", "url": "https://files.pythonhosted.org/packages/82/fa/f7a53a652f06b62aabd9c1c4176d77a10be779dcd56b04e42091696e6f51/Allanon-0.1.zip" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "14619dbf30151b42c1aa13e7eca550f5", "sha256": "f533fa7b898e1672ce2625d2722fdab1603f115fcb2bfb628e2e78e9381396a1" }, "downloads": -1, "filename": "Allanon-0.2.zip", "has_sig": false, "md5_digest": "14619dbf30151b42c1aa13e7eca550f5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 34868, "upload_time": "2014-01-02T21:44:45", "url": "https://files.pythonhosted.org/packages/a9/c1/e0e038858fdbe4709461811dda106f8e6d8b3d633809ebe54046791e72ae/Allanon-0.2.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "14619dbf30151b42c1aa13e7eca550f5", "sha256": "f533fa7b898e1672ce2625d2722fdab1603f115fcb2bfb628e2e78e9381396a1" }, "downloads": -1, "filename": "Allanon-0.2.zip", "has_sig": false, "md5_digest": "14619dbf30151b42c1aa13e7eca550f5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 34868, "upload_time": "2014-01-02T21:44:45", "url": "https://files.pythonhosted.org/packages/a9/c1/e0e038858fdbe4709461811dda106f8e6d8b3d633809ebe54046791e72ae/Allanon-0.2.zip" } ] }