{ "info": { "author": "Learning Equality", "author_email": "ivan@learningequalty.org", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5" ], "description": "BasicCrawler\n============\n\nBasic web crawler that automates website exploration and producing web\nresource trees.\n\nTODO\n----\n\nVersion 0.2 TODO\n----------------\n\n- Finish \"is file\" logic to check content-type before downloading to\n avoid large downloads\n\n - infer file from extentsion in URL??\n\n- Make a single IGNORE\\_URLS list that accpets:\n- full urls (string)\n- compiled RE objects\n- functions for deciding what to ignore rather (anything callable)\n\n- path to url / vice versa (and possibly elsewhere): consider\n ``urllib.urlparse``? [e.g. ``url.startwith(source_domain)`` could be\n ``source_domain in url.domain`` to make it more flexible with\n subdomains\n- Additional valid domains can be specified but ``url_to_path_list``\n assumes adding CHANNEL\\_ROOT\\_DOMAIN [we may wish to expand all links\n based on parent URL]\n- refactor and remove need for MAIN\\_SOURCE\\_DOMAIN and use only\n SOURCE\\_DOMAINS instead\n\nFeature ideas\n-------------\n\n- Asynchronous download (not necessary but might be good for\n performance on large sites)\n- don't block for HTTP\n- allow multiple workers getting from queue\n\n- content\\_selector hints for default ``on_page`` handler to follow\n links only within a certain subset of the HTML tree. Can have:\n\n - site-wide selector at class level\n - pass in additional ``content_selector`` from referring page via\n context dict\n\n- Automatically detect standard embed tags (audio, video, pdfs) and add\n links to web resource tree in default ``on_page`` handler.\n\nUsage\n-----\n\nThe goal of the ``BasicCrawler`` class is to help with the initial\nexploration of the source website. It is your responsibility to write a\nsubclass that uses the HTML, URL structure, and content to guide the\ncrawling and produce the web resource tree.\n\nThe workflow is as follows\n\n1. Create your subclass\n\n- set the following attributes\n\n - ``MAIN_SOURCE_DOMAIN`` e.g. ``'https://learningequality.org'``\n - ``START_PAGE`` e.g. ``'https://learningequality.org/'``\n\n2. Run for the first time by calling ``crawler.crawl()`` or as a command\n line script\n\n- The BasicCrawler has basic logic for visiting pages and will print\n out on the a summary of the auto inferred site stricture findings and\n recommendations based on the URL structure observed during the\n initial crawl.\n- Based on the number of times a link appears on different pages of the\n site the crawler will suggest to you candidates for global navigation\n links. Most websites have an /about page, /contact us, and other such\n non-content-containing pages, which we do not want to include in the\n web resource tree. You should inspect these suggestions and decide\n which should be ignored (i.e. not crawled or included in the\n web\\_resource\\_tree output). To ignore URLs you can edit the\n attributes:\n\n - ``IGNORE_URLS`` (list of strings): crawler will ignore this URL\n - ``IGNORE_URL_PATTERNS`` (list of RE objects): regular expression\n that do the same thing Edit your crawler subclass' code and append\n to ``IGNORE_URLS`` and ``IGNORE_URL_PATTERNS`` the URLs you want\n to skip (anything that is not likely to contain content).\n\n3. Run the crawler again, this time there should be less noise in the\n output.\n\n- Note the suggestion for different paths that you might want to handle\n specially (e.g. ``/course``, ``/lesson``, ``/content``, etc.) You can\n define class methods to handle each of these URL types:\n\n ::\n\n def on_course(self, url, page, context):\n # what do you want the crawler to do when it visits the course with `url`\n # in the `context` (used for extra metadata; contains reference to parent)\n # The BeautifulSoup parsed contents of the `url` are provided as `page`.\n\n def on_lesson(self, url, page, context):\n # what do you want the crawler to do when it visits the lesson\n\n def on_content(self, url, page, context):\n # what do you want the crawler to do when it visits the content url\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/learningequality/BasicCrawler", "keywords": "basiccrawler", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "basiccrawler", "package_url": "https://pypi.org/project/basiccrawler/", "platform": "", "project_url": "https://pypi.org/project/basiccrawler/", "project_urls": { "Homepage": "https://github.com/learningequality/BasicCrawler" }, "release_url": "https://pypi.org/project/basiccrawler/0.1.2/", "requires_dist": null, "requires_python": "", "summary": "Basic web crawler that automates website exploration and producing web resource trees.", "version": "0.1.2" }, "last_serial": 3368345, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "a76b1ae29a26f5e7c9611ada4c62ec7d", "sha256": "46200abbd594894915756291548c34cd57f2983ce2ee55d6709d3e6138510cc5" }, "downloads": -1, "filename": "basiccrawler-0.1.0.tar.gz", "has_sig": false, "md5_digest": "a76b1ae29a26f5e7c9611ada4c62ec7d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14731, "upload_time": "2017-11-21T17:18:09", "url": "https://files.pythonhosted.org/packages/67/a0/6d159230c76de46e082c09b6c0731c518cdec035baf940dde5321319f1f8/basiccrawler-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "d25fb1960374dc569e35dbebf70c1f7a", "sha256": "510328ef70a219e0d50a45fa82ae620c2bace84031f3cf61334724d74f3e041f" }, "downloads": -1, "filename": "basiccrawler-0.1.1.tar.gz", "has_sig": false, "md5_digest": "d25fb1960374dc569e35dbebf70c1f7a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14782, "upload_time": "2017-11-21T20:55:36", "url": "https://files.pythonhosted.org/packages/02/78/80a3b9a853c95128da4dd0db7ca71385b1db17427f7c1552cbb7d650185a/basiccrawler-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "e264c49cd719b72fdd2cac5f1b549182", "sha256": "060968468028acece57a29372513f8699353c8b30d3f16e6dfa6fd72bf1d5214" }, "downloads": -1, "filename": "basiccrawler-0.1.2.tar.gz", "has_sig": false, "md5_digest": "e264c49cd719b72fdd2cac5f1b549182", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15728, "upload_time": "2017-11-27T16:32:00", "url": "https://files.pythonhosted.org/packages/53/2f/e31ac0f0037541af9e3398f6884090dd96fad479efc1a2c753b68f3fc22b/basiccrawler-0.1.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e264c49cd719b72fdd2cac5f1b549182", "sha256": "060968468028acece57a29372513f8699353c8b30d3f16e6dfa6fd72bf1d5214" }, "downloads": -1, "filename": "basiccrawler-0.1.2.tar.gz", "has_sig": false, "md5_digest": "e264c49cd719b72fdd2cac5f1b549182", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15728, "upload_time": "2017-11-27T16:32:00", "url": "https://files.pythonhosted.org/packages/53/2f/e31ac0f0037541af9e3398f6884090dd96fad479efc1a2c753b68f3fc22b/basiccrawler-0.1.2.tar.gz" } ] }