{ "info": { "author": "cardsurf", "author_email": "cardsurf@email.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: GNU General Public License v2 (GPLv2)", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Topic :: Internet :: WWW/HTTP", "Topic :: Internet :: WWW/HTTP :: Dynamic Content" ], "description": "xcrawler\n========\nA multi-threaded, open source web crawler\n\nFeatures\n---------\n* Use multiple threads to visit web pages\n* Extract web page data using XPath expressions or CSS selectors\n* Extract urls from a web page and visit extracted urls\n* Write extracted data to an output file\n* Set HTTP session parameters such as: cookies, SSL certificates, proxies\n* Set HTTP request parameters such as: header, body, authentication\n* Download files from the urls\n* Supports Python 2 and Python 3\n\nInstallation\n------------\n::\n\n pip install xcrawler\n\n| \n| When installing ``lxml`` library on Windows you may encounter ``Microsoft Visual C++ is required`` errors.\n| To install ``lxml`` library on Windows:\n\n#. Download and install Microsoft Windows SDK:\n\n * For Python 2.6, 2.7, 3.0, 3.1, 3.2: `Microsoft Windows SDK for .NET Framework 3.5 SP1 `_\n * For Python 3.3, 3.4: `Microsoft Windows SDK for .NET Framework 4.0 `_\n\n#. Click the Start Menu, search for and open the command prompt:\n\n * For Python 2.6, 2.7, 3.0, 3.1, 3.2: ``CMD Shell``\n * For Python 3.3, 3.4: ``Windows SDK 7.1 Command Prompt``\n\n#. Install ``lxml``\n\n::\n\n setenv /x86 /release && SET DISTUTILS_USE_SDK=1 && set STATICBUILD=true && pip install lxml\n\nUsage\n-----\n| Data and urls are extracted from a web page by a page scraper.\n| To extract data and urls from a web page use the following methods:\n| ``extract`` returns data extracted from a web page\n| ``visit`` returns next Pages to be visited\n| \n| A crawler can be configured before crawling web pages. A user can configure such settings of the crawler as:\n| * the number of threads used to visit web pages\n| * the name of an output file\n| * the request timeout\n| To run the crawler call:\n| ``crawler.run()``\n| \n| Examples how to use xcrawler can be found at: https://github.com/cardsurf/xcrawler/tree/master/examples\n| \n\nXPath Example\n-------------\n.. code:: python\n\n from xcrawler import XCrawler, Page, PageScraper\n\n\n class Scraper(PageScraper):\n def extract(self, page):\n topics = page.xpath(\"//a[@class='question-hyperlink']/text()\")\n return topics\n\n\n start_pages = [ Page(\"http://stackoverflow.com/questions/16622802/center-image-within-div\", Scraper()) ]\n crawler = XCrawler(start_pages)\n crawler.config.output_file_name = \"stackoverflow_example_crawler_output.csv\"\n crawler.run()\n\nCSS Example\n-------------\n.. code:: python\n\n from xcrawler import XCrawler, Page, PageScraper\n\n\n class StackOverflowItem:\n def __init__(self):\n self.title = None\n self.votes = None\n self.tags = None\n self.url = None\n\n\n class UrlsScraper(PageScraper):\n def visit(self, page):\n hrefs = page.css_attr(\".question-summary h3 a\", \"href\")\n urls = page.to_urls(hrefs)\n return [Page(url, QuestionScraper()) for url in urls]\n\n\n class QuestionScraper(PageScraper):\n def extract(self, page):\n item = StackOverflowItem()\n item.title = page.css_text(\"h1 a\")[0]\n item.votes = page.css_text(\".question .vote-count-post\")[0].strip()\n item.tags = page.css_text(\".question .post-tag\")[0]\n item.url = page.url\n return item\n\n\n start_pages = [ Page(\"http://stackoverflow.com/questions?sort=votes\", UrlsScraper()) ]\n crawler = XCrawler(start_pages)\n crawler.config.output_file_name = \"stackoverflow_css_crawler_output.csv\"\n crawler.config.number_of_threads = 3\n crawler.run()\n\nFile Example\n-------------\n.. code:: python\n\n from xcrawler import XCrawler, Page, PageScraper\n\n\n class WikimediaItem:\n def __init__(self):\n self.name = None\n self.base64 = None\n\n\n class EncodedScraper(PageScraper):\n def extract(self, page):\n url = page.xpath(\"//div[@class='fullImageLink']/a/@href\")[0]\n item = WikimediaItem()\n item.name = url.split(\"/\")[-1]\n item.base64 = page.file(url)\n return item\n\n\n start_pages = [ Page(\"https://commons.wikimedia.org/wiki/File:Records.svg\", EncodedScraper()) ]\n crawler = XCrawler(start_pages)\n crawler.config.output_file_name = \"wikimedia_file_example_output.csv\"\n crawler.run()\n\nSession Example\n----------------\n.. code:: python\n\n from xcrawler import XCrawler, Page, PageScraper\n from requests.auth import HTTPBasicAuth\n\n\n class Scraper(PageScraper):\n def extract(self, page):\n return page.__str__()\n\n\n start_pages = [ Page(\"http://192.168.1.1/\", Scraper()) ]\n crawler = XCrawler(start_pages)\n crawler.config.output_file_name = \"router_session_example_output.csv\"\n crawler.config.session.headers = {\"User-Agent\": \"Custom User Agent\",\n \"Accept-Language\": \"fr\"}\n crawler.config.session.auth = HTTPBasicAuth('admin', 'admin')\n crawler.run()\n\nRequest Example\n----------------\n.. code:: python\n\n from xcrawler import XCrawler, Page, PageScraper\n\n\n class Scraper(PageScraper):\n def extract(self, page):\n return page.__str__()\n\n\n start_page = Page(\"http://192.168.5.5\", Scraper())\n start_page.request.cookies = {\"theme\": \"classic\"}\n crawler = XCrawler([start_page])\n crawler.config.request_timeout = (5, 5)\n crawler.config.output_file_name = \"router_request_example_output.csv\"\n crawler.run()\n\nDocumentation\n--------------\n| For more information about xcrawler see the source code and Python Docstrings: `source code `_\n| The documentation can also be accessed at runtime with Python's built-in ``help`` function:\n\n.. code:: python\n\n >>> import xcrawler\n >>> help(xcrawler.Config)\n # Information about the Config class\n >>> help(xcrawler.PageScraper.extract)\n # Information about the extract method of the PageScraper class\n\nLicence\n-------\nGNU GPL v2.0", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/cardsurf/xcrawler", "keywords": "crawler,spider,scraper", "license": "GNU GPL v2.0", "maintainer": "", "maintainer_email": "", "name": "xcrawler", "package_url": "https://pypi.org/project/xcrawler/", "platform": "Any", "project_url": "https://pypi.org/project/xcrawler/", "project_urls": { "Homepage": "https://github.com/cardsurf/xcrawler" }, "release_url": "https://pypi.org/project/xcrawler/1.3.0/", "requires_dist": null, "requires_python": "", "summary": "A multi-threaded, open source web crawler", "version": "1.3.0" }, "last_serial": 1910351, "releases": { "1.1.0": [ { "comment_text": "", "digests": { "md5": "a856999ff97fe28ee39c1bd9c669ccfc", "sha256": "7be5885e847a90b620b44fa464f0605e268b8bae646dfc3cec54c49924d7cffb" }, "downloads": -1, "filename": "xcrawler-1.1.0.zip", "has_sig": false, "md5_digest": "a856999ff97fe28ee39c1bd9c669ccfc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30251, "upload_time": "2015-09-29T11:15:22", "url": "https://files.pythonhosted.org/packages/25/02/4d7af4f5ab913539cc64bb82dc903c8d784414fcb8ccbf7a5ec5d6babdd9/xcrawler-1.1.0.zip" } ], "1.2.0": [ { "comment_text": "", "digests": { "md5": "f5f45e88212a69f8918a25e8cf01d992", "sha256": "f665a36cdef3103dbe46b75533a44ec5aa06b11e78b42cf80e15921f0b9742e5" }, "downloads": -1, "filename": "xcrawler-1.2.0.zip", "has_sig": false, "md5_digest": "f5f45e88212a69f8918a25e8cf01d992", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 71169, "upload_time": "2015-10-14T10:34:43", "url": "https://files.pythonhosted.org/packages/3d/72/e167eb7c69a6d24a0f77199324164fc71317d688f775c3a88122f518c79a/xcrawler-1.2.0.zip" } ], "1.3.0": [ { "comment_text": "", "digests": { "md5": "e0585a2c4d97eb13d70e631a706d2719", "sha256": "2536c3a903384fc727f35b4a73ab6a1b9e153a9e5586eabfa0b13bee55a7e2c4" }, "downloads": -1, "filename": "xcrawler-1.3.0.tar.gz", "has_sig": false, "md5_digest": "e0585a2c4d97eb13d70e631a706d2719", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 29971, "upload_time": "2016-01-18T19:00:28", "url": "https://files.pythonhosted.org/packages/e6/22/c42c3907f45bc36c3e8c04f8ec95eed62659e7c720e4db0933ad5e120a4d/xcrawler-1.3.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e0585a2c4d97eb13d70e631a706d2719", "sha256": "2536c3a903384fc727f35b4a73ab6a1b9e153a9e5586eabfa0b13bee55a7e2c4" }, "downloads": -1, "filename": "xcrawler-1.3.0.tar.gz", "has_sig": false, "md5_digest": "e0585a2c4d97eb13d70e631a706d2719", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 29971, "upload_time": "2016-01-18T19:00:28", "url": "https://files.pythonhosted.org/packages/e6/22/c42c3907f45bc36c3e8c04f8ec95eed62659e7c720e4db0933ad5e120a4d/xcrawler-1.3.0.tar.gz" } ] }