{
    "info": {
        "author": "cardsurf",
        "author_email": "cardsurf@email.com",
        "bugtrack_url": null,
        "classifiers": [
            "License :: OSI Approved :: GNU General Public License v2 (GPLv2)",
            "Operating System :: OS Independent",
            "Programming Language :: Python",
            "Programming Language :: Python :: 2",
            "Programming Language :: Python :: 2.7",
            "Programming Language :: Python :: 3",
            "Programming Language :: Python :: 3.4",
            "Topic :: Internet :: WWW/HTTP",
            "Topic :: Internet :: WWW/HTTP :: Dynamic Content"
        ],
        "description": "xcrawler\n========\nA multi-threaded, open source web crawler\n\nFeatures\n---------\n* Use multiple threads to visit web pages\n* Extract web page data using XPath expressions or CSS selectors\n* Extract urls from a web page and visit extracted urls\n* Write extracted data to an output file\n* Set HTTP session parameters such as: cookies, SSL certificates, proxies\n* Set HTTP request parameters such as: header, body, authentication\n* Download files from the urls\n* Supports Python 2 and Python 3\n\nInstallation\n------------\n::\n\n    pip install xcrawler\n\n| \n| When installing ``lxml`` library on Windows you may encounter ``Microsoft Visual C++ is required`` errors.\n| To install ``lxml`` library on Windows:\n\n#. Download and install Microsoft Windows SDK:\n\n   * For Python 2.6, 2.7, 3.0, 3.1, 3.2: `Microsoft Windows SDK for .NET Framework 3.5 SP1 <http://www.microsoft.com/en-us/download/confirmation.aspx?id=3138>`_\n   * For Python 3.3, 3.4: `Microsoft Windows SDK for .NET Framework 4.0 <http://www.microsoft.com/en-us/download/confirmation.aspx?id=8279>`_\n\n#. Click the Start Menu, search for and open the command prompt:\n\n   * For Python 2.6, 2.7, 3.0, 3.1, 3.2: ``CMD Shell``\n   * For Python 3.3, 3.4: ``Windows SDK 7.1 Command Prompt``\n\n#. Install ``lxml``\n\n::\n\n    setenv /x86 /release && SET DISTUTILS_USE_SDK=1 && set STATICBUILD=true && pip install lxml\n\nUsage\n-----\n| Data and urls are extracted from a web page by a page scraper.\n| To extract data and urls from a web page use the following methods:\n| ``extract`` returns data extracted from a web page\n| ``visit`` returns next Pages to be visited\n| \n| A crawler can be configured before crawling web pages. A user can configure such settings of the crawler as:\n| * the number of threads used to visit web pages\n| * the name of an output file\n| * the request timeout\n| To run the crawler call:\n| ``crawler.run()``\n| \n| Examples how to use xcrawler can be found at: https://github.com/cardsurf/xcrawler/tree/master/examples\n| \n\nXPath Example\n-------------\n.. code:: python\n\n    from xcrawler import XCrawler, Page, PageScraper\n\n\n    class Scraper(PageScraper):\n        def extract(self, page):\n            topics = page.xpath(\"//a[@class='question-hyperlink']/text()\")\n            return topics\n\n\n    start_pages = [ Page(\"http://stackoverflow.com/questions/16622802/center-image-within-div\", Scraper()) ]\n    crawler = XCrawler(start_pages)\n    crawler.config.output_file_name = \"stackoverflow_example_crawler_output.csv\"\n    crawler.run()\n\nCSS Example\n-------------\n.. code:: python\n\n    from xcrawler import XCrawler, Page, PageScraper\n\n\n    class StackOverflowItem:\n        def __init__(self):\n            self.title = None\n            self.votes = None\n            self.tags = None\n            self.url = None\n\n\n    class UrlsScraper(PageScraper):\n        def visit(self, page):\n            hrefs = page.css_attr(\".question-summary h3 a\", \"href\")\n            urls = page.to_urls(hrefs)\n            return [Page(url, QuestionScraper()) for url in urls]\n\n\n    class QuestionScraper(PageScraper):\n        def extract(self, page):\n            item = StackOverflowItem()\n            item.title = page.css_text(\"h1 a\")[0]\n            item.votes = page.css_text(\".question .vote-count-post\")[0].strip()\n            item.tags = page.css_text(\".question .post-tag\")[0]\n            item.url = page.url\n            return item\n\n\n    start_pages = [ Page(\"http://stackoverflow.com/questions?sort=votes\", UrlsScraper()) ]\n    crawler = XCrawler(start_pages)\n    crawler.config.output_file_name = \"stackoverflow_css_crawler_output.csv\"\n    crawler.config.number_of_threads = 3\n    crawler.run()\n\nFile Example\n-------------\n.. code:: python\n\n    from xcrawler import XCrawler, Page, PageScraper\n\n\n    class WikimediaItem:\n        def __init__(self):\n            self.name = None\n            self.base64 = None\n\n\n    class EncodedScraper(PageScraper):\n        def extract(self, page):\n            url = page.xpath(\"//div[@class='fullImageLink']/a/@href\")[0]\n            item = WikimediaItem()\n            item.name = url.split(\"/\")[-1]\n            item.base64 = page.file(url)\n            return item\n\n\n    start_pages = [ Page(\"https://commons.wikimedia.org/wiki/File:Records.svg\", EncodedScraper()) ]\n    crawler = XCrawler(start_pages)\n    crawler.config.output_file_name = \"wikimedia_file_example_output.csv\"\n    crawler.run()\n\nSession Example\n----------------\n.. code:: python\n\n    from xcrawler import XCrawler, Page, PageScraper\n    from requests.auth import HTTPBasicAuth\n\n\n    class Scraper(PageScraper):\n        def extract(self, page):\n            return page.__str__()\n\n\n    start_pages = [ Page(\"http://192.168.1.1/\", Scraper()) ]\n    crawler = XCrawler(start_pages)\n    crawler.config.output_file_name = \"router_session_example_output.csv\"\n    crawler.config.session.headers = {\"User-Agent\": \"Custom User Agent\",\n                                      \"Accept-Language\": \"fr\"}\n    crawler.config.session.auth = HTTPBasicAuth('admin', 'admin')\n    crawler.run()\n\nRequest Example\n----------------\n.. code:: python\n\n    from xcrawler import XCrawler, Page, PageScraper\n\n\n    class Scraper(PageScraper):\n        def extract(self, page):\n            return page.__str__()\n\n\n    start_page = Page(\"http://192.168.5.5\", Scraper())\n    start_page.request.cookies = {\"theme\": \"classic\"}\n    crawler = XCrawler([start_page])\n    crawler.config.request_timeout = (5, 5)\n    crawler.config.output_file_name = \"router_request_example_output.csv\"\n    crawler.run()\n\nDocumentation\n--------------\n| For more information about xcrawler see the source code and Python Docstrings: `source code <https://github.com/cardsurf/xcrawler/tree/master/xcrawler/core/>`_\n| The documentation can also be accessed at runtime with Python's built-in ``help`` function:\n\n.. code:: python\n\n    >>> import xcrawler\n    >>> help(xcrawler.Config)\n        # Information about the Config class\n    >>> help(xcrawler.PageScraper.extract)\n        # Information about the extract method of the PageScraper class\n\nLicence\n-------\nGNU GPL v2.0",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/cardsurf/xcrawler",
        "keywords": "crawler,spider,scraper",
        "license": "GNU GPL v2.0",
        "maintainer": "",
        "maintainer_email": "",
        "name": "xcrawler",
        "package_url": "https://pypi.org/project/xcrawler/",
        "platform": "Any",
        "project_url": "https://pypi.org/project/xcrawler/",
        "project_urls": {
            "Homepage": "https://github.com/cardsurf/xcrawler"
        },
        "release_url": "https://pypi.org/project/xcrawler/1.3.0/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "A multi-threaded, open source web crawler",
        "version": "1.3.0"
    },
    "last_serial": 1910351,
    "releases": {
        "1.1.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "a856999ff97fe28ee39c1bd9c669ccfc",
                    "sha256": "7be5885e847a90b620b44fa464f0605e268b8bae646dfc3cec54c49924d7cffb"
                },
                "downloads": -1,
                "filename": "xcrawler-1.1.0.zip",
                "has_sig": false,
                "md5_digest": "a856999ff97fe28ee39c1bd9c669ccfc",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 30251,
                "upload_time": "2015-09-29T11:15:22",
                "url": "https://files.pythonhosted.org/packages/25/02/4d7af4f5ab913539cc64bb82dc903c8d784414fcb8ccbf7a5ec5d6babdd9/xcrawler-1.1.0.zip"
            }
        ],
        "1.2.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "f5f45e88212a69f8918a25e8cf01d992",
                    "sha256": "f665a36cdef3103dbe46b75533a44ec5aa06b11e78b42cf80e15921f0b9742e5"
                },
                "downloads": -1,
                "filename": "xcrawler-1.2.0.zip",
                "has_sig": false,
                "md5_digest": "f5f45e88212a69f8918a25e8cf01d992",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 71169,
                "upload_time": "2015-10-14T10:34:43",
                "url": "https://files.pythonhosted.org/packages/3d/72/e167eb7c69a6d24a0f77199324164fc71317d688f775c3a88122f518c79a/xcrawler-1.2.0.zip"
            }
        ],
        "1.3.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "e0585a2c4d97eb13d70e631a706d2719",
                    "sha256": "2536c3a903384fc727f35b4a73ab6a1b9e153a9e5586eabfa0b13bee55a7e2c4"
                },
                "downloads": -1,
                "filename": "xcrawler-1.3.0.tar.gz",
                "has_sig": false,
                "md5_digest": "e0585a2c4d97eb13d70e631a706d2719",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 29971,
                "upload_time": "2016-01-18T19:00:28",
                "url": "https://files.pythonhosted.org/packages/e6/22/c42c3907f45bc36c3e8c04f8ec95eed62659e7c720e4db0933ad5e120a4d/xcrawler-1.3.0.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "e0585a2c4d97eb13d70e631a706d2719",
                "sha256": "2536c3a903384fc727f35b4a73ab6a1b9e153a9e5586eabfa0b13bee55a7e2c4"
            },
            "downloads": -1,
            "filename": "xcrawler-1.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e0585a2c4d97eb13d70e631a706d2719",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 29971,
            "upload_time": "2016-01-18T19:00:28",
            "url": "https://files.pythonhosted.org/packages/e6/22/c42c3907f45bc36c3e8c04f8ec95eed62659e7c720e4db0933ad5e120a4d/xcrawler-1.3.0.tar.gz"
        }
    ]
}