{ "info": { "author": "Jesus Andres Castaneda Sosa", "author_email": "jesus.cast.sosa@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Topic :: Software Development :: Build Tools" ], "description": "Scraptor\n=======\nScraptor is a pretentious - pretentious because it cannot even do half of the features it aims (yet) - scraping framework that wants to scale and wants to grow. Scraptor is a child T-Rex scrapper and is still learning a lot. Maybe one day scraptor will live up to his goals.\n\nSyntax\n=======\nScraptor defines data as sets of fields. In order to specify a field you use the decorator @field and specify a callback function that handles the result before it is saved. A field can take several parameters. The syntax for defining a field is:\n```python\n@field(css_selector, name, attr)\ndef callback(field_value):\n\t# Do something with field_value before saving\n\treturn field_value\n# 'css_selector' and 'name' are required, 'attr' is optional\n```\nThe following field deletes the characters 'http' and 'https' from links\n```python\n@field('a', name = \"link\", attr = 'href')\ndef clean(link):\n\treturn link.replace(\"http://\",\"\").replace(\"https://\",\"\")\n```\nIn case the attr is ommitted, the field returns the text value of the element\n```python\n@field('p', name='paragraph')\ndef censor(text):\n\treplacement_dictionary = [ (\"fuck\", \"great\"), (\"shit\",\"nice\") ]\n\tfor word in replacement_dictionary:\n\t\ttext.replace(word[0], word[1])\n\treturn text\n```\nAfter defining all the fields you call run with the url to scrape and the css selector (nodeOfType) that defines a container node. If nodeOfType is ommitted the container node is the whole document.\n```python\nrun(url = \"https://twitter.com/i/moments\", nodeOfType = \".MomentCapsuleSummary\")\n```\n\nExample\n=======\nThe following example extracts the url of the image and the title of twitters moments. It is saved as example_links.py\n```python\nfrom scraptor import *\n\n@field(\".MomentCapsuleDetails-title\", name=\"title\")\ndef y(x):\treturn x\n\n@field(\".MomentMediaItem-entity--image\", name=\"imagesURL\", attr = \"src\")\ndef y(x):\treturn x\n\nrun(url = \"https://twitter.com/i/moments\", nodeOfType = \".MomentCapsuleSummary\")\n\n# RESULT EXAMPLE - RUN on monday November 23rd, 2015\n# {'imagesURL': u'https://pbs.twimg.com/media/CUhQSWoWEAA1tis.jpg:large', 'title': u'\"Anti-Muslim is Anti-American\" column sparks controversy'}\n# {'imagesURL': u'https://pbs.twimg.com/media/CUDBMH2WwAEF75C.jpg:large', 'title': u'LeBron & Steph continue NBA domination'}\n# {'imagesURL': u'https://pbs.twimg.com/media/CUhYzbRU8AAlaoT.png:large', 'title': u'When Slack goes down'}\n# {'imagesURL': u'https://pbs.twimg.com/media/CUdO5giUcAE8oMT.jpg:large', 'title': u'Celebrities only black people know'}\n# {'imagesURL': u'https://pbs.twimg.com/media/CUghf-tWsAQ5ftS.jpg:large', 'title': u\"New Game of Thrones poster teases Jon Snow's fate\"}\n# {'imagesURL': u'https://o.twimg.com/2/proxy.jpg?t=HBiTAWh0dHBzOi8vdi5jZG4udmluZS5jby9yL3ZpZGVvcy9FQkM1Q0FERUFGMTE0OTkwNDIzMjA3MDE4MDg2NF8zOGI3OGNhZWZhMC4xLjEuOTU5NzYzNDQ2MjUwNTExMzc0Ny5tcDQuanBnP3ZlcnNpb25JZD01eU54dXFnX2NrbHhoWW8zamlGRzd5UHEuWHhCVXYyMBTABxTABwAWABIA&s=xlxoIi9Ri3VEJqq8cHVbcS04UE2-2lu32hf-r4rilsU', 'title': u'Mouth-watering Thanksgiving spreads'}\n# {'imagesURL': u'https://o.twimg.com/2/proxy.jpg?t=HBiUAWh0dHBzOi8vdi5jZG4udmluZS5jby9yL3ZpZGVvcy8zQTVBMEVDMjlFMTI3NjA1NDA3MTQ0MjM5NTEzNl80N2MzMjAzMjVhNi4zLjAuMTgwNjI0NjIyNDA1Njc2NDMxMjMubXA0LmpwZz92ZXJzaW9uSWQ9UUsycUZsbUM4NkFZVGdidHd0OE9KYUoya2R1ODBkQnkUwAcUwAcAFgASAA&s=PS2LPX-HQMWYau5Rvj5SXvdMuGVFp0Q1ILd8Ead3QZo', 'title': u'Show us your fat pets'}\n# {'imagesURL': u'https://pbs.twimg.com/tweet_video_thumb/CUf9-rSW4AA3DWC.png', 'title': u'Happy Doctor Who Day, Whovians'}\n```\n\nTODO\n=======\nImplementation of the following:\n\nClass | Descrition\n------------------------ | ------------------------\nclass Storage | Backend for saving. Currently aiming towards Firebase, and files of type CSV, XML, HTML, and JSON.\nclass Formats | Used by storage\nclass Paginations | Decision tree for finding pagination dom elements or use actions to continue scraping.\nclass Instructions | Maybe a cli ?\nclass ImageStorages | Only aiming at Imgurl", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jesuscast/scraptor", "keywords": "scraping development", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "scraptor", "package_url": "https://pypi.org/project/scraptor/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/scraptor/", "project_urls": { "Homepage": "https://github.com/jesuscast/scraptor" }, "release_url": "https://pypi.org/project/scraptor/0.5.0/", "requires_dist": [ "selenium", "requests" ], "requires_python": "", "summary": "Scraptor scraping micro framework", "version": "0.5.0" }, "last_serial": 1962650, "releases": { "0.2.0": [ { "comment_text": "", "digests": { "md5": "25951ed0e61bb86fa1219df59e227c47", "sha256": "7d5d024ba8dcf4bc16f40340f504e3326fbf3b73e711fa938cf9a902fc469bd0" }, "downloads": -1, "filename": "scraptor-0.2.0-py2-none-any.whl", "has_sig": false, "md5_digest": "25951ed0e61bb86fa1219df59e227c47", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 8375, "upload_time": "2015-11-25T11:38:49", "url": "https://files.pythonhosted.org/packages/0b/3d/9c384f1b8a155c11a4452f25ef46aff9da0b28ed05df989628f7922f4e1b/scraptor-0.2.0-py2-none-any.whl" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "f44e1ba2fd34a9dcce2c428ec666bd13", "sha256": "2c8eb673c3352f7624cc4a7848bba0abe929e4fefd58b6044fd3ed041bd8a621" }, "downloads": -1, "filename": "scraptor-0.2.1-py2-none-any.whl", "has_sig": false, "md5_digest": "f44e1ba2fd34a9dcce2c428ec666bd13", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 8362, "upload_time": "2015-11-25T11:51:50", "url": "https://files.pythonhosted.org/packages/2f/28/49d9282b951b369117e52cb3296d5cc7ae8c2d6e5d5d07a367b469e4e44b/scraptor-0.2.1-py2-none-any.whl" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "9b4a90469b772a66bdbc6e7966167739", "sha256": "95ae8184fbcf9e86517d918432bff6c9ca8329fc963d573cd8c51b2d2c43f0e0" }, "downloads": -1, "filename": "scraptor-0.2.2-py2-none-any.whl", "has_sig": false, "md5_digest": "9b4a90469b772a66bdbc6e7966167739", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 8354, "upload_time": "2015-11-25T13:17:59", "url": "https://files.pythonhosted.org/packages/6e/61/8dc29fe4e818b9d943f72bd1a91101a988c75979a1cff54adf2b9bbc4112/scraptor-0.2.2-py2-none-any.whl" } ], "0.5.0": [ { "comment_text": "", "digests": { "md5": "179085b8a34d2705031505e1c4519b7e", "sha256": "5d6e77c14219d4e5f8605e1fb7756970c8cc8e6200d8d5ce984a1dbf5a83a321" }, "downloads": -1, "filename": "scraptor-0.5.0-py2-none-any.whl", "has_sig": false, "md5_digest": "179085b8a34d2705031505e1c4519b7e", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 11658, "upload_time": "2016-02-18T06:19:58", "url": "https://files.pythonhosted.org/packages/01/62/724795ac34598463ae41dfe0cafead1759ce70cc27b4f8f07dd6f2d63ddf/scraptor-0.5.0-py2-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "179085b8a34d2705031505e1c4519b7e", "sha256": "5d6e77c14219d4e5f8605e1fb7756970c8cc8e6200d8d5ce984a1dbf5a83a321" }, "downloads": -1, "filename": "scraptor-0.5.0-py2-none-any.whl", "has_sig": false, "md5_digest": "179085b8a34d2705031505e1c4519b7e", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 11658, "upload_time": "2016-02-18T06:19:58", "url": "https://files.pythonhosted.org/packages/01/62/724795ac34598463ae41dfe0cafead1759ce70cc27b4f8f07dd6f2d63ddf/scraptor-0.5.0-py2-none-any.whl" } ] }