{ "info": { "author": "Learning Equality", "author_email": "dev@learningequality.org", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5" ], "description": "\n# WebMixer\n\nA library for scraping urls\n\n## The Basic Scraper\n\nAll `webmixer.scrapers.pages` and `webmixer.scrapers.tags` classes inherit from `webmixer.base.BasicScraper`, which means they all have the following attributes and functions:\n\n### Attributes\n* directory (str): Directory to write files to\n* color (str): Color for error messages (default: 'rgb(153, 97, 137)')\n* locale (str): Language to use when writing error messages (default: 'en')\n\t* Note: must be listed in `webmixer.messages.MESSAGES`\n* default_ext (str): Extension to default to for extracted files\n\n### Functions\n#### create_tag(tag)\nArgs:\n* tag (str): tag name to create (e.g. 'p')\n\nReturns a BeautifulSoup tag\nExample:\n```\nimage_tag = create_tag('img')\n```\n___\n\n#### get_filename(link, default_ext=None)\n\nArgs:\n* link (str): URL that has been scraped\n* default_ext (optional str): if the link doesn't have an extension, use this extension'\n\nReturns a filename (str) to use for extracted files\nExample:\n```\nvideo_filename = get_filename('', default_ext='.mp4')\n```\n___\n\n#### mark_tag_to_skip(tag)\nMark tag to skip during further scraping operations\nArgs:\n* tag (str): tag to mark\n\nExample:\n```\nProcess img tag here...\n\nmark_tag_to_skip(img)\n```\n___\n\n#### write_url(link, url=None, default_ext=None, filename=None, directory=None)\nDownloads a url and writes it to a zip\nArgs:\n* filepath (str): path to local file\n* directory (str): directory to write to zip\n* url (optional str): URL used for handling relative URLs\n* default_ext (optional str): if the link doesn't have an extension, use this extension\n* filename (optional str): name for file to write to zip\n* directory (optional str): directory to write file to zip\n\nReturns filepath within zip\nExample:\n```\nwrite_url('', url='https://domain.com/', default_ext='.mp4', filename='video', directory='media') # 'media/video.mp4'\n```\n___\n\n#### write_contents(filename, contents, directory=None)\nWrites contents to the zip with a given filename\nArgs:\n* filename (str): filename for contents\n* contents (bytes): contents to write to zip\n* directory (str): directory to write to zip\n\nReturns filepath within zip\nExample:\n```\nwrite_contents('myfile.pdf', , directory='docs') # docs/myfile.pdf\n```\n___\n\n#### write_file(filepath, directory=None)\nWrites a local file to the zip\nArgs:\n* filepath (str): path to local file\n* directory (str): directory to write to zip\n\nReturns filepath within zip\nExample:\n```\nwrite_file('path/to/myfile.mp3', directory='music') # music/myfile.mp3\n```\n___\n\n\n#### create_broken_link_message(link)\nGenerates a tag with broken link message\nArgs:\n* link (str): link to copy/paste\n\nReturns a div tag with a link to copy/paste into browser\nExample:\n```\niframe.replaceWith(create_broken_link_message(''))\n# iframe ->
copy link...
\n```\n___\n\n#### create_copy_link_message(link, partially_scrapable=False)\nGenerates a tag with 'copy link into browser' message\nArgs:\n* link (str): link to copy/paste\n* partially_scrapable (bool): link was mostly scraped, but doesn't include everything from original site\n\nReturns a div tag with a link to copy/paste into browser\nExample:\n```\niframe.replaceWith(create_copy_link_message(''))\n# iframe ->
copy link...
\n```\n___\n\n### Exceptions\n`webmixer.exceptions` can be useful for handling errors from a variety of sources. If you are scraping a more specialized source, there may be some exceptions that are exclusive to that source. You can then raise the following exceptions to correctly manage that source:\n\n#### BrokenSourceException\nUsed when the link is completely broken (e.g. site no longer exists)\n\n#### UnscrapableSourceException\nUsed when the link is working, but cannot be supported on Kolibri (e.g. Flash content)\n\nFor instance, the `webmixer.scrapers.pages.gdrive.GoogleDriveScraper` may throw a `FileNotDownloadableError` error. In order to handle this correctly, it will raise an `UnscrapableSourceException`\n```\ntry:\n\t...\nexcept FileNotDownloadableError as e:\n\traise UnscrapableSourceException(e)\n```\n\n\n## Page Scrapers\nThere are several page scrapers that are available for use in scraping html pages. These will download urls to their respective file types\n\n### Built-in Scrapers\nHere is a list of the basic scraper classes, which are also listed under `webmixer.scrapers.pages.base.COMMON_SCRAPERS`:\n* WebVideoScraper\n* PDFScraper\n* EPubScraper\n* ImageScraper\n* FlashScraper\n* VideoScraper\n* AudioScraper\n\n\n### Using Page Scrapers\nWhen you create a scraper object, you may specify the following:\n* url (str): URL that tag can be found at (used to handle relative URLs) __required__\n* zipper (optional `ricecooker.utils.html_writer`): Zip to write to\n* triaged (optional [str]): List of already parsed URLs\n\nTo scrape the page, you may use any of the following writing options:\n\n__to_zip__: Writes a file to self.zipper, which is useful when scraping embedded sources from an html page\nArgs:\n* filename (optional str): name of file to write to\nReturns path to file from within zip\n\nHere are the default extensions for each `webmixer.scrapers.pages.base.Scraper`:\n| Scraper | Extension |\n|--|--|\n| HTMLPageScraper |.html |\n|PDFScraper|.pdf|\n| EPubScraper | .epub |\n| AudioScraper | .mp3 |\n| VideoScraper | .mp4 |\n| WebVideoScraper | .mp4 |\n|ImageScraper| .png |\n|FlashScraper| _error_ |\n\nFor example:\n```\nfrom webmixer.scrapers.base import ImageScraper\nimage= \nimage['src'] = ImageScraper('').to_zip() # Sets 'src' to zipped image filepath\n```\n___\n\n__to_tag__: Writes file to zip and generates a tag based on what kind of scraper it is. This is useful when you are replacing iframes with native html elements\nArgs:\n* filename (optional str): name of file to write to\nReturns tag\n\nHere are the return tag types for each `webmixer.scrapers.pages.base.Scraper`:\n| Scraper | Tag |\n|--|--|\n| HTMLPageScraper | None|\n|PDFScraper|\\|\n| EPubScraper | None |\n| AudioScraper | \\ |\n| VideoScraper | \\ |\n| WebVideoScraper | \\ |\n|ImageScraper|\\|\n|FlashScraper|_error_|\n\nFor example:\n```\nfrom webmixer.scrapers.base import PDFScraper\niframe= \niframe.replaceWith(PDFScraper('').to_tag()) # Replaces iframe with tag\n```\n\n___\n\n__to_file__: Writes to a file. This is useful for downloading URLs as files to your local machine.\nArgs:\n* filename (optional str): name of file to write to\n* directory (optional str): directory to write to\n* overwrite (bool): overwrite file if it exists\nReturns a filepath to the downloaded file\n\n`to_file` uses the `download_file` method to write the file to a `write_to_path`\n\nHere are the return file types for each `webmixer.scrapers.pages.base.Scraper`:\n| Scraper | Extension |\n|--|--|\n| HTMLPageScraper | .zip - generated by `ricecooker.utils.html_writer` |\n|PDFScraper|.pdf |\n| EPubScraper | .epub |\n| AudioScraper | .mp3 |\n| VideoScraper | .mp4 |\n| WebVideoScraper | .mp4 |\n|ImageScraper|_error - content kind not supported_ |\n|FlashScraper|_error_ |\n\nFor example:\n```\nfrom webmixer.scrapers.base import HTMLPageScraper\nnew_html_zip_path = HTMLPageScraper('').to_file() # Returns newly scraped html .zip file\n```\n\n### Custom Scrapers\nGiven how diverse the internet is, you may need to implement your own scraper to handle individual sources. You __must__ implement a `test` classmethod in order to use your scraper.\n\n__If you would like to share a custom scraper, please feel free to open a pull request with a new file under `webmixer.scrapers.pages`__\n\n#### Attributes\nAll scrapers have the following attributes:\n* dl_directory (str): Directory to write `to_file` downloaded file to (default: 'downloads')\n* directory (str): Directory to write files to\n* color (str): Color for error messages (default: 'rgb(153, 97, 137)')\n* locale (str): Language to use when writing error messages (default: 'en')\n\t* Note: must be listed in `webmixer.messages.MESSAGES`\n* default_ext (str): Extension to default to for extracted files\n* kind (`le_utils.constants.content_kind`): Content kind to write to\n\n`webmixer.scrapers.pages.base.HTMLPageScraper` has these additional attributes:\n* partially_scrapable (bool): Not all content can be viewed from within Kolibri (default: False)\n* scrape_subpages (bool): Determines whether to scrape any subpages within this page (default: True)\n* main_area_selector (optional tuple): Main element selector to replace everything in body tag\n* omit_list (optional list): list of selectors to remove from page contents (e.g. [('a', {'class': 'link'})])\n* loadjs (bool): Determines whether to load js when loading the page (default: True)\n* scrapers ([`webmixer.scrapers.pages.Scraper`]): List of additional scrapers to use on this page\n* extra_tags ([`webmixer.scrapers.tags.Tag`]): List of additional tags to scrape\n\nFor example, the following code will remove links, scrape Wikipedia pages, and sets all images to 'myimg.png':\n```\nfrom webmixer.scrpaers.tags import ImageTag\nfrom webmixer.scrapers.pages.base import HTMLPageScraper\nfrom webmixer.scrapers.pages.wikipedia import WikipediaScraper\n\nclass MyCustomTag(ImageTag):\n\tdef process(self):\n\t\tself.tag['src'] = self.write_file('myimg.png')\n\nclass MyCustomScraper(HTMLPageScraper):\n\tomit_list = [('a',)] \t\t # Remove links\n\textra_tags = [MyCustomTag] # Use MyCustomTag to set images to 'myimg.png'\n\tscrapers = [WikipediaScraper] # Scrape any Wikipedia pages\n\n\t@classmethod # Required test classmethod\n\tdef test(self, url):\n\t\treturn 'my-domain.com' in url\n```\n\n#### Functions\n__@classmethod test(url)__: Required method to determine if this is the correct scraper for this URL\nArgs:\n * url (str): url to test\nReturns True if scraper is meant to scrape URL\nExample:\n```\n@classmethod\ndef test(self, url):\n\treturn 'somedomain' in url\n```\n\n___\n\n__preprocess(contents)__: Process contents before main scraping method\nArgs:\n\tcontents (BeautifulSoup): contents to preprocess\nExample:\n```\n# Delete the first image on the page before scraping all the images\ndef preprocess(self, contents):\n\tcontents.find('img').decompose()\n```\n___\n\n__postprocess(contents)__: Process contents after main scraping method\nArgs:\n\tcontents (BeautifulSoup): contents to postprocess\nExample:\n```\n# Append a link at the end of the tag\ndef postprocess(self, contents):\n\tlink = self.create_tag('a')\n\tlink.string = 'New Link'\n\tcontents.body.append(link)\n```\n\n\n## Tags\nThere are several tags that are available for use in scraping html pages. These will handle downloading any referenced files.\n\n### Using Tags\nTo create a tag, you may specify the following:\n* tag (BeautifulSoup.tag): tag to parse __required__\n* url (str): url that tag can be found at (used to handle relative URLs) __required__\n* attribute (optional str): attribute to find link at (e.g. 'src' or 'data-src')\n* scrape_subpages (optional bool): parse linked pages referenced by this tag (default: True)\n* extra_scrapers (optional [`webmixer.scrapers.base.BasicScrapers`]): list of scrapers to try to scrape linked pages\n* color (optional str): color for injected error messages (default: 'rgb(153, 97, 137)')\n\nTo scrape the tag, use the `scrape` method. This will process the tag so that it can be usable from within an html zip. Here is a simple scraping example:\n```\nfrom webmixer.scrapers.tags import ImageTag\nimage_tag = \nimage_scraper = ImageTag(image_tag, '')\nimage_scraper.scrape() # image_tag['src'] will point to downloaded image file in zip\n```\n\n\n\n### Built-in Tags\nHere is a list of the available tags, which are also listed under `webmixer.scrapers.tags.COMMON_TAGS`\n\n* ImageTag (img)\n* AudioTag (audio)\n* VideoTag (video)\n* EmbedTag (embed)\n* LinkTag (a) _Scrapes linked pages referenced by 'href' attribute_\n* IframeTag (iframe) _Scrapes embedded pages referenced byon 'src' attribute_\n* StyleTag (style) _Scrapes sheets referenced by 'href' attribute_\n* ScriptTag (script) _Scrapes scripts referenced by 'src' attribute_\n\n\n### Custom Tags\nDepending on the source you are trying to scrape, you may need more specific methods for scraping a page. To create a custom tag, you will need to subclass `webmixer.scrapers.tags.BasicScraperTag`\n\n#### Attributes\nAll tags have the following attributes:\n* selector (tuple): BeautifulSoup selector to find tag (e.g. ('a', {'class': 'link'}))\n* default_ext (str): Extension to use if link doesn't have an extension\n* directory (str): Directory to write tag files to\n* attributes (dict): Any attributes to assign to a tag\n* default_attribute (str): Attribute that references files (default: 'src')\n* scrape_subpages (bool): Determines whether to scrape any linked pages (default: True)\n* extra_scrapers ([`webmixer.scrapers.base.BasicScrapers`]): List of additional scrapers to use for scraping linked pages\n* color (str): Color for error messages (default: 'rgb(153, 97, 137)')\n* locale (str): Language to use when writing error messages (default: 'en')\n\t* Note: must be listed in `webmixer.messages.MESSAGES`\n\nExample:\n```\nfrom webmixer.scrapers.tags import BasicScraperTag\n\nclass MyVideoTag(BasicScraperTag):\n\tselector = ('video', {'class': 'video-class'}) # Select video.video-class\n\tdirectory = 'media'\t\t\t\t\t\t\t\t# Files will be written to media folder\n\tattributes = {\t\t\t\t\t\t\t\t\t# Videos will have width 100%\n\t\t'width': '100%'\n\t}\n```\n\n\n#### Built-in functions\nFor more custom scraping logic, you may also override the following methods:\n\n__process()__: Makes the tag usable from within an html zip by downloading any referenced files\nExample:\n```\nclass MyVideoTag(BasicScraperTag):\n\tdef process(self):\n\t\t# Scrape all of the tags\n\t for source in self.tag.find_all('source'):\n\t BasicScraperTag(source, self.zipper, self.url).scrape()\n```\n___\n\n__handle_error()__: Determines how to handle cases where the link is broken\nExample:\n```\nclass MyVideoTag(BasicScraperTag):\n\tdef handle_error(self):\n\t\tself.tag.decompose() # Just remove the element if it doesn't work\n```\n___\n\n__handle_unscrapable()__: Determines how to handle cases where the link is not scrapable\nExample:\n```\nclass MyVideoTag(BasicScraperTag):\n\tdef handle_unscrapable(self):\n\t\tself.tag.replaceWith(self.create_copy_link_message(self.link))\n```\n\n\n## Helper Functions\n### webmixer.utils.guess_scraper\nIf you would like to determine which scraper to use based on a URL, you can use the `webmixer.utils.guess_scraper` method. This will accept the following arguments:\n* url (str): URL to scrape\n* scrapers ([`webmixer.scrapers.base.BasicScrapers`]): list of other scrapers to test URL against\n* allow_defualt (optional bool): use generic default scraper in case nothing matches (default: False)\n\n_You can also pass in additional arguments to scrapers with `kwargs`_\n\nSo a simple usage of `guess_scraper` might be:\n```\nfrom webmixer.utils import guess_scraper\nscraper = guess_scraper('', scrapers=[MyCustomScraper])\n```\n\n\n=======\nHistory\n=======\n\n0.0.0 (2019-07-30)\n------------------\n* First release on PyPI.\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/learningequality/webmixer/releases", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/learningequality/webmixer", "keywords": "scrapers webmixer web-mixer mixer scraper", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "webmixer", "package_url": "https://pypi.org/project/webmixer/", "platform": "", "project_url": "https://pypi.org/project/webmixer/", "project_urls": { "Download": "https://github.com/learningequality/webmixer/releases", "Homepage": "https://github.com/learningequality/webmixer" }, "release_url": "https://pypi.org/project/webmixer/0.0.0/", "requires_dist": [ "requests (>=2.11.1)", "beautifulsoup4 (>=4.6.3)", "ricecooker (>=0.6.31)", "le-utils (>=0.1.19)", "cssutils (>=1.0.2)", "google-api-python-client (>=1.7.9)", "google-auth-oauthlib (>=0.4.0)", "oauthlib (>=3.0.1)", "requests-oauthlib (>=1.2.0)", "youtube-dl (>=2019.7.2)" ], "requires_python": "", "summary": "Library for scraping urls and downloading them as files", "version": "0.0.0" }, "last_serial": 5614649, "releases": { "0.0.0": [ { "comment_text": "", "digests": { "md5": "de66b6a9b04b07f5b2cddfea87993fb0", "sha256": "8bc4f1c57ed4064fce1a4ee4287d4d668720067a8220e31a779761d67be73023" }, "downloads": -1, "filename": "webmixer-0.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "de66b6a9b04b07f5b2cddfea87993fb0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 30365, "upload_time": "2019-07-31T00:34:56", "url": "https://files.pythonhosted.org/packages/36/73/419c91c2bf50fd39a0b87ea0a7a44d4491b0ac6a67b6a27b0f4710eb5861/webmixer-0.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "45a9e4125c6796686b315225be197f85", "sha256": "49e0fb24c93ea74d1e221b491d66d042b6e785fbb5036fc4c493ac213e5f467a" }, "downloads": -1, "filename": "webmixer-0.0.0.tar.gz", "has_sig": false, "md5_digest": "45a9e4125c6796686b315225be197f85", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27917, "upload_time": "2019-07-31T00:34:59", "url": "https://files.pythonhosted.org/packages/48/bb/dee7e7ca81cd28ac4f57e1c54a987281180f0748b42d8b3ed9d524d83f09/webmixer-0.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "de66b6a9b04b07f5b2cddfea87993fb0", "sha256": "8bc4f1c57ed4064fce1a4ee4287d4d668720067a8220e31a779761d67be73023" }, "downloads": -1, "filename": "webmixer-0.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "de66b6a9b04b07f5b2cddfea87993fb0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 30365, "upload_time": "2019-07-31T00:34:56", "url": "https://files.pythonhosted.org/packages/36/73/419c91c2bf50fd39a0b87ea0a7a44d4491b0ac6a67b6a27b0f4710eb5861/webmixer-0.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "45a9e4125c6796686b315225be197f85", "sha256": "49e0fb24c93ea74d1e221b491d66d042b6e785fbb5036fc4c493ac213e5f467a" }, "downloads": -1, "filename": "webmixer-0.0.0.tar.gz", "has_sig": false, "md5_digest": "45a9e4125c6796686b315225be197f85", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27917, "upload_time": "2019-07-31T00:34:59", "url": "https://files.pythonhosted.org/packages/48/bb/dee7e7ca81cd28ac4f57e1c54a987281180f0748b42d8b3ed9d524d83f09/webmixer-0.0.0.tar.gz" } ] }