{ "info": { "author": "Bob Jordan", "author_email": "bmjjr@bomquote.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy" ], "description": "\n\n.. image:: https://raw.githubusercontent.com/bmjjr/transistor/master/img/transistor_logo.png?token=AAgJc9an2d8HwNRHty-6vMZ94VfUGGSIks5b8VHbwA%3D%3D\n\n**Web data collection and storage for intelligent use cases.**\n\n.. image:: https://img.shields.io/badge/Python-3.6%20%7C%203.7-blue.svg\n :target: https://github.com/bomquote/transistor\n.. image:: https://img.shields.io/badge/pypi%20package-0.2.2-blue.svg\n :target: https://pypi.org/project/transistor/0.2.2/\n.. image:: https://img.shields.io/badge/Status-Beta-blue.svg\n :target: https://github.com/bomquote/transistor\n.. image:: https://img.shields.io/badge/license-MIT-lightgrey.svg\n :target: https://github.com/bomquote/transistor/blob/master/LICENSE\n.. image:: https://ci.appveyor.com/api/projects/status/xfg2yedwyrbyxysy/branch/master?svg=true\n :target: https://ci.appveyor.com/project/bomquote/transistor\n.. image:: https://pyup.io/repos/github/bomquote/transistor/shield.svg?t=1542037265283\n :target: https://pyup.io/account/repos/github/bomquote/transistor/\n :alt: Updates\n.. image:: https://api.codeclimate.com/v1/badges/0c34950c38db4f38aea6/maintainability\n :target: https://codeclimate.com/github/bomquote/transistor/maintainability\n :alt: Maintainability\n.. image:: https://codecov.io/gh/bomquote/transistor/branch/master/graph/badge.svg\n :target: https://codecov.io/gh/bomquote/transistor\n\n\n=============\n*transistor*\n=============\n\nAbout\n-----\n\nThe web is full of data. Transistor is a web scraping framework for collecting, storing, and using targeted data from structured web pages.\n\nTransistor's current strengths are in being able to:\n - provide an interface to use `Splash `_ headless browser / javascript rendering service.\n - includes *optional* support for using the scrapinghub.com `Crawlera `_ 'smart' proxy service.\n - ingest keyword search data from a spreadsheet and automatically transform keywords into a queue of tasks.\n - scale one worker into an arbitrary number of workers combined into a ``WorkGroup``.\n - coordinate an arbitary number of ``WorkGroup`` objects searching an arbitrary number of websites, into one scrape job with a ``WorkGroupManager``.\n - send out all the ``WorkGroups`` concurrently, using gevent based asynchronous I/O.\n - return data from each website for each search term 'task' in our list, for easy website-to-website comparison.\n\nSuitable use cases include:\n - comparing attributes like stock status and price, for a list of ``book titles`` or ``part numbers``, across multiple websites.\n\nDevelopment of Transistor is sponsored by `BOM Quote Manufacturing `_.\n\n**Primary goals**:\n\n1. Enable scraping targeted data from a wide range of websites including sites rendered with Javascript.\n2. Navigate websites which present logins, custom forms, and other blockers to data collection, like captchas.\n3. Provide asynchronous I/O for task execution, using `gevent `_.\n4. Easily integrate within a web app like `Flask `_, `Django `_ , or other python based `web frameworks `_.\n5. Provide spreadsheet based data ingest and export options, like import a list of search terms from excel, ods, csv, and export data to each as well.\n6. Utilize quick and easy integrated task work queues which can be automatically filled with data search terms by a simple spreadsheet import.\n7. Able to integrate with more robust task queues like `Celery `_ while using `rabbitmq `_ or `redis `_ as a message broker as desired.\n8. Provide hooks for users to persist data via any method they choose, while also supporting our own opinionated choice which is a `PostgreSQL `_ database along with `newt.db `_.\n9. Contain useful abstractions, classes, and interfaces for scraping and crawling with machine learning assistance (wip, timeline tbd).\n10. Further support data science use cases of the persisted data, where convenient and useful for us to provide in this library (wip, timeline tbd).\n11. Provide a command line interface (low priority wip, timeline tbd).\n\nQuickstart\n----------\n\nFirst, install ``Transistor`` from pypi:\n\n.. code-block:: rest\n\n pip install transistor\n\nIf you have previously installed ``Transistor``, please ensure you are using the latest version:\n\n.. code-block:: rest\n\n pip-install --upgrade transistor\n\nNext, setup Splash, following the Quickstart instructions. Finally, follow the minimal abbreviated Quickstart example ``books_to_scrape`` as detailed below.\n\nThis example is explained in more detail in the source code found in the ``examples/books_to_scrape`` folder, including fully implementing object persistence with ``newt.db``.\n\nQuickstart: Setup Splash\n-------------------------\nSuccessfully scraping is now a complex affair. Most websites with useuful data will rate limit, inspect headers, present captchas, and use javascript that must be rendered to get the data you want.\n\nThis rules out using simple python requests scripts for most serious use. So, setup becomes much more complicated.\n\nTo deal with this, we are going to use `Splash `_,\n\"A Lightweight, scriptable browser as a service with an HTTP API\".\n\nTransistor also supports the **optional** use of a *smart proxy service* from `scrapinghub `_ called `Crawlera `_.\nThe crawlera smart proxy service helps us:\n\n- avoid getting our own server IP banned\n- enable regional browsing which is important to us, because data can differ per region on the websites we want to scrape, and we are interested in those differences\n\nThe minimum monthly cost for the smallest size crawlera `C10` plan is $25 USD/month. This level is useful but can easily be overly restrictive. The next level up is $100/month.\n\nThe easiest way to get setup with Splash is to use `Aquarium `_ and that is what we are going to do. Using Aquarium requires Docker and Docker Compose.\n\n**Windows Setup**\n\nOn Windows, the easiest way to get started with Docker is to use `Chocolately `_ to install docker-for-windows. Using Chocolately requires\n`installing Chocolately `_.\n\nThen, to install Docker-for-windows with Chocolately:\n\n.. code-block:: rest\n\n C:\\> choco install docker-for-windows\n\nYou will likely need to restart your Windows box after installing docker-for-windows, even if it doesn't tell you to do so.\n\n**All Platforms**\n\nInstall Docker for your platform. For Aquarium, follow the `installation instructions `_.\n\nAfter setting up Splash with Aquarium, ensure you set the following environment variables:\n\n.. code-block:: python\n\n SPLASH_USERNAME = ''\n SPLASH_PASSWORD = ''\n\nFinally, to run Splash service, cd to the Aquarium repo on your hard drive, and then run ``docker-compose up`` in your command prompt.\n\n**Troubleshooting Aquarium and Splash service**:\n\n1. Ensure you are in the ``aquarium`` folder when you run the ``docker-compose up`` command.\n2. You may have some initial problem if you did not share your hard drive with Docker.\n3. Share your hard drive with docker (google is your friend to figure out how to do this).\n4. Try to run the ``docker-compose up`` command again.\n5. Note, upon computer/server restart, you need to ensure the Splash service is started, either daemonized or with ``docker-compose up``.\n\nAt this point, you should have a splash service running in your command prompt.\n\n**Crawlera**\n\nUsing crawlera is optional and not required for this ``books_to_scrape`` quickstart.\n\nBut, if you want to use Crawlera with Transistor, first, register for the service and buy a subscription at `scrapinghub.com `_.\n\nAfter registering for Crawlera, create accounts in scrapinghub.com for each region you would like to present a proxied ip address from. For our case, we are setup to handle three regions, ALL for global, China, and USA.\n\nFinally, you should set environment variables on your computer/server with the api key for each region you need, like below:\n\n.. code-block:: python\n\n CRAWLERA_ALL = ''\n CRAWLERA_CN = ''\n CRAWLERA_USA = ''\n\nQuickstart: ``books_to_scrape`` example\n---------------------------------------\n\nSee ``examples/books_to_scrape`` for a fully working example with more detailed notes in the source code. We'll go through an abbreviated setup here, without many of the longer notes and database/persistence parts that you can find in the ``examples`` folder source code.\n\nIn this abbreviated example, we will create a ``Spider`` to crawl the books.toscrape.com website to search for 20 different book titles, which the titles are ingested from an excel spreadsheet. After we find the book titles, we will export the targeted data to a different csv file.\n\nThe ``books_to_scrape`` example assumes we have a column of 20 book titles in an excel file, with a column heading in the spreadsheet named *item*. We plan to scrape the domain ``books.toscrape.com`` to find the book titles. For the book titles we find, we will scrape the sale price and stock status.\n\nFirst, let's setup a custom scraper Spider by subclassing ``SplashScraper``. This will enable it to use the Splash headless browser.\n\nNext, create a few custom methods to parse the html found by the ``SplashScraper`` and saved in the ``self.page`` attribute, with beautifulsoup4.\n\n.. code-block:: python\n\n from transistor.scrapers import SplashScraper\n\n class BooksToScrapeScraper(SplashScraper):\n \"\"\"\n Given a book title, scrape books.toscrape.com/index.html\n for the book cost and stock status.\n \"\"\"\n\n def __init__(self, book_title: str, script=None, **kwargs):\n \"\"\"\n Create the instance with a few custom attributes and\n set the baseurl\n \"\"\"\n super().__init__(script=script, **kwargs)\n self.baseurl = 'http://books.toscrape.com/'\n self.book_title = book_title\n self.price = None\n self.stock = None\n\n def start_http_session(self, url=None, timeout=(3.05, 10.05)):\n \"\"\"\n Starts the scrape session. Normally, you can just call\n super().start_http_session(). In this case, we also want to start out\n with a call to self._find_title() to kickoff the crawl.\n \"\"\"\n super().start_http_session(url=url, timeout=timeout)\n return self._find_title()\n\n # now, define your custom books.toscrape.com scraper logic below\n\n def _find_title(self):\n \"\"\"\n Search for the book title in the current page. If it isn't found, crawl\n to the next page.\n \"\"\"\n if self.page:\n title = self.page.find(\"a\", title=self.book_title)\n if title:\n return self._find_price_and_stock(title)\n else:\n return self._crawl()\n return None\n\n def _next_page(self):\n \"\"\"\n Find the url to the next page from the pagination link.\n \"\"\"\n if self.page:\n next_page = self.page.find('li', class_='next').find('a')\n if next_page:\n if next_page['href'].startswith('catalogue'):\n return self.baseurl + next_page['href']\n else:\n return self.baseurl + '/catalogue/' + next_page['href']\n return None\n\n def _crawl(self):\n \"\"\"\n Navigate to the next url page using the SplashScraper.open() method and\n then call find_title again, to see if we found our tasked title.\n \"\"\"\n if self._next_page():\n self.open(url=self._next_page())\n return self._find_title()\n return print(f'Crawled all pages. Title not found.')\n\n def _find_price_and_stock(self, title):\n \"\"\"\n The tasked title has been found and so now find the price and stock and\n assign them to class attributes self.price and self.stock for now.\n \"\"\"\n price_div = title.find_parent(\n \"h3\").find_next_sibling(\n 'div', class_='product_price')\n\n self.price = price_div.find('p', class_='price_color').text\n self.stock = price_div.find('p', class_='instock availability').text.translate(\n {ord(c): None for c in '\\n\\t\\r'}).strip()\n print('Found the Title, Price, and Stock.')\n\nNext, we need to setup two more subclasses from baseclasses ``SplashScraperItem`` and ``ItemLoader``. This will allow us to export the data from the ``SplashScraper`` spider to the csv spreadsheet.\n\nSpecifically, we are interested to export the ``book_title``, ``stock`` and ``price`` attributes. See more detail in ``examples/books_to_scrape/persistence/serialization.py`` file.\n\n.. code-block:: python\n\n from transistor.persistence.item import Field\n from transistor.persistence import SplashScraperItems\n from transistor.persistence.loader import ItemLoader\n\n\n class BookItems(SplashScraperItems):\n # -- names of your customized scraper class attributes go here -- #\n\n book_title = Field() # the book_title which we searched\n price = Field() # the self.price attribute\n stock = Field() # the self.stock attribute\n\n\n def serialize_price(value):\n \"\"\"\n A simple serializer used in BookItemsLoader to ensure USD is\n prefixed on the `price` Field, for the data returned in the scrape.\n :param value: the scraped value for the `price` Field\n \"\"\"\n if value:\n return f\"UK {str(value)}\"\n\n class BookItemsLoader(ItemLoader):\n def write(self):\n \"\"\"\n Write your scraper's exported custom data attributes to the\n BookItems class. Call super() to also capture attributes\n built-in from the Base ItemLoader class.\n\n Last, ensure you assign the attributes from `self.items` to\n `self.spider.` and finally you must return\n self.items in this method.\n \"\"\"\n\n # now, define your custom items\n self.items['book_title'] = self.spider.book_title\n self.items['stock'] = self.spider.stock\n # set the value with self.serialize_field(field, name, value) as needed,\n # for example, `serialize_price` below turns '\u00a350.10' into 'UK \u00a350.10'\n # the '\u00a350.10' is the original scraped value from the website stored in\n # self.scraper.price, but we think it is more clear as 'UK \u00a350.10'\n self.items['price'] = self.serialize_field(\n field=Field(serializer=serialize_price),\n name='price',\n value=self.spider.price)\n\n # call super() to write the built-in SplashScraper Items from ItemLoader\n super().write()\n\n return self.items\n\nFinally, to run the scrape, we will need to create a main.py file. This is all we need for the minimal example to scrape and export targeted data to csv.\n\nSo, at this point, we've:\n\n1. Setup a custom scraper ``BooksToScrapeScraper`` by subclassing ``SplashScraper``.\n2. Setup ``BookItems`` by subclassing ``SplashScraperItems``.\n3. Setup ``BookItemsLoader`` by subclassing ``ItemLoader``.\n4. Wrote a simple ``serializer`` with the ``serialize_price`` function, which appends 'UK' to the returned `price` attribute data.\n\nNext, we are ready to setup a ``main.py`` file as the final entry point to run our first scrape and export the data to a csv file.\n\nThe first thing we need to do is perform some imports.\n\n.. code-block:: python\n\n # -*- coding: utf-8 -*-\n # in main.py, monkey patching for gevent must be done first\n from gevent import monkey\n monkey.patch_all()\n\n from transistor import StatefulBook, WorkGroup, BaseWorkGroupManager\n from transistor.persistence.exporters import CsvItemExporter\n from import BooksToScrapeScraper\n from import BookItems, BookItemsLoader\n\n\nSecond, setup a ``StatefulBook`` which will read the ``book_titles.xlsx`` file and transform the book titles from the spreadsheet \"titles\" column into task queues for our ``WorkGroups``.\n\n.. code-block:: python\n\n filepath = 'your/path/to/book_titles.xlsx'\n trackers = ['books.toscrape.com']\n tasks = StatefulBook(filepath, trackers, keywords=\"titles\")\n\nThird, setup a list of exporters which than then be passed to whichever ``WorkGroup`` objects you want to use them with. In this case, we are just going to use the built-in ``CsvItemExporter`` but we could also use additional exporters to do multiple exports at the same time, if desired.\n\n.. code-block:: python\n\n exporters=[\n CsvItemExporter(\n fields_to_export=['book_title', 'stock', 'price'],\n file=open('c:/book_data.csv', 'a+b'))\n ]\n\nFourth, setup the ``WorkGroup`` in a list we'll call *groups*. We use a list here because you can setup as many ``WorkGroup`` objects with unique target websites and as many individual workers, as you need:\n\n.. code-block:: python\n\n groups = [\n WorkGroup(\n name='books.toscrape.com',\n url='http://books.toscrape.com/',\n spider=BooksToScrapeScraper,\n items=BookItems,\n loader=BookItemsLoader,\n exporters=exporters,\n workers=20, # this creates 20 Spiders and assigns each a book as a task\n kwargs={'timeout': (3.0, 20.0)})\n ]\n\nFifth, setup the ``WorkGroupManager`` and prepare the file to call the ``manager.main()`` method to start the scrape job:\n\n.. code-block:: python\n\n # If you want to execute all the scrapers at the same time, ensure the pool is\n # marginally larger than the sum of the total number of workers assigned in the\n # list of WorkGroup objects. However, sometimes you may want to constrain your pool\n # to a specific number less than your scrapers. That's also OK. This is useful\n # like Crawlera's C10 instance, only allows 10 concurrent workers. Set pool=10.\n manager = BaseWorkGroupManager(job_id='books_scrape', book=tasks, groups=groups, pool=25)\n\n if __name__ == \"__main__\":\n manager.main() # call manager.main() to start the job.\n\nFinally, run ``python main.py`` and then **profit**. After a brief Spider runtime to crawl the books.toscrape.com website and write the data, you should have a newly exported csv file in the filepath you setup, 'c:/book_data.csv' in our example above.\n\nTo summarize what we did in ``main.py``:\n\nWe setup a ``BaseWorkGroupManager``, wrapped our spider ``BooksToScrapeScraper`` inside a list of ``WorkGroup`` objects called *groups*. Then we passed the *groups* list to the ``BaseWorkGroupManager``.\n\n- Passing a list of ``WorkGroup`` objects allows the ``WorkGroupManager`` to run multiple jobs targeting different websites, concurrently.\n- In this simple example, we are only scraping ``books.toscrape.com``, but if we wanted to also scrape ``books.toscrape.com.cn``, then we'd setup two ``BaseGroup`` objects and wrap them each in their own ``WorkGroup``, one for each domain.\n\n\nNOTE-1: A more robust use case will also subclass the ``BaseWorker`` class. Because, it provides several methods as hooks for data persistence and post-scrape manipulation.\nAlso, one may also consider to sublcass the ``WorkGroupManager`` class and override it's ``monitor`` method. This is another hook point to have access to the ``BaseWorker`` object before it shuts down for good.\n\nRefer to the full example in the ``examples/books_to_scrape/workgroup.py`` file for an example of customizing ``BaseWorker`` and ``WorkGroupManager`` methods. In the example, we show how to to save data to postgresql with newt.db but you can use whichever db you choose.\n\nNOTE-2: If you do try to follow the more detailed example in ``examples/books_to_scrape``, including data persistence with postgresql and newt.db, you may need to set the environment variable:\n\n.. code-block:: python\n\n TRANSISTOR_DEBUG = 1\n\nWhether or not you actually need to set this ``TRANSISTOR_DEBUG`` environment variable will depend on how you setup your settings.py and newt_db.py files.\nIf you copy the files verbatim as shown in the ``examples/books_to_scrape`` folder, then you will need to set it.\n\nDirectly Using A SplashScraper\n--------------------------------\n\nPerhaps you just want to do a quick one-off scrape?\n\nIt is possible to just use your custom scraper subclassed from ``SplashScraper`` directly, without going through all the work to setup a ``StatefulBook``, ``BaseWorker``, ``BaseGroup``, ``WorkGroup``, and ``WorkGroupManager``.\n\nJust fire it up in a python repl like below and ensure the ``start_http_session`` method is run, which can generally be done by setting ``autorun=True``.\n\n.. code-block:: python\n\n >>> from my_custom_scrapers.component.mousekey import MouseKeyScraper\n >>> ms = MouseKeyScraper(part_number='C1210C106K4RACTU', autorun=True)\n\nAfter the scrape completes, various methods and attributes from ``SplashScraper`` and ``SplashBrowser`` are available, plus your custom attributes and methods from your own subclassed scraper, are available:\n\n.. code-block:: python\n\n >>> print(ms.stock())\n '4,000'\n >>> print(ms.pricing())\n '{\"1\" : \"USD $0.379\", \"10\" : \"USD $0.349\"}'\n\n\nArchitecture Summary\n--------------------\n\nTransistor provides useful layers and objects in the following categories:\n\n**Layers & Services**\n\n1. **javascript rendering service / headless browser layer**:\n\n- Transistor uses `Splash `_ implemented with `Aquarium `_ cookicutter docker template.\n- Splash provides a programmable headless browser to render javascript and Aquarium provides robust concurrency with multiple Splash instances that are load balanced with `HAProxy `_ .\n- Transistor provides integration with Splash through our ``SplashBrowser`` class found in ``transistor/browsers/splash_browser.py``.\n\n2. **smart proxy service**:\n\n- Transistor supports use of `Crawlera `_ , which is a paid *smart proxy service* providing robust protection against getting our own ip banned while scraping sites that actively present challenges to web data collection.\n- Crawlera use is optional. It has a minimum monthly cost of $25 USD for starter package and next level up is currently $100 USD/month.\n- in using Crawlera, the concurrency provided by gevent for asynchronous I/O along with Splash running with Aquarium, is absolutely required, because a single request with Splash + Crawlera is quite slow, taking up to **15 minutes** or more to successfully return a result.\n\n**Spiders**\n\n1. **browsers**\n\n- see: ``transistor/browsers``\n- wrap `python-requests `_ and `beautifulsoup4 `_ libraries to serve our various scraping browser needs.\n- browser API is generally created by subclassing and overriding the well known `mechanicalsoup `_ library to work with Splash and/or Splash + Crawlera.\n- if Javascript support is not needed for a simple scrape, it is nice to just use mechanicalsoup's ``StatefulBrowser`` class directly as a Scraper, like as shown in ``examples/cny_exchange_rate.py`` .\n- a ``Browser`` object is generally instantiated inside of a ``Scraper`` object, where it handles items like fetching the page, parsing headers, creating a ``self.page`` object to parse with beautifulsoup4, handling failures with automatic retries, and setting class attributes accessible to our ``Scraper`` object.\n\n2. **scrapers**\n\n- see ``transistor/scrapers``\n- instantiates a browser to grab the ``page`` object, implements various html filter methods on ``page`` to return the target data, can use Splash headless browser/javascript rendering service to navigate links, fill out forms, and submit data.\n- for a Splash or Splash + Crawlera based scraper ``Spider``, the ``SplashScraper`` base class provides a minimal required Lua script and all required connection logic. However, more complex use cases will require providing your own custom modified Lua script.\n- the scraper design is built around gevent based asynchronous I/O, and this design allows to send out an arbitrarily large number of scraper workers, with each scraper worker assigned a specific scrape task to complete.\n- the current core design, in allowing to send out an arbitrarily large number of scraper workers, is not necessarily an optimal design to 'crawl' pages in search of targeted data. Where it shines is when you need to use a webpage search function on an arbitrarily large list of search tasks, await the search results for each task, and finally return a scraped result for each task.\n\n3. **crawlers** (wip, on the to-do list)\n\n- see ``transistor/crawlers`` (not yet implemented)\n- this crawling ``Spider`` will be supported through a base class called ``SplashCrawler``.\n- while it is straightforward to use the current Transistor scraper ``SplashScraper`` design to do basic crawling (see ``examples/books_to_scrape/scraper.py`` for an example) the current way to do this with Transistor is not optimal for crawling. So we'll implement modified designs for crawling spiders.\n- specifics TBD, may be fully custom or else may reuse some good architecture parts of `scrapy `_, although if we do that, it will be done so we don't need a scrapy dependency and further it will be using gevent for asynchronous I/O.\n\n\n**Program I/O**\n\n1. **schedulers**:\n\n*BOOKS*\n\n- see ``transistor/schedulers/books``\n- a ``StatefulBook`` object provides an interface to work with spreadsheet based data.\n- for example, a book facilitates importing a column of keyword search term data, like 'book titles' or 'electronic component part numbers', from a designated column in an .xlsx file.\n- after importing the keyword search terms, the book will transform each search term into a task contained in a ``TaskTracker`` object\n- each ``TaskTracker`` will contain a queue of tasks to be assigned by the ``WorkGroupManager``, and will ultimately allow an arbitrarily large number of ``WorkGroups`` of ``BaseWorkers`` to execute the tasks, concurrently.\n\n*RabbitMQ & Redis*\n\n- see ``transistor/schedulers/brokers``\n- provides the ``ExchangeQueue`` class in transistor.scheulers.brokers.queues which can be passed to the ``tasks`` parameter of ``BaseWorkGroupManager``\n- Just pass the appropriate connection string to ``ExchangeQueue`` and ``BaseWorkGroupManager`` and you can use either RabbitMQ or Redis as a message broker, thanks to `kombu `_.\n- in this case, the ``BaseWorkGroupManager`` also acts as a AMQP ``consumer`` which can receive messages from RabbitMQ message broker\n\n\n2. **workers**:\n\n- a ``BaseWorker`` object encapsulates a ``Spider`` object like the ``SplashScraper`` or ``SplashCrawler`` objects, which has been customized by the end user to navigate and extract the targeted data from a structured web page.\n- a ``BaseGroup`` object can then be created, to encapsulate the ``BaseWorker`` object which contains the ``Spider`` object.\n- The purpose of this ``BaseGroup`` object is to enable concurrency and scale by being able to spin up an arbitrarily large number of ``BaseWorker`` objects, each assigned a different scrape task for execution.\n- the ``BaseGroup`` object can then receive tasks to execute, like individual book titles or electronic component part numbers to search, delegated by a ``WorkGroupManager`` class.\n- each ``BaseWorker`` in the ``BaseGroup`` also processes web request results, as they are returned from it's wrapped ``SplashScraper`` object. ``BaseWorker`` methods include hooks for exporting data to mutiple formats like csv/xml or saving it to the db of your choice.\n- each ``BaseGroup`` should be wrapped in a ``WorkGroup`` which is passed to the ``WorkGroupManager``. Objects which the ``BaseWorker`` will use to process the ``Spider`` after it returns from the scrape should also be specified in ``WorkGroup``, like ``Items``, ``ItemLoader``, and ``Exporter``.\n\n3. **managers**:\n\n- the overall purpose of the ``WorkGroupManager`` object is to provide yet more scale and concurrency through asynchronous I/O.\n- The ``WorkGroupManager`` can spin up an arbitrarily large number of ``WorkGroup`` objects while assigning each ``BaseWorker/Spider`` in each of the ``WorkGroup`` objects, individual scrape tasks.\n- This design approach is most useful when you have a finite pipeline of scrape tasks which you want to search and compare the same terms, across multiple different websites, with each website targeted by one ``WorkGroup``.\n- for example, we may have a list of 50 electronic component part numbers, which we want to search each part number in ten different regional websites. The ``WorkGroupManager`` can spin up a ``WorkGroup`` for each of the 10 websites, assign 50 workers to each ``WorkGroup``, and send out 500 ``BaseWorkers`` each with 1 task to fill, concurrently.\n- to further describe the ``WorkGroupManager``, it is a middle-layer between ``StatefulBook`` and ``BaseGroup``. It ingests ``TaskTracker`` objects from the ``StatefulBook`` object. It is also involved to switch states for ``TaskTracker`` objects, useful to track the task state like completed, in progress, or failed (this last detail is a work-in-progress).\n\n**Persistence**\n\n1. **exporters**\n\n- see ``transistor/persistence/exporters``\n- export data from a ``Spider`` to various formats, including *csv*, *xml*, *json*, *xml*, *pickle*, and *pretty print* to a *file* object.\n\n\n**Object Storage, Search, and Retrieval**\n\nTransistor can be used with the whichever database or persistence model you choose to implement. But, it will offer some open-source code in support of below:\n\n1. **SQLAlchemy**\n\n- we use `SQL Alchemy `_ extensively and may include some contributed code as we find appropriate or useful to keep in the Transistor repository. At least, an example for reference will be included in the `examples` folder.\n\n\n2. **object-relational database** using `PostgreSQL `_ with `newt.db `_.\n\n- persist and store your custom python objects containing your web scraped data, directly in a PostgreSQL database, while also converting your python objects to JSON, *automatically* indexing them for super-quick searches, and making it available to be used from within your application or externally.\n- leverage PostgreSQL's strong JSON support as a document database while also enabling \"ease of working with your data as ordinary objects in memory\".\n- this is accomplished with `newt.db `_ which turns `PostgreSQL `_ into an object-relational database while leveraging PostgreSQL's well integrated JSON support.\n- newt.db is itself a wrapper built over the battle tested `ZODB `_ python object database and `RelStorage `_ which integrates ZODB with PostgreSQL.\n- more on newt.db here [1]_ and here [2]_\n\n.. [1] `Why Postgres Should Be Your Document Database (blog.jetbrains.com) `_\n.. [2] `Newt DB, the amphibious database (newtdb.org) `_.\n\n\n\n\nDatabase Setup\n---------------\nTransistor maintainers prefer to use PostgreSQL with newt.db. Below is a quick setup walkthrough.\n\nAfter you have a valid PostgreSQL installation, you should install newt.db:\n\n.. code-block:: rest\n\n pip install newt.db\n\nAfter installation of newt.db you need to provide a URI connection string for newt.db to connect to PostgreSQL. An example setup might use two files for this, with a URI as shown\nin ``examples/books_to_scrape/settings.py`` and a second file to setup newt.db as shown in ``examples/books_to_scrape/newt_db.py`` as shown below:\n\n1. ``examples/books_to_scrape/settings.py``\n\n- not recreated here, check the source file\n\n2. ``examples/books_to_scrape/newt_db.py``:\n\n.. code-block:: python\n\n import os\n import newt.db\n from examples.books_to_scrape.settings import DevConfig, ProdConfig, TestConfig\n from transistor.utility.utils import get_debug_flag\n\n def get_config():\n if 'APPVEYOR' in os.environ:\n return TestConfig\n return DevConfig if get_debug_flag() else ProdConfig\n\n CONFIG = get_config()\n ndb = newt.db.connection(CONFIG.NEWT_DB_URI)\n\nNext, we need to store our first two python objects in newt.db, which are:\n\n1. A list collection object, so we have a place to store our scrapes.\n2. An object to hold our list collection object, so that we can have a list of lists\n\n.. code-block:: python\n\n from transistor.persistence.newt_db.collections import SpiderList, SpiderLists\n\nNow, from your python repl:\n\n.. code-block:: python\n\n from transistor.newt_db import ndb\n\n >>> ndb.root.spiders = SpiderLists() # Assigning SpiderLists() is only required during initial setup. Or else, when/if you change the SpiderLists() object, for example, to provide more functionality to the class.\n >>> ndb.root.spiders.add('first-scrape', SpiderList()) # You will add a new SpiderList() anytime you need a new list container. Like, every single scrape you save. See ``process_exports`` method in ``examples/books_to_scrape/workgroup.py``.\n >>> ndb.commit() # you must explicitly commit() after each change to newt.db.\n\nAt this point, you are ready-to-go with newt.db and PostgreSQL.\n\nLater, when you have a scraper object instance, such as ``BooksToScrapeScraper()`` which has finished it's web scrape cycle, it will be stored in the ``SpiderList()`` named ``first-scrape`` like such:\n\n.. code-block:: python\n\n >>> ndb.root.spiders['first-scrape'].add(BooksToScrapeScraper(name=\"books.toscrape.com\", book_title=\"Soumission\"))\n\n\nMore on StatefulBook\n--------------------\n\nPractical use requires multiple methods of input and output. ``StatefulBook`` provides a method for reading an excel file\nwith one column of search terms, *part numbers* in the below example, which we would like to search and scrape data from multiple websites which sell such components:\n\n.. code-block:: python\n\n >>> from transistor import StatefulBook\n\n >>> filepath = '/path/to/your/file.xlsx'\n >>> trackers = ['mousekey.cn', 'mousekey.com', 'digidog.com.cn', 'digidog.com']\n\nThis will create four separate task trackers for each of the four websites to search with the part numbers:\n\n.. code-block:: python\n\n >>> book = StatefulBook(filepath, trackers, keywords=\"part_numbers\")\n\n >>> book.to_do()\n\nOutput:\n\n.. code-block:: python\n\n deque([, , , ])\n\nSo now, each website we intend to scrape, has it's own task queue. To work with an individual tracker and see what is in it's individual to_do work queue:\n\n.. code-block:: python\n\n >>> for tracker in book.to_do():\n >>> if tracker.name == 'mousekey.cn':\n >>> ms_tracker = tracker\n\n >>> print(ms_tracker)\n\n \n\n >>> ms_tracker.to_do()\n\n deque(['050R30-76B', '1050170001', '12401598E4#2A', '525591052', '687710152002', 'ZL38063LDG1'])\n\n\n\nTesting\n-------------\n\nThe easiest way to test your scraper logic is to download the webpage html and then pass in the html file with a test dict.\nBelow is an example:\n\n.. code-block:: python\n\n from pathlib import Path\n data_folder = Path(\"c:/Users//repos//tests/scrapers/component/mousekey\")\n file_to_open = data_folder / \"mousekey.cn.html\"\n f = open(file_to_open, encoding='utf-8')\n page = f.read()\n test_dict = {\"_test_true\": True, \"_test_page_text\": page, \"_test_status_code\": 200, \"autostart\": True}\n\n from my_custom_scrapers.component.mousekey import MouseKeyScraper\n\n ms = MouseKeyScraper(part_number='GRM1555C1H180JA01D', **test_dict)\n\n assert ms.stock() == '17,090'\n assert ms.pricing() == '{\"1\": \"CNY \u00a50.7888\", \"10\": \"CNY \u00a50.25984\", \"100\": \"CNY \u00a50.1102\", ' \\\n '\"500\": \"CNY \u00a50.07888\", \"10,000\": \"CNY \u00a50.03944\"}'\n\n", "description_content_type": "text/x-rst", "docs_url": null, "download_url": "https://github.com/bomquote/transistor/archive/v0.2.2.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/bomquote/transistor", "keywords": "scraping,crawling,spiders,requests,beautifulsoup4,mechanicalsoup,framework,headless-browser", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "transistor", "package_url": "https://pypi.org/project/transistor/", "platform": "", "project_url": "https://pypi.org/project/transistor/", "project_urls": { "Download": "https://github.com/bomquote/transistor/archive/v0.2.2.tar.gz", "Homepage": "https://github.com/bomquote/transistor" }, "release_url": "https://pypi.org/project/transistor/0.2.2/", "requires_dist": [ "mechanicalsoup (>=0.11.0)", "requests (>=2.20.1)", "urllib3 (>=1.24.1)", "keyring (>=16.1.1)", "lxml (>=4.2.5)", "lz4 (>=2.1.2)", "pyexcel (>=0.5.9.1)", "pyexcel-io (>=0.5.10)", "pyexcel-ods3 (>=0.5.3)", "pyexcel-webio (>=0.1.4)", "pyexcel-xls (>=0.5.8)", "pyexcel-xlsx (>=0.5.6)", "cookiecutter (>=1.6.0)", "cssselect (>=1.0.3)", "w3lib (>=1.19.0)", "pycryptodome (>=3.7.2)", "gevent (>=1.3.7)", "newt.db (>=0.9.0); extra == 'newt.db'", "zodbpickle (>=1.0.2); extra == 'newt.db'", "persistent (>=4.4.3); extra == 'newt.db'", "zodb (>=5.5.1); extra == 'newt.db'" ], "requires_python": ">=3.6.0", "summary": "A web scraping framework for intelligent use cases.", "version": "0.2.2" }, "last_serial": 4554751, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "58757302ebba29fc665a499043580e2a", "sha256": "359adc143fe812ea394d4cfa9ca3e141231d5f7e38c69b38951c9c4ba889086d" }, "downloads": -1, "filename": "transistor-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "58757302ebba29fc665a499043580e2a", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 12129, "upload_time": "2018-11-12T18:44:59", "url": "https://files.pythonhosted.org/packages/a5/21/14ae5d27eea492427572d539cfa58e4625cecd9b8aa6b4ebc30846c71494/transistor-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "085f5dce58fafc3b1e05cc35b07aa558", "sha256": "8205c52052a12ae2f2f9334961460fe1c87eefb46e7eea48053532f12919a89d" }, "downloads": -1, "filename": "transistor-0.1.0.tar.gz", "has_sig": false, "md5_digest": "085f5dce58fafc3b1e05cc35b07aa558", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 37368, "upload_time": "2018-11-12T18:45:02", "url": "https://files.pythonhosted.org/packages/5c/e7/d2c80e0f46f16dddd8485a3d21bbd7e511182d742d879b6498ab2544909f/transistor-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "cf24426ed95ce24b1b5aff07882e3f98", "sha256": "62dccf301e4b8605b3102cbe8429b320c69dbd5d8a3b5121fc4c0c385824c012" }, "downloads": -1, "filename": "transistor-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "cf24426ed95ce24b1b5aff07882e3f98", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5.0", "size": 89483, "upload_time": "2018-11-17T08:21:28", "url": "https://files.pythonhosted.org/packages/0f/15/bfccc83a4f0dff849bdefccc1112b0a71e72c97cff23b99d496c5660362d/transistor-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "124529868ff8ae70bfc25f3f4d0bd3c8", "sha256": "77a7d12256446eeeda17e7173bf0f81fca415422a8abd5628904e7a7a2d6a9d8" }, "downloads": -1, "filename": "transistor-0.1.1.tar.gz", "has_sig": false, "md5_digest": "124529868ff8ae70bfc25f3f4d0bd3c8", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5.0", "size": 94418, "upload_time": "2018-11-17T08:21:29", "url": "https://files.pythonhosted.org/packages/a7/93/f523e6cd0b60a1786ae7e3530e3e7859c8c11d96403761ca5a855063c865/transistor-0.1.1.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "29ef59ce21754160fdad3081375b132b", "sha256": "dc2850d8dec7f7adc961ceb96ebb9dd76f36b65fbd358a11979ca7b7e5b385be" }, "downloads": -1, "filename": "transistor-0.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "29ef59ce21754160fdad3081375b132b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.0", "size": 121700, "upload_time": "2018-11-29T05:42:08", "url": "https://files.pythonhosted.org/packages/54/77/5e4c72fe04d76aa8c7466933de6cc1d42f964ee248e13b5bab9899efd963/transistor-0.2.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9f2941e081c7b626fbad17159a5ca1dc", "sha256": "7b1bb95b3430b840d09935b5532f9f616f5f15bab967c659f63d32335379f160" }, "downloads": -1, "filename": "transistor-0.2.0.tar.gz", "has_sig": false, "md5_digest": "9f2941e081c7b626fbad17159a5ca1dc", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 116285, "upload_time": "2018-11-29T05:42:10", "url": "https://files.pythonhosted.org/packages/7a/7d/4a77277e85f19d36c78942751b5ae7c9aec3482a27771fd7374de3673b24/transistor-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "1ce0461cb8bbcfd1be455bdc89b7d44d", "sha256": "a1ad49027b507237f83d10fde6ca6c273172a5b8a86c5a7eef7e4e998da839c4" }, "downloads": -1, "filename": "transistor-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "1ce0461cb8bbcfd1be455bdc89b7d44d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.0", "size": 121839, "upload_time": "2018-11-29T06:43:49", "url": "https://files.pythonhosted.org/packages/f6/aa/03c8fa58c8fdfa20782366751c3c58eb2015762a04c6167b29a6f53b6099/transistor-0.2.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "58902dc6d8a073dcd68d9b251652bee4", "sha256": "492cc2c93346a36e4ec59ebd4a4d1f1a56dc4549d97e8ba52494196b3b612624" }, "downloads": -1, "filename": "transistor-0.2.1.tar.gz", "has_sig": false, "md5_digest": "58902dc6d8a073dcd68d9b251652bee4", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 116858, "upload_time": "2018-11-29T06:43:51", "url": "https://files.pythonhosted.org/packages/de/0d/7c4d0553f300d24013be09135cf22eeeb70955e10ec8f2cbed3d78f65e8e/transistor-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "ba7af5ece9ce177a38e52307467deb24", "sha256": "85958f109dc058c174bfa31939545809da49ab5bed0a837226129c07d95f1eeb" }, "downloads": -1, "filename": "transistor-0.2.2-py3-none-any.whl", "has_sig": false, "md5_digest": "ba7af5ece9ce177a38e52307467deb24", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.0", "size": 140813, "upload_time": "2018-12-03T08:14:43", "url": "https://files.pythonhosted.org/packages/6a/90/64666315ba09744d67ab9450108364baa97410d93dbbe70ee371001c9a67/transistor-0.2.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d6dfa88219954bd432ae643b750bc23d", "sha256": "426f56ef5cd78603345d9cb53655d23c12754e5f832ca55e3f4ab67db1b2e25e" }, "downloads": -1, "filename": "transistor-0.2.2.tar.gz", "has_sig": false, "md5_digest": "d6dfa88219954bd432ae643b750bc23d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 123261, "upload_time": "2018-12-03T08:14:45", "url": "https://files.pythonhosted.org/packages/88/ad/14d2280c10781e1addd1269f657d3890efa68719d19020b3f1dbec164cd9/transistor-0.2.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "ba7af5ece9ce177a38e52307467deb24", "sha256": "85958f109dc058c174bfa31939545809da49ab5bed0a837226129c07d95f1eeb" }, "downloads": -1, "filename": "transistor-0.2.2-py3-none-any.whl", "has_sig": false, "md5_digest": "ba7af5ece9ce177a38e52307467deb24", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.0", "size": 140813, "upload_time": "2018-12-03T08:14:43", "url": "https://files.pythonhosted.org/packages/6a/90/64666315ba09744d67ab9450108364baa97410d93dbbe70ee371001c9a67/transistor-0.2.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d6dfa88219954bd432ae643b750bc23d", "sha256": "426f56ef5cd78603345d9cb53655d23c12754e5f832ca55e3f4ab67db1b2e25e" }, "downloads": -1, "filename": "transistor-0.2.2.tar.gz", "has_sig": false, "md5_digest": "d6dfa88219954bd432ae643b750bc23d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 123261, "upload_time": "2018-12-03T08:14:45", "url": "https://files.pythonhosted.org/packages/88/ad/14d2280c10781e1addd1269f657d3890efa68719d19020b3f1dbec164cd9/transistor-0.2.2.tar.gz" } ] }