{ "info": { "author": "Gavin Zhang", "author_email": "gavinz0228@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: MacOS :: MacOS X", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX", "Programming Language :: Python :: 3" ], "description": "# pifetcher\nA scalable headless data fetching library written with python and message queue service to enable quickly and easily parsing web data in a distributive way.\n\n## To install\n```\npip install pifetcher\n```\n\n## PYPI Link [https://pypi.org/project/pifetcher/](https://pypi.org/project/pifetcher/)\n\n## dependencies:\n- selenium\n- BeautifulSoup4\n- boto3 (optional but by default)\n- ChromeDriver for chrome 76(by default)\n- Chrome executable v 76(by default)\n\n## features:\n\n- event-callback-based interaction between user defined logic and the pre-disigned fetch worker\n- process works in batches, library user will be able to capture the event of a batch of works have been finished\n- easy to use, only needs to inherit the FetchWorker class and implement the basic call back functions\n- it's design to use message queue, enbles more than just one worker to perform data fetching in order to scale the application \n\n## how to use:\n\n1. set up work queue component on the host computer(aws simple queue service by default), such as credentials, regions\n[AWS BOTO3 initial set up docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html)\n\n2. configure a fetcher by creating a field mapping config file, for example:\ncreate a mapping config file for fetching amazon.com item pricing data\n\n```javascript\n{\n \"price\": {\n \"type\": \"text\",\n \"selector\": \"#priceblock_ourprice\",\n \"attribute\":\".text\"\n },\n \"id\": {\n \"type\": \"text\",\n \"selector\": \"#ASIN\",\n \"attribute\": \"value\"\n },\n \"title\": {\n \"type\": \"text\",\n \"selector\": \"#productTitle\",\n \"attribute\":\".text\"\n }\n}\n```\n3. create a pifetcherConfig.json file, and add the fetcher mapping file that previously created to fetcher -> mappingConfigs with its name and file path \n\nnumWorksPerTime : defines the number of messages it try to fetch from the queue per work cycle\npollingIntervalOnActive : time interval before fetching the next message when the worker status is active(meaning it fetched at least on message in the last worker cycle)\npollingIntervalOnIdle : time interval before fetching the next message when the worker status is active(meaning it fetched no message in the last worker cycle)\n\n```javascript\n{\n \"browser\":{\n \"browser_options\":[\"--window-size=1920,1080\", \"--disable-extensions\", \"--proxy-server='direct://'\", \"--proxy-bypass-list=*\", \"--start-maximized\",\"--ignore-certificate-errors\", \"--headless\"],\n \"win-driver_path\":\"chromedriver-win-76.exe\",\n \"win-binary_location\": \"\",\n \"mac-driver_path\":\"chromedriver-mac-76\",\n \"mac-binary_location\": \"\"\n\n },\n \"queue\":\n {\n \"numWorksPerTime\": 1,\n \"queueType\":\"AWSSimpleQueueService\",\n \"queueName\":\"datafetch.fifo\",\n \"pollingIntervalOnActive\": 0.2,\n \"pollingIntervalOnIdle\": 60\n },\n \"logger\":\n {\n \"output\":\"console\"\n },\n \"fetcher\":\n {\n \"mappingConfigs\":{\n \"amazon\":\"amazon.json\"\n }\n\n }\n}\n```\n4. to use the fetcher worker\n- import the fetcher worker class and config class \n```python\nfrom pifetcher.core import Config\nfrom pifetcher.core import FetchWorker\n```\n- load the pifetherConfig.json to the Config class\n```python\nConfig.use('pifetcherConfig.json')\n```\n\n- implement event function with your own logic\non_save_result : this will be called when a data object has been successfully parsed\non_empty_result_error: this will be called after parsing an empty object, you may want to stop/ pause the process to investigate the problem before continuing parsing\non_batch_start: this will be called when the worker received a batch start signal , you may implement your logic of adding fetching tasks to the queue here\non_batch_finish: this will be called when the worker received a batch finish signal\nexample:\n```python\n def on_save_result(self, result, batch_id, work):\n print(result, batch_id, work)\n def on_empty_result_error(self):\n self.stop()\n def on_batch_start(self, batch_id):\n work = {}\n work['url'] = 'a amazon url'\n work['fetcherName'] = 'amazon'\n self.add_works([work])\n def on_batch_finish(self, batch_id):\n print(f\"all works with the batchId {batch_id} have been processed\")\n```\n5. Run the worker and, send a StartProcess Signal to the queue to start the process\n\n- start the worker to receive and process works\n\n```python\ntw = TestWorker()\ntw.do_works()\n```\n\n- to send a start signal to the queue\nIf you want to send out the start signal from one of the worker, you can call this function\n```python\ntw.send_start_signal()\n```\n\nBut if you want to start the batch process from another system, you can use the code below\n```python\n sqs = boto3.resource('sqs')\n queue = sqs.get_queue_by_name(QueueName='datafetch.fifo')\n content = {\"type\":\"BatchStart\",\"batchId\": str(uuid.uuid4()),\"content\":{}}\n queue.send_message(MessageBody=json.dumps(content), MessageGroupId = \"FetchWork\", MessageDeduplicationId = str(time.time()).replace(\".\",\"\"))\n\n``` \n\nCommand to exit all chromedriver in windows\n```\ntaskkill /f /im chromedriver-win-76.exe\n```\n\n# How to optimized the number of polls the worker has to send to the queue\n\nWhen no message was fetched in a worker cycle, it would enter the idle state. Under the idle state, it's supposed to wait a longer time interval before trying to fetch the next message. This sleep interval is defined in the config file at the location:\n```javascript\n \"pollingIntervalOnIdle\": 60\n```\n\nAfter the worker received at least one mssage in a worker cycle, the worker status will be set as ACTIVE. Under this state, it's supposed to wait a shorter time interval before trying to fetch the next message. This sleep interval is defined in the config file at the location:\n```javascript\n \"pollingIntervalOnActive\": 0.2,\n```\n\n\n### To do list items:\n- fix browser driver issues\n- simplify initial setup process\n\n### Completed items:\n- use better strategy to reduce number of requests a worker has to send\n- put all constants in config the config file (checked)\n- complete the type conversions for different data types (checked)\n- add message type (work initiation message type) (checked)\n- logging (checked)\n- data fetching with use of aws sqs\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/gavinz0228/pifetcher", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "pifetcher", "package_url": "https://pypi.org/project/pifetcher/", "platform": "", "project_url": "https://pypi.org/project/pifetcher/", "project_urls": { "Homepage": "https://github.com/gavinz0228/pifetcher" }, "release_url": "https://pypi.org/project/pifetcher/0.0.3.4/", "requires_dist": [ "bs4", "selenium", "boto3" ], "requires_python": "", "summary": "A scalable headless data fetching library written with python and message queue service to enable quickly and easily prasing web data in a distributive way.", "version": "0.0.3.4" }, "last_serial": 5877394, "releases": { "0.0.2.5": [ { "comment_text": "", "digests": { "md5": "63000b608275be44a3a49766cded06f0", "sha256": "e96b04735ecd3f268a50b93d2d5709d91add070238985a3354631c86c879347c" }, "downloads": -1, "filename": "pifetcher-0.0.2.5-py3-none-any.whl", "has_sig": false, "md5_digest": "63000b608275be44a3a49766cded06f0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21774204, "upload_time": "2019-08-26T03:34:53", "url": "https://files.pythonhosted.org/packages/c6/ae/acc3e4f8bdee1a303c74c3da7499faa397d9a8e25e3147bbc03aa6d9d10b/pifetcher-0.0.2.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d23adfcfae6b888d9a8d48b6818c60be", "sha256": "fead05dfafc8d71f116625d733eee65692d57bb216edde3ba1b4a13a6ef713f0" }, "downloads": -1, "filename": "pifetcher-0.0.2.5.tar.gz", "has_sig": false, "md5_digest": "d23adfcfae6b888d9a8d48b6818c60be", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21695238, "upload_time": "2019-08-26T03:35:03", "url": "https://files.pythonhosted.org/packages/de/fd/2e59f4f7c79aa65ccc86e574c8aa5f557cb3cb89ccd1cef519a40a546ac9/pifetcher-0.0.2.5.tar.gz" } ], "0.0.2.6": [ { "comment_text": "", "digests": { "md5": "2f416d98cfea40a92a37cddb68c24ff2", "sha256": "b74a29e8b3c4ed38fad9fc63ecea52281f2081de85e0eedd6775f1f945a4a9fe" }, "downloads": -1, "filename": "pifetcher-0.0.2.6-py3-none-any.whl", "has_sig": false, "md5_digest": "2f416d98cfea40a92a37cddb68c24ff2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21774833, "upload_time": "2019-08-27T01:52:13", "url": "https://files.pythonhosted.org/packages/17/62/6aaf168f78b57a70f6e6ae26ed3f6e16c9529c8d0c2fc17ec07c9bf8cf20/pifetcher-0.0.2.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "31cb02c4153b70b11a49e018b7ab3d43", "sha256": "8beab05f872fdec2829073bc7e1f47f09b30057866473d463247d6f4bf3340ef" }, "downloads": -1, "filename": "pifetcher-0.0.2.6.tar.gz", "has_sig": false, "md5_digest": "31cb02c4153b70b11a49e018b7ab3d43", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21696135, "upload_time": "2019-08-27T01:52:34", "url": "https://files.pythonhosted.org/packages/f9/82/8e7cc967bc069c96b5694ff4ab29490c02d80df164938031e9a14d6e73d6/pifetcher-0.0.2.6.tar.gz" } ], "0.0.2.7": [ { "comment_text": "", "digests": { "md5": "5ebe13a643eced0c3b892a8c474bd891", "sha256": "8d549eb6030b5566caf3f3589ccadbe3772212967cf1f4822dd8cfd4614cfa40" }, "downloads": -1, "filename": "pifetcher-0.0.2.7-py3-none-any.whl", "has_sig": false, "md5_digest": "5ebe13a643eced0c3b892a8c474bd891", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775413, "upload_time": "2019-08-28T03:45:04", "url": "https://files.pythonhosted.org/packages/02/78/63133dd9692ddf9bd19e5d5279b22c33c237b42a49c490ba3253fe7ccc32/pifetcher-0.0.2.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a34d8fcb3cf1c4b8622df26b29df201f", "sha256": "7eff53e824a5162306e7621e324df1ec8fe2cbf3f0d90e3ad2b920dcbc89e25c" }, "downloads": -1, "filename": "pifetcher-0.0.2.7.tar.gz", "has_sig": false, "md5_digest": "a34d8fcb3cf1c4b8622df26b29df201f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21696896, "upload_time": "2019-08-28T03:45:20", "url": "https://files.pythonhosted.org/packages/0f/a1/da4ed3de3d039d8c3fcd42947311448e2ecbb7cb7d16f8ebe0aa33fac4d6/pifetcher-0.0.2.7.tar.gz" } ], "0.0.2.8": [ { "comment_text": "", "digests": { "md5": "66238f644c3bb2f32f612868aea66799", "sha256": "b67e7f6c86fcf321d820ca8834d615a9a6ce98efdbe582a84bf5b8bd41bc0bbf" }, "downloads": -1, "filename": "pifetcher-0.0.2.8-py3-none-any.whl", "has_sig": false, "md5_digest": "66238f644c3bb2f32f612868aea66799", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775431, "upload_time": "2019-08-28T04:22:33", "url": "https://files.pythonhosted.org/packages/17/87/5522bc87543b2181be2e7ebd3b86d5e02dda188461c3f1589f537632cc27/pifetcher-0.0.2.8-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "90af58eaf2d3fe06a105a85e4567eaa3", "sha256": "704b6634e53461497778a10230c3deb4e89595ed41ecebf53882f215b5b2ca04" }, "downloads": -1, "filename": "pifetcher-0.0.2.8.tar.gz", "has_sig": false, "md5_digest": "90af58eaf2d3fe06a105a85e4567eaa3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21696883, "upload_time": "2019-08-28T04:22:42", "url": "https://files.pythonhosted.org/packages/67/4c/843480a226274e4b97d83b830a6c189e783c0e24e0b379ee356af9ac1be9/pifetcher-0.0.2.8.tar.gz" } ], "0.0.2.9": [ { "comment_text": "", "digests": { "md5": "9ee16c8825aef009a8b61ba1aba9266d", "sha256": "394a515afe480e15c9737491c335ee0a2d40154961067a0477b3f20e9e2c87ea" }, "downloads": -1, "filename": "pifetcher-0.0.2.9-py3-none-any.whl", "has_sig": false, "md5_digest": "9ee16c8825aef009a8b61ba1aba9266d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775421, "upload_time": "2019-08-28T04:31:58", "url": "https://files.pythonhosted.org/packages/82/45/1eac5f9f6bfd5518424bad15e617284dc9f6cd0e2285d7c60ef80ef611bd/pifetcher-0.0.2.9-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d2f9f81932ee37e0b33bba2776d25b68", "sha256": "d75f1c80d0813884c5294211c9d923fa662d5ab06c40d0e11fa6e4b23e3d579c" }, "downloads": -1, "filename": "pifetcher-0.0.2.9.tar.gz", "has_sig": false, "md5_digest": "d2f9f81932ee37e0b33bba2776d25b68", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21696890, "upload_time": "2019-08-28T04:32:07", "url": "https://files.pythonhosted.org/packages/60/87/7383acda652fb0c24bb80ef2edb774bf640833402cb8f72798488a81065e/pifetcher-0.0.2.9.tar.gz" } ], "0.0.3.0": [ { "comment_text": "", "digests": { "md5": "6a9587bfa5004e63e6f9dab864d6680b", "sha256": "ce7a6dbc9d5883de2367a60e36d7d4b02cb42afab8e22e33a41f0a1e038a9756" }, "downloads": -1, "filename": "pifetcher-0.0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "6a9587bfa5004e63e6f9dab864d6680b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775425, "upload_time": "2019-09-23T02:32:26", "url": "https://files.pythonhosted.org/packages/ed/44/401153647df10b7c1e5f176530800e1400019b63b4919bf78ac34a4b35b4/pifetcher-0.0.3.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "43712db4c8f152af7a52ca17079b07db", "sha256": "ab445a7889a25d96de470a4d9c09e588134236b8808e5dc4e81f137d92259ee8" }, "downloads": -1, "filename": "pifetcher-0.0.3.0.tar.gz", "has_sig": false, "md5_digest": "43712db4c8f152af7a52ca17079b07db", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21696882, "upload_time": "2019-09-23T02:32:35", "url": "https://files.pythonhosted.org/packages/9e/a4/fe15bc96f0bbaf478609903a2b7707412ceef71ede2b3aacb7f81bbfa574/pifetcher-0.0.3.0.tar.gz" } ], "0.0.3.1": [ { "comment_text": "", "digests": { "md5": "1a8043638998a38c3996caa987c458a2", "sha256": "e658d5648afca3b0ceb6facd5214859f6c51f3cbad9fff05e1da5b248cad5765" }, "downloads": -1, "filename": "pifetcher-0.0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "1a8043638998a38c3996caa987c458a2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775434, "upload_time": "2019-09-23T03:20:53", "url": "https://files.pythonhosted.org/packages/54/f2/da821b0dd902b0ca6c323d6474891d3ac51f04bfb29b92cd229219ee84ee/pifetcher-0.0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "066403c895b92cbc62fe8d1ee93b3e4a", "sha256": "4a30fdf1f871d5d87174c847bd6e240f3442145be95be3813435179b4dee2d09" }, "downloads": -1, "filename": "pifetcher-0.0.3.1.tar.gz", "has_sig": false, "md5_digest": "066403c895b92cbc62fe8d1ee93b3e4a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21696879, "upload_time": "2019-09-23T03:21:02", "url": "https://files.pythonhosted.org/packages/47/b9/0284108c76fa90aaf12eaea4e801cf611b9218fe5209757a66fa173fe3fa/pifetcher-0.0.3.1.tar.gz" } ], "0.0.3.2": [ { "comment_text": "", "digests": { "md5": "9284351104e00cd26dd8d8dec47d5669", "sha256": "28b6f7b1d564415151cf68bb586ee2c23beda3599b44b7c85a55971b39bb760c" }, "downloads": -1, "filename": "pifetcher-0.0.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "9284351104e00cd26dd8d8dec47d5669", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775433, "upload_time": "2019-09-24T02:42:15", "url": "https://files.pythonhosted.org/packages/67/23/18dc7042fad54929eaa6e7e5e343873b7a094d14f047c65cdcc1721d2cf0/pifetcher-0.0.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cd6f3725724ac2c9964de4db577bfff2", "sha256": "801288e18d0bf4ee0378defabfbf82803e45fdf1416e90ffc7afce548be76473" }, "downloads": -1, "filename": "pifetcher-0.0.3.2.tar.gz", "has_sig": false, "md5_digest": "cd6f3725724ac2c9964de4db577bfff2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21697026, "upload_time": "2019-09-24T02:42:29", "url": "https://files.pythonhosted.org/packages/68/b4/e6f87993b06e7af19a7cb366ae05da7cc11b07c1debe7bb255d14f151d5a/pifetcher-0.0.3.2.tar.gz" } ], "0.0.3.3": [ { "comment_text": "", "digests": { "md5": "dcd2a2c73dc3a3aba04b51608be50fdd", "sha256": "960e81d304e496e570cf5a87c642a02aef11e5082fd11b6f53fec40669d7b678" }, "downloads": -1, "filename": "pifetcher-0.0.3.3-py3-none-any.whl", "has_sig": false, "md5_digest": "dcd2a2c73dc3a3aba04b51608be50fdd", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775435, "upload_time": "2019-09-24T03:00:09", "url": "https://files.pythonhosted.org/packages/e1/ea/b03ae89d9939d5cb1be0bcf14122d69ad74b4b228131bebc99aeaf64440b/pifetcher-0.0.3.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ba61f624838cfa9d5dd49b42b3a8ccf7", "sha256": "d4c75ed7206c813e5227abe993e8b5a783cf221b423462d196f623a6780a0552" }, "downloads": -1, "filename": "pifetcher-0.0.3.3.tar.gz", "has_sig": false, "md5_digest": "ba61f624838cfa9d5dd49b42b3a8ccf7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21697055, "upload_time": "2019-09-24T03:00:34", "url": "https://files.pythonhosted.org/packages/c5/32/8568c8fc1be2a0011734d37bc3364088d67001b01c7662ffc1dcc840aa1c/pifetcher-0.0.3.3.tar.gz" } ], "0.0.3.4": [ { "comment_text": "", "digests": { "md5": "e4e5790d6810b4b5912752ddc755f9c2", "sha256": "c16eea265125f9753459985678c27cd42cec5adf254516d25cc996c14e8c342a" }, "downloads": -1, "filename": "pifetcher-0.0.3.4-py3-none-any.whl", "has_sig": false, "md5_digest": "e4e5790d6810b4b5912752ddc755f9c2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775503, "upload_time": "2019-09-24T04:06:57", "url": "https://files.pythonhosted.org/packages/6b/dd/6e74333ea1e937e8f5b4b2e53a3be709670809b6b6b68899d909612fabd7/pifetcher-0.0.3.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5f945dcd661485c07e268661113fc0e2", "sha256": "63711c03c351d4b8c2b8d55297e7398c7629f1cde71e1f8f53bcd8cb94e5c5ee" }, "downloads": -1, "filename": "pifetcher-0.0.3.4.tar.gz", "has_sig": false, "md5_digest": "5f945dcd661485c07e268661113fc0e2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21696974, "upload_time": "2019-09-24T04:07:11", "url": "https://files.pythonhosted.org/packages/ea/68/ff80505adc5d8c400f45ff7362f01afeb0098364dda431a6f04f4841f5e0/pifetcher-0.0.3.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e4e5790d6810b4b5912752ddc755f9c2", "sha256": "c16eea265125f9753459985678c27cd42cec5adf254516d25cc996c14e8c342a" }, "downloads": -1, "filename": "pifetcher-0.0.3.4-py3-none-any.whl", "has_sig": false, "md5_digest": "e4e5790d6810b4b5912752ddc755f9c2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21775503, "upload_time": "2019-09-24T04:06:57", "url": "https://files.pythonhosted.org/packages/6b/dd/6e74333ea1e937e8f5b4b2e53a3be709670809b6b6b68899d909612fabd7/pifetcher-0.0.3.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5f945dcd661485c07e268661113fc0e2", "sha256": "63711c03c351d4b8c2b8d55297e7398c7629f1cde71e1f8f53bcd8cb94e5c5ee" }, "downloads": -1, "filename": "pifetcher-0.0.3.4.tar.gz", "has_sig": false, "md5_digest": "5f945dcd661485c07e268661113fc0e2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21696974, "upload_time": "2019-09-24T04:07:11", "url": "https://files.pythonhosted.org/packages/ea/68/ff80505adc5d8c400f45ff7362f01afeb0098364dda431a6f04f4841f5e0/pifetcher-0.0.3.4.tar.gz" } ] }