{ "info": { "author": "Florents Tselai", "author_email": "florents.tselai@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# [`PyPocketExplore`](http://tselai.com/pypocketexplore-collecting-exploring-predicting-pocket-items-machine-learning.html) - Unofficial API to [Pocket Explore](https://getpocket.com/explore/) data\n\n[`PyPocketExplore`](http://tselai.com/pypocketexplore-collecting-exploring-predicting-pocket-items-machine-learning.html) \nis a CLI-based and web-based API to access [Pocket Explore](https://getpocket.com/explore/) data.\nIt can be used to collect data about the most popular Pocket items for different topics.\n\nAn example usage would be crawling the data and use it as a training set to predict the number of pocket saves for a web page.\n\n## Usage\n\nThe easiest way to install the package is through PyPi.\nThis should get you up-and-running pretty quickly.\n```shell\n$ pip install PyPocketExplore\n```\n\nThrough the CLI there are two modes: `topic` and `batch`\n\nWith the first one (`pypocketexplore topic`) you can download items from specific topics and output them to a nicely formatted JSON file.\n\n```bash\nUsage: pypocketexplore topic [OPTIONS] [LABEL]...\n\n Download items for specific topics\n\nOptions:\n --limit INTEGER Limit items to download\n --out TEXT JSON output filepath\n --nlp If set, also downloads the page and applies NLP (through\n NLTK)\n```\n\nFor example, this command\n```bash\n$ pypocketexplore topic python data sex books --nlp --out life_topics.json\n```\nwill go through the corresponding pages: \n`https://getpocket.com/explore/python, https://getpocket.com/explore/data, https://getpocket.com/explore/sex, https://getpocket.com/explore/books`\none-by-one and then:\n\n* scrap and extract the immediately available data for each item (`item_id, title, save count, excerpt and url`)\n* run each item url through the awesome [Newspaper](http://newspaper.readthedocs.io/en/latest/) library (in-parallel)\n* apply NLP to each item's text\n* save the results to `life_topics.json`\n\nIn the end you'll have a **rich dataset full of text to play with and of course a popularity metric** - pretty cool to experiment with.\nYou can check it out [here](https://tselai.com/data/life_topics.json)\n\nFor each topic on *Pocket Explore*, there are a set of `related topics` which one can crawl through pretty easily\nin a recursive way.\nFor example after scraping `https://getpocket.com/explore/python` on can then scrap the \n`related topics: programming javascript google windows java linux data science python 3 developer`.\n\nThis essentially means that one can crawl through the whole graph of topics by following the `related topics` as edges. \nTo do this one of course needs a set of *seed topics* to initiate the crawling process.\nTo get these seeds, the `pypocketexplore batch` mode fetches the taxonomy labels provided by [IBM Watson](https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/categories.html).\nand then walks through the graph.\n(I guess Pocket uses the IBM Watson to label its items, so this kind of reverse-engineering make sense. (Sorry Pocket guys) )\n\n```bash\nUsage: pypocketexplore batch [OPTIONS]\n\n Download items for all topics recursively. USE WITH CAUTION!\n\nOptions:\n --n INTEGER Max number of total items to download\n --limit INTEGER Limit items to download per topic\n --out TEXT JSON output filepath\n --nlp If set, also downloads the page and applies NLP (through\n NLTK)\n --mongo TEXT Mongo DB URI to save items\n --help Show this message and exit.\n```\n\n**CAUTION**\nThis mode with all goodies enabled will take few days to run and then collect around 300k unique items\nthrough 8k topics.\nI have tried to space the requests to Pocket's servers and handle rate limit errors, \nbut one can never be sure with such things.\n\n## Web API\n\nTo have access to a standalone web API you need to clone the repo locally first.\n```shell\n$ git clone git@github.com:Florents-Tselai/PyPocketExplore.git\n$ cd PyPocketExplore\n$ pip install -r requirements.txt\n```\n\nTo run this API application, use the `flask` command as same as [Flask Quickstart](http://flask.pocoo.org/docs/0.12/quickstart/)\n\n```shell\n$ cd PyPocketExplore\n$ export FLASK_APP=./PyPocketExplore/pypocketexplore/api/api.py\n$ export FLASK_DEBUG=1 ## if you run in debug mode.\n$ flask run\n * Running on http://localhost:5000/\n```\n\n## Web API Documentation\n\n### Topic\n\n* `GET /api/topic/{topic}` - Get topic data\n\nExample topics: `python, finance, business` and more\n\nExample `GET /api/topic/python`\n\n#### Response\n\n```json\n[\n {\n \"excerpt\": \"For part 1, see here. All the software written for this project is in Python. I\u2019m not an expert python programmer, far from it but the huge number of available libraries and the fact that I can make some sense of it all without having spent a lifetime in Python made this a fairly obvious choice.\",\n \"image\": \"https://d33ypg4xwx0n86.cloudfront.net/direct?\"url\"=https%3A%2F%2Fjacquesmattheij.com%2Fusb-microscope.jpg&resize=w750\",\n \"item_id\": \"1731527024\",\n \"saves_count\": 223,\n \"title\": \"Sorting 2 Tons of Lego, The software Side \u00b7 Jacques Mattheij\",\n \"topic\": \"python\",\n \"url\": \"https://jacquesmattheij.com/sorting-lego-the-software-side\"\n },\n \n {\n \"excerpt\": \"There are lots of free resources for learning Python available now. I wrote about some of them way back in 2013, but there\u2019s even more now then there was then! In this article, I want to share these resources with you.\",\n \"image\": \"https://d33ypg4xwx0n86.cloudfront.net/direct?\"url\"=https%3A%2F%2Fdz2cdn1.dzone.com%2Fstorage%2Farticle-thumb%2F5158392-thumb.jpg&resize=w750\",\n \"item_id\": \"1727350036\",\n \"saves_count\": 59,\n \"title\": \"Free Python Resources\",\n \"topic\": \"python\",\n \"url\": \"https://dzone.com/articles/free-python-resources\"\n },\n \n {\n \"excerpt\": \"A surprisingly versatile Swiss Army knife\u200a\u2014\u200awith very long blades!TL;DRWe (an investment bank in the Eurozone) are deploying Jupyter and the Python scientific stack in a corporate environment to provide employees and contractors with an interactive computing environment with to help them leve\",\n \"image\": \"https://d33ypg4xwx0n86.cloudfront.net/direct?\"url\"=https%3A%2F%2Fcdn-\"image\"s-1.medium.com%2Fmax%2F1600%2F1%2AmeN9gfB_nuwmGGwLQzhVQA.png&resize=w750\",\n \"item_id\": \"1726489646\",\n \"saves_count\": 41,\n \"title\": \"Jupyter & Python in the corporate LAN\",\n \"topic\": \"python\",\n \"url\": \"https://medium.com/@olivier.borderies/jupyter-python-in-the-corporate-lan-109e2ffde897\"\n },\n ...\n]\n```\n\n\nLicense\n-------\n\nCopyright (c) 2017 [Florents Tselai](https://tselai.com)\nLicensed under the [MIT license](http://opensource.org/licenses/MIT).\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://tselai.com/pypocketexplore-collecting-exploring-predicting-pocket-items-machine-learning", "keywords": "", "license": "The MIT License (MIT) Copyright \u00a9 2017 Florents Tselai.", "maintainer": "", "maintainer_email": "", "name": "PyPocketExplore", "package_url": "https://pypi.org/project/PyPocketExplore/", "platform": "", "project_url": "https://pypi.org/project/PyPocketExplore/", "project_urls": { "Homepage": "http://tselai.com/pypocketexplore-collecting-exploring-predicting-pocket-items-machine-learning" }, "release_url": "https://pypi.org/project/PyPocketExplore/1.0.0/", "requires_dist": null, "requires_python": "", "summary": "PyPocketExplore - Unofficial API to Pocket Explore data", "version": "1.0.0" }, "last_serial": 3100469, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "cb4cc1cd467e7393e7f8eba937d50637", "sha256": "5c5b66f2bff2c3db133cd2f433711b5e9af8ca62d7458ad7d79354af9aab3363" }, "downloads": -1, "filename": "PyPocketExplore-1.0.0.tar.gz", "has_sig": false, "md5_digest": "cb4cc1cd467e7393e7f8eba937d50637", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8275, "upload_time": "2017-08-16T11:29:10", "url": "https://files.pythonhosted.org/packages/aa/db/5bb6f7ba2de734981fd26ccf6cedc4cc1368a33c349eef980bd17af2f860/PyPocketExplore-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cb4cc1cd467e7393e7f8eba937d50637", "sha256": "5c5b66f2bff2c3db133cd2f433711b5e9af8ca62d7458ad7d79354af9aab3363" }, "downloads": -1, "filename": "PyPocketExplore-1.0.0.tar.gz", "has_sig": false, "md5_digest": "cb4cc1cd467e7393e7f8eba937d50637", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8275, "upload_time": "2017-08-16T11:29:10", "url": "https://files.pythonhosted.org/packages/aa/db/5bb6f7ba2de734981fd26ccf6cedc4cc1368a33c349eef980bd17af2f860/PyPocketExplore-1.0.0.tar.gz" } ] }