{ "info": { "author": "['Ahmet Taspinar', 'Lasse Schuirmann', 'Sriram Kumar']", "author_email": "srirambsk1996@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "Synopsis\n========\n\nA simple script to scrape for Tweets using the Python package requests\nto retrieve the content and Beautifullsoup4 to parse the retrieved\ncontent.\n\n1. Motivation\n=============\n\nTwitter has provided `REST\nAPI's `__ which can be used by\ndevelopers to access and read Twitter data. They have also provided a\n`Streaming API `__ which can\nbe used to access Twitter Data in real-time.\n\nMost of the software written to access Twitter data provide a library\nwhich functions as a wrapper around Twitters Search and Streaming API's\nand therefore are limited by the limitations of the API's.\n\nWith Twitter's Search API you can only sent 180 Requests every 15\nminutes. With a maximum number of 100 tweets per Request this means you\ncan mine for 4 x 180 x 100 = 72.000 tweets per hour. By using\nTwitterScraper you are not limited by this number but by your internet\nspeed/bandwith and the number of instances of TwitterScraper you are\nwilling to start.\n\nOne of the bigger disadvantages of the Search API is that you can only\naccess Tweets written in the **past 7 days**. This is a major bottleneck\nfor anyone looking for older past data to make a model from. With\nTwitterScraper there is no such limitation.\n\nPer Tweet it scrapes the following information: \n + Tweet-id \n + Tweet-url \n + Tweet text \n + Tweet html \n + Tweet timestamp \n + Tweet Epoch timestamp\n + Tweet No. of likes\n + Tweet No. of replies\n + Tweet No. of retweets\n + Username\n + User Full Name\n + User ID\n + Tweet is an retweet (only when scraping for user profiles)\n + Username retweeter (only when scraping for user profiles)\n + Userid retweeter (only when scraping for user profiles)\n + Retweet ID (only when scraping for user profiles)\n\nIn addition it can scrape for the following user information: \n + Date user joined\n + User location (if filled in)\n + User blog (if filled in)\n + User No. of tweets\n + User No. of following\n + User No. of followers\n + User No. of likes\n + User No. of lists\n + User is verified\n\n\n2. Installation and Usage\n=========================\n\nTo install **sriram-twitter-scraper**:\n\n.. code:: python\n\n (sudo) pip install sriram-twitter-scraper\n\nor you can clone the repository and in the folder containing setup.py\n\n.. code:: python\n\n python setup.py install\n\n2.2 The CLI\n-----------\n\nYou can use the command line application to get your tweets stored to\nJSON right away. Twitterscraper takes several arguments:\n\n- ``-h`` or ``--help`` Print out the help message and exits.\n\n- ``-l`` or ``--limit`` TwitterScraper stops scraping when *at least*\n the number of tweets indicated with ``--limit`` is scraped. Since\n tweets are retrieved in batches of 20, this will always be a multiple\n of 20. Omit the limit to retrieve all tweets. You can at any time abort the\n scraping by pressing Ctrl+C, the scraped tweets will be stored safely\n in your JSON file.\n\n- ``--lang`` Retrieves tweets written in a specific language. Currently\n 30+ languages are supported. For a full list of the languages print\n out the help message.\n\n- ``-bd`` or ``--begindate`` Set the date from which TwitterScraper\n should start scraping for your query. Format is YYYY-MM-DD. The\n default value is set to 2006-03-21. This does not work in combination with ``--user``.\n\n- ``-ed`` or ``--enddate`` Set the enddate which TwitterScraper should\n use to stop scraping for your query. Format is YYYY-MM-DD. The\n default value is set to today. This does not work in combination with ``--user``.\n\n- ``-u`` or ``--user`` Scrapes the tweets from that users profile page.\n This also includes all retweets by that user. See section 2.2.4 in the examples below \n for more information.\n\n- ``--profiles`` : Twitterscraper will in addition to the tweets, also scrape for the profile \n information of the users who have written these tweets. The results will be saved in the \n file userprofiles_.\n\n- ``-p`` or ``--poolsize`` Set the number of parallel processes\n TwitterScraper should initiate while scraping for your query. Default\n value is set to 20. Depending on the computational power you have,\n you can increase this number. It is advised to keep this number below\n the number of days you are scraping. For example, if you are\n scraping from 2017-01-10 to 2017-01-20, you can set this number to a\n maximum of 10. If you are scraping from 2016-01-01 to 2016-12-31, you\n can increase this number to a maximum of 150, if you have the\n computational resources. Does not work in combination with ``--user``.\n\n- ``-o`` or ``--output`` Gives the name of the output file. If no\n output filename is given, the default filename 'tweets.json' or 'tweets.csv' \n will be used.\n\n- ``-c`` or ``--csv`` Write the result to a CSV file instead of a JSON file.\n\n- ``-d`` or ``--dump``: With this argument, the scraped tweets will be\n printed to the screen instead of an outputfile. If you are using this\n argument, the ``--output`` argument doe not need to be used.\n\n- ``-ow`` or ``--overwrite``: With this argument, if the output file already exists\n it will be overwritten. If this argument is not set (default) twitterscraper will \n exit with the warning that the output file already exists.\n\n\n2.2.1 Examples of simple queries\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nBelow is an example of how twitterscraper can be used:\n\n``twitterscraper Trump --limit 1000 --output=tweets.json``\n\n``twitterscraper Trump -l 1000 -o tweets.json``\n\n``twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``\n\n\n\n2.2.2 Examples of advanced queries\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nYou can use any advanced query Twitter supports. An advanced query\nshould be placed within quotes, so that twitterscraper can recognize it\nas one single query.\n\nHere are some examples:\n\n- search for the occurence of 'Bitcoin' or 'BTC':\n ``twitterscraper \"Bitcoin OR BTC\" -o bitcoin_tweets.json -l 1000``\n- search for the occurence of 'Bitcoin' and 'BTC':\n ``twitterscraper \"Bitcoin AND BTC\" -o bitcoin_tweets.json -l 1000``\n- search for tweets from a specific user:\n ``twitterscraper \"Blockchain from:VitalikButerin\" -o blockchain_tweets.json -l 1000``\n- search for tweets to a specific user:\n ``twitterscraper \"Blockchain to:VitalikButerin\" -o blockchain_tweets.json -l 1000``\n- search for tweets written from a location:\n ``twitterscraper \"Blockchain near:Seattle within:15mi\" -o blockchain_tweets.json -l 1000``\n\nYou can construct an advanced query on `Twitter Advanced Search `__ or use one of the operators shown on `this page `__.\nAlso see `Twitter's Standard operators `__\n\n\n\n2.2.3 Examples of scraping user pages\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nYou can also scraped all tweets written or retweetet by a specific user. \nThis can be done by adding the boolean argument ``-u / --user`` argument. \nIf this argument is used, the search term should be equal to the username. \n\nHere is an example of scraping a specific user:\n\n``twitterscraper realDonaldTrump --user -o tweets_username.json``\n\nThis does not work in combination with ``-p``, ``-bd``, or ``-ed``.\n\nThe main difference with the example \"search for tweets from a specific user\" in section 2.2.2 is that this method really scrapes\nall tweets from a profile page (including retweets). \nThe example in 2.2.2 scrapes the results from the search page (excluding retweets). \n\n\n2.3 From within Python\n----------------------\n\nYou can easily use TwitterScraper from within python:\n\n::\n\n from twitterscraper import query_tweets\n\n if __name__ == '__main__':\n list_of_tweets = query_tweets(\"Trump OR Clinton\", 10)\n\n #print the retrieved tweets to the screen:\n for tweet in query_tweets(\"Trump OR Clinton\", 10):\n print(tweet)\n\n #Or save the retrieved tweets to file:\n file = open(\u201coutput.txt\u201d,\u201dw\u201d) \n for tweet in query_tweets(\"Trump OR Clinton\", 10):\n file.write(tweet.encode('utf-8')) \n file.close()\n\n\n2.4 Scraping for retweets\n-------------------------\n\nA regular search within Twitter will not show you any retweets. \nTwitterscraper therefore does not contain any retweets in the output. \n\nTo give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet, \na search for ``#trump2020`` will only show the original tweet. \n\nThe only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument. \n\n\n2.5 Scraping for User Profile information\n-----------------------------------------\nBy adding the argument ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.\nThe results will be saved in the file \"userprofiles_\".\n\nTry not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :)\nIt is also possible to scrape for profile information without scraping for tweets. \nExamples of this can be found in the examples folder. \n\n\n3. Output\n=========\n\nAll of the retrieved Tweets are stored in the indicated output file. The\ncontents of the output file will look like:\n\n::\n\n [{\"fullname\": \"Rupert Meehl\", \"id\": \"892397793071050752\", \"likes\": \"1\", \"replies\": \"0\", \"retweets\": \"0\", \"text\": \"Latest: Trump now at lowest Approval and highest Disapproval ratings yet. Oh, we're winning bigly here ...\\n\\nhttps://projects.fivethirtyeight.com/trump-approval-ratings/?ex_cid=rrpromo\\u00a0\\u2026\", \"timestamp\": \"2017-08-01T14:53:08\", \"user\": \"Rupert_Meehl\"}, {\"fullname\": \"Barry Shapiro\", \"id\": \"892397794375327744\", \"likes\": \"0\", \"replies\": \"0\", \"retweets\": \"0\", \"text\": \"A former GOP Rep quoted this line, which pretty much sums up Donald Trump. https://twitter.com/davidfrum/status/863017301595107329\\u00a0\\u2026\", \"timestamp\": \"2017-08-01T14:53:08\", \"user\": \"barryshap\"}, (...)\n ]\n\n3.1 Opening the output file\n---------------------------\n\nIn order to correctly handle all possible characters in the tweets\n(think of Japanese or Arabic characters), the output is saved as utf-8\nencoded bytes. That is why you could see text like\n\"\\u30b1 \\u30f3 \\u3055 \\u307e \\u30fe ...\" in the output file.\n\nWhat you should do is open the file with the proper encoding:\n\n.. figure:: https://user-images.githubusercontent.com/4409108/30702318-f05bc196-9eec-11e7-8234-a07aabec294f.PNG\n\n Example of output with Japanese characters\n\n3.1.2 Opening into a pandas dataframe\n-------------------------------------\n\nAfter the file has been opened, it can easily be converted into a pandas DataFrame\n\n:: \n\n import pandas as pd\n df = pd.read_json('tweets.json', encoding='utf-8')\n\n\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/sriramkumar1996/twitterscraper", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "sriram-twitter-scraper", "package_url": "https://pypi.org/project/sriram-twitter-scraper/", "platform": "", "project_url": "https://pypi.org/project/sriram-twitter-scraper/", "project_urls": { "Homepage": "https://github.com/sriramkumar1996/twitterscraper" }, "release_url": "https://pypi.org/project/sriram-twitter-scraper/1.3.2/", "requires_dist": [ "coala-utils (~=0.5.0)", "bs4", "lxml", "requests", "billiard" ], "requires_python": "", "summary": "Tool for scraping Tweets", "version": "1.3.2" }, "last_serial": 5942426, "releases": { "1.3.1": [ { "comment_text": "", "digests": { "md5": "f2421d2091fe32ed9282dfa4c2d530f3", "sha256": "95f6af5927bfdfeab522233c6f484f2e4e5392d6d7f837f35ff82d71e1647819" }, "downloads": -1, "filename": "sriram_twitter_scraper-1.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "f2421d2091fe32ed9282dfa4c2d530f3", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 15506, "upload_time": "2019-10-08T01:48:53", "url": "https://files.pythonhosted.org/packages/b9/4e/e2466db3f284eb0f9c1b88724c6d7ef4a8929cf747caca99d19e743b3b7a/sriram_twitter_scraper-1.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "37f5a8d137e17d9a8cc299fb272b3f0c", "sha256": "7b1f4e32e4548576e113bf044923fe7252addb259d190977875a85c6eeedcc11" }, "downloads": -1, "filename": "sriram-twitter-scraper-1.3.1.tar.gz", "has_sig": false, "md5_digest": "37f5a8d137e17d9a8cc299fb272b3f0c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13974, "upload_time": "2019-10-08T01:48:58", "url": "https://files.pythonhosted.org/packages/f3/a0/36716f8329d248b6ccff013d81147cf3590436c2d20fcc453aa4754be352/sriram-twitter-scraper-1.3.1.tar.gz" } ], "1.3.2": [ { "comment_text": "", "digests": { "md5": "267379c3854418fb78ab46d0f856d192", "sha256": "8a56e30b090d676ed1f269f197cbdb4c06c57a94d792de15efbf5c1b70f3847d" }, "downloads": -1, "filename": "sriram_twitter_scraper-1.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "267379c3854418fb78ab46d0f856d192", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 15509, "upload_time": "2019-10-08T01:48:56", "url": "https://files.pythonhosted.org/packages/07/8d/fd699c04a5a4494c3fe26a1508db9868798e404766b398bb08a2923e9f31/sriram_twitter_scraper-1.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9515e70ca09961541d1563d58fecf41d", "sha256": "847e73e94380009410ddd6987de7d472da6bd983a58052b82e2b93b17753b013" }, "downloads": -1, "filename": "sriram-twitter-scraper-1.3.2.tar.gz", "has_sig": false, "md5_digest": "9515e70ca09961541d1563d58fecf41d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13974, "upload_time": "2019-10-08T01:49:00", "url": "https://files.pythonhosted.org/packages/1b/3c/adfef7de73cd947d8c5e55e55d33dbf332685451b7bfd071618ea90ab6bb/sriram-twitter-scraper-1.3.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "267379c3854418fb78ab46d0f856d192", "sha256": "8a56e30b090d676ed1f269f197cbdb4c06c57a94d792de15efbf5c1b70f3847d" }, "downloads": -1, "filename": "sriram_twitter_scraper-1.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "267379c3854418fb78ab46d0f856d192", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 15509, "upload_time": "2019-10-08T01:48:56", "url": "https://files.pythonhosted.org/packages/07/8d/fd699c04a5a4494c3fe26a1508db9868798e404766b398bb08a2923e9f31/sriram_twitter_scraper-1.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9515e70ca09961541d1563d58fecf41d", "sha256": "847e73e94380009410ddd6987de7d472da6bd983a58052b82e2b93b17753b013" }, "downloads": -1, "filename": "sriram-twitter-scraper-1.3.2.tar.gz", "has_sig": false, "md5_digest": "9515e70ca09961541d1563d58fecf41d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13974, "upload_time": "2019-10-08T01:49:00", "url": "https://files.pythonhosted.org/packages/1b/3c/adfef7de73cd947d8c5e55e55d33dbf332685451b7bfd071618ea90ab6bb/sriram-twitter-scraper-1.3.2.tar.gz" } ] }