{ "info": { "author": "Dillon Mabry", "author_email": "rapid.dev.solutions@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# Youtube Sentiment Helper\n[![Build Status](https://travis-ci.org/dillonmabry/youtube-sentiment-helper.svg?branch=master)](https://travis-ci.org/dillonmabry/youtube-sentiment-helper)\n[![Python 3.4](https://img.shields.io/badge/python-3.4-blue.svg)](https://www.python.org/downloads/release/python-340/)\n[![Python 3.5](https://img.shields.io/badge/python-3.5-blue.svg)](https://www.python.org/downloads/release/python-350/)\n[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nDetermine sentiment of Youtube video per comment based analysis using Sci-kit by analyzing video comments based on positive/negative sentiment. \nHelper tool to make requests to a machine learning model in order to determine sentiment using the Youtube API.\n\n## Install Instructions\n```\npip install .\n```\nor PyPI (https://pypi.org/project/youtube-sentiment/)\n```\npip install youtube-sentiment\n```\n## How to Use\nCurrent usage:\n```\nimport youtube_sentiment as yt\nyt.video_summary(, , , ) \n```\nor\n```\npython main.py \n```\nChoices for model selection are found under the included models for setup also under project path `./models`\n## Tests\n```\npython setup.py test\n```\n## To-Do\n- [X] Create API to use Youtube API V3 via REST to get comments for videos\n- [X] Create initial Python package\n- [X] Analyze existing sentiment analysis models to select and use\n- [X] Improve/enhance existing sentiment learning model\n- [ ] Create deep model for sentiment\n- [X] Utilize sentiment analysis to analyze Youtube video and provide analytics\n- [X] Finalize Python package for project\n- [ ] Fix any new bugs\n- [ ] Create web based portal\n\n## Models Available\n - lr_sentiment_basic (Basic Vectorizer/Logistic Regression model, 2 MB)\n - lr_sentiment_cv (Hypertuned TFIDF/Logistic Regression model with clean dataset, 60 MB)\n - *To-be-added* cnn_sentiment (Convolutional Neural Net model)\n - *To-be-added* cnn_sentiment (LTSM Neural Net model)\n\n## Traditional ML Model Creation\n\n*Why use Twitter sentiment as training?*\n\nTwitter comments/replies/tweets are the closest existing training set to Youtube comments that are the simplest to setup. A deep autoencoder could be used to generate comments for a larger dataset (over 100k) with Youtube-esque comments but then the reliability of classifying the data would be very tricky.\n\n**TLDR: It is the simplest and most effective to bootstrap for a traditional model**\n\n```python\n# Develop sentiment analysis classifier using traditional ML models\n# Pipeline modeling using the following guide: \n# https://ryan-cranfill.github.io/sentiment-pipeline-sklearn-1/\n# Data processing and cleaning guide:\n# https://towardsdatascience.com/another-twitter-sentiment-analysis-bb5b01ebad90\n\n# Imports\nimport numpy as np\nimport time\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport re\nfrom bs4 import BeautifulSoup\nimport nltk\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import accuracy_score, log_loss, confusion_matrix, auc, roc_curve\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import FeatureUnion, Pipeline\nfrom sklearn.externals import joblib\nfrom sklearn.model_selection import train_test_split\n```\n\n\n```python\n# Dataset of 1.6m Twitter tweets\ncolumns = ['sentiment', 'id', 'date', 'query_string', 'user', 'text']\ntrain = pd.read_csv('stanford_twitter_train.csv', encoding='latin-1', header=None, names=columns)\ntest = pd.read_csv('stanford_twitter_test.csv', encoding='latin-1', header=None, names=columns)\n```\n\n\n```python\n## Local helpers\n\n# AUC visualization\ndef show_roc(model, test, test_labels):\n # Predict\n probs = model.predict_proba(test)\n preds = probs[:,1]\n fpr, tpr, threshold = roc_curve(test_labels, preds)\n roc_auc = auc(fpr, tpr)\n # Chart\n plt.title('Receiver Operating Characteristic')\n plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)\n plt.legend(loc = 'lower right')\n plt.plot([0, 1], [0, 1],'r--')\n plt.xlim([0, 1])\n plt.ylim([0, 1])\n plt.ylabel('True Positive Rate')\n plt.xlabel('False Positive Rate')\n plt.show()\n\n# Tweet cleanser\ntok = nltk.tokenize.WordPunctTokenizer()\npat1 = r'@[A-Za-z0-9_]+'\npat2 = r'https?://[^ ]+'\ncombined_pat = r'|'.join((pat1, pat2))\nwww_pat = r'www.[^ ]+'\nnegations_dic = {\"isn't\":\"is not\", \"aren't\":\"are not\", \"wasn't\":\"was not\", \"weren't\":\"were not\",\n \"haven't\":\"have not\",\"hasn't\":\"has not\",\"hadn't\":\"had not\",\"won't\":\"will not\",\n \"wouldn't\":\"would not\", \"don't\":\"do not\", \"doesn't\":\"does not\",\"didn't\":\"did not\",\n \"can't\":\"can not\",\"couldn't\":\"could not\",\"shouldn't\":\"should not\",\"mightn't\":\"might not\",\n \"mustn't\":\"must not\"}\nneg_pattern = re.compile(r'\\b(' + '|'.join(negations_dic.keys()) + r')\\b')\ndef clean_tweet(text):\n soup = BeautifulSoup(text, 'lxml')\n souped = soup.get_text()\n try:\n bom_removed = souped.decode(\"utf-8-sig\").replace(u\"\\ufffd\", \"?\")\n except:\n bom_removed = souped\n stripped = re.sub(combined_pat, '', bom_removed)\n stripped = re.sub(www_pat, '', stripped)\n lower_case = stripped.lower()\n neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)\n letters_only = re.sub(\"[^a-zA-Z]\", \" \", neg_handled)\n # During the letters_only process two lines above, it has created unnecessay white spaces,\n # I will tokenize and join together to remove unneccessary white spaces\n words = [x for x in tok.tokenize(letters_only) if len(x) > 1]\n return (\" \".join(words)).strip()\n```\n\n\n```python\n# Data cleaning\ncleaned_tweets = []\nfor tweet in train['text']: \n cleaned_tweets.append(clean_tweet(tweet))\ncleaned_df = pd.DataFrame(cleaned_tweets, columns=['text'])\ncleaned_df['target'] = train.sentiment\ncleaned_df.target[cleaned_df.target == 4] = 1 # rename 4 to 1 as positive label\ncleaned_df = cleaned_df[cleaned_df.target != 2] # remove neutral labels\ncleaned_df = cleaned_df.dropna() # drop null records\ncleaned_df.to_csv('stanford_clean_twitter_train.csv',encoding='utf-8')\n```\n\n```python\n# Starting point from import\ncsv = 'stanford_clean_twitter_train.csv'\ndf = pd.read_csv(csv,index_col=0)\n```\n\n```python\n# Random shuffle and ensure no null records\ndf = df.sample(frac=1).reset_index(drop=True)\ndf = df.dropna() # drop null records\n```\n\n\n```python\nX, y = df.text[0:200000], df.target[0:200000] # Max data size 200k for memory purposes\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.10)\n```\n\n\n```python\n# Dataset shapes post-split\nprint(np.shape(X_train))\nprint(np.shape(X_test))\nprint(np.unique(y_train))\n```\n\n (180000,)\n (20000,)\n [0 1]\n \n\n\n```python\n# NLTK Twitter tokenizer best used for short comment-type text sets\nimport nltk\ntokenizer = nltk.casual.TweetTokenizer(preserve_case=False)\n```\n\n\n```python\n# Hyperparameter tuning (Simple model)\n#cvect = CountVectorizer(tokenizer=tokenizer.tokenize)\ntfidf = TfidfVectorizer()\nclf = LogisticRegression()\n\npipeline = Pipeline([\n ('tfidf', tfidf),\n ('clf', clf)\n ])\n\nparameters = {\n 'tfidf__ngram_range': [(1,1), (1,2), (1,3)], # ngram range of tokenizer\n 'tfidf__norm': ['l1', 'l2', None], # term vector normalization\n 'tfidf__max_df': [0.25, 0.5, 1.0], # maximum document frequency for the CountVectorizer\n 'clf__C': np.logspace(-2, 0, 3) # C value for the LogisticRegression\n}\n\ngrid = GridSearchCV(pipeline, parameters, cv=3, verbose=1)\nprint(\"Performing grid search...\")\nprint(\"pipeline:\", [name for name, _ in pipeline.steps])\nt0 = time.time()\ngrid.fit(X_train, y_train)\nprint(\"done in %0.3fs\" % (time.time() - t0))\nprint()\n\nprint(\"Best score: %0.3f\" % grid.best_score_)\nprint(\"Best parameters set:\")\nbest_parameters = grid.best_estimator_.get_params()\nfor param_name in sorted(parameters.keys()):\n print(\"\\t%s: %r\" % (param_name, best_parameters[param_name]))\n```\n\n Performing grid search...\n pipeline: ['tfidf', 'clf']\n Fitting 3 folds for each of 81 candidates, totalling 243 fits\n \n\n [Parallel(n_jobs=1)]: Done 243 out of 243 | elapsed: 52.7min finished\n \n\n done in 3186.295s\n \n Best score: 0.803\n Best parameters set:\n \tclf__C: 0.01\n \ttfidf__max_df: 0.25\n \ttfidf__ngram_range: (1, 3)\n \ttfidf__norm: None\n \n\n\n```python\n# Dump model from grid search cv\njoblib.dump(grid.best_estimator_, 'lr_sentiment_cv.pkl', compress=1)\n```\n\n\n\n\n ['lr_sentiment_cv.pkl']\n\n\n\n\n```python\n# Starting point 2: Post-model load comparison\nlra = joblib.load('./Models/Stanford_Twitter_Models/lr_sentiment_cv.pkl') \nlrb = joblib.load('./Models/Twitter_Simple_Models/lr_sentiment_basic.pkl') \n```\n\n\n```python\n# Model performance indicators for basic model\ny_pred_basic = lrb.predict(X_test)\nprint(confusion_matrix(y_test, y_pred_basic))\nshow_roc(lrb, X_test, y_test) # AUC\n```\n\n [[7562 2347]\n [2181 7910]]\n \n\n\n![basic_auc](https://user-images.githubusercontent.com/10522556/47269973-06dd1280-d533-11e8-8686-284702733082.png)\n\n\n\n```python\n# Model performance indicators for hypertuned model\ny_pred_hyper = lra.predict(X_test)\nprint(confusion_matrix(y_test, y_pred_hyper))\nshow_roc(lra, X_test, y_test) # AUC\n```\n\n [[7861 2048]\n [1863 8228]]\n \n\n\n![cv_auc](https://user-images.githubusercontent.com/10522556/47269972-06dd1280-d533-11e8-99d6-a2b211f73185.png)\n\n\n\n```python\nprint(lrb.predict([\"terrible idea why was this even made\"]))\nprint(lrb.predict([\"that was the best movie ever\"]))\n```\n\n [0]\n [1]", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/dillonmabry/youtube-sentiment-helper", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "youtube-sentiment", "package_url": "https://pypi.org/project/youtube-sentiment/", "platform": "", "project_url": "https://pypi.org/project/youtube-sentiment/", "project_urls": { "Homepage": "https://github.com/dillonmabry/youtube-sentiment-helper" }, "release_url": "https://pypi.org/project/youtube-sentiment/0.3.0/", "requires_dist": null, "requires_python": "", "summary": "Analyze Youtube videos for general sentiment analysis", "version": "0.3.0" }, "last_serial": 4459745, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "ec700534dd80f37734bd509c61edc2f2", "sha256": "b2daa5dee42d20da305fe2e75d10144ded63c447f6c97b71ef2f3ef95612de84" }, "downloads": -1, "filename": "youtube_sentiment-0.1.tar.gz", "has_sig": false, "md5_digest": "ec700534dd80f37734bd509c61edc2f2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58791331, "upload_time": "2018-10-30T02:49:16", "url": "https://files.pythonhosted.org/packages/98/86/1d7e316e10f7eff63645080420c54251aa53dd3151f129f00f286aaa724f/youtube_sentiment-0.1.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "82689fa8e081b37a15677f30fca9c37e", "sha256": "37650c118cc808945e90445620c6dc28bbf491f748d0615aa8be3772de1a843f" }, "downloads": -1, "filename": "youtube_sentiment-0.2.1.tar.gz", "has_sig": false, "md5_digest": "82689fa8e081b37a15677f30fca9c37e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58796955, "upload_time": "2018-10-30T23:42:58", "url": "https://files.pythonhosted.org/packages/0b/b5/e51426b729729ff7a9b131244dd1b7e086410f7877ad66a9364f2036facf/youtube_sentiment-0.2.1.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "356986d8de0f3fb95666fecfaffa1536", "sha256": "428475343dcfb115ba9d1417bb1955f37fd5f3ca6c22cfd06eec25fe2f298ba0" }, "downloads": -1, "filename": "youtube_sentiment-0.3.0.tar.gz", "has_sig": false, "md5_digest": "356986d8de0f3fb95666fecfaffa1536", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58797200, "upload_time": "2018-11-07T02:19:26", "url": "https://files.pythonhosted.org/packages/bf/01/52bfd759ae90a800ad538f6f8b0d43da4b1d8d6dc70de04fabbcfe69e68d/youtube_sentiment-0.3.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "356986d8de0f3fb95666fecfaffa1536", "sha256": "428475343dcfb115ba9d1417bb1955f37fd5f3ca6c22cfd06eec25fe2f298ba0" }, "downloads": -1, "filename": "youtube_sentiment-0.3.0.tar.gz", "has_sig": false, "md5_digest": "356986d8de0f3fb95666fecfaffa1536", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58797200, "upload_time": "2018-11-07T02:19:26", "url": "https://files.pythonhosted.org/packages/bf/01/52bfd759ae90a800ad538f6f8b0d43da4b1d8d6dc70de04fabbcfe69e68d/youtube_sentiment-0.3.0.tar.gz" } ] }