{
    "info": {
        "author": "Dillon Mabry",
        "author_email": "rapid.dev.solutions@gmail.com",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "# Youtube Sentiment Helper\n[![Build Status](https://travis-ci.org/dillonmabry/youtube-sentiment-helper.svg?branch=master)](https://travis-ci.org/dillonmabry/youtube-sentiment-helper)\n[![Python 3.4](https://img.shields.io/badge/python-3.4-blue.svg)](https://www.python.org/downloads/release/python-340/)\n[![Python 3.5](https://img.shields.io/badge/python-3.5-blue.svg)](https://www.python.org/downloads/release/python-350/)\n[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nDetermine sentiment of Youtube video per comment based analysis using Sci-kit by analyzing video comments based on positive/negative sentiment. \nHelper tool to make requests to a machine learning model in order to determine sentiment using the Youtube API.\n\n## Install Instructions\n```\npip install .\n```\nor PyPI (https://pypi.org/project/youtube-sentiment/)\n```\npip install youtube-sentiment\n```\n## How to Use\nCurrent usage:\n```\nimport youtube_sentiment as yt\nyt.video_summary(<Youtube API Key>, <Youtube Video ID>, <Max Pages of Comments>, <Sentiment Model>) \n```\nor\n```\npython main.py <Youtube API Key> <Youtube Video ID> <Max Pages of Comments> <Sentiment Model>\n```\nChoices for model selection are found under the included models for setup also under project path `./models`\n## Tests\n```\npython setup.py test\n```\n## To-Do\n- [X] Create API to use Youtube API V3 via REST to get comments for videos\n- [X] Create initial Python package\n- [X] Analyze existing sentiment analysis models to select and use\n- [X] Improve/enhance existing sentiment learning model\n- [ ] Create deep model for sentiment\n- [X] Utilize sentiment analysis to analyze Youtube video and provide analytics\n- [X] Finalize Python package for project\n- [ ] Fix any new bugs\n- [ ] Create web based portal\n\n## Models Available\n - lr_sentiment_basic (Basic Vectorizer/Logistic Regression model, 2 MB)\n - lr_sentiment_cv (Hypertuned TFIDF/Logistic Regression model with clean dataset, 60 MB)\n - *To-be-added* cnn_sentiment (Convolutional Neural Net model)\n - *To-be-added* cnn_sentiment (LTSM Neural Net model)\n\n## Traditional ML Model Creation\n\n*Why use Twitter sentiment as training?*\n\nTwitter comments/replies/tweets are the closest existing training set to Youtube comments that are the simplest to setup. A deep autoencoder could be used to generate comments for a larger dataset (over 100k) with Youtube-esque comments but then the reliability of classifying the data would be very tricky.\n\n**TLDR: It is the simplest and most effective to bootstrap for a traditional model**\n\n```python\n# Develop sentiment analysis classifier using traditional ML models\n# Pipeline modeling using the following guide: \n# https://ryan-cranfill.github.io/sentiment-pipeline-sklearn-1/\n# Data processing and cleaning guide:\n# https://towardsdatascience.com/another-twitter-sentiment-analysis-bb5b01ebad90\n\n# Imports\nimport numpy as np\nimport time\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport re\nfrom bs4 import BeautifulSoup\nimport nltk\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import accuracy_score, log_loss, confusion_matrix, auc, roc_curve\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import FeatureUnion, Pipeline\nfrom sklearn.externals import joblib\nfrom sklearn.model_selection import train_test_split\n```\n\n\n```python\n# Dataset of 1.6m Twitter tweets\ncolumns = ['sentiment', 'id', 'date', 'query_string', 'user', 'text']\ntrain = pd.read_csv('stanford_twitter_train.csv', encoding='latin-1', header=None, names=columns)\ntest = pd.read_csv('stanford_twitter_test.csv', encoding='latin-1', header=None, names=columns)\n```\n\n\n```python\n## Local helpers\n\n# AUC visualization\ndef show_roc(model, test, test_labels):\n    # Predict\n    probs = model.predict_proba(test)\n    preds = probs[:,1]\n    fpr, tpr, threshold = roc_curve(test_labels, preds)\n    roc_auc = auc(fpr, tpr)\n    # Chart\n    plt.title('Receiver Operating Characteristic')\n    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)\n    plt.legend(loc = 'lower right')\n    plt.plot([0, 1], [0, 1],'r--')\n    plt.xlim([0, 1])\n    plt.ylim([0, 1])\n    plt.ylabel('True Positive Rate')\n    plt.xlabel('False Positive Rate')\n    plt.show()\n\n# Tweet cleanser\ntok = nltk.tokenize.WordPunctTokenizer()\npat1 = r'@[A-Za-z0-9_]+'\npat2 = r'https?://[^ ]+'\ncombined_pat = r'|'.join((pat1, pat2))\nwww_pat = r'www.[^ ]+'\nnegations_dic = {\"isn't\":\"is not\", \"aren't\":\"are not\", \"wasn't\":\"was not\", \"weren't\":\"were not\",\n                \"haven't\":\"have not\",\"hasn't\":\"has not\",\"hadn't\":\"had not\",\"won't\":\"will not\",\n                \"wouldn't\":\"would not\", \"don't\":\"do not\", \"doesn't\":\"does not\",\"didn't\":\"did not\",\n                \"can't\":\"can not\",\"couldn't\":\"could not\",\"shouldn't\":\"should not\",\"mightn't\":\"might not\",\n                \"mustn't\":\"must not\"}\nneg_pattern = re.compile(r'\\b(' + '|'.join(negations_dic.keys()) + r')\\b')\ndef clean_tweet(text):\n    soup = BeautifulSoup(text, 'lxml')\n    souped = soup.get_text()\n    try:\n        bom_removed = souped.decode(\"utf-8-sig\").replace(u\"\\ufffd\", \"?\")\n    except:\n        bom_removed = souped\n    stripped = re.sub(combined_pat, '', bom_removed)\n    stripped = re.sub(www_pat, '', stripped)\n    lower_case = stripped.lower()\n    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)\n    letters_only = re.sub(\"[^a-zA-Z]\", \" \", neg_handled)\n    # During the letters_only process two lines above, it has created unnecessay white spaces,\n    # I will tokenize and join together to remove unneccessary white spaces\n    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]\n    return (\" \".join(words)).strip()\n```\n\n\n```python\n# Data cleaning\ncleaned_tweets = []\nfor tweet in train['text']:                                                                 \n    cleaned_tweets.append(clean_tweet(tweet))\ncleaned_df = pd.DataFrame(cleaned_tweets, columns=['text'])\ncleaned_df['target'] = train.sentiment\ncleaned_df.target[cleaned_df.target == 4] = 1 # rename 4 to 1 as positive label\ncleaned_df = cleaned_df[cleaned_df.target != 2] # remove neutral labels\ncleaned_df = cleaned_df.dropna() # drop null records\ncleaned_df.to_csv('stanford_clean_twitter_train.csv',encoding='utf-8')\n```\n\n```python\n# Starting point from import\ncsv = 'stanford_clean_twitter_train.csv'\ndf = pd.read_csv(csv,index_col=0)\n```\n\n```python\n# Random shuffle and ensure no null records\ndf = df.sample(frac=1).reset_index(drop=True)\ndf = df.dropna() # drop null records\n```\n\n\n```python\nX, y = df.text[0:200000], df.target[0:200000] # Max data size 200k for memory purposes\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.10)\n```\n\n\n```python\n# Dataset shapes post-split\nprint(np.shape(X_train))\nprint(np.shape(X_test))\nprint(np.unique(y_train))\n```\n\n    (180000,)\n    (20000,)\n    [0 1]\n    \n\n\n```python\n# NLTK Twitter tokenizer best used for short comment-type text sets\nimport nltk\ntokenizer = nltk.casual.TweetTokenizer(preserve_case=False)\n```\n\n\n```python\n# Hyperparameter tuning (Simple model)\n#cvect = CountVectorizer(tokenizer=tokenizer.tokenize)\ntfidf = TfidfVectorizer()\nclf = LogisticRegression()\n\npipeline = Pipeline([\n        ('tfidf', tfidf),\n        ('clf', clf)\n    ])\n\nparameters = {\n    'tfidf__ngram_range': [(1,1), (1,2), (1,3)], # ngram range of tokenizer\n    'tfidf__norm': ['l1', 'l2', None], # term vector normalization\n    'tfidf__max_df': [0.25, 0.5, 1.0], # maximum document frequency for the CountVectorizer\n    'clf__C': np.logspace(-2, 0, 3) # C value for the LogisticRegression\n}\n\ngrid = GridSearchCV(pipeline, parameters, cv=3, verbose=1)\nprint(\"Performing grid search...\")\nprint(\"pipeline:\", [name for name, _ in pipeline.steps])\nt0 = time.time()\ngrid.fit(X_train, y_train)\nprint(\"done in %0.3fs\" % (time.time() - t0))\nprint()\n\nprint(\"Best score: %0.3f\" % grid.best_score_)\nprint(\"Best parameters set:\")\nbest_parameters = grid.best_estimator_.get_params()\nfor param_name in sorted(parameters.keys()):\n    print(\"\\t%s: %r\" % (param_name, best_parameters[param_name]))\n```\n\n    Performing grid search...\n    pipeline: ['tfidf', 'clf']\n    Fitting 3 folds for each of 81 candidates, totalling 243 fits\n    \n\n    [Parallel(n_jobs=1)]: Done 243 out of 243 | elapsed: 52.7min finished\n    \n\n    done in 3186.295s\n    \n    Best score: 0.803\n    Best parameters set:\n    \tclf__C: 0.01\n    \ttfidf__max_df: 0.25\n    \ttfidf__ngram_range: (1, 3)\n    \ttfidf__norm: None\n    \n\n\n```python\n# Dump model from grid search cv\njoblib.dump(grid.best_estimator_, 'lr_sentiment_cv.pkl', compress=1)\n```\n\n\n\n\n    ['lr_sentiment_cv.pkl']\n\n\n\n\n```python\n# Starting point 2: Post-model load comparison\nlra = joblib.load('./Models/Stanford_Twitter_Models/lr_sentiment_cv.pkl') \nlrb = joblib.load('./Models/Twitter_Simple_Models/lr_sentiment_basic.pkl') \n```\n\n\n```python\n# Model performance indicators for basic model\ny_pred_basic = lrb.predict(X_test)\nprint(confusion_matrix(y_test, y_pred_basic))\nshow_roc(lrb, X_test, y_test) # AUC\n```\n\n    [[7562 2347]\n     [2181 7910]]\n    \n\n\n![basic_auc](https://user-images.githubusercontent.com/10522556/47269973-06dd1280-d533-11e8-8686-284702733082.png)\n\n\n\n```python\n# Model performance indicators for hypertuned model\ny_pred_hyper = lra.predict(X_test)\nprint(confusion_matrix(y_test, y_pred_hyper))\nshow_roc(lra, X_test, y_test) # AUC\n```\n\n    [[7861 2048]\n     [1863 8228]]\n    \n\n\n![cv_auc](https://user-images.githubusercontent.com/10522556/47269972-06dd1280-d533-11e8-99d6-a2b211f73185.png)\n\n\n\n```python\nprint(lrb.predict([\"terrible idea why was this even made\"]))\nprint(lrb.predict([\"that was the best movie ever\"]))\n```\n\n    [0]\n    [1]",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/dillonmabry/youtube-sentiment-helper",
        "keywords": "",
        "license": "MIT",
        "maintainer": "",
        "maintainer_email": "",
        "name": "youtube-sentiment",
        "package_url": "https://pypi.org/project/youtube-sentiment/",
        "platform": "",
        "project_url": "https://pypi.org/project/youtube-sentiment/",
        "project_urls": {
            "Homepage": "https://github.com/dillonmabry/youtube-sentiment-helper"
        },
        "release_url": "https://pypi.org/project/youtube-sentiment/0.3.0/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "Analyze Youtube videos for general sentiment analysis",
        "version": "0.3.0"
    },
    "last_serial": 4459745,
    "releases": {
        "0.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "ec700534dd80f37734bd509c61edc2f2",
                    "sha256": "b2daa5dee42d20da305fe2e75d10144ded63c447f6c97b71ef2f3ef95612de84"
                },
                "downloads": -1,
                "filename": "youtube_sentiment-0.1.tar.gz",
                "has_sig": false,
                "md5_digest": "ec700534dd80f37734bd509c61edc2f2",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 58791331,
                "upload_time": "2018-10-30T02:49:16",
                "url": "https://files.pythonhosted.org/packages/98/86/1d7e316e10f7eff63645080420c54251aa53dd3151f129f00f286aaa724f/youtube_sentiment-0.1.tar.gz"
            }
        ],
        "0.2.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "82689fa8e081b37a15677f30fca9c37e",
                    "sha256": "37650c118cc808945e90445620c6dc28bbf491f748d0615aa8be3772de1a843f"
                },
                "downloads": -1,
                "filename": "youtube_sentiment-0.2.1.tar.gz",
                "has_sig": false,
                "md5_digest": "82689fa8e081b37a15677f30fca9c37e",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 58796955,
                "upload_time": "2018-10-30T23:42:58",
                "url": "https://files.pythonhosted.org/packages/0b/b5/e51426b729729ff7a9b131244dd1b7e086410f7877ad66a9364f2036facf/youtube_sentiment-0.2.1.tar.gz"
            }
        ],
        "0.3.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "356986d8de0f3fb95666fecfaffa1536",
                    "sha256": "428475343dcfb115ba9d1417bb1955f37fd5f3ca6c22cfd06eec25fe2f298ba0"
                },
                "downloads": -1,
                "filename": "youtube_sentiment-0.3.0.tar.gz",
                "has_sig": false,
                "md5_digest": "356986d8de0f3fb95666fecfaffa1536",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 58797200,
                "upload_time": "2018-11-07T02:19:26",
                "url": "https://files.pythonhosted.org/packages/bf/01/52bfd759ae90a800ad538f6f8b0d43da4b1d8d6dc70de04fabbcfe69e68d/youtube_sentiment-0.3.0.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "356986d8de0f3fb95666fecfaffa1536",
                "sha256": "428475343dcfb115ba9d1417bb1955f37fd5f3ca6c22cfd06eec25fe2f298ba0"
            },
            "downloads": -1,
            "filename": "youtube_sentiment-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "356986d8de0f3fb95666fecfaffa1536",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 58797200,
            "upload_time": "2018-11-07T02:19:26",
            "url": "https://files.pythonhosted.org/packages/bf/01/52bfd759ae90a800ad538f6f8b0d43da4b1d8d6dc70de04fabbcfe69e68d/youtube_sentiment-0.3.0.tar.gz"
        }
    ]
}