Sentiment Analysis of NLP data in R

whois Projects Sentiment Analysis of NLP data in R

Sentiment Analysis of NLP data in R

27th Sep 2018

R sentiment-analysis NLP

In this blog post, I try to analyse textual data and derive a sentiment in R

This is part of my series of documenting my small experiments using R & solving Data Analysis problems. These experiements might be redundant and may have been already written and blogged about by various people, but this is more of a personal diary and personal learning process.And, in this process, if anyone gets inspired or learns something new then, thats the best thing that could happen. If a more knowledgeable person than me, stumbles upon this blog and thinks there is a much better way to do things or i have erred somewhere, please feel free to share the feedback and help everyone grow.

Photo by Arnel Hasanovic on Unsplash

Link to the GitHub repository: Twitter-Sentiment-IPhoneXSMAX

Recently Apple launched its new flagship model of iPhone, the iPhone Xs and iPhone Xs Max. I was really impressed by the new Apple watch but not so much by the new phones. (please dont hate me). I was wondering what other people must be thinking about the new phone. I decided to try and do a sentiment analysis to see how people are reacting to the new iPhone launch.

Obviously, the best place to get what people are saying in textual form is from Twitter. So, i decided to analyse 15,000 random tweets between 10th september 2018 and 24th September 2018 to try and come up with the sentiment of the poeple towards the new iPhone Launch based on the tweets.

Natural language text as data is a very difficult animal to tackle

It has Loose structure, not everyone conforms to same structure when tweeting
It can have Poor spellings, not everyone spell checks
Non-traditional grammar, different language speaking audience may use grammer differently
Multi-Lingual, English is not the only spoken language

To overcome most of these issues, today ill restrict myself to tweets from English language only and assume i have correct spellings

To achieve natural language processing, im going to be using an amazing package developed by the lovely folks at stanford called syuzhet. This library allows us to asses the sentiment of each word or sentence (in this case our individual tweets). For other text mining function, i will use another amazing package *tm *which gives us a lot of different text mining functions to work with.

Basic Steps of Text mining the way i understand are these:

Extract data
Cleaning the data
Create the corpus of the entire collection
Remove all the punctuations, stop words & Stem the document
Create the DTM — Document Term Matrix
Check for Sparsity & Remove Sparse Terms
Convert to DataFrame
Get the Sentiment Value
At this stage, we can also apply Machine Learning Methods like
1. CART
2. RandomForest

Step 1: Extract the Data from Twitter:

Here im using the library twitteR and the function setup_twitter_oauth which takes following parameters: consumer_key,consumer_secret,access_token, access_secret.

To create your own consumer_key, consumer_secret, access_token and access_secret, go to http://apps.twitter.com or just google about it, its pretty easy. I’ve edited out my personal credentials for obvious reasons.

Setting up Twitter Authentication

Once the authentication is set,we are ready to extract the tweets. Extracting tweets is very simple, you have to call a simple function searchTwitter and supply the requisite parameters.

For this blog i extracted 15,000 tweets with string “iPhoneXSMAX”, between the timeperiod 10th sept and 24th sept 2018. I only extracted tweets which were in english language and converted the extracted tweets into the data frame.

Extracting Tweets

Extracted Data from Twitter

2. Cleaning the Data:

As you can see above, extracted data has a lot of columns which are not needed in our todays experiment, so dropping all unwanted columns and cleaning the tweets for special characters and converting it into UTF-8 encoding.

Also, there are a lot of tweets which are duplicate, as same tweets are repeated multiple times by various bot. we will delete the duplicate tweets as well. Please note these are not retweets.

Cleaning the Extracted Data

Removing Duplicate Tweets

Data after De-Duplication

After deduplication we are down to 7,460 tweets.

3. Create the corpus of the entire collection

4. Remove all the punctuations, stop words & Stem the document

Next step is we build the corpus from the collection of strings(tweet texts)that we have, and further clean the data, by removing puntuation, stop words and extra white spaces and then stemming the document

Stemming the document means going to the words root. The tm package provides the stemDocument() function to get to a word's root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument and returns a PlainTextDocument

wordcloud of the tweets

5. Create the DTM — Document Term Matrix

As per Wikipedia: A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

DTM is basically when the text is arranged in rows. It is extremely important to be able to convert the text into a DTM form, otherwise we would not be able to do any analysis or processing of the text.

DTM of the tweets

6. Check for Sparsity & Remove Sparse Terms

7. Convert to DataFrame

8. Get the Sentiment Value

Sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)

It basically means that we remove terms which appear frequently in the document. we create the new sparse vector and convert it into a data frame.

Now that we have processed the text we will send it to the get_sentiment function of the syuzhet package. this function will asses the sentiment of each word or sentence. This function returns the sentiment value in the numerical form. If the value is less than zero we will consider it as a negative sentiment, if its above zero then its a positive sentiment and Neutral if the value is equal to zero.

We add this sentiment value to the Sparse vector that we created in the earlier step and print the results

Remove Sparse terms

So whats the public sentiments of 7460 tweets on iPhoneXSMax?

Sentiment of the tweets

Results: Its largely positive, 59% of tweets were positive towards the new launch. Approx. 30% were neutral and only 11% were negative towards the new product.

Previous Post Next Post

Sentiment Analysis of NLP data in R

Sentiment Analysis of NLP data in R

Popular Tags

Syndicate