Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
End-to-end Natural Language Processing (NLP)
  1. text cleaning: removing puntuations, numbers, stopwords, HTML tags and URLs, stemming
  2. text tokenizing and creating a bag-of-words model
  3. word scoring: binary, count, frequency, Term frequency–Inverse document frequency (TF-IDF)

Examples:
  1. UCI Spam Collection data  https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
  2. UCI Yelp Restaurant Review data  https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#
  3. UCI Amazon Product Review data  https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#
  4. Kaggle IMDB Sentiment data  https://www.kaggle.com/c/word2vec-nlp-tutorial
  5. Kaggle Yelp Business Rating data  https://www.kaggle.com/c/yelp-recsys-2013
  6. Kaggle Toxic Comment Classification Challenge https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
  7. CrowdFlower Twitter Airline Sentiment data  https://www.crowdflower.com/data-for-everyone/
  8. CrowdFlower Twitter Global Warming Sentiment data  https://www.crowdflower.com/data-for-everyone/
  9. CrowdFlower Corporate Messaging data  https://www.crowdflower.com/data-for-everyone/
  10. CrowdFlower Coachella 2015 Twitter sentiment data https://www.crowdflower.com/data-for-everyone/