-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathreadme
More file actions
16 lines (15 loc) · 1.25 KB
/
readme
File metadata and controls
16 lines (15 loc) · 1.25 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
End-to-end Natural Language Processing (NLP)
1. text cleaning: removing puntuations, numbers, stopwords, HTML tags and URLs, stemming
2. text tokenizing and creating a bag-of-words model
3. word scoring: binary, count, frequency, Term frequency–Inverse document frequency (TF-IDF)
Examples:
1. UCI Spam Collection data https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
2. UCI Yelp Restaurant Review data https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#
3. UCI Amazon Product Review data https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#
4. Kaggle IMDB Sentiment data https://www.kaggle.com/c/word2vec-nlp-tutorial
5. Kaggle Yelp Business Rating data https://www.kaggle.com/c/yelp-recsys-2013
6. Kaggle Toxic Comment Classification Challenge https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
7. CrowdFlower Twitter Airline Sentiment data https://www.crowdflower.com/data-for-everyone/
8. CrowdFlower Twitter Global Warming Sentiment data https://www.crowdflower.com/data-for-everyone/
9. CrowdFlower Corporate Messaging data https://www.crowdflower.com/data-for-everyone/
10. CrowdFlower Coachella 2015 Twitter sentiment data https://www.crowdflower.com/data-for-everyone/