Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

Vocab count / Bag of Words (BOW) - no contextual info kept

Use count vectorizer from sklearn or TF-IDF (better)
Remove stop words

One-hot encoding
Frequency count - no contextual info kept
TF-IDF - no contextual info kept
Word Embeddings : preserve contextual information. Get the semantics of a word.
- Resource on implementation of various embeddings

Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set.
Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size).
GLoVe (pre-trained from Stanford) - based on global context
BERT embeddings
GPT3 embeddings

Using word embeddings:

Learn it (not recommended)
Reuse it (check what dataset the embeddings has been trained on)
Reuse + fine-tune

Models :

Recurrent Neural Network(RNN) :

...