...
- Vocab count / Bag of Words (BOW) - no contextual info kept
- Use count vectorizer from sklearn or TF-IDF (better)
- Remove stop words
- One-hot encoding
- Frequency count - no contextual info kept
- TF-IDF - no contextual info kept
- Word Embeddings : preserve contextual information. Get the semantics of a word.
- Resource on implementation of various embeddings
- Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set.
- Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size).
- GLoVe (pre-trained from Stanford) - based on global context
- BERTembeddings
- GPT3 embeddings
- Using word embeddings:
- Learn it (not recommended)
- Reuse it (check what dataset the embeddings has been trained on)
- Reuse + fine-tune
Models :
...