...
- Vocab count / Bag of Words (BoW)
- The simplest technique of counting all words and giving indexes for each.
- No contextual information is preserved. Context is very important in language and this key part is lost when employing this technique.
- Use count vectorizer from sklearn or TF-IDF (better)
- Remove stop words
- One-hot encoding
- Each word is represented as an n-dimensional vector where n is the total number of words. The index of the particular word will have value 1 and the rest of the index values are 0.
- Context is lost.
- Easy to manipulate and process because of 1s and 0s.
- Frequency count
- The frequency of each word is preserved along with if the word is present or not in a sentence.
- Context is lost, but frequency is preserved.
- The idea here is more frequent the word, less significant it is.
- TF-IDF
- Term-frequency Inverse-document frequency.
- This is the most used form of vectorization in simple NLP tasks. It uses the word frequency in each document and across documents.
- No contextual information is preserved. But word importance is highlighted in this method.
- This method is shown to give good results for topic classification, spam filtering (identifying spam words).
- Blog on BoW and TF-IDF
- Word Embeddings : preserve contextual information. Get the semantics of a word.
- Resource on implementation of various embeddings
- Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set. ( considers text sequences)
- Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size). ( considers text sequences)
- GLoVe (pre-trained from Stanford) - based on global context( considers text sequences)
- BERTembeddings ( an advanced technique using transformer)
- GPT3 embeddings
- Using word embeddings:
- Learn it (not recommended)
- Reuse it (recommended - although check what dataset the embeddings has been trained on)
- Reuse + fine-tune (recommended)
- Fine-tuning usually means use the below / first few layers with the same weights as the pre-trained model. Freeze the below layers during the training phase.
- The final few layers (usually the last 3-4 layers) are trained with the dataset at hand. That way the weights of the final layers are learnt specific to the task / data at hand.
- The lower layers have rich general knowledge of being trained with a huge and varied dataset (from the pre-trained model) while the last / final layers to the output have weights according to the specific task.
...