Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Vocab count / Bag of Words (BoW)
    1. The simplest technique of counting all words and giving indexes for each.
    2. No contextual information is preserved. Context is very important in language and this key part is lost when employing this technique.
    3. Use count vectorizer from sklearn or TF-IDF (better)
    4. Remove stop words
  2. One-hot encoding
    1. Each word is represented as an n-dimensional vector where n is the total number of words. The index of the particular word will have value 1 and the rest of the index values are 0.
    2. Context is lost. 
    3. Easy to manipulate and process because of 1s and 0s.
  3. Frequency count
    1. The frequency of each word is preserved along with if the word is present or not in a sentence.
    2. Context is lost, but frequency is preserved.
    3. The idea here is more frequent the word, less significant it is.
  4. TF-IDF
    1. Term-frequency Inverse-document frequency.
    2. This is the most used form of vectorization in simple NLP tasks. It uses the word frequency in each document and across documents.
    3. No contextual information is preserved. But word importance is highlighted in this method.
    4. This method is shown to give good results for topic classification, spam filtering (identifying spam words).
    5. Blog on BoW and TF-IDF
  5. Word Embeddings : preserve contextual information. Get the semantics of a word.
    1. Resource on implementation of various embeddings
    2. Learn word embeddings using n-gram (pyTorch, Keras ). Here the word embeddings are learned in the training phase, hence embeddings are specific to the training set. ( considers text sequences)
    3. Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size).  ( considers text sequences)
    4. GLoVe (pre-trained from Stanford) - based on global context( considers text sequences)
    5. BERTembeddings ( an advanced technique using transformer)
    6. GPT3 embeddings 
  6. Using word embeddings:
    1. Learn it (not recommended)
    2. Reuse it (recommended - although check what dataset the embeddings has been trained on)
    3. Reuse + fine-tune (recommended)
      1. Fine-tuning usually means use the below / first few layers with the same weights as the pre-trained model. Freeze the below layers during the training phase.
      2. The final few layers (usually the last 3-4 layers) are trained with the dataset at hand. That way the weights of the final layers are learnt specific to the task / data at hand.
      3. The lower layers have rich general knowledge of being trained with a huge and varied dataset (from the pre-trained model) while the last / final layers to the output have weights according to the specific task.

...