Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Deep Learning - Text Analysis(Minu Mathew)

Natural language - no structure. Computers like some structure. So try to introduce some structure.

Regular Expressions :

Good for quick string comparisons, transformations.

Tokenization, Normalization and stemming - methods to add some structure

Dimensionality reduction

Capture the most important structure. 

convert high dimensional space to a low dimensional space by preserving only important vectors (Eigen vectors) - get rid of highly correlated dimensions and reduce to single dimension.

Method to transform text to numeric :

  • Vocab count / Bag of Words (BOW) - no contextual info kept
    • Remove stop words
  • One-hot encoding
  • Frequency count - no contextual info kept
  • TF-IDF - no contextual info kept
  • Word Embeddings : preserve contextual information. Get the semantics of a word.
    • Learn word embeddings using n-gram (pyTorch, Keras )
    • Word2Vec (pre-trained word embeddings from Google) - Based on word distributions and local context (window size). 
    • GLoVe (pre-trained from Stanford) - based on global context
    • BERT



Models :

RNN

LSTM

CNN

Transformer architecture :

BERT model

XL-Net (by microsoft) - BERT and GPT-3 works better in general

GPT-3 model:


ML Ops (Kastan Day, Todd Nicholson)

...