# anlp21

Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html

Notebook | Description |
---|---|

1.words/EvaluateTokenizationForSentiment | The impact of tokenization choices on sentiment classification. |

1.words/ExploreTokenization | Different methods for tokenizing texts (whitespace, NLTK, spacy, regex) |

1.words/TokenizePrintedBooks | Design a better tokenizer for printed books |

1.words/Text_Complexity | Implement type-token ratio and Flesch-Kincaid Grade Level scores for text |

2.compare/ChiSquare, Mann-Whitney Tests | Explore two tests for finding distinctive terms |

2.compare/Log-odds ratio with priors | Implement the log-odds ratio with an informative (and uninformative) Dirichlet prior |

3.dictionaries/DictionaryTimeSeries | Plot sentiment over time using human-defined dictionaries |

3.dictionaries/Empath | Explore using Empath dictionaries to characterize texts |

4.embeddings/DistributionalSimilarity | Explore distributional hypothesis to build high-dimensional, sparse representations for words |

4.embeddings/WordEmbeddings | Explore word embeddings using Gensim |

4.embeddings/Semaxis | Implement SemAxis for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold), |

4.embeddings/BERT | Explore the basics of token representations in BERT and use it to find token nearest neighbors |

4.embedings/SequenceEmbeddings | Use sequence embeddings to find TV episode summaries most similar to a short description |

5.eda/WordSenseClustering | Inferring distinct word senses using KMeans clustering over BERT representations |

5.eda/Haiku KMeans | Explore text representation in clustering by trying to group haiku and non-haiku poems into two distinct clusters |