BERTopic Explained

90% of the world's data is unstructured. It is built by humans, for humans. That's great for human consumption, but it is very hard to organize when we begin dealing with the massive amounts of data abundant in today's information age. Organization is complicated because unstructured text data is not intended to be understood by machines, and having humans process this abundance of data is wildly expensive and *very slow*. Fortunately, there is light at the end of the tunnel. More and more of this unstructured text is becoming accessible and understood by machines. We can now search text based on *meaning*, identify the sentiment of text, extract entities, and much more. Transformers are behind much of this. These transformers are (unfortunately) not Michael Bay's Autobots and Decepticons and (fortunately) not buzzing electrical boxes. Our NLP transformers lie somewhere in the middle, they're not sentient Autobots (yet), but they can understand language in a way that existed only in sci-fi until a short few years ago. Machines with a human-like comprehension of language are pretty helpful for organizing masses of unstructured text data. In machine learning, we refer to this task as *topic modeling*, the automatic clustering of data into particular topics. BERTopic takes advantage of the superior language capabilities of these (not yet sentient) transformer models and uses some other ML magic like UMAP and HDBSCAN (more on these later) to produce what is one of the most advanced techniques in language topic modeling today. 🌲 Pinecone article: https://www.pinecone.io/learn/bertopic 🔗 Code notebooks: https://github.com/pinecone-io/exampl... 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 🎉 Subscribe for Article and Video Updates!   / subscribe     / membership   👾 Discord:   / discord   00:00 Intro 01:40 In this video 02:58 BERTopic Getting Started 08:48 BERTopic Components 15:21 Transformer Embedding 18:33 Dimensionality Reduction 25:07 UMAP 31:48 Clustering 37:22 c-TF-IDF 40:49 Custom BERTopic 44:04 Final Thoughts