The author of this post, Roberto Salazar, attended Jeff Heer's course Techniques and Frameworks for Data Exploration through Sphere.
When conducting a study, not all available data comes directly from numbers and quantitative sources. For example, data might come from documents, news, articles, or blog posts. With it, we can fit, test, and validate predictive and classification models, perform automated analysis, and develop visualization charts.
Visualizing text helps viewers understand what is happening in a single document, a small set of related documents, or an entire corpus of documents. As a result, we can quickly grasp their nature, group them into clusters, and compare them. In the same way, visualizing the contexts of a set of data allows for finding correlations between multiple attributes (e.g., time, authors, structures, etc.) and hidden trends and patterns.
Visualizing text data comes with multiple challenges. Therefore, understanding the right text data visualization tools is critical. For example, what is the best tool for an intended analysis? Networks? Clusters? Histograms with word and entity frequencies? Analysts must clearly define the questions they wish to explore and the strategy to do so, including data cleansing and data transformation techniques to obtain reliable results.
This article shows four ways to visualize text data. We'll use a set of text data available online.
This Medium Articles Dataset contains information about randomly chosen Medium articles published from seven of its most popular publications:
Though we do not know why this set was created — what the creator was looking for or the criteria they may have used — we can use it as an example to show potential learnings from data visualization.
Even before digging into their contents, we can obtain meaningful insights from the text source. It could reveal significant trends or patterns supporting further conclusions from the analysis. Let's start with the data set structure.
The raw data set contains the titles of 6,508 Medium articles, along with other features such as URL, subtitle, claps, etc. But for this analysis, we are only interested in analyzing the titles. As observed in Image 1, only some articles have a subtitle and include an image.
Next, let us group the articles by publication and date to see what we can learn.
Image 2 shows that writers published much of their work in June and October and that The Startup had the most articles written in 2019. However, it is interesting that writers published only a few pieces during the first five months. Unfortunately, the dataset owner did not specify the criteria used to sample the articles, resulting in unanswered questions and potential gaps in the metadata analysis.
Now, let us look at the publications' distribution and count from this dataset sample.
Images 3 and 4 show that this dataset is heavily unbalanced towards The Startup and Towards Data Science publications. When working with data, in some specific tasks, such as the development of machine learning classifiers and predictors (i.e., mathematical models), it is preferred to have balanced datasets to obtain more accurate results with reliable models. For other tasks, having unbalanced datasets could reveal interesting trends or findings within the data, encouraging the analyst to find out why a given label, class, group or cluster represents a minority from the population under study.
Defining the best visualization tool for analyzing text data is a critical task. Each visualization tool serves a particular purpose and provides unique insights about the text being analyzed. For this case study, let us explore four visualization tools:
When analyzing text data, one of the most common approaches is considering each unique word (i.e., gram) as a token. Therefore, we start with a list of all the unique words in the corpus of texts or documents. Then we use text cleansing techniques to remove the most common words and punctuation marks required for correct grammatical structure.
These are called stop words and include the following:
Once we cleanse the raw text, we plot a histogram, as shown in Image 5. Image 5 shows that having words such as data, learning, machine, science, artificial intelligence (AI), and Python suggests that the most common topics addressed in the articles are related to data science. However, more analysis is needed to determine which publications address such issues.
In addition to obtaining the most repeated grams, we must obtain the most repeated bigrams — that is, two words together. In some texts, analysis of single words without their previous and preceding word might result in losing context. For example, the word data by itself can be used in multiple contexts. But analyzing the word along with its previous and proceeding word (e.g., data extraction, data loading, historical data, big data) provides a better understanding of the context.
Image 6 shows the most repeated bigrams in the articles' titles: machine learning and data science. Analyzing the most common bigrams reinforces that the articles' most-addressed topics relate to data science.
Named Entity Recognition (NER) is a subdomain within the Natural Language Processing field focused on locating and classifying named entities within unstructured text through information extraction algorithms into predefined labels or groups. With the results from an NER analysis, we can leverage visualization tools to develop graphs and plots that help identify trends, patterns, and insights. Let's conduct an NER analysis of the articles' titles to get some findings.
Image 7 shows a visualization render of an NER analysis in Jupyter Lab, a Python code editor program, for the first article's titles. This visualization is useful when analyzing short pieces of text, where we can easily spot the entities within the text with their respective labels. However, when analyzing a corpus of texts, working with other visualization plots becomes more meaningful for better analysis. So, let us visualize the frequencies of entities' labels found in the data set.
Image 8 shows the frequency of each entity label found on all the article titles and that the most frequent entity label was ORG (i.e., organizations), followed by PERSON, and CARDINAL, where ORG entities tripled PERSON entities. Next, let us identify the most frequent entities' names listed within the ORG, PERSON, and GPE labels.
Image 9 shows the top 10 organization entities within the articles' titles. The top 5 organization entities — UX, h3-string">" How, h3, AI and h3-string">" What — are not organizations, as compared with Microsoft and Amazon. This suggests that the spaCy model used for the NER analysis is not 100% accurate when identifying organization entities within texts. However, analysts can improve the results by setting a set of rules to remove wrongly labeled entities, for example: exclude acronyms — such as UX, AI, UI —, remove words with less than n amount of characters, or remove words containing specific punctuation marks.
Image 10 shows the top 10 person entities found within the article's titles. Just as with the ORG entities, most of these entities are not persons. For example, Machine Learning, Deep Learning, and AI are all research fields within the Data Science domain, not personal names. From the top 10 person entities found, the only entity that refers to a person is Markov, the last name of Andrey Andreyevich Markov, a Russian mathematician.
Image 11 shows the top 10 GPEs entities found within the article's titles. From this list, AI, Us (the plural pronoun), Blockchain, and Machine Learning do not represent real GPEs, for example, compared with America, the UK, and the US. NER currently has many challenges and problems to be solved.
To improve its accuracy and reliability, the analyzed text must be as clean as possible to avoid wrongly labeled entities and noisy data. There are multiple lines of research to improve its accuracy when generating models to be used for natural language processing tasks.
Word networks show the interconnected and supporting use of words between textual units containing key terms. In addition, they help identify clusters of words that we can group by a given topic. For this analysis, let us build a word network from the articles' titles and try to spot clusters.
Image 12 shows the resulting word network from the most common bigrams of the articles' titles. At first glance, we see a very well-defined cluster (i.e., a cluster containing a large amount of words connected to each other) at the middle right center of the network, as shown in Image 13. Having a cluster of multiple words reveals that there are multiple co-occurrences between these words in the text being analyzed. This cluster comprises words including science, machine, and data, which suggests that it is related to the topics of data science and machine learning. The words “learning” and “data” represent key nodes in these clusters, since they have multiple co-occurrences with other words, such as machine learning and data scientist.
Another cluster is in the top right center of the network, as shown in Image 14. This cluster comprises words such as life, lessons, and learned, which suggests that it relates to personal experiences.
Finally, a third cluster is in the lower left section of the network, as shown in Image 15. This cluster comprises words such as neural, convolutional, and networks, which suggests that the topic of deep learning is present in the articles.
Topic Modeling is a text-mining tool for analyzing text data and discovering hidden semantic structures in a corpus of texts. It allows the clustering of relevant documents addressing a common topic. For this example, let us assume that each of the six Medium publications addresses a unique topic (which could not be true) and find which set of words is assigned to each by using LDA (Latent Dirichlet Allocation), a specific topic modeling technique.
In Image 16, the six suggested Medium articles topics are grouped into three clusters, which could indicate that some of the publications shared common topics. Next, let us find the most relevant words for each topic.
Images 17 through 22 show each topic's 30 most relevant terms. Topics 2 and 4 (the upper cluster) suggest they are related to data science and machine learning since they contain words such as data, learning, Python, machine, science, deep, neural, models, and predicting.
On the other hand, topics 3, 5, and 6 (the bottom-left cluster) suggest being related to businesses, startups, and marketing, since they contain words such as business, work, success, money, marketing, brand, strategies, and content.
Finally, Topic 1 (the bottom-right cluster) represents its own cluster, which means a noisy topic. The words strong, markup, and class are the most often repeated since they are the HTML-reserved words in the article titles
While visualizing text data still represents an ongoing challenge, multiple visualization tools are helpful for extracting meaningful information and insights from it. Some help analysts get a big picture of the most frequently used words, some help spot the main topics and fields covered within a corpus of texts, and others help understand the relationship between n-grams.
The field of natural language processing is an exciting field among data science practitioners and researchers who try to find more optimum and reliable ways of transforming text data into information, information into knowledge, and knowledge into wisdom.
Roberto Salazar attended Sphere's course Techniques and Frameworks for Data Exploration with Jeff Heer.