What Big Data Can Tell Us about What Words Mean: Saving Weighted Dictionaries with Word Embeddings

Friday, April 2, 2021 - 12:00pm

Place:

Virtual

Dustin Stoltz

Department of Sociology and Anthropology
Lehigh University

What Big Data Can Tell Us about What Words Mean: Saving Weighted Dictionaries with Word Embeddings

Much effort has been expended building weighted dictionaries based on hand-coding words' semantic content. Such pre-made dictionaries are used for a variety of tasks, such as sentiment analysis, extracting cognitive content, measuring the "abstractness" of text, and for identifying the sensorimotor norms of words. Despite being intuitive methods for identifying motives, desires, ideas, connotations, and themes, they cannot overcome the problems associated with the long-tail distribution of words in a corpus. Furthermore, even very large weighted dictionaries -- such as, crowd-sourced dictionaries of about 40 thousand words -- will encounter the problem of out-of-vocabulary words. Text analysts can overcome these problems using word embeddings. Embeddings use the co-occurrence of words in large corpora to assign words locations in a multi-dimensional space of meaning. This can be used to "weight" words by measuring their distance from "anchors" in this space of meaning. I propose a method which combines the strengths of these two approaches by

defining and validating these "anchors" using hand-weighted dictionaries.

Virtual event via ZOOM: https://lehigh.zoom.us/j/92365141692

Event Semester:

Spring 2021