FYI: 文字探勘的最新做法與工具:An Improved Method of Automated Nonparametric Content Analysis for Social Science


An Improved Method of Automated Nonparametric Content Analysis for Social Science

Citation:

Connor T. Jerzak, Gary King, and Anton Strezhnev. Working Paper. "An Improved Method of Automated Nonparametric Content Analysis for Social Science". Copy at http://j.mp/2DyLYxL
Paper375 KB
An Improved Method of Automated Nonparametric Content Analysis for Social Science

Abstract:

Computer scientists and statisticians are often interested in classifying textual documents into chosen categories. Social scientists and others are often less interested in any one document and instead try to estimate the proportion falling in each category. The two existing types of techniques for estimating these category proportions are parametric "classify and count" methods and "direct" nonparametric estimation of category proportions without an individual classification step. Unfortunately, classify and count methods can sometimes be highly model dependent or generate more bias in the proportions even as the percent correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning and usage of language is too similar across categories or too different between training and test sets. We develop an improved direct estimation approach without these problems by introducing continuously valued text features optimized for this problem, along with a form of matching adapted from the causal inference literature. We evaluate our approach in analyses of a diverse collection of 73 data sets, showing that it substantially improves performance compared to existing approaches. As a companion to this paper, we offer easy-to-use software that implements all ideas discussed herein.

Last updated on 05/18/2018