Machine Learning for Text Analysis by milesdwilliams15

Here's what I'm doing to create a structural topic model (STM) of primary and general presidential debates:

## Structural Topic Model (STM) of Primary & General Presidential Debates (2000-2016)
# Debate Transcripts stored in: text <- c(R1,...,P15)
# Metadata stored in: metDat<-data.frame(debateID,year)
library(stm) # upload stm package from the library

First, I've created a vector of text objects (which I've called "text") where each object (R1,...,P15) contains the text for a primary or presidential debate transcript. I then created an object called "metDat," which is a matrix of metadata associated with each debate transcript (i.e., the election year of the debate and whether the debate was a Republican or Democratic primary debate, or a general election Presidential debate). Once I've made these, I then open the "stm" package in R.

Before I can actually create a STM, I have to process the text data using the textProcessor() command in the stm package:

# Process text data:
debates<-textProcessor(documents=text,
                   metadata=metaDat,
                   verbos=FALSE)

The processed text is now in an object called "debates."

Now I can go ahead and run a STM using the stm() command:

# Estimate STM:
stm1<-stm(documents=debates$documents,
          vocab=debates$vocab,
          K=15,
          prevalence=~debateID+s(year),
          data=metDat)

Results can be visualized in a number of ways using the plot.STM() command:

# Visualize results in "label" format:
par(mfcol=c(1,3))
plot.STM(stm1,type="labels",topics=1:5,width=50,text.cex=1.25)
plot.STM(stm1,type="labels",topics=6:10,width=50,text.cex=1.25)
plot.STM(stm1,type="labels",topics=11:15,width=50,text.cex=1.25)

# Visualize results in "summary" format:
par(mfcol=c(1,1))
plot.STM(stm1,type="summary")

Beyond just doing some basic summary visualizations, I can also go further and visualize estimated effects. Doing so will allow me to show what topics are more likely to appear in Democratic vs. Republican vs. general debates, both in general and over time:

# Estimate Model Effects:
stm1effect<-estimateEffect(formula=1:15~debateID+s(year),
                           stmobj=stm1,
                           metadata=metDat)        #smooth the effect of year

# Visualize Estimated Effects:
plot.estimateEffect(stm1effect,
                    covariate="debateID",
                    model=stm1,
                    topics=stm1effect$topics[1:5]) #makes a pointestimate plot

# Make overlapping continuous plots to show expected topic proportions over time per type of debate:
plot.estimateEffect(stm1effect,                    #Topic proportions in Rep. debates
                    covariate="year",
                    model=stm1,
                    topics=stm1effect$topics[10],
                    method="continuous",
                    xlab="Election Year",
                    ylab="Expected Topic Proportions",
                    moderator="debateID",
                    moderator.value="Republican",
                    ylim=c(-.1,.45),
                    linecol="red",
                    printlegend=F)
plot.estimateEffect(stm1effect,                    #Topic proportions in Dem. debates
                    covariate="year",
                    model=stm1,
                    topics=stm1effect$topics[10],
                    method="continuous",
                    xlab="Election Year",
                    ylab="Expected Topic Proportions",
                    moderator="debateID",
                    moderator.value="Democrat",
                    ylim=c(-.1,.45),
                    linecol="blue",
                    printlegend=F,add=T)
plot.estimateEffect(stm1effect,                    #Topic proportions in Pres. debates
                    covariate="year",
                    model=stm1,
                    topics=stm1effect$topics[10],
                    method="continuous",
                    xlab="Election Year",
                    ylab="Expected Topic Proportions",
                    moderator="debateID",
                    moderator.value="Presidential",
                    ylim=c(-.1,.45),
                    linecol="green",
                    printlegend=F,add=T)

Finally, I can show graphically topic correlations:

#Visualize Topic Correlations:
plot(topicCorr(stm1))

Back to Main Page