kwic() function returns less rows than it should
I'm currently trying to perform a sentiment analysis on a kwic
object, but I'm afraid that the kwic()
function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform kwic()
on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
do you know?
how many words do you know
See also questions close to this topic
-
pivot_wider does not keep all the variables
I would like to keep the variable
cat
(category) in the output of my function. However, I am not able to keep it. The idea is to apply a similar function tom <- 1 - (1 - se * p2)^df$n
based on the category. But in order to perform that step, I need to keep the variable category.Here's the code:
#script3 suppressPackageStartupMessages({ library(mc2d) library(tidyverse) }) sim_one <- function() { df<-data.frame(id=c(1:30),cat=c(rep("a",12),rep("b",18)),month=c(1:6,1,6,4,1,5,2,3,2,5,4,6,3:6,4:6,1:5,5),n=rpois(30,5)) nr <- nrow(df) df$n[df$n == "0"] <- 3 se <- rbeta(nr, 96, 6) epi.a <- rpert(nr, min = 1.5, mode = 2, max = 3) p <- 0.2 p2 <- epi.a*p m <- 1 - (1 - se * p2)^df$n results <- data.frame(month = df$month, m, df$cat) results %>% arrange(month) %>% group_by(month) %>% mutate(n = row_number(), .groups = "drop") %>% pivot_wider( id_cols = n, names_from = month, names_glue = "m_{.name}", values_from =m ) } set.seed(99) iters <- 1000 sim_list <- replicate(iters, sim_one(), simplify = FALSE) sim_list[[1]] #> # A tibble: 7 x 7 #> n m_1 m_2 m_3 m_4 m_5 m_6 #> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 0.970 0.623 0.905 0.998 0.929 0.980 #> 2 2 0.912 0.892 0.736 0.830 0.890 0.862 #> 3 3 0.795 0.932 0.553 0.958 0.931 0.798 #> 4 4 0.950 0.892 0.732 0.649 0.777 0.743 #> 5 5 NA NA NA 0.657 0.980 0.945 #> 6 6 NA NA NA 0.976 0.836 NA #> 7 7 NA NA NA NA 0.740 NA
Created on 2022-05-07 by the reprex package (v2.0.1)
-
calculate weighted average over several columns with NA
I have a data frame like this one:
ID duration1 duration2 total_duration quantity1 quantity2 1 5 2 7 3 1 2 NA 4 4 3 4 3 5 NA 5 2 NA
I would like to do a weighted mean for each subject like this:
df$weighted_mean<- ((df$duration1*df$quantity1) + (df$duration2*df$quantity2) ) / (df$total_duration)
But as I have NA, this command does not work and it is not very nice....
The result would be this:
ID duration1 duration2 total_duration quantity1 quantity2 weighted_mean 1 5 2 7 3 1 2.43 2 NA 4 4 3 4 4 3 5 NA 5 2 NA 2
Thanks in advance for the help
-
I am to extract data from netCDF file using R for specific loaction the code i've written as showen and I have an error at the end of the code
I need some help with extracting date from NetCDF files using R , I downloaded them from cordex (The Coordinated Regional climate Downscaling Experiment). In total I have some files. This files have dimensions of (longitude, latitude, time) and the variable is maximum temperature (tasmax). At specific location, I need to extract data of tasmax at different time. In total I have some files. This files have dimensions of (longitude, latitude, time) and variable maximum temperature (tasmax). At specific location, I need to extract data of tasmax at different time.I wrote the code using R but at the end of code, an error appeared. Error ( location subscript out of bounds)
getwd() setwd("C:/Users/20120/climate change/rcp4.5/tasmax")
dir() library ("ncdf4") libra,-ry(ncdf4.helpers) library ("chron") ncin <- nc_open("tasmax_AFR-44_ICHEC-EC-EARTH_rcp45_r1i1p1_KNMI-RACMO22T_v1_mon_200601-201012.nc") lat <- ncvar_get(ncin, "lat") lon <- ncvar_get(ncin, "lon") tori <- ncvar_get(ncin, "time") title <- ncatt_get(ncin,0,"title") institution <- ncatt_get(ncin,0,"institution") datasource <- ncatt_get(ncin,0,"source") references <- ncatt_get(ncin,0,"references") history <- ncatt_get(ncin,0,"history") Conventions <- ncatt_get(ncin,0,"Conventions") tustr <- strsplit(tunits$value,"") ncin$dim$time$units ncin$dim$time$calendar tas_time <- nc.get.time.series(ncin, v = "tasmax", time.dim.name = "time") tas_time[c(1:3, length(tas_time) - 2:0)] tmp.array <- ncvar_get(ncin,"tasmax") dunits <- ncatt_get(ncin,"tasmax","units") tmp.array <- tmp.array-273.15 tunits <- ncatt_get(ncin,"time","units") nc_close(ncin) which.min(abs(lat-28.9)) which.min(abs(lon-30.2)) tmp.slice <- tmp.array[126,32981,] tmp.slice
Error in tmp.array[126, 32981, ] : subscript out of bounds
-
Shop name classification
I have a list of merchant name and its corresponding Merchant Category Code (MCC). It seems that about 80 percent of MCCs are true. Total number of MCCs are about 300. Merchant name may contain one, two or three words. I need to predict MCC using merchant name. How can I do that?
-
R: How can I add titles based on grouping variable in word_associate?
I am using the word_associate package in R Markdown to create word clouds across a grouping variable with multiple categories. I would like the titles of each word cloud to be drawn from the character values of the grouping variable.
I have added trans_cloud(title=TRUE) to my code, but have not been able to resolve my problem. Here's my code, which runs but doesn't produce graphs with titles:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), trans_cloud(title=TRUE))
I have also tried the following, which does not run:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), title=TRUE)
Can anyone help me figure this out? I can't find any guidance in the documentation and there's hardly any examples of or discussions about word_associate on the web.
Here's an example data frame that reproduces the problem:
id text question1 I love cats even though I'm allergic to them. question1 I hate cats because I'm allergic to them. question1 Cats are funny, cute, and sweet. question1 Cats are mean and they scratch me. question2 I only pet cats when I have my epipen nearby. question2 I avoid petting cats at all cost. question2 I visit a cat cafe every week. They have 100 cats. question2 I tried to pet my friend's cat and it attacked me.
Note that if I run this in R (instead of Markdown), the figures automatically print the "question1_list1" and "question2_list1" in bright blue at the top of the figure file. This doesn't work for me because I need the titles to exclude "_list1" and be written in black. These automatically generated titles do not respond to changes in my trans_cloud specifications. Ex:
library(qdap) word_associate(df$text, match.string=c("cat"), grouping.var=c(df$id), text.unit="sentence", stopwords=c(Top200Words), wordcloud=TRUE, cloud.colors=c("#0000ff","#FF0000"), trans_cloud(title=TRUE, title.color="#000000"))
In addition, I'm locked in to using this package (as opposed to other options for creating word clouds in R) because I'm using it to create network plots, too.
-
How to solve missing words in nltk.corpus.words.words()?
I have tried to remove non-English words from a text. Problem many other words are absent from the NLTK words corpus.
My Code :
import pandas as pd lst = ['I have equipped my house with a new [xxx] HP203X climatisation unit'] df = pd.DataFrame(lst, columns=['Sentences']) import nltk nltk.download('words') words = set(nltk.corpus.words.words()) df['Sentences'] = df['Sentences'].apply(lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in (words))) df
Input : "I have equipped my house with a new [xxx] HP203X climatisation unit" Result : "I have my house with a new unit"
Should have been : "I have equipped my house with a new climatisation unit"
I can't figure out how to complete nltk.corpus.words.words() to avoid words like "equipped", "climatisation" to be remouved from the sentences.
-
How to properly tokenize column in pandas?
I am trying to solve tokenization problem in my dataset with comments from social media. I want to tokenize, lemmatize, remove punctuations and stop-words from the pandas column. I am struggling how to do it for each of the comment. I receive the following error when trying to get tokens:
import pandas as pd import nltk ... merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message']), axis=1) TypeError: expected string or bytes-like object
When I am trying to tell pandas that I am passing it a string object, it gives me the following error message:
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message'].str), axis=1) AttributeError: 'str' object has no attribute 'str'
What am I doing wrong?
-
Subword tokenization for the words with mistakes
Currently, I am trying to work on tokenization techniques for the languages that do not have spaces between their words (Thai, Chinese, Japanese, etc.). That is to work on grammar checking in further stages of a project.
One question arises in my mind: how can we apply tokenization techniques if there are any typos or mistakes in the data? what would be the best way to approach the task?
-
Tokenization of Compound Words not Working in Quanteda
I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.
This is the subset of the dataset I'm using as a reproducible example:
test_cluster <- speeches_subset %>% filter(grepl('Schwester Agnes', speechContent, ignore.case = TRUE)) test_corpus <- corpus(test_cluster, docid_field = "id", text_field = "speechContent")
Here,
test_cluster
contains six observations of 12 variables, that is, six rows in which the columnspeechContent
contains the compound word "Schwester Agnes".test_corpus
transforms the underlying data into aquanteda
corpus object.When I then run the following code, I would expect, first, the content of the
speechContent
variables to be tokenized, and due totokens_compound
, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with thekeyword
variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making withtokens_compound()
, but I'm not sure... Any help would be greatly appreciated!test_tokens <- tokens(test_corpus, remove_punct = TRUE, remove_numbers = TRUE) %>% tokens_compound(pattern = phrase("Schwester Agnes")) test_kwic <- kwic(test_tokens, pattern = "Schwester Agnes", window = 5)
EDIT: I realize that the examples above are not easily reproducible, so please refer to the reprex below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.") data <- data.frame(id=1:3, speechContent = speech) test_corpus <- corpus(data, docid_field = "id", text_field = "speechContent") test_tokens <- tokens(test_corpus, remove_punct = TRUE, remove_numbers = TRUE) %>% tokens_compound(pattern = c("stack", "overflow")) test_kwic <- kwic(test_tokens, pattern = "stack overflow", window = 5)
-
How do I change the encoding of the texts within a corpus to UTF-8?
I am working with a corpus of speeches from the ParlSpeech v2 dataset. I would like to change the encoding of all of the texts to UTF-8. I am using R on Windows.
I have already checked the encoding of the corpus:
encoding(corp_labcon) Probable encoding: UTF-8 (but other encodings also detected) Encoding proportions: [***************-------------............~~~~~~~~aaaaaaaabbbb] Samples of the first text as: [*] UTF-8 Like my hon. Friend the Member for Halesowen and Stourbridge [-] windows-1252 Like my hon. Friend the Member for Halesowen and Stourbridge [.] ISO-8859-1 Like my hon. Friend the Member for Halesowen and Stourbridge [~] ISO-8859-2 Like my hon. Friend the Member for Halesowen and Stourbridge [a] ISO-8859-9 Like my hon. Friend the Member for Halesowen and Stourbridge Error in stri_encode(x[1], topEncodingsTable$encoding[i]) : The requested ICU resource file cannot be found. (U_FILE_ACCESS_ERROR)
Up until now, I have only found solutions using the iconv command for which I need to know exactly which encodings have been used, or else for single text files. The solution could also be applied to the dataframe which I converted the corpus from. Any suggestions would be greatly appreciated, thank you!
-
Tidyr Unite() Function Returns Empty Data Frame
When trying to merge two columns (pre and post) in a kwic dataframe created with the quanteda package, the resulting data frame contains only NA values. Using the paste() function from base R works perfectly fine, but I'd rather solve this issue with a tidy approach. Has anyone else experienced this before and knows what to do?
I'm including a reprex below, but unfortunately, in the reprex the unite function works perfectly fine. I'm wondering if it's related to the input being a data frame created with quanteda::kwic?
pre = c("Pre Text 1", "Pre Text 2", "Pre Text 3") post = c("Post Text 1", "Post Text 2", "Post Text 3") data <- data.frame(id=1:3, pre = pre, post = post) data2 <- data %>% unite("merged", pre, post, sep = " ")
EDIT: I'm including a better example in the code below. "x" is a data frame that resulted from applying kwic() to my dataset, and speeches_meta is metadata associated with the texts contained in "x". My issue is that when running the unite function on the "dput" object, it somehow doubles the amount of variables and all of the observations except for two are empty (with the two that aren't containing a bunch of information from all variables).
merged_kwic <- left_join(x, speeches_meta, by = "docname") dput <- dput(merged_kwic[1:3, c("pre", "post")]) dput <- dput %>% unite("merged", pre, post, sep = " ")
EDIT 2:
The following is the output I get after running the following code:
dput(merged_kwic[1:3, c("pre", "post")])
structure(list(docname = c("585662", "586622", "650973"), from = c(377L, 1665L, 562L), to = c(377L, 1665L, 562L), pre = c("5 Dies kann weder durch", "tief in die Mottenkiste der", "unterstellen dass es ihnen um" ), keyword = c("Ostalgie", "Ostalgie", "Ostalgie"), post = c("noch durch Amnesie durch Gedächtnisverlust", "greifen würden 33 An dieser", "geht um eine Werbung für"), pattern = structure(c(1L, 1L, 1L), .Label = "ostalgie", class = "factor"), id = c(585662, 586622, 650973), session = c(241, 245, 56), electoralTerm = c(13, 13, 15), firstName = c("Dietrich", "werner", "Vera"), lastName = c("Austermann", "schulz", "Lengsfeld" ), politicianId = c(11000066, 11002108, 11002721), factionId = c(4, 3, 4), documentUrl = c("https://dip21.bundestag.de/dip21/btp/13/13241.pdf", "https://dip21.bundestag.de/dip21/btp/13/13245.pdf", "https://dip21.bundestag.de/dip21/btp/15/15056.pdf" ), positionShort = c("Member of Parliament", "Member of Parliament", "Member of Parliament"), positionLong = c(NA_character_, NA_character_, NA_character_), date = structure(c(10395, 10402, 12236), class = "Date")), ntoken = c(`585662` = 839L, `586622` = 1724L, `650973` = 647L), row.names = c(NA, 3L), class = c("kwic", "data.frame"))