Column with string into columns
I have column with strings. Each variable is separated with coma. Rows have different numbers of variables. Somtimes variable x is first, sometimes third, sometimes it's missing. Each data looks like this "year:2005" How can I extract data from this column?
See also questions close to this topic

R  SVM  how to compute odds ratio?
I'm struggling a bit to understand how to compute Odds ratio out of an svm regression.
My training dataset is: train.data; my test dataset is: test.data
Basically, I have the following model:
model_svm < train ( y ~ x + z, data=train.data, method="svmLinear", trControl=trainControl ("cv", number = 10)
At this stage, how can I compute the odds ratio?
Thank you very much for your time and help,
Best, Maurizio

How to remove rows that contain identical pairs in opposite order in 2 columns
In a correlation matrix I would like to get rid of the rows that are basically containing the same information as another row, except instead of "A" and "B" in var1 and var2 column contain "B" and "A" respectively
var1 var2 value 1 cyl mpg 0.8521620 2 disp mpg 0.8475514 3 wt mpg 0.8676594 4 mpg cyl 0.8521620 5 disp cyl 0.9020329 6 hp cyl 0.8324475 7 vs cyl 0.8108118 8 mpg disp 0.8475514 9 cyl disp 0.9020329 10 wt disp 0.8879799 11 cyl hp 0.8324475 12 mpg wt 0.8676594 13 disp wt 0.8879799 14 cyl vs 0.8108118
Here we could drop for instance row 4 with mpg vs cyl since we have cyl vs mpg in row 1 already
I know I could filter for unique values in column value, BUT i don't want to do this as with my enormous data set there is actually a chance of getting identical correlation score with multiple pairs of columns. So it has to be done by name matching col
var1
andvar2
I have this code so far to filter out data rows that are above a certain correlation value, but are not 1 (variable vs itself)
mtcars %>% as.matrix %>% cor %>% as.data.frame %>% rownames_to_column(var = 'var1') %>% gather(var2, value, var1) %>% filter(value > 0.8  value < 0.8) %>% filter(value != 1)
EDIT
Andre's answer
cor %>% {(function(x){x[upper.tri(x)]<NA; x})(.)} %>%
is faster, but Rui's answer is more generic and can be applied to other situations other than cor matrix calculations.
Unit: milliseconds expr min lq mean median uq max neval cld Andre 4.818793 5.113676 5.630160 5.408955 5.704825 22.33730 100 a Rui 5.413692 5.761669 7.531146 6.003656 6.583750 78.02836 100 b

How not to dplyr::summarize on alphabetic order in R
I want to summarize relocations (between cities), based on a unique ID number. A sample dataframe, with two unique ID's:
year ID city adress 1 2013 1 B adress_1 2 2014 1 B adress_1 3 2015 1 A adress_2 4 2016 1 A adress_2 5 2013 2 B adress_3 6 2014 2 B adress_3 7 2015 2 C adress_4 8 2016 2 C adress_4
I have provided a sample code below. The summaries are correct, except for one thing. If, for example, a relocation is found between city B and city A, I want an output of relocation found from city B to city A (and number of times 1 = seen once in the dataframe). However, because of the properties of the summary function (and the tendency to store output in alphabetic order), I get the following output
tmp < df %>% group_by(ID, city, adress) %>% summarize(numberofyears = n()) tmp < tmp %>% group_by(ID) %>% #filter(n() >1) %>% mutate(from = city[1], from_adres = adress[1], from_years = numberofyears[1], to = city[2], to_adres = adress[2], to_years = numberofyears[2]) %>% distinct(ID, .keep_all = TRUE) %>% select(c(2:3)) # A tibble: 2 x 8 # Groups: ID [2] ID numberofyears from from_adres from_years to to_adres to_years <dbl> <int> <fct> <fct> <int> <fct> <fct> <int> 1 1 2 A adress_2 2 B adress_1 2 2 2 2 B adress_3 2 C adress_4 2
Which is wrong, because we know that adress_1 preceed adress_2. When summarizing a relocation from City B to City C, I get the right results.
It is a very small detail, but an important one as I tried to demonstrate. Any suggestions would be very much appreciated!

cosine similarity and euclidean distance between sentences
I want to calculate the cosine similarity and euclidean distance between sentences.
Is there a python package through which I can compute this?
vec1 = vector_dict[text1] vec2 = vector_dict[text2] intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x]**2 for x in vec1.keys()]) sum2 = sum([vec2[x]**2 for x in vec2.keys()]) denominator = math.sqrt(sum1) * math.sqrt(sum2) return round(float(numerator) / denominator, 2)

LDAvis HTML output from serVis does not work: JSON.parse error
As a beginner in R and topic modelling with R, I am following what other people suggested to first fit a lda model on a corpus of corporate annual reports, and then visualize the results thorugh LDAvis. Everything works fine until the very last step, when I open the directory on the browser and get the following error:
"SyntaxError: JSON.parse: bad control character in string literal at line 10 column 16177 of the JSON data"
Here are my codes:
#load text mining library library(tm) #load files into corpus #get listing of .txt files in directory ceoletters < read.csv("ceoletters.csv") corpus < iconv(ceoletters$ceoletter, to = "ASCII", sub = "") #create corpus from vector letters < Corpus(VectorSource(corpus)) #start preprocessing letters <tm_map(letters,content_transformer(tolower)) letters < tm_map(letters, removePunctuation) letters < tm_map(letters, removeNumbers) letters < tm_map(letters,removeWords,stopwords("english")) letters < tm_map(letters, stripWhitespace) #Stem document letters < tm_map(letters,stemDocument) #Create documentterm matrix dtm < DocumentTermMatrix(letters) #convert rownames to filenames rownames(dtm) < ceoletters$letter_id #collapse matrix by summing over columns freq < colSums(as.matrix(dtm)) #length should be total number of terms length(freq) #create sort order (descending) ord < order(freq,decreasing=TRUE) #List all terms in decreasing order of freq and write to disk freq[ord] write.csv(freq[ord],"word_freq.csv") ##fitting LDA #load topic models library library(topicmodels) library(doParallel) #Set parameters for Gibbs sampling burnin < 2000 iter < 2000 thin < 500 seed <list(2003,5,63,100001,765) nstart < 5 best < TRUE registerDoParallel(4) #Number of topics k < 100 #Run LDA using Gibbs sampling ldaOut <LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin)) #write out results #docs to topics ldaOut.topics < as.matrix(topics(ldaOut)) write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv")) #top 20 terms in each topic ldaOut.terms < as.matrix(terms(ldaOut,10)) write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv")) #probabilities associated with each topic assignment topicProbabilities < as.data.frame(ldaOut@gamma) write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
and here are my codes to visualize the results:
library(LDAvis) library(servr) topicmodels2LDAvis < function(x, ...){ post < topicmodels::posterior(x) if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics") mat < x@wordassignments LDAvis::createJSON( phi = post[["terms"]], theta = post[["topics"]], vocab = colnames(post[["terms"]]), doc.length = slam::row_sums(mat, na.rm = TRUE), term.frequency = slam::col_sums(mat, na.rm = TRUE) ) } serVis(topicmodels2LDAvis(ldaOut))
Any idea to solve this problem? Thanks.

How to avoid false positives using Geograpy?
I want to extract country names from affiliation of authors. For example, I have the following text:
affiliation = "1Key Laboratory of Marine Drugs, Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao 266003, PR China."
I utilized the following code:
import geograpy places = geograpy.get_place_context(text = affiliation) print(places.countries)
And the result is the following one:
['China', 'United States', 'Russian Federation']
Obviously, "United States" and "Russian Federation" are false positives. How could I eliminate them automatically?