Why should I use a Hashing Vectorizer for text clustering?
I am trying to clusterize some textual data, and I am following the scikit-learn example for doing so.
In the example, you have the option to use a Hashing Vectorizer followed by a TF-IDF vectorizer, and this is the default pipeline:
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', alternate_sign=False,
norm=None)
vectorizer = make_pipeline(hasher, TfidfTransformer())
- What does a Hashing Vectorizer exactly do? I cannot get what it exactly does neither by the documentation nor from Wikipedia
- What are the advantages and disadvantages on using a Hashing Vectorizer for text clustering? In the example, it is given as an option (you can also use only a TF-IDF, but the default option is to use Hashing Vectorizer+TF-IDF)
See also questions close to this topic
-
Capture live output of a subprocess whithout blocking it
I am currently writing a script that will need at some point to capture the output of a heavy
grep
command.The problem is, I can't obtain decent performance from the command compared to just running in the shell. (It appears to be at least 2x slower, sometimes never ending). I'm grepping a whole partition (that's part of the purpose of the script), I know it's a slow operation, I'm talking about the huge difference between runtime in the shell and in my python script.
I've struggled with it for quite some time. Tried the queue library, threading, multiprocessing, gave asyncio a bit of a shot. I'm getting really lost.
I've shortened it to simplest form, here it is :
from subprocess import PIPE, Popen p = Popen(['grep', '-a', '-b', 'MyTestString', '/dev/sda1'], stdin = None, stdout = PIPE, stderr = PIPE, bufsize=-1) while True: output = p.stdout.readline() if output: print(output.strip())
So here, my
grep
command is way slower than in the shell. I've tried putting atime.sleep
in my main loop but it seems to be worse.Just a few more infos :
- There will be very few output from the command.
- The final goal would be to grab the output without blocking the main thread but one problem at time.
- Again, I know that my grep command is a heavy one
Thanks for your help if you have any idea or suggestion. I'm on the verge of despair :(
-
Invalid base64-encoded string: number of data characters (217) cannot be 1 more than a multiple of 4 error in python django
You see I am getting this incredibly weird error related to the no of characters in a string I guess. I am a bit new to django and I have never see such an weird error. I checked a few threads and according to that this is about encoding and decoding something which I really do not know how to resolve. Here is my code:
Request Method: GET Request URL: http://127.0.0.1:8000/profiles/myprofile/ Django Version: 2.1.5 Python Version: 3.9.0 Installed Applications: ['django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.messages', 'django.contrib.staticfiles', 'posts', 'profiles'] Installed Middleware: ['django.middleware.security.SecurityMiddleware', 'django.contrib.sessions.middleware.SessionMiddleware', 'django.middleware.common.CommonMiddleware', 'django.middleware.csrf.CsrfViewMiddleware', 'django.contrib.auth.middleware.AuthenticationMiddleware', 'django.contrib.messages.middleware.MessageMiddleware', 'django.middleware.clickjacking.XFrameOptionsMiddleware'] Traceback: File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\sessions\backends\base.py" in _get_session 190. return self._session_cache During handling of the above exception ('SessionStore' object has no attribute '_session_cache'), another exception occurred: File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\core\handlers\exception.py" in inner 34. response = get_response(request) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\core\handlers\base.py" in _get_response 126. response = self.process_exception_by_middleware(e, request) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\core\handlers\base.py" in _get_response 124. response = wrapped_callback(request, *callback_args, **callback_kwargs) File "D:\PROJECTS\social\src\profiles\views.py" in my_profile_view 6. profile = Profile.objects.get(user=request.user) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\db\models\manager.py" in manager_method 82. return getattr(self.get_queryset(), name)(*args, **kwargs) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\db\models\query.py" in get 390. clone = self.filter(*args, **kwargs) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\db\models\query.py" in filter 844. return self._filter_or_exclude(False, *args, **kwargs) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\db\models\query.py" in _filter_or_exclude 862. clone.query.add_q(Q(*args, **kwargs)) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\db\models\sql\query.py" in add_q 1263. clause, _ = self._add_q(q_object, self.used_aliases) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\db\models\sql\query.py" in _add_q 1284. child_clause, needed_inner = self.build_filter( File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\db\models\sql\query.py" in build_filter 1176. value = self.resolve_lookup_value(value, can_reuse, allow_joins) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\db\models\sql\query.py" in resolve_lookup_value 1009. if hasattr(value, 'resolve_expression'): File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\utils\functional.py" in inner 213. self._setup() File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\utils\functional.py" in _setup 347. self._wrapped = self._setupfunc() File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\auth\middleware.py" in <lambda> 24. request.user = SimpleLazyObject(lambda: get_user(request)) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\auth\middleware.py" in get_user 12. request._cached_user = auth.get_user(request) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\auth\__init__.py" in get_user 182. user_id = _get_user_session_key(request) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\auth\__init__.py" in _get_user_session_key 59. return get_user_model()._meta.pk.to_python(request.session[SESSION_KEY]) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\sessions\backends\base.py" in __getitem__ 55. return self._session[key] File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\sessions\backends\base.py" in _get_session 195. self._session_cache = self.load() File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\sessions\backends\db.py" in load 44. return self.decode(s.session_data) if s else {} File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\site-packages\django\contrib\sessions\backends\base.py" in decode 101. encoded_data = base64.b64decode(force_bytes(session_data)) File "C:\Users\aarti\AppData\Local\Programs\Python\Python39\lib\base64.py" in b64decode 87. return binascii.a2b_base64(s) Exception Type: Error at /profiles/myprofile/ Exception Value: Invalid base64-encoded string: number of data characters (217) cannot be 1 more than a multiple of 4 ```
-
Django register user
Hi i want to connect normal user using UserCreationForm and my own Creation when i post it with button my postgre auth_user and auth_users_register nthing add to database when i click on button my code:
forms.py
class RegisterForm(ModelForm): class Meta: model = Register fields = ['date_of_birth', 'image_add'] widgets = { 'date_of_birth': DateInput(attrs={'type': 'date'}) } class CreateUserForm(UserCreationForm): class Meta: model = User fields = ['username', 'email', 'password1', 'password2']
models.py
def validate_image(image_add): max_height = 250 max_width = 250 if 250 < max_width or 250 < max_height: raise ValidationError("Height or Width is larger than what is allowed") class Register(models.Model): user = models.OneToOneField( User, on_delete=models.CASCADE) date_of_birth = models.DateField( max_length=8, verbose_name="date of birth") image_add = models.ImageField( upload_to="avatars", verbose_name="avatar", validators=[validate_image])
views.py
class RegisterPageView(View): def get(self, request): if request.user.is_authenticated: return redirect('/') user_form = CreateUserForm(request.POST) register_form = RegisterForm(request.POST) return render(request, 'register.html', {'user_form': user_form, 'register_form': register_form}) def post(self, request): if request.method == 'POST': user_form = CreateUserForm(request.POST) register_form = RegisterForm(request.POST) if user_form.is_valid() and register_form.is_valid(): user_form.save() register_form.save(commit=False) user = user_form.cleaned_data.get('username') messages.success( request, 'Your account has been registered' + user) return redirect('login') user_form = CreateUserForm() register_form = RegisterForm() context = {'user_form': user_form, 'register_form': register_form} return render(request, 'register.html', context)
-
Automatic font-size JavaScript
I need to write a JavaScript program that automatically changes the font size depending on the window and the amount of text. Text cannot wrap to a new line, it is just intended to shrink. If I add text to div, the text should fit into div and shrink it without wrapping
<div class="master"> <div>Bigger Automatic Text - with any length </div> <p>Medium Automatic Text - with any length</p> </div>
-
Add progress bar (verbose) when creating gensim dictionary
I want to create a gensim dictionary from lines of a dataframe. The
df.preprocessed_text
is a list of words.from gensim.models.phrases import Phrases, Phraser from gensim.corpora.dictionary import Dictionary def create_dict(df, bigram=True, min_occ_token=3): token_ = df.preprocessed_text.values if not bigram: return Dictionary(token_) bigram = Phrases(token_, min_count=3, threshold=1, delimiter=b' ') bigram_phraser = Phraser(bigram) bigram_token = [] for sent in token_: bigram_token.append(bigram_phraser[sent]) dictionary = Dictionary(bigram_token) dictionary.filter_extremes(no_above=0.8, no_below=min_occ_token) dictionary.compactify() return dictionary
I couldn't find a progress bar option for it and the callbacks doesn't seem to work for it too. Since my corpus is huge, I really appreciate a way to show the progress. Is there any?
-
self-writing text - animation in batch
I want to create animation of self-writing text. I tried using cls and timeouts, like this:
echo s CSCRIPT SLEEP.VBS 100> nul cls echo so CSCRIPT SLEEP.VBS 100> nul cls echo som
It would solve my problem, but it would be very big amount of code and time to write it. I did some research, but I couldnt find this in batch.
-
TfidfVectorizer results in 1x1 sparse matrix with just 1 element
I'm trying to apply text based multilabel classification to a subset of This dataset. When I try to transform my data the result is an 1x1 sparse matrix which I can't do anything with, because the length isn't the same as my labels. My data before split:
1 No7 Lift & Luminate Triple Action Serum 50... 2 No7 Stay Perfect Foundation Cool Vanilla by No7 3 Wella Koleston Perfect Hair Colour 44/44 Mediu... 4 Lacto Calamine Skin Balance Oil control 120 ml... 5 Mary Kay Satin Hands Hand Cream Travel MINI Si... ... ... 98671 Panasonic Shockwave Portable Compact Disc Play... 98672 Jensen SC-340 Home-Theater Universal Remote Co... 98673 Motorola TalkAbout T250 2-Mile 14-Channel Two-... 98674 Sharp MDMT821 Ultra-Thin Minidisc Player/Recorder 98675 KLH KHP201TW Digital Headphones 98675 rows × 1 columns
Relevant Code:
vectorizer = TfidfVectorizer(max_features = 10000) vectorizer.fit(data) Xtrain, Xtest, Ytrain, Ytest = train_test_split(data, labels, test_size=0.2, random_state=1) x_train = vectorizer.transform(Xtrain) x_test = vectorizer.transform(Xtest) x_train
Output:
<1x1 sparse matrix of type '<class 'numpy.float64'>' with 1 stored elements in Compressed Sparse Row format>
-
XGB model (or any other ML model) objective function vs scoring metrics
I was trying to set the random state for XGB using numpy RandomState generator for hyperparameter tuning such that each instance would give a different column subsampling and so on.
However, unlike normal sklearn regressors such as random forest, it seems that I cannot set the random_state parameter as such:
regr = XGBRegressor('random_state': np.random.RandomState(42)) regr.fit(x_train, y_train) pred_y_test = regr.predict(x_test)
The following error occurs:
xgboost.core.XGBoostError: Invalid Parameter format for seed expect int but value='RandomState(MT19937)'
Do I have to set it as an integer only? What if I want the seed number to change after every hyperparameter trial? Is there an alternative random seed generator that I can use or should I just leave that parameter as None?
-
MultiLabelBinarizer with duplicated values
I have an expected array
[1,1,3]
and a predicted array[1,2,2,4]
for which I want to calculateprecision_recall_fscore_support
, so I need a matrix in the following format:>> mlb = MultiLabelBinarizerWithDuplicates() >> transformed = mlb.fit_transform([(1, 1, 3), (1, 2, 2, 4)]) array([[1,1,0,0,1,0], [1,0,1,1,0,1]]) >> mlb.classes_ [1,1,2,2,3,4]
For the duplicated values I don't care which one of them is turned on, meaning that this is also a valid result:
array([[1,1,0,0,1,0], [0,1,1,1,0,1]])
MultiLabelBinarizer clearly says "All entries should be unique (cannot contain duplicate classes)" so it doesn't support this usecase.
-
how to draw a bipartite graph for a term-document clustering with edges of the words in the documents?
so i got a clustering project, which i've decided to do a term-document clustering on Hamshahri newspaper (a persian dataset) and in the end of the project i want to do a bipartite graph on documents as node 0 but it only shows D1,D2, D3, .... and the terms shown on the node 1 but drawn from the tfidf_matrix that has separated the words earlier in the project, but i cant get the edges (the frequency of the words repeated in each document) please help me... there are a couple of other slight problems in the code as well (like the right-to-left alignment or the letters break for languages like Persian or arabic in matplotlib, or the lists dont show proper), if you run it whole you'd notice
i want the bipartite graph to look easy and understandable like this
X, Y = bipartite.sets(B) pos = dict() pos.update( (n, (1, i)) for i, n in enumerate(X) ) # put nodes from X pos.update( (n, (2, i)) for i, n in enumerate(Y) ) # put nodes from Y nx.draw(B, pos=pos) plt.show()
here's the code to the whole project and bipartite is in the last part
-
Plot causes "Error: Incorrect Number of Dimensions"
I am learning about the "kohonen" package in R for the purpose of making Self Organizing Maps (SOM, also called Kohonen Networks - a type of Machine Learning algorithm). I am following this R language tutorial over here: https://www.rpubs.com/loveb/som
I tried to create my own data (this time with both "factor" and "numeric" variables) and run the SOM algorithm (this time using the "supersom()" function instead):
#load libraries and adjust colors library(kohonen) #fitting SOMs library(ggplot2) #plots library(RColorBrewer) #colors, using predefined palettes contrast <- c("#FA4925", "#22693E", "#D4D40F", "#2C4382", "#F0F0F0", "#3D3D3D") #my own, contrasting pairs cols <- brewer.pal(10, "Paired") #create and format data a =rnorm(1000,10,10) b = rnorm(1000,10,5) c = rnorm(1000,5,5) d = rnorm(1000,5,10) e <- sample( LETTERS[1:4], 100 , replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) ) f <- sample( LETTERS[1:5], 100 , replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2) ) g <- sample( LETTERS[1:2], 100 , replace=TRUE, prob=c(0.5, 0.5) ) data = data.frame(a,b,c,d,e,f,g) data$e = as.factor(data$e) data$f = as.factor(data$f) data$g = as.factor(data$g) cols <- 1:4 data[cols] <- scale(data[cols]) #som model som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"), dist.fct = "euclidean", keep.data = TRUE)
From here, I was able to successfully make some of the basic plots:
#plots #pretty gradient colors colour1 <- tricolor(som$grid) colour4 <- tricolor(som$grid, phi = c(pi/8, 6, -pi/6), offset = 0.1) plot(som, type="changes") plot(som, type="count") plot(som, type="quality", shape = "straight") plot(som, type="dist.neighbours", palette.name=grey.colors, shape = "straight")
However, the problem arises when I try to make individual plots for each variable:
#error var <- 1 #define the variable to plot plot(som, type = "property", property = getCodes(som)[,var], main=colnames(getCodes(som))[var], palette.name=terrain.colors) var <- 6 #define the variable to plot plot(som, type = "property", property = getCodes(som)[,var], main=colnames(getCodes(som))[var], palette.name=terrain.colors)
This produces an error:
"Error: Incorrect Number of Dimensions"
A similar error (
NAs by coercion
) is produced when attempting to cluster the SOM Network:#cluster (error) set.seed(33) #for reproducability fit_kmeans <- kmeans(data, 3) #3 clusters are used, as indicated by the wss development. cl_assignmentk <- fit_kmeans$cluster[data$unit.classif] par(mfrow=c(1,1)) plot(som, type="mapping", bg = rgb(colour4), shape = "straight", border = "grey",col=contrast) add.cluster.boundaries(som, fit_kmeans$cluster, lwd = 3, lty = 2, col=contrast[4])
Can someone please tell me what I am doing wrong? Thanks
Sources: https://www.rdocumentation.org/packages/kohonen/versions/2.0.5/topics/supersom
-
Can't fit Fuzzy C Means Algorithm
I've tried a clustering using a mobile price dataset using notebook. But the problem is when I select some column from dataset then fit it to FCM But it was error.
Here's my code
df = pd.read_csv(data_train) df_using = df[['battery_power', 'px_width', 'ram']] df_using.head() fcm = FCM (n_clusters=3, m=2, error=0.005, max_iter=1000) fcm.fit(df_using) fcm_centers = fcm.centers fcm_labels = fcm.u.argmax(axis=1)
and the error said that
TypeError Traceback (most recent call last) <ipython-input-11-6cc06ebd8dc4> in <module>() 1 fcm = FCM (n_clusters=3, m=2, error=0.005, max_iter=1000) ----> 2 fcm.fit(df_using) 3 4 fcm_centers = fcm.centers 5 fcm_labels = fcm.u.argmax(axis=1) 3 frames /usr/local/lib/python3.6/dist-packages/jax/api.py in _check_arg(arg) 2174 if not (isinstance(arg, core.Tracer) or _valid_jaxtype(arg)): 2175 raise TypeError("Argument '{}' of type {} is not a valid JAX type" -> 2176 .format(arg, type(arg))) 2177 2178 # TODO(necula): this duplicates code in core.valid_jaxtype
can anyone help me whyy ;( I really apreciate for yor help. I'm sorry for my bad in my english and structur of my question.