linux process sleep at socket
I am working on a python2.7 project. I set a process pool with 10 processes. My code structure is:
pool = Pool(10) pool.map(ProcessJson, jsonFiles) def ProcessJson(jsonpath): # doing something and get (int)numUrls for idx in xrange(numUrls): flag = DownloadVideo(paras).run() if flag == 0: continue class DownloadVideo(): def __init__(para): # init def run(self): try: videopath = os.path.join(folder, sname.encode("utf-8")) try: cmd = "wget -q -c --limit-rate=1M --tries=3 -T10 -P %s --output-document=%s \"%s\"" % (folder, videopath, url) ret = os.system(cmd) if (ret >> 8) != 0: logger.error("cmd error---" + str(ret >> 8) + "---" + cmd + "\n") return 0 except Exception as e: logger.error("python system cmd error---", e) return 0 except Exception as e1: logger.error("exception with downloading---", e1) return 0
After sometime, the process are all sleep. I use
sudo strace -p pid and
lsof -p pid to find the problem. What I get are showing:
sudo strace -p 27723 Process 27723 attached wait4(-1,
Then i found its child process 27724
sudo strace -p 27724 Process 27724 attached read(6, cat wchan 27724 do_wait_data :/proc/27724$ ls -l ./fd l-wx------ 1 cwz domain^users 64 7月 12 16:20 3 -> Documents/tools/log/__main__-1.log lr-x------ 1 cwz domain^users 64 7月 12 16:20 4 -> pipe: l-wx------ 1 cwz domain^users 64 7月 12 16:20 5 -> chenweizhao/Documents/tools/GoogleImages- 1/Tim_Faraday/xXP4smHFrfLw9M.jpg lrwx------ 1 cwz domain^users 64 7月 12 16:20 6 -> socket:  l-wx------ 1 cwz domain^users 64 7月 12 16:20 7 -> pipe: lr-x------ 1 cwz domain^users 64 7月 12 16:20 8 -> /dev/null
lsof -p 27724 wget 27724 cwz 3w REG 8,2 427352 5786641 chenweizhao/tools/log/__main__-1.log wget 27724 cwz 4r FIFO 0,8 0t0 8180487 pipe wget 27724 cwz 5w REG 8,2 24404 9971506 chenweizhao/tools/GoogleImages-1/Tim_Faraday/xXP4smHFrfLw9M.jpg wget 27724 cwz 6u IPv4 8286859 0t0 TCP myIP:47496->ec2-13-56-87-172.us-west-1.compute.amazonaws.com:https (ESTABLISHED) wget 27724 cwz 7w FIFO 0,8 0t0 8180490 pipe wget 27724 cwz 8r CHR 1,3 0t0 1029 /dev/null
How can I solve the problem? Thank you very much!
See also questions close to this topic
Python List Comp with replace - ' a bytes-like object is required, not 'str'
I am trying to split up a tab delimited byte object in to lines and fields. In my input data when a field is supposed to be empty the data has -- . I want to replace -- with something that will act as a empty when I use it to build a mysql insert ''. I am new to list comprehension but I found a few examples that seemed similar.
for line in line_split[1:]: field_split=line.split(b'\t') field_split = [x.replace('--', '') for x in field_split] print("f-", field_split) report_list.append(field_split)
If I comment out the replace line that errors so it can print, I get back the following line. If you scroll right the field value i want to replace shows b'--'. This seems like it should be a simple fix but I have messing around for way longer then I care to admit
f- [b'1020569383', b'X012312', b'42132LVPG0U', b'Glow', b'Sports', b'Glow', b'Amazon', b'18.85', b'18.85', b'11.61', b'10.67', b'1.54', b'36.02', b'inches', b'0.52', b'pounds', b'Lg-Std-Non-Media', b'USD', b'6.02', b'2.83', b'0.00', b'--', b'--', b'--', b'3.19']
Box plots in Python using Seaborn - creating duplicates for bigrams and trigrams
I am using Spyder as part of Anaconda and trying to classify tweets (text) by event type. To do this, I am using the package cross_val_score, having already vectorised my tweets using TfidVectorizer and then transforming my training data using fit_transform for unigrams, bigrams and trigrams, as per the below:
# TF-IDF on unigrams, bigrams and trigrams tfidf_words = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1', ngram_range=(1,1), stop_words='english') # vectorize for bigrams tfidf_bigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1', ngram_range=(2,2), stop_words='english') # vecorize for trigrams tfidf_trigrams = TfidfVectorizer(sublinear_tf=True, min_df=0, norm='l2', encoding='latin-1', ngram_range=(3,3), stop_words='english') # Transform and fit each of the outputs from TF-IDF (unigrams, bigrams and trigrams) x_train_words = tfidf_words.fit_transform(x_train_sm.preprocessed).toarray() # bigrams x_train_bigrams = tfidf_bigrams.fit_transform(x_train_sm.preprocessed).toarray() #trigrams x_train_trigrams = tfidf_trigrams.fit_transform(x_train_sm.preprocessed).toarray()
Now I perform cross validation using the package cross_val_score to calculate the average accuracy for unigrams, bigrams and trigrams. Once complete, I am trying to produce and save a boxplot for the accuracies achieved. This is completed for 4 different models:
# Create list of models to be tested: Random Forest, Linear SVC, Naive Bayes & Logistic Regression models = [OneVsRestClassifier(RandomForestClassifier(n_estimators = 200, max_depth=3, random_state=0)), OneVsRestClassifier(LinearSVC()), OneVsRestClassifier(MultinomialNB()), OneVsRestClassifier(LogisticRegression(random_state=0))] # number of folds (10-fold cross validation performed for each model) CV = 10 ########## Fitting, predicting and calculating average accuracy for unigrams data ########## # create blank dataframe with an index equal to the number of CV folds * number of models tested cv_words = pd.DataFrame(index=range(CV * len(models))) #create an empty list, which will be populated with the accuracies of each model at each fold entries =  # list of the names of the models tested names = ["Random Forest", "Linear SVC", "Naive Bayes", "Logistic Regression"] # convert y_train_sm from an array into a series to work in the 'cross_val_score' function # this series contains all of the event_ids for the corresponding encoded tweets (labels) # cross_val_score is a functin used to calculate performance scores and implement cross-validation y_train_sm = pd.Series(y_train_sm.tolist()) ### Fitting, predicting and calculating average accuracy for unigrams data ### # calculate the accuracy at each fold and populate the results in the 'entries' list # populate the dataframe 'cv_words' with the fold and accuracy scores at each fold i = 0 for model in models: #model_name = #model.__class__.__name__ model_name = names[i] # model => the model that will be used to fit the data # x_train_words_sm => x training data after oversampling (unigrams) # y_train_sm => y training data after oversampling (event_id) # scoring => the type of score you want the function 'cross_val_score' to return # cv = number of folds you want to be performed with cross-validation accuracies = cross_val_score(model, x_train_words, y_train_sm, scoring ='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_words = pd.DataFrame(entries, columns=['model_name_unigrams', 'fold_idx', 'accuracy']) i = i + 1 # plot the results of each model on a single box plot box_words = sns.boxplot(x='model_name_unigrams', y='accuracy', data=cv_words) fig_words = box_words.get_figure() fig_words.savefig('boxplot_unigrams.png')
The output of the unigrams is exactly what I want:
Now when I run the code for bigrams and trigrams (highlight ALL code and hit 'play'), I get the following:
The code for each of these is identical, except they use 'cv_bigrams' and 'cv_trigrams' for the data input for the box plots. Code for each is below.
# create blank dataframe with an index equal to the number of CV folds * number of models tested cv_bigrams = pd.DataFrame(index=range(CV * len(models))) # clear the previous list called 'entries' that was populated with values entries =  # calculate the accuracy at each fold and populate the results in the 'entries' list # populate the dataframe 'cv_bigrams' with the fold and accuracy score at each fold i = 0 for model in models: #model_name = #model.__class__.__name__ model_name = names[i] # model => the model that will be used to fit the data # x_train_bigrams_sm => x training data after oversampling (bigrams) # y_train_sm => y training data after oversampling (event_id) # scoring => the type of score you want the function 'cross_val_score' to return # cv = number of folds you want to performed with cross-validation accuracies = cross_val_score(model, x_train_bigrams, y_train_sm, scoring ='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_bigrams = pd.DataFrame(entries, columns=['model_name_bigrams', 'fold_idx', 'accuracy']) i = i + 1
# create blank dataframe with an index equal to number of CV folds * number of models tested cv_trigrams = pd.DataFrame(index=range(CV * len(models))) # clear the previous list called 'entries' that was populated with values entries =  # calculate the accuracy at each fold and populate the results in the 'entries' list # populate the dataframe 'cv_trigrams' with the fold and accuracy score at each fold i = 0 for model in models: #model_name = #model.__class__.__name__ model_name = names[i] # model => the model that will be used to fit the data # x_train_trigrams => data that is to be fitted by the selected model (trigrams) # y_train_sm => y training data after oversampling (event_id) # scoring => the type of score you want the function 'cross_val_score' to return # cv = number of folds you want to performed with cross-validation accuracies = cross_val_score(model, x_train_trigrams, y_train_sm, scoring ='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_trigrams = pd.DataFrame(entries, columns=['model_name_trigrams', 'fold_idx', 'accuracy']) i = i + 1
Here is what happens if I select the below code only and run:
# plot the results of each model as a box plot box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams) box_bigrams = sns.boxplot(x='model_name_bigrams', y='accuracy', data=cv_bigrams) fig_bigrams = box_bigrams.get_figure() fig_bigrams.savefig('boxplot_bigrams.png')
Same for trigrams:
# plot the results of each model as a box plot box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams) box_trigrams = sns.boxplot(x='model_name_trigrams', y='accuracy', data=cv_trigrams) fig_trigrams = box_trigrams.get_figure() fig_trigrams.savefig('boxplot_trigrams.png')
Any idea why I am getting duplicate boxplots overlapping each other when I run all of the code at once (which I need to do when I put this code into production), rather than highlighting the snippets and running separately?
How to make Scrapy crawl in DFS Order
I have a scrapy code where the structure is like
I want the scrapy to crawl the pages in dfs order i.e all the 3 links first followed by 2 and then 1. But scrapy doesn't crawl that way. I have tried all the ways to achieve this but unable to get the solution. Can someone suggest me the correct way to get this?
def parse(self, response):
print "url1" yield scrapy.Request(url, callback=self.parse2)
def parse2(self, response):
print "url2" yield scrapy.Request(url, callback=self.parse3)
def parse3(self, response):
# Do something
Output should be something like
url1 url2 url3 .... .... .... url2 url3 .... .... url2 url3 .... .... url1
Thanks in advance
Execute Bash Script via Cron Job
I've been searching for a while now, and I can't quite understand why I'm having this problem, nor find a usable solution. I'm trying to execute a bash script via a cron job.
I have the following two under the
crontab -llisted for
* * * * * /tmp/test.sh * * * * * /bin/echo "cron works" >> /tmp/file
test.shcontains the following:
echo "test" >> debug.txt
With the following permissions:
-rwxrwxr-x 1 root root 25 Jul 23 11:40 test.sh
I'm just trying to execute the test script once every minute. The second cron job where I'm outputting "cron works" to
/tmp/fileworks fine but the first one doesn't.
I can see that they're both executing via
grep CRON /var/log/syslog:
Jul 23 11:56:01 privacy CRON: (root) CMD (/tmp/test.sh) Jul 23 11:56:01 privacy CRON: (root) CMD (/bin/echo "cron works" >> /tmp/file)
Any suggestions are appreciated, and if there's anymore information I can add, let me know.
Systemctl stop not execute when status is failed
I have a service, it will start a java application, when everything is fine, it work fine too.
But when something goes wrong, the java process still running, but 'systemctl status myservice' is failed. In this situation, when I try to stop this service by execute 'systemctl stop myservice', nothing happened, the process still existed, I still can found it by 'ps -ef |grep java'.
I know I can kill this process by execute 'systemctl kill myservice', but I stil l want to know why this happen, and how to solve this problem?
SSH key management in big and mixed environments (Linux)
In every company I worked at it was always pain in the ass to manage ssh keys. We had different ways of managing them. But mostly it were some CM systems like puppet/chef/ansible or just manual copy of keys or even some ugly bash scripts :D
Also heard that some people use LDAP or any DB as ssh key store. But still you need some additional automation over here like some CM tool to put/delete key on server.
So the question is, is there some nice and modern way of doing it that I don't know? How big IT companies (like google or facebook etc) are handling keys?
Java checking for running process
I want my java programm to "link" with other programms. So when a certain processes are running my programm should react. I know already how to search for the process in the systems processlist however I am curious if there is a better way to check for that than having a loop searching the whole process list every second.
Automatically Change Process Priority Windows 10
first time posting, so I hope I get it right!
I've got a quick question regarding Process Priority in Windows 10. Specifically, changing the priority for all processes spawned by a particular application, in this case Unity 3D.
I've figured out how to set them at once, using a CMD Batch file. However, Unity is dynamically spawning and killing processes according to it's current state, requiring me to have to re-run the batch file every so often.
What I'm looking for, and have not been able to find so far, is a means of either setting the priority for the application, and as such all it's processes, or a way to intercept the process creation event to automate this process.
The reason I'm looking at a way to do this via the CMD Prompt instead of using third-party applications is two-fold. One, to learn a bit about the CMD Prompt, and two, to have a simple script to share with fellow developers who may not be interested in downloading and configuring an application.
Thank you for your time!
Adjust token information during process runtime
I'm creating a process with
SetTokenInformationbefore creating the process). After the process is running I would like to disable/enable the
TokenUIAccessfor the process (without restarting the process).
Is there a way to adjust the token information during process runtime? I'm expecting to find something like
AdjustTokenPrivileges) but such doesn't exist.
W hat are these: http.mydomain.com and w.mydomain.com?
While crawling my page I found these:
They're copy of my page and contains errors but I wasn't aware of their existence. What are these?
Getting a list with external urls with Wget
wgeton Windows7 with
GOW, GNU on Windows, the kind of leightwight CygWin.
I want to crawl a domain recurisively, from top to bottom without limiting of nested levels, and save all external urls (urls with other domain then the crawled one) into a text file. By this i want exclude from saving domains with facebook, google, pinterest and instagram in the domain name.
I try following way:
$ wget https://example.com -O - 2>C:\tmp | grep -oP 'href="\Khttp:.+?"' | sed 's/"//' | grep -v facebook -v google -v pinterest -v instagram> file.txt
Access deniedon this, without any other alerts.
Is is a kind of insufficient writing access somewhere? Or something other? How could i get this done?
Distributed crawling and rate limiting / flow control
I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.
This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.
The system must adhere to the following 2 rules:
- multiple workers (concurrency issues)
- variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.
What i have tried:
I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.
I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.
I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.
Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?
Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?