How to validate algorithm ant colony optimization vs parallelism ant colony optimization for reduction of dimensionality
I am using the ant colony algorithm for the reduction of dimensionality, and I am going to compare it with an ant colony algorithm but parallel. My question is, what type of database should I use??? and as valid these algorithms???. Help please, it is to present in a paper.
See also questions close to this topic

Pandas Parallel Processing to Change dtypes
I have a huge pandas df (Millions of rows x Thousands of columns) that is causing memory issues due to dtypes. I have written code to scan through the data and encode the data in their optimial dtypes by parsing through each column individually, but it's taking 24 hours or so to run. The work is RAM intensive, but doesn't usually use more than 1 core so I want to optimize by parellizing to leverage more cores and speed up the task.
I tried using the multiprocessing library in python (code below), but the gains weren't monumental. I also looked into dask, but not sure that would work. One of the big space saving items is object > categorical, so I think I need an entire column to get that to work. I know that I can specify the categorical dtype at entry, but I am reading from many files (multiyears) so as soon as one level changes it goes back to object.
import pandas as pd from multiprocessing.dummy import Pool as ThreadPool def combine_n_clean(col): appended_df = [] for calyear in range(start_yr,end_yr): df = pd.read_feather('dataloc.feather', columns = [col]) appended_df.append(df) appended_df = pd.concat(appended_df, axis = 0, ignore_index = True) appended_df = float_to_cat_conversion(appended_df) #Float32 > Integer or Categorical appended_df = object_var_conversion(appended_df) #Object > Categorical return appended_df def object_var_conversion(df_name): object_df = df_name.select_dtypes(include=['object']) converted_obj = pd.DataFrame() for col in object_df.columns: num_unique_values = len(object_df[col].unique()) num_total_values = len(object_df[col]) if num_unique_values / num_total_values < 0.5: #Only convert to Category if under 50% unique var converted_obj.loc[:,col] = object_df[col].astype('category') else: converted_obj.loc[:,col] = object_df[col] del object_df df_name = df_name.drop(columns=converted_obj.columns) #Drop object columns df_name = pd.concat([df_name,converted_obj], axis = 1) #Add back in categorical columns del converted_obj return df_name pool = ThreadPool(16) results = pool.map(combine_n_clean, col_name_list) pool.close() pool.join() results = pd.concat(results, axis=1)
The code above runs, but is offering essentially no time savings.

Jenkins is creating separate workspace when running parallel nodes, but those temporary workspaces don't have all the files as original workspace
Jenkins is creating separate workspace when running parallel nodes, but those temporary workspaces don't have all the files as original workspace. I want to use the file in the workspace. Is there any workaround for this?

Parallel Primes in Prolog?
I have the following simple code to determine prime numbers. Its a simple generate and test, not extremly optimized:
prime(N) : M is floor(sqrt(N)), between(2,M,K), N mod K =:= 0, !, fail. prime(_).
Here is an example run:
? between(1,20,N), prime(N), write(N), nl, fail; true. 1 2 3 5 7 11 13 17 19 true.
How would I parallelize the listing of primes over multiple threads in Prolog? The output listing need not be sorted.

PCA decomposition explained_variance_ratio_
I am new in python, I used time series feature extraction and I used PCA as dimensionality reduction of these extracted features after applying data scaling, when I use the explained_variance_ratio_ I found that the sum of the variance_ratio is greater than one?
x = StandardScaler().fit_transform(result) pca = PCA(n_components=2) principalComponents = pca.fit_transform(x) principalDf[K].append(pd.DataFrame(data=principalComponents)) print(pca.explained_variance_ratio_)
[0.99890074 0.99988208]

How to integrate new data to T SNE?
I need to integrate a new data without rexcute the tSNE algorithme for the all data.
Is there a possibility to do this because the pickle does not work well.
tsne = TSNE(n_components=2, verbose=1, perplexity=30, n_iter=1000, learning_rate=30, early_exaggeration=12) tsne_results = tsne.fit_transform(allData) pickle.dump(tsne, open("TSNE.SAVED", 'wb'))
It seems to me that I do not have such a possibility, but I hope there is a way to solve this problem

python/sklearn  how to get clusters and cluster names after doing kmeans
So I have the following code where I do a kmeans clustering after doing dimensionality reduction.
# Create CountVectorizer vec = CountVectorizer(token_pattern=r'[az]+', ngram_range=(1,1), min_df = 2, max_df = .8, stop_words=ENGLISH_STOP_WORDS) cv = vec.fit_transform(X) print('Dimensions: ', cv.shape) # Create LSA/TruncatedSVD with full dimensions cv_lsa = TruncatedSVD(n_components=cv.shape[1]1) cv_lsa_data = cv_lsa.fit_transform(cv) # Find dimensions with 80% variance explained number = np.searchsorted(cv_lsa.explained_variance_ratio_.cumsum(), .8) + 1 print('Dimensions with 80% variance explained: ',number) # Create LSA/TruncatedSVD with 80% variance explained cv_lsa80 = TruncatedSVD(n_components=number) cv_lsa_data80 = cv_lsa80.fit_transform(cv) # Do Kmeans when k=4 kmean = KMeans(n_clusters=4) clustered = km.fit(cv_lsa_data80)
Now I'm stuck on what to do next. I want to get the clusters identified by the kmeans object and get the top 10/most common used word in those clusters. Something like:
Cluster 1:
1st most common word  count
2nd most common word  countCluster 2:
1st most common word  count
2nd most common word  count 
Reduce time of Optimisation (GA and ACO) or alternative of GA
I have a list of pairs of Source and destination. Let the list size be 65000. For a given pair of source and destination, I have 6 random paths. I applied Genetic Algorithm(GA). My individual is a list having a random solution for each give source and destination. Cost function gives the cost of each individual. My objective is to reduce the cost function. GA is taking too much time for giving the solution. Is there any other optimisation method which can give fast solution or any way to reduce GA time. I am thinking to apply Ant Colony. Will it be better than GA. How to resolve this issue.
I have tried GA on Intellij platform.

What is happening with vector array here?
I'm solving the Traveling Salesman Problem via an ACO implementation in C++. However, I found out that the program I've built so far gives a segmentation fault. (Note: I've limited the algorithm to only do one iteration of the colony for debugging purposes).
First off, I have a total of 52 cities taken from a file, and I distribute the ants so that every city has the same number of ants starting from it.
To store the distances between every pair of cities, I'm using a vector of vectors of doubles called Map (a square matrix). However, halfway during the execution it looks like these vectors are deleted. In this instance, it happens when calculating the path for the ant number 55. I've added a section of code just to highlight exactly where it crashes:
//DEBUGGING SECTION cout << "Size Roulette: " << Roulette.size() << endl; cout << "Size Remain: " << RemainingCities.size() << endl; cout << "Size Map: " << Map.size() << " x " << Map[0].size() << endl; int k = 0; cout << "Test: Map access: " << endl; for(int i = 0; i < Map.size(); ++i) // HERE IT CRASHES AT ANT NUMBER 55 cout << Map[0][i] << " "; cout << endl; cout << "Test: Operation: " << Map[Colony[ant_i][city_i1]][RemainingCities[k]] << endl; Roulette[k] = pow((MAX_DIST  Map[Colony[ant_i][city_i1]][RemainingCities[k]]), heur_coef) + pow((pheromones[Colony[ant_i][city_i1]][RemainingCities[k]]), pher_coef); //END OF DEBUGGING SECTION
There, the function Map[0].size() normally returns 52 (just like Map.size(), as it's supposed to be a square matrix), but at the crashing iteration it returns what looks like a memory address, and the moment I try to access any element, a segmentation fault occurs.
I have checked that the memory access is always correct, and I can access any other variable without issue except Map until that 55th ant. I've tried different seeds for the roulette method, but it always crashes at the same place.
I have also varied the number of ants of the colony. If it's just one ant per city, the program executes without issue, but for any higher amount the program always crashes at the 55th ant.
You can download the full cpp file and the reading .tsp file from github:
https://github.com/yitosmash/ACO
In any case, I'll leave the full function here:
void ACO(const vector<City>& cities, const vector<vector<double>>& Map, int max_it, int num_ants, double decay, double heur_coef, double pher_coef, double pher_coef_elit) { srand(30); //Initialise colony of ants (each ant is a vector of city indices) vector<vector<int>> Colony(num_ants, vector<int>(cities.size(), 0)); //Initialise pheromone matrix vector<vector<double>> pheromones(cities.size(), vector<double>(cities.size(), 0)); //Initialise costs vector(for etilist expansion) vector<double> costs(cities.size(), 0); //Auxiliar vector of indices vector<int> cityIndices(cities.size()); for (int i = 0; i < cities.size(); ++i) cityIndices[i] = i; //Longest distance from Map, used for heuristic values. vector<double> longests(cities.size(), 0); for(int i = 0; i < cities.size(); ++i) longests[i] = *(max_element(Map[i].begin(), Map[i].end())); const double MAX_DIST = *(max_element(longests.begin(), longests.end())); longests.clear(); int i=0; while(i<max_it) { for(int ant_i = 0; ant_i < num_ants; ++ant_i) { cout << "Ant: " << ant_i << endl; //City for ant_i to start at; each ant is assigned a determined starting city int starting_city = (int) ((float)ant_i/num_ants*cities.size()); //cout << starting_city << endl; Colony[ant_i][0] = starting_city; //Get a vector with the cities left to visit vector<int> RemainingCities = cityIndices; //Remove starting city from remaining cities RemainingCities.erase(RemainingCities.begin() + starting_city); //Create path for ant_i for(int city_i = 1; city_i < Colony[ant_i].size(); ++city_i) { cout << "Calculating city number: " << city_i << endl; //Create roulette for next city selection vector<double> Roulette(RemainingCities.size(), 0); double total = 0; //DEBUGGING SECTION cout << "Size Roulette: " << Roulette.size() << endl; cout << "Size Remain: " << RemainingCities.size() << endl; cout << "Size Map: " << Map.size() << " x " << Map[0].size() << endl; int k = 0; cout << "Test: Map access: " << endl; for(int i = 0; i < Map.size(); ++i) // HERE IT CRASHES AT ANT NUMBER 55 cout << Map[0][i] << " "; cout << endl; cout << "Test: Operation: " << Map[Colony[ant_i][city_i1]][RemainingCities[k]] << endl; Roulette[k] = pow((MAX_DIST  Map[Colony[ant_i][city_i1]][RemainingCities[k]]), heur_coef) + pow((pheromones[Colony[ant_i][city_i1]][RemainingCities[k]]), pher_coef); //END OF DEBUGGING SECTION for(int j = 0; j < RemainingCities.size(); ++j) { //Heuristic value is MAX_DIST  current edge. Roulette[j] = pow((MAX_DIST  Map[Colony[ant_i][city_i1]][RemainingCities[j]]), heur_coef) + pow((pheromones[Colony[ant_i][city_i1]][RemainingCities[j]]), pher_coef); total += Roulette[j]; } cout << endl; //Transform roulette into stacked probabilities Roulette[0] = Roulette[0]/total; for(int j = 1; j < Roulette.size(); ++j) Roulette[j] = Roulette[j1] + Roulette[j] / total; //Select a city from Roulette int chosen = 0; double r = (double) rand()/RAND_MAX; while(Roulette[chosen] < r) chosen++; //Add chosen city to Colony[ant_i][city_i] = RemainingCities[chosen]; RemainingCities.erase(RemainingCities.begin() + chosen); } cout << endl; //Save cost of ant_i, for elitist expansion costs[ant_i] = pathCost(Colony[ant_i], Map); } i++; } }