text generation using Estimator API
I have been trying to start transitioning to Estimator API since it is recommended by tensorflow people. However, I wonder how some of the basic stuff can be done efficiently with estimator framework. During weekend I tried to create GRU based model for text generation and followed tensorflow example for building custom estimators. I have been able to create a model that I can train relatively easily and its results match with nonestimator version. However for sampling (generating text) I faced with some trouble. I finally made it work but it is very slow, since every time it predict a character, the estimator framework load the whole graph which makes the whole thing slow. Is there a way to not load the graph every time or any other solution? Second issue: I also had to use state_is_tuple=False since I have to send back and forth the state of GRU (between model method and generator method) and I can't send tuples. Any one knows how to deal with this? thanks P.S. Here is a link to my code example: https://github.com/amirharati/sample_estimator_charlm/blob/master/RnnLm.py
See also questions close to this topic

Is concatenated matrix multiplication faster than multiple nonconcatenated matmul? If so, why?
The definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. We can simplify the expression by using a single matrix multiply by concatenating 4 small matrices (now the matrix are 4 times larger).
My question is: does this improve the efficiency of the matrix multiplication? If so, why? Because we can put them in continuous memory? Anything else?

VGG16 Tensorflow implementation does not learn on cifar10
This VGGNet was implemented using Tensorflow framework, from scratch, where all of the layers are defined in the code. The main problem I am facing here is that the training accuracy, not to mention validation accuracy, goes up even though I wait it out for a decent amount of time. There are few problems that I suspect is causing this problem right now. First, I think the network is too deep and wide for cifar10 dataset. Second, extracting data batch out of the whole dataset is not exhaustive, i.e. Batch selection is used over and over again over the whole dataset without eliminating those examples that were selected in the ongoing epoch.
However, still I could not get this code to work after many hours and days of experiments.
I wish I could extract the problematic code section to ask a question, but since I cannot pinpoint the exact section here, let me upload my whole code.
import os import sys import tensorflow as tf import numpy as np import scipy as sci import math import matplotlib.pyplot as plt import time import random import imageio import pickle import cv2 import json from pycocotools.coco import COCO class SVGG: def __init__(self, num_output_classes): self.input_layer_size = 0 self.num_output_classes = num_output_classes # Data self.X = [] self.Y = [] self.working_x = [] self.working_y = [] self.testX = [] self.testY = [] # hard coded for now. Have to change. self.input_data_size = 32 # 32 X 32 self.input_data_size_flat = 3072 # 32 X 32 X 3 == 3072 self.num_of_channels = 3 # 3 for colour image self.input_data_size = 32 # 32 X 32 self.input_data_size_flat = self.input_data_size * self.input_data_size # 32 X 32 X 3 == 3072 self.num_of_channels = 3 # 3 for colour image self.convolution_layers = [] self.convolution_weights = [] self.fully_connected_layers = [] self.fully_connected_weights = [] def feed_examples(self, input_X, input_Y): """ Feed examples to be learned :param input_X: Training dataset X :param input_Y: Traning dataset label :return: """ # Take first input and calculate its size # hard code size self.X = input_X self.Y = input_Y self.input_data_size_flat = len(self.X[0]) * len(self.X[0][0]) * len(self.X[0][0][0]) def feed_test_data(self, test_X, test_Y): self.testX = test_X self.testY = test_Y def run(self): x = tf.placeholder(tf.float32, [None, self.input_data_size_flat], name='x') x_data = tf.reshape(x, [1, self.input_data_size, self.input_data_size, 3]) y_true = tf.placeholder(tf.float32, [None, self.num_output_classes], name='y_true') y_true_cls = tf.argmax(y_true, axis=1) """ VGG layers """ # Create layers ######################################## Input Layer ######################################## input_layer, input_weight = self.create_convolution_layer(x_data, num_input_channels=3, filter_size=3, num_filters=64, use_pooling=True) # False ######################################## Convolutional Layer ######################################## ############### Conv Layer 1 ################# conv_1_1, w_1_1 = self.create_convolution_layer(input=input_layer, num_input_channels=64, filter_size=3, num_filters=64, use_pooling=False) conv_1_2, w_1_2 = self.create_convolution_layer(input=conv_1_1, num_input_channels=64, filter_size=3, num_filters=128, use_pooling=True) ############### Conv Layer 2 ################# conv_2_1, w_2_1 = self.create_convolution_layer(input=conv_1_2, num_input_channels=128, filter_size=3, num_filters=128, use_pooling=False) conv_2_2, w_2_2 = self.create_convolution_layer(input=conv_2_1, num_input_channels=128, filter_size=3, num_filters=256, use_pooling=True) ############### Conv Layer 3 ################# conv_3_1, w_3_1 = self.create_convolution_layer(input=conv_2_2, num_input_channels=256, filter_size=3, num_filters=256, use_pooling=False) conv_3_2, w_3_2 = self.create_convolution_layer(input=conv_3_1, num_input_channels=256, filter_size=3, num_filters=256, use_pooling=False) conv_3_3, w_3_3 = self.create_convolution_layer(input=conv_3_2, num_input_channels=256, filter_size=3, num_filters=512, use_pooling=True) ############### Conv Layer 4 ################# conv_4_1, w_4_1 = self.create_convolution_layer(input=conv_3_3, num_input_channels=512, filter_size=3, num_filters=512, use_pooling=False) conv_4_2, w_4_2 = self.create_convolution_layer(input=conv_4_1, num_input_channels=512, filter_size=3, num_filters=512, use_pooling=False) conv_4_3, w_4_3 = self.create_convolution_layer(input=conv_4_2, num_input_channels=512, filter_size=3, num_filters=512, use_pooling=True) ############### Conv Layer 5 ################# conv_5_1, w_5_1 = self.create_convolution_layer(input=conv_4_3, num_input_channels=512, filter_size=3, num_filters=512, use_pooling=False) conv_5_2, w_5_2 = self.create_convolution_layer(input=conv_5_1, num_input_channels=512, filter_size=3, num_filters=512, use_pooling=False) conv_5_3, w_5_3 = self.create_convolution_layer(input=conv_5_2, num_input_channels=512, filter_size=3, num_filters=512, use_pooling=True) layer_flat, num_features = self.flatten_layer(conv_5_3) ######################################## Fully Connected Layer ######################################## fc_1 = self.create_fully_connected_layer(input=layer_flat, num_inputs=num_features, num_outputs=4096) fc_2 = self.create_fully_connected_layer(input=fc_1, num_inputs=4096, num_outputs=4096) fc_3 = self.create_fully_connected_layer(input=fc_2, num_inputs=4096, num_outputs=self.num_output_classes, use_dropout=False) # Normalize prediction y_prediction = tf.nn.softmax(fc_3) # The classnumber is the index of the largest element y_prediction_class = tf.argmax(y_prediction, axis=1) # CostFuction to be optimized cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=fc_3, labels=y_true) # => Now we have a measure of how well the model performs on each image individually. But in order to use the # Cross entropy to guide the optimization of the model's variable swe need a single value, so we simply take the # Average of the crossentropy for all the image classifications cost = tf.reduce_mean(cross_entropy) # Optimizer optimizer_adam = tf.train.AdamOptimizer(learning_rate=0.002).minimize(cost) # Performance measure correct_prediction = tf.equal(y_prediction_class, y_true_cls) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) total_iterations = 0 num_iterations = 100000 start_time = time.time() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for i in range(num_iterations): x_batch, y_true_batch, _ = self.get_batch(X=self.X, Y=self.Y, low=0, high=40000, batch_size=128) feed_dict_train = {x: x_batch, y_true: y_true_batch} sess.run(optimizer_adam, feed_dict_train) if i % 100 == 99: # Calculate the accuracy on the trainingset. x_batch, y_true_batch, _ = self.get_batch(X=self.X, Y=self.Y, low=40000, high=50000, batch_size=1000) feed_dict_validate = {x: x_batch, y_true: y_true_batch} acc = sess.run(accuracy, feed_dict=feed_dict_validate) # Message for printing. msg = "Optimization Iteration: {0:>6}, Training Accuracy: {1:>6.1%}" # print(sess.run(y_prediction, feed_dict=feed_dict_train)) # print(sess.run(y_prediction_class, feed_dict=feed_dict_train)) print(msg.format(i + 1, acc)) if i % 10000 == 9999: oSaver = tf.train.Saver() oSess = sess path = "./model/_" + "iteration_" + str(i) + ".ckpt" oSaver.save(oSess, path) if i == num_iterations  1: x_batch, y_true_batch, _ = self.get_batch(X=self.testX, Y=self.testY, low=0, high=10000, batch_size=10000) feed_dict_test = {x: x_batch, y_true: y_true_batch} test_accuracy = sess.run(accuracy, feed_dict=feed_dict_test) msg = "Test Accuracy: {0:>6.1%}" print(msg.format(test_accuracy)) def get_batch(self, X, Y, low=0, high=50000, batch_size=128): x_batch = [] y_batch = np.ndarray(shape=(batch_size, self.num_output_classes)) index = np.random.randint(low=low, high=high, size=batch_size) counter = 0 for idx in index: x_batch.append(X[idx].flatten()) y_batch[counter] = one_hot_encoded(Y[idx], self.num_output_classes) y_batch_cls = Y[idx] counter += 1 return x_batch, y_batch, y_batch_cls def generate_new_weights(self, shape): w = tf.Variable(tf.truncated_normal(shape, stddev=0.05)) return w def generate_new_biases(self, shape): b = tf.Variable(tf.constant(0.05, shape=[shape])) return b def create_convolution_layer(self, input, num_input_channels, filter_size, num_filters, use_pooling): """ :param input: The previous layer :param num_input_channels: Number of channels in previous layer :param filter_size: W and H of each filter :param num_filters: Number of filters :return: """ shape = [filter_size, filter_size, num_input_channels, num_filters] weights = self.generate_new_weights(shape) biases = self.generate_new_biases(num_filters) layer = tf.nn.conv2d(input=input, filter=weights, strides=[1, 1, 1, 1], padding='SAME') layer += biases # Max Pooling if use_pooling: layer = tf.nn.max_pool(layer, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') # ReLu. Using elu for better performance layer = tf.nn.elu(layer) return layer, weights def create_fully_connected_layer(self, input, num_inputs, num_outputs, use_dropout=True): weights = self.generate_new_weights(shape=[num_inputs, num_outputs]) biases = self.generate_new_biases(shape=num_outputs) layer = tf.matmul(input, weights) + biases layer = tf.nn.elu(layer) if use_dropout: keep_prob = tf.placeholder(tf.float32) keep_prob = 0.5 layer = tf.nn.dropout(layer, keep_prob) return layer def flatten_layer(self, layer): """ Flattens dimension that is output by a convolution layer. Flattening is need to feed into a fullyconnectedlayer. :param layer: :return: """ # shape [num_images, img_height, img_width, num_channels] layer_shape = layer.get_shape() # Number of features h x w x channels num_features = layer_shape[1: 4].num_elements() # Reshape layer_flat = tf.reshape(layer, [1, num_features]) # Shape is now [num_images, img_height * img_width * num_channels] return layer_flat, num_features def unpickle(file): with open(file, 'rb') as file: dict = pickle.load(file, encoding='bytes') return dict def convert_to_individual_image(flat): img_R = flat[0:1024].reshape((32, 32)) img_G = flat[1024:2048].reshape((32, 32)) img_B = flat[2048:3072].reshape((32, 32)) #B G R mean = [125.3, 123.0, 113.9] img = np.dstack((img_R  mean[0], img_G  mean[1], img_B  mean[2])) img = np.array(img) # img = cv2.resize(img, (224, 224), img) return img def read_coco_data(img_path, annotation_path): coco = COCO(annotation_path) ids = list(coco.imgs.keys()) ann_keys = list(coco.anns.keys()) print(coco.imgs[ids[0]]) print(coco.anns[ann_keys[0]]) def one_hot_encoded(class_numbers, num_classes=None): if num_classes is None: num_classes = np.max(class_numbers) + 1 return np.eye(num_classes, dtype=float)[class_numbers] if __name__ == '__main__': data = [] labels = [] val_data = [] val_label = [] # cifar10 counter = 0 for i in range(1, 6): unpacked = unpickle("./cifar10/data_batch_" + str(i)) tmp_data = unpacked[b'data'] tmp_label = unpacked[b'labels'] inner_counter = 0 for flat in tmp_data: converted = convert_to_individual_image(flat) data.append(converted) labels.append(tmp_label[inner_counter]) counter += 1 inner_counter += 1 cv2.imwrite("./img/" + str(counter) + ".jpg", converted) # Test data unpacked = unpickle("./cifar10/test_batch") test_data = [] test_data_flat = unpacked[b'data'] test_label = unpacked[b'labels'] for flat in test_data_flat: test_data.append(convert_to_individual_image(flat)) svgg = SVGG(10) svgg.feed_examples(input_X=data, input_Y=labels) svgg.feed_test_data(test_X=test_data, test_Y=test_label) svgg.run()

How to use tensorflowgpu (with retrain.py) with Nvidia GPU card on a Windows 10 PC?
I have a setup with retrain.py to work with tenforflow and working fine. Now I want to try tensoflowgpu version on a new PC with Nvidia GPU card and Intel Zeon processor. Are there any changes to be done to run retrain.py (code changes)? Do I need to install any special drivers ?

Utility of keras LSTM timesteps
Like many people, I'm confused by the timesteps dimension for keras LSTM inputs. Most tutorials "prepare" data by storing n past values at every timestep, which gives a timestep dimension of n. But is it really necessary? From what I understand, LSTMs units have been designed to remember past values of the data. So, if one provides n autoregressive components at every timestep, I don't see the point in using a sophisticated neural network unit then. A simple neural network will be able to give a weight to every lagged value and then "learn" the sequential nature of a time series...
Thanks, Tibo

an alternative way to concatenate two LSTM cell in keras (CRNN model)
So I am working on CRNN model. this is the model for OCR. This is the link I am working on. this is the model building part code:
x_reshape = Reshape(target_shape=(int(bn_shape[1]), int(bn_shape[2] * bn_shape[3])))(batchnorm_7) fc_1 = Dense(128, activation='relu')(x_reshape) # (?, 50, 128) rnn_1 = LSTM(128, kernel_initializer="he_normal", return_sequences=True)(fc_1) rnn_1b = LSTM(128, kernel_initializer="he_normal", go_backwards=True, return_sequences=True)(fc_1) rnn1_merged = add([rnn_1, rnn_1b]) rnn_2 = LSTM(128, kernel_initializer="he_normal", return_sequences=True)(rnn1_merged) rnn_2b = LSTM(128, kernel_initializer="he_normal", go_backwards=True, return_sequences=True)(rnn1_merged) rnn2_merged = concatenate([rnn_2, rnn_2b]) drop_1 = Dropout(0.25)(rnn2_merged) fc_2 = Dense(label_classes, kernel_initializer='he_normal', activation='softmax')(drop_1) # model setting base_model = Model(inputs=inputShape, outputs=fc_2) # the model for prediecting
(I have not include the whole as I just want to give the sense how the model is) So in one line here:
rnn2_merged = concatenate([rnn_2, rnn_2b])
it is concatenating two cell.I was wondering is there any alternative way to concatenate these without using
Concatenate
in keras?(I have problem converting this model to CoreML, it raises error
Only channel and sequence concatenation are supported.
in concatenate which is not supported by coreML. thats why I wanted to implement it in another way).I have read a couple of papers regarding this like this. they explain that in coreMl Concatenate can only be done in the predetermined dimension!
any input is appreciated.

Which direction Keras RNN works?
I'm trying to predict sports events results using RNN.
I started from simple RNN model, that can imitate my data, just to check how it works and to get intuition about how RNNs work in general.
from keras.models import Sequential from keras.layers import Dense, LSTM, GRU, SimpleRNN, Activation, Dropout, TimeDistributed import numpy as np np.random.seed(1337) #now I need to create my fake data x_seed = [[1,0], [1,0], [1,0], [1,0], [1,0], [1,0]] y_seed = [1.,1.,1.,1.,1.,1.] #so basically I use the same input features but with different outputs to check RNN's ability to predict timedependent data sz = len(x_seed) features = len(x_seed[0]) #we need to reshape data to 3d format to be able to use it with RNN. I use one sample with 6 timesteps x_train = np.array(x_seed).reshape(1,sz,features) y_train = np.array(y_seed).reshape(1,sz,1) #creating simple RNN model with Keras model=Sequential() model.add(SimpleRNN(input_dim = features, output_dim = 2, return_sequences = True)) model.add(SimpleRNN(input_dim = features, output_dim = 2, return_sequences = True)) model.add(TimeDistributed(Dense(units = 1, activation = "tanh"))) #comping and fitting model.compile(loss = "mse", optimizer = 'adam') model.fit(np.array(x_train), np.array(y_train), verbose=2, epochs = 10000, batch_size = 1, shuffle=False)#, validation_split=1) #now it is time to check the results print(model.predict(np.array([[[1,0]]])))
So, to check my RNN I passed the same input [1,0]. As my ylabels are timechanged from 1 to 1, I expected negative output from my network. But as a result I got [[[0.9986634]]], which is strange for me. Next I reversed ylabels to [1,1,1,1,1,1] and in this case I did get expected negative result [[[0.99519145]]]. So for me it seems, that Keras RNN treats input in reversed order. Is it really true or there is something wrong with my code?

LSTM architecture in Keras implementation?
I am new to
Keras
and going through the LSTM and its implementation details inKeras documentation
. It was going easy but suddenly I came through this SO post and the comment. It has confused me on what is the actual LSTM architecture:Here is the code:
model = Sequential() model.add(LSTM(32, input_shape=(10, 64))) model.add(Dense(2))
As per my understanding, 10 denote the no. of timesteps and each one of them is fed to their respective
LSTM cell
; 64 denote the no. of features for each timestep.But, the comment in the above post and the actual answer has confused me about the meaning of 32.
Also, how is the output from
LSTM
is getting connected to theDense
layer.A handdrawn diagrammatic explanation would be quite helpful in visualizing the architecture.
EDIT:
As far as this another SO post is concerned, then it means 32 represents the length of the output vector that is produced by each of the
LSTM cells
ifreturn_sequences=True
.If that's true then how do we connect each of 32dimensional output produced by each of the 10 LSTM cells to the next dense layer?
Also, kindly tell if the first SO post answer is ambiguous or not?

Input of RNN model layer how it works?
I don't understand input of RNN model. Why it show None before node size every layer. Why it is (None,1) (None,12)
This is my code.
K.clear_session() model = Sequential() model.add(Dense(12, input_dim=1, activation='relu')) model.add(Dropout(0.2)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') model.summary()

Custom batches for tf.data.Dataset
I'm using the Estimator API of tensorflow and would like to create custom batches for training.
I have examples that looks as follows
example1 = { "num_sentences": 3, "sentences": [[1, 2], [3, 4], [5, 6]] } example2 = { "num_sentences": 2, "sentences": [[1, 2], [3, 4]] }
So an example can have any number of fixed sized sentences. Now I would like to build batches which size depends on the number of sentences in a batch. Otherwise I have to use batch size 1 as some examples may have "batch size" sentences and a large batch size does not fit into the GPUmemory.
For example: I have a batch size of 6 and examples with the number of sentences [5, 3, 3, 2, 2, 1]. Then I group the examples to the batches [5], [3, 3] and [2, 2, 1]. Note that example "1" in the last batch would be padded.
I have written an algorithm that groups the examples to such batches. Now I am not able to feed the batches into tf.data.Dataset.
I have tried using
tf.data.Dataset.from_generator
but the method seems to expect individual examples and I get an error if the generator yields batches like [example1, example2].How can I feed Dataset with custom batches? Is there a more elegant way to solve my problem?
Update: I assume I am not able to provide the output shapes parameter correctly. The following code works fine.
import tensorflow as tf def gen(): for b in range(3): #yield [{"num_sentences": 3, "sentences": [[1, 2], [3, 4], [5, 6]]}] yield {"num_sentences": 3, "sentences": [[1, 2], [3, 4], [5, 6]]} dataset = tf.data.Dataset.from_generator(generator=gen, output_types={'num_sentences': tf.int32, 'sentences': tf.int32}, #output_shapes=tf.TensorShape([None, {'num_sentences': tf.TensorShape(None), 'sentences': tf.TensorShape(None)}]) output_shapes={'num_sentences': tf.TensorShape(None), 'sentences': tf.TensorShape(None)} ) def print_dataset(dataset): it = dataset.make_one_shot_iterator() with tf.Session() as sess: print(dataset.output_shapes) print(dataset.output_types) while True: try: data = it.get_next() print("data" + str(sess.run(data))) except tf.errors.OutOfRangeError: break print_dataset(dataset)
If I yield an array instead and uncomment the output_shapes I get an error "int() argument must be a string, a byteslike object or a number, not 'dict' "

MirroredStrategy doesn't use GPUs
I wanted to use the tf.contrib.distribute.MirroredStrategy() on my Multi GPU System but it doesn't use the GPUs for the training (see the output below). Also I am running tensorflowgpu 1.12.
I did try to specify the GPUs directly in the MirroredStrategy, but the same problem appeared.
model = models.Model(inputs=input, outputs=y_output) optimizer = tf.train.AdamOptimizer(LEARNING_RATE) model.compile(loss=lossFunc, optimizer=optimizer) NUM_GPUS = 2 strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS) config = tf.estimator.RunConfig(train_distribute=strategy) estimator = tf.keras.estimator.model_to_estimator(model, config=config)
These are the results I am getting:
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0 INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:0 INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:1 WARNING:tensorflow:Not all devices in DistributionStrategy are visible to TensorFlow session.
The expected result would be obviously to run the training on a Multi GPU system. Are those known issues?

Tensorflow Parameter Server Hangs when doing distributed training with Estimator
System information  TensorFlow version:1.5, 1.8, 1.12 all the same results  MacOS 10.14.3
Is parameter server expected to be killed after the training is done when using tf.estimator.Estimator for distributed training ? or it is a expected behavior that
ps
hangs forever ?I am trying simple mnist example on localhost to try to get distributed training with estimator work but not able to do so.
Here is the complete code (code I downloaded and modified from https://github.com/yuiskw/tensorflowservingexample/blob/master/python/train/mnist_custom_estimator.py).
from __future__ import absolute_import from __future__ import division from __future__ import print_function import simplejson import numpy as np import tensorflow as tf tf.logging.set_verbosity(tf.logging.INFO) tf.app.flags.DEFINE_integer('steps', 100, 'The number of steps to train a model') tf.app.flags.DEFINE_string('job_name', 'master', 'job_name') tf.app.flags.DEFINE_string('task_index', '0', 'task_index') tf.app.flags.DEFINE_string('model_dir', './models/ckpt/', 'Dir to save a model and checkpoints') FLAGS = tf.app.flags.FLAGS INPUT_FEATURE = 'image' NUM_CLASSES = 10 def model_fn(features, labels, mode): # Input Layer input_layer = features[INPUT_FEATURE] # Logits layer logits = tf.layers.dense(inputs=input_layer, units=NUM_CLASSES) predictions = { # Generate predictions (for PREDICT and EVAL mode) "classes": tf.argmax(input=logits, axis=1), # Add `softmax_tensor` to the graph. It is used for PREDICT and by the # `logging_hook`. "probabilities": tf.nn.softmax(logits, name="softmax_tensor") } # PREDICT mode if mode == tf.estimator.ModeKeys.PREDICT: return tf.estimator.EstimatorSpec( mode=mode, predictions=predictions, export_outputs={ 'predict': tf.estimator.export.PredictOutput(predictions) }) # Calculate Loss (for both TRAIN and EVAL modes) loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits) # Configure the Training Op (for TRAIN mode) if mode == tf.estimator.ModeKeys.TRAIN: optimizer = tf.train.AdamOptimizer() train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step()) return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op) # Add evaluation metrics (for EVAL mode) eval_metric_ops = { "accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"]) } return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops) def main(_): # Load training and eval data mnist = tf.contrib.learn.datasets.load_dataset("mnist") train_data = mnist.train.images # Returns np.array train_labels = np.asarray(mnist.train.labels, dtype=np.int32) eval_data = mnist.test.images # Returns np.array eval_labels = np.asarray(mnist.test.labels, dtype=np.int32) # reshape images # To have input as an image, we reshape images beforehand. train_data = train_data.reshape(train_data.shape[0], 28 * 28) eval_data = eval_data.reshape(eval_data.shape[0], 28 * 28) # Create the Estimator training_config = tf.estimator.RunConfig( model_dir=FLAGS.model_dir, save_summary_steps=20, save_checkpoints_steps=20) classifier = tf.estimator.Estimator( model_fn=model_fn, model_dir=FLAGS.model_dir, config=training_config) # Set up logging for predictions # Log the values in the "Softmax" tensor with label "probabilities" tensors_to_log = {"probabilities": "softmax_tensor"} logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=50) # Train the model train_input_fn = tf.estimator.inputs.numpy_input_fn( x={INPUT_FEATURE: train_data}, y=train_labels, batch_size=FLAGS.steps, num_epochs=1, shuffle=True) # Evaluate the model and print results eval_input_fn = tf.estimator.inputs.numpy_input_fn( x={INPUT_FEATURE: eval_data}, y=eval_labels, num_epochs=1, shuffle=False) train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn) # setup eval spec evaluating ever n seconds eval_spec = tf.estimator.EvalSpec(input_fn = eval_input_fn) tf.estimator.train_and_evaluate(classifier, train_spec, eval_spec) def make_tf_training_config(args): """ Returns TF_CONFIG that can be used to set the environment variable necessary for distributed training See https://github.com/clusterone/clusteronetutorials/blob/master/tfestimator/mnist.py """ worker_hosts = ['localhost:2222'] ps_hosts = ['localhost:2224'] tf_config = { 'task': { 'type': FLAGS.job_name, 'index': FLAGS.task_index }, 'cluster': { 'master': [worker_hosts[0]], 'ps': ps_hosts }, 'environment': 'cloud' } return tf_config import os if __name__ == "__main__": print("@@@ Version: {}".format(tf.__version__)) tf_config = make_tf_training_config(None) os.environ['TF_CONFIG'] = simplejson.dumps(tf_config) tf.app.run()
Here is how I launch one master job and one ps job:
# launch master worker job python test.py
# launch ps job python test.py job_name ps
Here is the job log for master worker job (only last few lines as this job succeeded and exit):
INFO:tensorflow:Finished evaluation at 2019021901:40:18 INFO:tensorflow:Saving dict for global step 3301: accuracy = 0.9918, global_step = 3301, loss = 0.025573652 INFO:tensorflow:Saving 'checkpoint_path' summary for global step 3301: ./models/ckpt/model.ckpt3301 INFO:tensorflow:Loss for final step: 0.02294134.
Here is the complete job log for ps job:
@@@ Version: 1.12.0 WARNING:tensorflow:From test.py:118: load_dataset (from tensorflow.contrib.learn.python.learn.datasets) is deprecated and will be removed in a future version. Instructions for updating: Please use tf.data. WARNING:tensorflow:From /Users/hjing/miniconda2/lib/python2.7/sitepackages/tensorflow/contrib/learn/python/learn/datasets/__init__.py:80: load_mnist (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. WARNING:tensorflow:From /Users/hjing/miniconda2/lib/python2.7/sitepackages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:300: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. WARNING:tensorflow:From /Users/hjing/miniconda2/lib/python2.7/sitepackages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Please write your own downloading logic. WARNING:tensorflow:From /Users/hjing/miniconda2/lib/python2.7/sitepackages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use tf.data to implement this functionality. Extracting MNISTdata/trainimagesidx3ubyte.gz WARNING:tensorflow:From /Users/hjing/miniconda2/lib/python2.7/sitepackages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use tf.data to implement this functionality. Extracting MNISTdata/trainlabelsidx1ubyte.gz Extracting MNISTdata/t10kimagesidx3ubyte.gz Extracting MNISTdata/t10klabelsidx1ubyte.gz WARNING:tensorflow:From /Users/hjing/miniconda2/lib/python2.7/sitepackages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. INFO:tensorflow:TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'ps': [u'localhost:2224'], u'master': [u'localhost:2222']}, u'task': {u'index': u'0', u'type': u'ps'}} INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_session_config': device_filters: "/job:ps" device_filters: "/job:worker" device_filters: "/job:master" allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_task_type': u'ps', '_train_distribute': None, '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x127a47550>, '_model_dir': './models/ckpt/', '_protocol': None, '_save_checkpoints_steps': 20, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 1, '_tf_random_seed': None, '_save_summary_steps': 20, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_global_id_in_cluster': 1, '_master': u'grpc://localhost:2224'} INFO:tensorflow:Not using Distribute Coordinator. INFO:tensorflow:Start Tensorflow server. 20190218 17:38:15.844761: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 20190218 17:38:15.846148: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job master > {0 > localhost:2222} 20190218 17:38:15.846171: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps > {0 > localhost:2224} 20190218 17:38:15.846704: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2224
And the ps hangs forever.
Is this behavior expected ?
Thanks !
We welcome contributions by users. Will you be able to update submit a PR (use the doc style guide) to fix the doc Issue?