运行Inception v3的TensorFlow Out of Memory错误分布在4台机器上(TensorFlow out of Memory error running Inception v3 distributed on 4 machines)
我正在尝试运行在最多32台计算机上分发的Inception v3( https://github.com/tensorflow/models/tree/master/inception )。
我在4台机器上运行时发现内存不足错误。
这是错误:
INFO:tensorflow:Started 0 queues for processing input data. E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[2048,1001] [[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]] [[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]] Traceback (most recent call last): File "imagenet_distributed_train.py", line 65, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "imagenet_distributed_train.py", line 61, in main inception_distributed_train.train(server.target, dataset, cluster_spec) File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 286, in train loss_value, step = sess.run([train_op, global_step]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[2048,1001] [[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]] [[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]] Caused by op u'gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul', defined at: File "imagenet_distributed_train.py", line 65, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "imagenet_distributed_train.py", line 61, in main inception_distributed_train.train(server.target, dataset, cluster_spec) File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 215, in train grads = opt.compute_gradients(total_loss) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 229, in compute_gradients return self._opt.compute_gradients(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 253, in compute_gradients colocate_gradients_with_ops=colocate_gradients_with_ops) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients in_grads = _AsList(grad_fn(op, *out_grads)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 402, in _L2LossGrad return op.inputs[0] * grad File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 754, in binary_op_wrapper return func(x, y, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 903, in _mul_dispatch return gen_math_ops.mul(x, y, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1427, in mul result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__ self._traceback = _extract_stack() ...which was originally created as op u'logits/logits/weights/Regularizer/L2Regularizer/L2Loss', defined at: File "imagenet_distributed_train.py", line 65, in <module> tf.app.run() [elided 1 identical lines from previous traceback] File "imagenet_distributed_train.py", line 61, in main inception_distributed_train.train(server.target, dataset, cluster_spec) File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 154, in train logits = inception.inference(images, num_classes, for_training=True) File "/home/ubuntu/indu/models/inception/inception/inception_model.py", line 87, in inference scope=scope) File "/home/ubuntu/indu/models/inception/inception/slim/inception_model.py", line 326, in inception_v3 restore=restore_logits) File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args return func(*args, **current_args) File "/home/ubuntu/indu/models/inception/inception/slim/ops.py", line 300, in fc restore=restore) File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args return func(*args, **current_args) File "/home/ubuntu/indu/models/inception/inception/slim/variables.py", line 290, in variable trainable=trainable, collections=collections) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 830, in get_variable custom_getter=custom_getter) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 673, in get_variable custom_getter=custom_getter) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 217, in get_variable validate_shape=validate_shape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 202, in _true_getter caching_device=caching_device, validate_shape=validate_shape)
我正在使用EC2 G2.8XL实例。 这些实例具有:
- 英特尔至强E5-2670(Sandy Bridge)处理器
- 60 GB内存和
- 四颗GK104GL [GRID K520] GPU,每颗GPU均带有4 GB内存。
- 10千兆网卡
我在这些机器上运行Ubuntu 14.04.4 LTS。
我每个GPU运行一个工人。 所以,总共有16名工人。
我每台机器运行一个PS。 所以,总共4个PS。
我使用的是批量大小为8.(4台机器内存不足,批量大小为8个,即使批量大小为2,也有32台机器内存不足)。
已安装的CUDA和cuDNN版本:
ubuntu@ip-172-31-16-180:~$ ls -l /usr/local/cuda/lib64/libcud* -rw-r--r-- 1 root root 322936 Aug 15 2015 /usr/local/cuda/lib64/libcudadevrt.a lrwxrwxrwx 1 root root 16 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5 lrwxrwxrwx 1 root root 19 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18 -rwxr-xr-x 1 root root 383336 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5.18 -rw-r--r-- 1 root root 720192 Aug 15 2015 /usr/local/cuda/lib64/libcudart_static.a
我从https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl安装了TensorFlow
ubuntu@ip-172-31-16-180:~$ python -c "import tensorflow; print(tensorflow.version)" I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 0.10.0rc0
有人能帮我弄清楚如何解决这个问题,并在32台机器的集群中运行Inception v3?
更多信息:以下是我在集群中的机器上执行的命令:
On machine1: CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=0 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=0 > /tmp/worker0 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=1 > /tmp/worker1 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=2 > /tmp/worker2 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=3 > /tmp/worker3 2>&1 & On machine2: CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=1 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=4 > /tmp/worker4 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=5 > /tmp/worker5 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=6 > /tmp/worker6 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=7 > /tmp/worker7 2>&1 & On machine3: CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=2 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=8 > /tmp/worker8 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=9 > /tmp/worker9 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=10 > /tmp/worker10 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=11 > /tmp/worker11 2>&1 & On machine4: CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=3 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=12 > /tmp/worker12 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=13 > /tmp/worker13 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=14 > /tmp/worker14 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=15 > /tmp/worker15 2>&1 &
更新1:
我尝试了以下实验:
实验1:
- machine1上的Worker1,worker2,worker3和worker4
- ps1或machine1,machine2上的ps2,machine3上的ps3,machine4上的ps4。
这与除四台机器中的三台机器的工人被移除之外的四台机器配置相同。 machine1上的工作负载保持不变。 机器1上的通信负载(与四个ps对话)保持不变。 我预计这会耗尽内存,但这工作得很好。
实验2:
- machine1上的Worker1,worker2,worker3和worker4。
- 机器2上的ps1(仅限ps)。
这有点像魅力,学习速度比实验1快。
鉴于此,我想知道为什么使用全部四个GPU的四台机器内存不足。
I'm trying to run Inception v3 (https://github.com/tensorflow/models/tree/master/inception) distributed on upto 32 machines.
I'm seeing out of memory error when I run it on 4 machines.
Here is the error:
INFO:tensorflow:Started 0 queues for processing input data. E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[2048,1001] [[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]] [[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]] Traceback (most recent call last): File "imagenet_distributed_train.py", line 65, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "imagenet_distributed_train.py", line 61, in main inception_distributed_train.train(server.target, dataset, cluster_spec) File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 286, in train loss_value, step = sess.run([train_op, global_step]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[2048,1001] [[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]] [[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]] Caused by op u'gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul', defined at: File "imagenet_distributed_train.py", line 65, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "imagenet_distributed_train.py", line 61, in main inception_distributed_train.train(server.target, dataset, cluster_spec) File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 215, in train grads = opt.compute_gradients(total_loss) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 229, in compute_gradients return self._opt.compute_gradients(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 253, in compute_gradients colocate_gradients_with_ops=colocate_gradients_with_ops) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients in_grads = _AsList(grad_fn(op, *out_grads)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 402, in _L2LossGrad return op.inputs[0] * grad File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 754, in binary_op_wrapper return func(x, y, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 903, in _mul_dispatch return gen_math_ops.mul(x, y, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1427, in mul result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__ self._traceback = _extract_stack() ...which was originally created as op u'logits/logits/weights/Regularizer/L2Regularizer/L2Loss', defined at: File "imagenet_distributed_train.py", line 65, in <module> tf.app.run() [elided 1 identical lines from previous traceback] File "imagenet_distributed_train.py", line 61, in main inception_distributed_train.train(server.target, dataset, cluster_spec) File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 154, in train logits = inception.inference(images, num_classes, for_training=True) File "/home/ubuntu/indu/models/inception/inception/inception_model.py", line 87, in inference scope=scope) File "/home/ubuntu/indu/models/inception/inception/slim/inception_model.py", line 326, in inception_v3 restore=restore_logits) File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args return func(*args, **current_args) File "/home/ubuntu/indu/models/inception/inception/slim/ops.py", line 300, in fc restore=restore) File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args return func(*args, **current_args) File "/home/ubuntu/indu/models/inception/inception/slim/variables.py", line 290, in variable trainable=trainable, collections=collections) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 830, in get_variable custom_getter=custom_getter) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 673, in get_variable custom_getter=custom_getter) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 217, in get_variable validate_shape=validate_shape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 202, in _true_getter caching_device=caching_device, validate_shape=validate_shape)
I'm using EC2 G2.8XL instances. These instances have:
- Intel Xeon E5-2670 (Sandy Bridge) Processors
- 60 GB memory and
- Four GK104GL [GRID K520] GPU with 4 GB memory on each of them.
- 10 Gigabit NIC
I'm running Ubuntu 14.04.4 LTS on these machines.
I'm running one worker per GPU. So, in total there is 16 workers.
I'm running one PS per machine. So, 4 PS in total.
I'm using a batch size of 8. (4 machines run out of memory with a batch size of 8. 32 machines run out of memory even with a batch size of 2).
Installed version of CUDA and cuDNN:
ubuntu@ip-172-31-16-180:~$ ls -l /usr/local/cuda/lib64/libcud* -rw-r--r-- 1 root root 322936 Aug 15 2015 /usr/local/cuda/lib64/libcudadevrt.a lrwxrwxrwx 1 root root 16 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5 lrwxrwxrwx 1 root root 19 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18 -rwxr-xr-x 1 root root 383336 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5.18 -rw-r--r-- 1 root root 720192 Aug 15 2015 /usr/local/cuda/lib64/libcudart_static.a
I installed TensorFlow from https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl
ubuntu@ip-172-31-16-180:~$ python -c "import tensorflow; print(tensorflow.version)" I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 0.10.0rc0
Could someone please help me figure how how to fix this and run Inception v3 in a cluster with 32 machines?
More info: Here are the commands I'm executing on the machines in the cluster:
On machine1: CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=0 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=0 > /tmp/worker0 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=1 > /tmp/worker1 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=2 > /tmp/worker2 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=3 > /tmp/worker3 2>&1 & On machine2: CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=1 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=4 > /tmp/worker4 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=5 > /tmp/worker5 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=6 > /tmp/worker6 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=7 > /tmp/worker7 2>&1 & On machine3: CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=2 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=8 > /tmp/worker8 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=9 > /tmp/worker9 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=10 > /tmp/worker10 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=11 > /tmp/worker11 2>&1 & On machine4: CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=3 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=12 > /tmp/worker12 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=13 > /tmp/worker13 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=14 > /tmp/worker14 2>&1 & python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=15 > /tmp/worker15 2>&1 &
Update 1:
I tried the following experiments:
Experiment 1:
- Worker1, worker2, worker3 and worker4 on machine1
- ps1 or machine1, ps2 on machine2, ps3 on machine3, ps4 on machine4.
This is same as the 4 machine configuration that failed except that workers from 3 of the four machines are removed. Worker load on machine1 remains the same. Communication load on machine1 (talking to four ps) remains the same. I expected this to run out of memory but this worked perfectly fine.
Experiment 2:
- Worker1, worker2, worker3 and worker4 on machine1.
- ps1 (only ps) on machine2.
This worked like charm and learning rate was faster than experiment 1.
Given this, I wonder why four machine using all four GPUs run out of memory.
原文:https://stackoverflow.com/questions/39567835
最满意答案
要做你想做的事,你应该有这样的事情:
var user = new User({username: 'Name', password: 'unsecure'}); user.save();
您的代码有些奇怪,所以我强烈建议您使用express和mongoose创建一个示例站点的教程(很可能您可以找到一个博客)。
这是我制作的一个: https : //github.com/mathrawka/node-express-starter
祝你好运!
To do what you want, you should have something like this:
var user = new User({username: 'Name', password: 'unsecure'}); user.save();
There are a few things odd with your code, so I highly suggest going over a tutorial that uses express and mongoose to create a sample site (most likely you can find a blog).
Here is one I made: https://github.com/mathrawka/node-express-starter
Good luck!
相关问答
更多-
BSON(mongo原生使用的存储数据格式)有一个专用的日期类型UTC datetime ,它是一个64位(如此8个字节)的有符号整数,表示自Unix时间纪元以来的毫秒数。 很少有正当的理由说明为什么你会使用任何其他类型来存储日期和时间戳。 如果你不顾一切地在每个日期保存几个字节(再次,使用mongo的填充和最小块大小以及所有这些仅在非常罕见的情况下才值得遇到),则可以将日期存储为3字节的二进制blob,方法是将日期存储为YYYYMMDD格式的无符号整数或表示“自X年1月1日以来的天数”的2字节二进制blo ...
-
mongodb中的数据存储(Data storage in mongodb)[2022-06-17]
这是什么默认测试数据库? 在未指定数据库的情况下连接到mongod服务器时,将选择默认数据库“test”。 由于数据库是懒惰创建的,因此在您写入数据库之前它甚至可能不存在。 db.test.save( { a: 1 } ) 执行此行后,将创建具有当前名称(默认情况下为“test”)的数据库(如果尚未存在),并在其中创建集合“test”(如果尚未存在)。 它在哪里? 此外,这个对象存储在哪里? 我在哪里可以找到这个文件? 所有数据库最终都作为文件存储在数据目录中。 在那里寻找“test。*”文件。 what ... -
通常,MongoDB鼓励嵌入数据而不是关系,因为这允许通过单个查询获取所有相关数据。 但是有一个例外:MongoDB不喜欢无限增长的文档。 当文档在其生命周期内逐渐增长时,数据库必须经常重新分配硬盘空间。 这会减慢写入速度并导致数据库碎片化。 此外,文档的硬编码大小限制为16MB(主要是为了阻止文档增长)。 用户在其成员资格期间积累越来越多的私人消息将是无限增长的一个很好的例子。 在您的情况下,确定最常见的用例非常重要。 你打算如何向用户展示私人消息? 他们会在一个长HTML页面上看到他们用全文获得的所有消 ...
-
问题通过以下命令解决。 sudo mongod --port 27017 --dbpath /var/lib/mongodb/ problem solved by following command. sudo mongod --port 27017 --dbpath /var/lib/mongodb/
-
压缩没有内置任何东西。 有些操作系统提供磁盘/文件压缩,但是如果你想要更多控制,我建议你使用库来压缩它,无论你使用什么编程语言,都可以手动控制压缩。 例如,NodeJs为此提供了简单方便的方法: http ://nodejs.org/api/zlib.html#zlib_examples 3.0更新 如果您选择切换到3.0附带的新存储引擎WiredTiger,您可以选择几种类型的压缩,如此处所述 。 当然,您需要在生产工作负载中测试此更改,以确定额外的CPU利用率是否值得接收压缩。 There's noth ...
-
如何在indexedDB中存储mongodb集合数组(how to store a mongodb array of collections inside an indexedDB)[2022-04-08]
使用var transaction = event.target.transaction而不是var transaction = db.transaction(...); 完整答案相当冗长。 简而言之,您不希望在onupgradeneeded中创建新事务。 已有可用的活动交易。 Use var transaction = event.target.transaction instead of var transaction = db.transaction(...); A full answer is ra ... -
这取决于您将来的查询,您可能希望同时执行这两项操作。 磁盘空间比处理便宜,并且最好将磁盘空间加倍,而不是查询的两倍。 如果您只是按日期映射,那么您将希望按日期对所有用户/州进行分组。 如果您只是按用户进行映射,那么您需要按用户对所有日期/状态进行分组。 如果您要同时进行查询,则应该制作两个集合以最小化处理。 在任何一种情况下,绝对使用数组来表示饥饿状态。 日期分组的示例结构: { date: '1494288000', time-of-day: [ { am: [ { user: asdfas, hunge ...
-
要做你想做的事,你应该有这样的事情: var user = new User({username: 'Name', password: 'unsecure'}); user.save(); 您的代码有些奇怪,所以我强烈建议您使用express和mongoose创建一个示例站点的教程(很可能您可以找到一个博客)。 这是我制作的一个: https : //github.com/mathrawka/node-express-starter 祝你好运! To do what you want, you shoul ...
-
这取决于您的要求。 您已经从mysql迁移到mongodb,这并不意味着您的读取将是超高速的。 如果在mongodb中有任何显着的I / O改进,mysql开发人员也会采用它。 MongoDB提供了超过mysql的灵活性,并且还有一些优势。 因此,如果你的负载保持不变,你应该在mongodb层之前有一个缓存层。 Mysql和mongodb都带有内置缓存,它像查询一样基于查询缓存结果,但是休息数据在磁盘上,如上所述mongodb在I / O方面没有任何技术优势。 所以有一个缓存层,以避免过多查询db。 It ...
-
这与存储日志非常相似:大量写入,并且数据按顺序读回。 幸运的是,Mongo网站有这样的配方: https://docs.mongodb.org/ecosystem/use-cases/storing-log-data/ 关于数据的不变性,这对于MongoDB来说不是问题。 This is very similar to storing logs: lots of writes, and the data is read back in order. Luckily the Mongo Site has a ...