2023-11-12 08:06:35.785982: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2023-11-12 08:06:35.786005: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Initializing orca context Current pyspark location is : /opt/spark-2.4/python/lib/pyspark.zip/pyspark/__init__.py Start to getOrCreate SparkContext pyspark_submit_args is: --driver-class-path /home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/core/lib/all-2.3.0.jar:/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.3.0-jar-with-dependencies.jar:/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.3.0-jar-with-dependencies.jar pyspark-shell 23/11/12 08:06:37 INFO spark.SparkContext: Running Spark version 2.4.3 23/11/12 08:06:37 INFO spark.SparkContext: Submitted application: b.py 23/11/12 08:06:37 INFO spark.SecurityManager: Changing view acls to: ubuntu 23/11/12 08:06:37 INFO spark.SecurityManager: Changing modify acls to: ubuntu 23/11/12 08:06:37 INFO spark.SecurityManager: Changing view acls groups to: 23/11/12 08:06:37 INFO spark.SecurityManager: Changing modify acls groups to: 23/11/12 08:06:37 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 23/11/12 08:06:37 INFO util.Utils: Successfully started service 'sparkDriver' on port 42335. 23/11/12 08:06:37 INFO spark.SparkEnv: Registering MapOutputTracker 23/11/12 08:06:37 INFO spark.SparkEnv: Registering BlockManagerMaster 23/11/12 08:06:37 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 23/11/12 08:06:37 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 23/11/12 08:06:37 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-27db52eb-faea-4255-b4c2-be4acbf93c70 23/11/12 08:06:37 INFO memory.MemoryStore: MemoryStore started with capacity 5.2 GB 23/11/12 08:06:37 INFO spark.SparkEnv: Registering OutputCommitCoordinator 23/11/12 08:06:37 INFO util.log: Logging initialized @3680ms 23/11/12 08:06:37 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 23/11/12 08:06:37 INFO server.Server: Started @3771ms 23/11/12 08:06:37 INFO server.AbstractConnector: Started ServerConnector@2059cb89{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 23/11/12 08:06:37 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4b661eaa{/jobs,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@737db98e{/jobs/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@318b1348{/jobs/job,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3dd855cf{/jobs/job/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@23772a33{/stages,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a8f6c46{/stages/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@357e9d4a{/stages/stage,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6117f6cb{/stages/stage/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2042b36e{/stages/pool,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@54c1b899{/stages/pool/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5eee1a54{/storage,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@111eec{/storage/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@35fd0443{/storage/rdd,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6ca17715{/storage/rdd/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@42b3fc6f{/environment,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4add6cab{/environment/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@17dfb6af{/executors,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@70a1729e{/executors/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2d2ca861{/executors/threadDump,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@cb5315b{/executors/threadDump/json,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@663584ac{/static,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@394a24ab{/,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@43b43fa4{/api,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@191a51fb{/jobs/job/kill,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@575b305b{/stages/stage/kill,null,AVAILABLE,@Spark} 23/11/12 08:06:37 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ubuntu1:4040 23/11/12 08:06:37 INFO spark.SparkContext: Added JAR file:///home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/core/lib/all-2.3.0.jar at spark://ubuntu1:42335/jars/all-2.3.0.jar with timestamp 1699751197820 23/11/12 08:06:37 INFO spark.SparkContext: Added JAR file:///home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.3.0-jar-with-dependencies.jar at spark://ubuntu1:42335/jars/bigdl-dllib-spark_2.4.6-2.3.0-jar-with-dependencies.jar with timestamp 1699751197821 23/11/12 08:06:37 INFO spark.SparkContext: Added JAR file:///home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.3.0-jar-with-dependencies.jar at spark://ubuntu1:42335/jars/bigdl-orca-spark_2.4.6-2.3.0-jar-with-dependencies.jar with timestamp 1699751197821 23/11/12 08:06:37 INFO executor.Executor: Starting executor ID driver on host localhost 23/11/12 08:06:38 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37551. 23/11/12 08:06:38 INFO netty.NettyBlockTransferService: Server created on ubuntu1:37551 23/11/12 08:06:38 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 23/11/12 08:06:38 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ubuntu1, 37551, None) 23/11/12 08:06:38 INFO storage.BlockManagerMasterEndpoint: Registering block manager ubuntu1:37551 with 5.2 GB RAM, BlockManagerId(driver, ubuntu1, 37551, None) 23/11/12 08:06:38 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ubuntu1, 37551, None) 23/11/12 08:06:38 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, ubuntu1, 37551, None) 23/11/12 08:06:38 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@58189612{/metrics/json,null,AVAILABLE,@Spark} 2023-11-12 08:06:39,020 Thread-6 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2023-11-12 08:06:39,023 Thread-6 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2023-11-12 08:06:39,024 Thread-6 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2023-11-12 08:06:39,024 Thread-6 WARN The bufferSize is set to 4000 but bufferedIo is false: false 23-11-12 08:06:39 [Thread-6] INFO Engine$:122 - Auto detect executor number and executor cores number 23-11-12 08:06:39 [Thread-6] INFO Engine$:124 - Executor number is 1 and executor cores number is 4 23-11-12 08:06:39 [Thread-6] INFO ThreadPool$:95 - Set mkl threads to 1 on thread 18 23/11/12 08:06:39 WARN spark.SparkContext: Using an existing SparkContext; some configuration may not take effect. 23-11-12 08:06:39 [Thread-6] INFO Engine$:461 - Find existing spark context. Checking the spark conf... cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample BigDLBasePickler registering: bigdl.dllib.utils.common Sample cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor BigDLBasePickler registering: bigdl.dllib.utils.common JTensor cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity BigDLBasePickler registering: bigdl.dllib.utils.common JActivity Successfully got a SparkContext Vocabulary size: 7978 Maximum length: 189 2023-11-12 08:06:42,191 INFO services.py:1340 -- View the Ray dashboard at http://172.20.201.186:8265 {'node_ip_address': '172.20.201.186', 'raylet_ip_address': '172.20.201.186', 'redis_address': '172.20.201.186:6379', 'object_store_address': '/tmp/ray/session_2023-11-12_08-06-39_918267_21489/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-11-12_08-06-39_918267_21489/sockets/raylet', 'webui_url': '172.20.201.186:8265', 'session_dir': '/tmp/ray/session_2023-11-12_08-06-39_918267_21489', 'metrics_export_port': 57829, 'node_id': '2ef537723b0b546a62d7a716472adc697bac34537331623c04f82325'} (Worker pid=21738) 2023-11-12 08:06:43.947024: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory (Worker pid=21738) 2023-11-12 08:06:43.947051: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. (Worker pid=21738) WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/tf_runner.py:337: _CollectiveAllReduceStrategyExperimental.__init__ (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version. (Worker pid=21738) Instructions for updating: (Worker pid=21738) use distribute.MultiWorkerMirroredStrategy instead (Worker pid=21738) 2023-11-12 08:06:45.205836: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory (Worker pid=21738) 2023-11-12 08:06:45.205858: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) (Worker pid=21738) 2023-11-12 08:06:45.205871: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ubuntu1): /proc/driver/nvidia/version does not exist (Worker pid=21738) 2023-11-12 08:06:45.206336: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (Worker pid=21738) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (Worker pid=21738) 2023-11-12 08:06:45.209443: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 172.20.201.186:37441} (Worker pid=21738) 2023-11-12 08:06:45.209579: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://172.20.201.186:37441 2023-11-12 08:06:45.500578: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2023-11-12 08:06:45.500601: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) 2023-11-12 08:06:45.500614: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ubuntu1): /proc/driver/nvidia/version does not exist 2023-11-12 08:06:45.500887: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Traceback (most recent call last): File "/home/ubuntu/Downloads/spark-20231031T155920Z-001/spark/b.py", line 88, in validation_data=test_dataset File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/ray_estimator.py", line 311, in fit for i in range(self.num_workers)]) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/orca/learn/tf2/ray_estimator.py", line 311, in for i in range(self.num_workers)]) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/actor.py", line 120, in remote return self._remote(args, kwargs) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 410, in _start_span return method(self, args, kwargs, *_args, **_kwargs) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/actor.py", line 167, in _remote return invocation(args, kwargs) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/actor.py", line 161, in invocation num_returns=num_returns) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/actor.py", line 945, in _actor_method_call list_args, name, num_returns, self._ray_actor_method_cpus) File "python/ray/_raylet.pyx", line 1609, in ray._raylet.CoreWorker.submit_actor_task File "python/ray/_raylet.pyx", line 1614, in ray._raylet.CoreWorker.submit_actor_task File "python/ray/_raylet.pyx", line 380, in ray._raylet.prepare_args File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/serialization.py", line 361, in serialize return self._serialize_to_msgpack(value) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/serialization.py", line 341, in _serialize_to_msgpack self._serialize_to_pickle5(metadata, python_objects) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/serialization.py", line 301, in _serialize_to_pickle5 raise e File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/serialization.py", line 298, in _serialize_to_pickle5 value, protocol=5, buffer_callback=writer.buffer_callback) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps cp.dump(obj) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump return Pickler.dump(self, obj) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1074, in __reduce__ return convert_to_tensor, (self._numpy(),) File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1117, in _numpy raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot convert a Tensor of dtype variant to a NumPy array. Stopping orca context