alluxio master register itself as a worker

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

alluxio master register itself as a worker

Kun Li
I'm running Alluxio 1.8.0 with spark 2.3.1, in a kubernetes cluster.

Alluxio master is running as a statefulset , so it has the domain name of alluxio-master-0.alluxio-master.cdev.svc.cluster.local.

This is the alluxio pods that I have:

$ kubectl get po -ncdev

NAME READY STATUS RESTARTS AGE
alluxio
-master-0 2/2 Running 0 6m
alluxio
-worker-121418788-5fsfk 2/2 Running 0 17h
alluxio
-worker-121418788-m906g 2/2 Running 0 17h
alluxio
-worker-121418788-zdsrf 2/2 Running 0 17h
...


And the following shows in the logs of alluxio-master-0:

2018-09-14 12:05:51,558 INFO  WebServer - Alluxio Master Web service started @ /0.0.0.0:19999

2018-09-14 12:05:51,563 INFO  AlluxioMasterProcess - Alluxio master version 1.8.0 started (gained leadership). bindHost=/
0.0.0.0:19998, connectHost=ac04.rinc.com/10.1.1.27:19998, rpcPort=19998, webPort=19999
2018-09-14 12:05:51,598 INFO  DefaultSafeModeManager - Rpc server started, waiting 5000ms for workers to register
2018-09-14 12:05:51,600 INFO  FaultTolerantAlluxioMasterProcess - Primary started
2018-09-14 12:05:51,849 WARN  DefaultBlockMaster - Could not find worker id: 6064340727611749855 for heartbeat.
2018-09-14 12:05:51,871 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=alluxio-master-0.alluxio-master.cdev.svc.cluster.local, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=alluxio-master-0.alluxio-master.cdev.svc.cluster.local, rack=null)} id: 7545956817790988295
2018-09-14 12:05:51,887 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=7545956817790988295, workerAddress=WorkerNetAddress{host=alluxio-master-0.alluxio-master.cdev.svc.cluster.local, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=alluxio-master-0.alluxio-master.cdev.svc.cluster.local, rack=null)}, capacityBytes=37580963840, usedBytes=0, lastUpdatedTimeMs=1536897951886, blocks=[]}
2018-09-14 12:05:53,057 WARN  DefaultBlockMaster - Could not find worker id: 5141368466075040629 for heartbeat.
2018-09-14 12:05:53,058 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=ac04.rinc.com, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=ac04.rinc.com, rack=null)} id: 8231393212956172642
2018-09-14 12:05:53,061 WARN  DefaultBlockMaster - Invalid block: 151095607296 from worker ac04.rinc.com.
2018-09-14 12:05:53,061 WARN  DefaultBlockMaster - Invalid block: 152051908608 from worker ac04.rinc.com.
2018-09-14 12:05:53,061 INFO  DefaultBlockMaster - Requesting delete for orphaned block: 151095607296 from worker ac04.rinc.com.
2018-09-14 12:05:53,061 INFO  DefaultBlockMaster - Requesting delete for orphaned block: 152051908608 from worker ac04.rinc.com.
2018-09-14 12:05:53,061 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=8231393212956172642, workerAddress=WorkerNetAddress{host=ac04.rinc.com, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=ac04.rinc.com, rack=null)}, capacityBytes=37580963840, usedBytes=21308605, lastUpdatedTimeMs=1536897953061, blocks=[151095607296, 152051908608]}
2018-09-14 12:05:53,848 WARN  DefaultBlockMaster - Could not find worker id: 5786180206808424635 for heartbeat.
2018-09-14 12:05:53,849 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=aa04.rinc.com, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=aa04.rinc.com, rack=null)} id: 3677615251792051127
2018-09-14 12:05:53,853 WARN  DefaultBlockMaster - Invalid block: 151280156672 from worker aa04.rinc.com.
2018-09-14 12:05:53,853 WARN  DefaultBlockMaster - Invalid block: 151112384512 from worker aa04.rinc.com.
2018-09-14 12:05:53,853 INFO  DefaultBlockMaster - Requesting delete for orphaned block: 151280156672 from worker aa04.rinc.com.
2018-09-14 12:05:53,853 INFO  DefaultBlockMaster - Requesting delete for orphaned block: 151112384512 from worker aa04.rinc.com.
2018-09-14 12:05:53,853 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=3677615251792051127, workerAddress=WorkerNetAddress{host=aa04.rinc.com, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=aa04.rinc.com, rack=null)}, capacityBytes=37580963840, usedBytes=12492046, lastUpdatedTimeMs=1536897953853, blocks=[151280156672, 151112384512]}


 
So I have three workers at this time, and my spark job write to alluxio failed:
aa04.rinc.com
ac04.rinc.com
alluxio-master-0.alluxio-master.cdev.svc.cluster.local

Anyone ever saw this before ?

likun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: alluxio master register itself as a worker

Bin Fan
hi Kun,

it will be helpful if you can post the failure message of your spark job

- Bin

On Friday, September 14, 2018 at 7:25:00 AM UTC-7, Kun Li wrote:
I'm running Alluxio 1.8.0 with spark 2.3.1, in a kubernetes cluster.

Alluxio master is running as a statefulset , so it has the domain name of alluxio-master-0.alluxio-master.cdev.svc.cluster.local.

This is the alluxio pods that I have:

$ kubectl get po -ncdev

NAME READY STATUS RESTARTS AGE
alluxio
-master-0 2/2 Running 0 6m
alluxio
-worker-121418788-5fsfk 2/2 Running 0 17h
alluxio
-worker-121418788-m906g 2/2 Running 0 17h
alluxio
-worker-121418788-zdsrf 2/2 Running 0 17h
...


And the following shows in the logs of alluxio-master-0:

2018-09-14 12:05:51,558 INFO  WebServer - Alluxio Master Web service started @ /<a href="http://0.0.0.0:19999" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F0.0.0.0%3A19999\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFtluWzVduHMc3syrEdV6P1LIexGA&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F0.0.0.0%3A19999\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFtluWzVduHMc3syrEdV6P1LIexGA&#39;;return true;">0.0.0.0:19999

2018-09-14 12:05:51,563 INFO  AlluxioMasterProcess - Alluxio master version 1.8.0 started (gained leadership). bindHost=/
0.0.0.0:19998, connectHost=ac04.rinc.com/10.1.1.27:19998, rpcPort=19998, webPort=19999
2018-09-14 12:05:51,598 INFO  DefaultSafeModeManager - Rpc server started, waiting 5000ms for workers to register
2018-09-14 12:05:51,600 INFO  FaultTolerantAlluxioMasterProcess - Primary started
2018-09-14 12:05:51,849 WARN  DefaultBlockMaster - Could not find worker id: 6064340727611749855 for heartbeat.
2018-09-14 12:05:51,871 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=alluxio-master-0.alluxio-master.cdev.svc.cluster.local, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=alluxio-master-0.alluxio-master.cdev.svc.cluster.local, rack=null)} id: 7545956817790988295
2018-09-14 12:05:51,887 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=7545956817790988295, workerAddress=WorkerNetAddress{host=alluxio-master-0.alluxio-master.cdev.svc.cluster.local, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=alluxio-master-0.alluxio-master.cdev.svc.cluster.local, rack=null)}, capacityBytes=37580963840, usedBytes=0, lastUpdatedTimeMs=1536897951886, blocks=[]}
2018-09-14 12:05:53,057 WARN  DefaultBlockMaster - Could not find worker id: 5141368466075040629 for heartbeat.
2018-09-14 12:05:53,058 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=ac04.rinc.com, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=ac04.rinc.com, rack=null)} id: 8231393212956172642
2018-09-14 12:05:53,061 WARN  DefaultBlockMaster - Invalid block: 151095607296 from worker ac04.rinc.com.
2018-09-14 12:05:53,061 WARN  DefaultBlockMaster - Invalid block: 152051908608 from worker ac04.rinc.com.
2018-09-14 12:05:53,061 INFO  DefaultBlockMaster - Requesting delete for orphaned block: 151095607296 from worker ac04.rinc.com.
2018-09-14 12:05:53,061 INFO  DefaultBlockMaster - Requesting delete for orphaned block: 152051908608 from worker ac04.rinc.com.
2018-09-14 12:05:53,061 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=8231393212956172642, workerAddress=WorkerNetAddress{host=ac04.rinc.com, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=ac04.rinc.com, rack=null)}, capacityBytes=37580963840, usedBytes=21308605, lastUpdatedTimeMs=1536897953061, blocks=[151095607296, 152051908608]}
2018-09-14 12:05:53,848 WARN  DefaultBlockMaster - Could not find worker id: 5786180206808424635 for heartbeat.
2018-09-14 12:05:53,849 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=aa04.rinc.com, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=aa04.rinc.com, rack=null)} id: 3677615251792051127
2018-09-14 12:05:53,853 WARN  DefaultBlockMaster - Invalid block: 151280156672 from worker aa04.rinc.com.
2018-09-14 12:05:53,853 WARN  DefaultBlockMaster - Invalid block: 151112384512 from worker aa04.rinc.com.
2018-09-14 12:05:53,853 INFO  DefaultBlockMaster - Requesting delete for orphaned block: 151280156672 from worker aa04.rinc.com.
2018-09-14 12:05:53,853 INFO  DefaultBlockMaster - Requesting delete for orphaned block: 151112384512 from worker aa04.rinc.com.
2018-09-14 12:05:53,853 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=3677615251792051127, workerAddress=WorkerNetAddress{host=aa04.rinc.com, rpcPort=29998, dataPort=29999, webPort=29996, domainSocketPath=, tieredIdentity=TieredIdentity(node=aa04.rinc.com, rack=null)}, capacityBytes=37580963840, usedBytes=12492046, lastUpdatedTimeMs=1536897953853, blocks=[151280156672, 151112384512]}


 
So I have three workers at this time, and my spark job write to alluxio failed:
<a href="http://aa04.rinc.com" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Faa04.rinc.com\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEpcZcs9DAESkyIYkUsYbTBKfKPTg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Faa04.rinc.com\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEpcZcs9DAESkyIYkUsYbTBKfKPTg&#39;;return true;">aa04.rinc.com
<a href="http://ac04.rinc.com" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fac04.rinc.com\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHSZ1MamH6l6jmsPrGlJ4I6FuNQ1w&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fac04.rinc.com\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHSZ1MamH6l6jmsPrGlJ4I6FuNQ1w&#39;;return true;">ac04.rinc.com
alluxio-master-0.alluxio-master.cdev.svc.cluster.local

Anyone ever saw this before ?

likun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.