Re: Alluxio not working on k8s

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Alluxio not working on k8s

Haoyuan Li
Please try our user mailing list.

Best regards,

Haoyuan (HY)




On Tue, Jun 12, 2018 at 1:57 AM Sushil Kumar Sah <[hidden email]> wrote:
The problem was short circuit domain socket path. I was using whatever was present by default in alluxio git. In the default integration/kubernetes/conf/alluxio.properties.template the address for ALLUXIO_WORKER_DATA_SERVER_DOMAIN_SOCKET_ADDRESS was not complete. This is properly explained in https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Docker.html for enabling short circuit reads in alluxio worker containers using unix domain sockets.
Just because of a missing complete path for unix domain socket the alluxio worker was not able to come up in kubernetes when short circuit read was enabled for alluxio worker.

When I corrected the path in integration/kubernetes/conf/alluxio.properties for ALLUXIO_WORKER_DATA_SERVER_DOMAIN_SOCKET_ADDRESS=/opt/domain/d
Then things started wokring properly. Now also some tests are failing but alteast the alluxio setup is properly up. Now I will debug why some tests are failing.

I have submitted this fix in alluxio git for them to merge it in master branch.
https://github.com/Alluxio/alluxio/pull/7376


On Friday, June 8, 2018 at 4:45:43 PM UTC+5:30, Sushil Kumar Sah wrote:
I am trying alluxio 1.7.1 with docker 1.13.1, kubernetes 1.9.6, 1.10.1

I created the alluxio docker image as per the instructions on https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Docker.html

Then I followed the https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Kubernetes.html guide to run alluxio on kubernetes. I was able to bring up the alluxio master pod properly, but when I try to bring up alluxio worker I get  the error that Address in use. I have not modified anything in the yamls which I downloaded from alluxio git. Only change I did was for alluxio docker image name and api version in yamls for k8s to match properly.

I checked ports being used in my k8s cluster setup, and even on the nodes also. There are no ports that alluxio wants being used by any other process, but I still get address in use error. I am unable to understand what I can do to debug further or what I should change to make this work. I don't have any other application running on my k8s cluster setup. I tried with single node k8s cluster setup and multi node k8s cluster setup also. I tried k8s version 1.9 and 1.10 also.

There is definitely some issue from alluxio worker side which I am unable to debug.

This is the log that I get from worker pod:

[root@vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl logs po/alluxio-worker-knqt4
Formatting Alluxio Worker @ vm-sushil-scrum1-08062018-alluxio-1
2018-06-08 10:09:55,723 INFO  Configuration - Configuration file /opt/alluxio/conf/alluxio-site.properties loaded.
2018-06-08 10:09:55,845 INFO  Format - Formatting worker data folder: /alluxioworker/
2018-06-08 10:09:55,845 INFO  Format - Formatting Data path for tier 0:/dev/shm/alluxioworker
2018-06-08 10:09:55,856 INFO  Format - Formatting complete
2018-06-08 10:09:56,357 INFO  Configuration - Configuration file /opt/alluxio/conf/alluxio-site.properties loaded.
2018-06-08 10:09:56,549 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.194.11.7, rack=null)
2018-06-08 10:09:56,866 INFO  BlockWorkerFactory - Creating alluxio.worker.block.BlockWorker
2018-06-08 10:09:56,866 INFO  FileSystemWorkerFactory - Creating alluxio.worker.file.FileSystemWorker
2018-06-08 10:09:56,942 WARN  StorageTier - Failed to verify memory capacity
2018-06-08 10:09:57,082 INFO  log - Logging initialized @1160ms
2018-06-08 10:09:57,509 INFO  AlluxioWorkerProcess - Domain socket data server is enabled at /opt/domain.
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
        at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:164)
        at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:45)
        at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37)
        at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:56)
Caused by: java.lang.RuntimeException: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
        at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:224)
        at alluxio.worker.DataServer$Factory.create(DataServer.java:45)
        at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:159)
        ... 3 more
Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
        at io.netty.channel.unix.Errors.newIOException(Errors.java:117)
        at io.netty.channel.unix.Socket.bind(Socket.java:259)
        at io.netty.channel.epoll.EpollServerDomainSocketChannel.doBind(EpollServerDomainSocketChannel.java:75)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:504)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1226)
        at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:495)
        at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:480)
        at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
        at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:213)
        at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:305)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at java.lang.Thread.run(Thread.java:748)

-----------------------
[root@vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl get all
NAME                DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
ds/alluxio-worker   1         1         0         1            0           <none>          42m
ds/alluxio-worker   1         1         0         1            0           <none>          42m

NAME                          DESIRED   CURRENT   AGE
statefulsets/alluxio-master   1         1         44m

NAME                      READY     STATUS    RESTARTS   AGE
po/alluxio-master-0       1/1       Running   0          44m
po/alluxio-worker-knqt4   0/1       Error     12         42m

NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)               AGE
svc/alluxio-master   ClusterIP   None         <none>        19998/TCP,19999/TCP   44m
svc/kubernetes       ClusterIP   10.254.0.1   <none>        443/TCP               1h

---------------------

[root@vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl describe po/alluxio-worker-knqt4
Name:           alluxio-worker-knqt4
Namespace:      default
Node:           vm-sushil-scrum1-08062018-alluxio-1/10.194.11.7
Start Time:     Fri, 08 Jun 2018 10:09:05 +0000
Labels:         app=alluxio
                controller-revision-hash=3081903053
                name=alluxio-worker
                pod-template-generation=1
Annotations:    <none>
Status:         Running
IP:             10.194.11.7
Controlled By:  DaemonSet/alluxio-worker
Containers:
  alluxio-worker:
    Container ID:  docker://40a1eff2cd4dff79d9189d7cb0c4826a6b6e4871fbac65221e7cdd341240e358
    Image:         alluxio:1.7.1
    Image ID:      docker://sha256:b080715bd53efc783ee5f54e7f1c451556f93e7608e60e05b4615d32702801af
    Ports:         29998/TCP, 29999/TCP, 29996/TCP
    Command:
      /entrypoint.sh
    Args:
      worker
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 08 Jun 2018 11:01:37 +0000
      Finished:     Fri, 08 Jun 2018 11:02:02 +0000
    Ready:          False
    Restart Count:  14
    Limits:
      cpu:     1
      memory:  2G
    Requests:
      cpu:     500m
      memory:  2G
    Environment Variables from:
      alluxio-config  ConfigMap  Optional: false
    Environment:
      ALLUXIO_WORKER_HOSTNAME:   (v1:status.hostIP)
    Mounts:
      /dev/shm from alluxio-ramdisk (rw)
      /opt/domain from alluxio-domain (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-7xlz7 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  alluxio-ramdisk:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  Memory
  alluxio-domain:
    Type:          HostPath (bare host directory volume)
    Path:          /tmp/domain
    HostPathType:  Directory
  default-token-7xlz7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-7xlz7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason                 Age                 From                                          Message
  ----     ------                 ----                ----                                          -------
  Normal   SuccessfulMountVolume  56m                 kubelet, vm-sushil-scrum1-08062018-alluxio-1  MountVolume.SetUp succeeded for volume "alluxio-domain"
  Normal   SuccessfulMountVolume  56m                 kubelet, vm-sushil-scrum1-08062018-alluxio-1  MountVolume.SetUp succeeded for volume "alluxio-ramdisk"
  Normal   SuccessfulMountVolume  56m                 kubelet, vm-sushil-scrum1-08062018-alluxio-1  MountVolume.SetUp succeeded for volume "default-token-7xlz7"
  Normal   Pulled                 53m (x5 over 56m)   kubelet, vm-sushil-scrum1-08062018-alluxio-1  Container image "alluxio:1.7.1" already present on machine
  Normal   Created                53m (x5 over 56m)   kubelet, vm-sushil-scrum1-08062018-alluxio-1  Created container
  Normal   Started                53m (x5 over 56m)   kubelet, vm-sushil-scrum1-08062018-alluxio-1  Started container
  Warning  BackOff                1m (x222 over 55m)  kubelet, vm-sushil-scrum1-08062018-alluxio-1  Back-off restarting failed container

---------------------

[root@vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# netstat -tuplena
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name
tcp        0      0 10.194.11.7:5002        0.0.0.0:*               LISTEN      0          149989     26448/docker-proxy-
tcp        0      0 10.194.11.7:10250       0.0.0.0:*               LISTEN      0          104935     19283/kubelet
tcp        0      0 127.0.0.1:10251         0.0.0.0:*               LISTEN      0          100472     17920/kube-schedule
tcp        0      0 10.194.11.7:6380        0.0.0.0:*               LISTEN      99         155827     26659/redis-server
tcp        0      0 127.0.0.1:10252         0.0.0.0:*               LISTEN      0          97624      17764/kube-controll
tcp        0      0 10.194.11.7:8879        0.0.0.0:*               LISTEN      0          143624     23758/nginx: master
tcp        0      0 10.194.11.7:10255       0.0.0.0:*               LISTEN      0          107608     19283/kubelet
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      0          12481      1/systemd
tcp        0      0 127.0.0.1:10256         0.0.0.0:*               LISTEN      0          100461     17793/kube-proxy
tcp        0      0 0.0.0.0:8081            0.0.0.0:*               LISTEN      0          143622     23758/nginx: master
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      0          156660     27123/dnsmasq
tcp        0      0 127.0.0.1:8085          0.0.0.0:*               LISTEN      0          97730      17857/kube-apiserve
tcp        0      0 127.0.0.1:60053         0.0.0.0:*               LISTEN      0          61130      11533/skydns
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      0          42466      7310/sshd
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      0          11047      1017/master
tcp        0      0 10.194.11.7:8443        0.0.0.0:*               LISTEN      0          99373      17857/kube-apiserve
tcp        0      0 10.194.11.7:16380       0.0.0.0:*               LISTEN      99         154397     26839/redis-sentine
tcp        0      0 0.0.0.0:19998           0.0.0.0:*               LISTEN      0          452707     22265/java
tcp        0      0 0.0.0.0:19999           0.0.0.0:*               LISTEN      0          451738     22265/java
tcp        0      0 10.194.11.7:49152       0.0.0.0:*               LISTEN      0          121927     22081/glusterfsd
tcp        0      0 0.0.0.0:9090            0.0.0.0:*               LISTEN      0          143623     23758/nginx: master
tcp        0      0 10.194.11.7:24007       0.0.0.0:*               LISTEN      0          114455     21997/glusterd
tcp        0      0 10.194.11.7:5000        0.0.0.0:*               LISTEN      0          143625     23758/nginx: master
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      0          102303     19283/kubelet
tcp        0      0 10.194.11.7:5001        0.0.0.0:*               LISTEN      0          157094     27208/docker-proxy-
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      0          100463     17793/kube-proxy
tcp        0      0 127.0.0.1:8085          127.0.0.1:51610         ESTABLISHED 0          174185     17857/kube-apiserve
tcp        0      0 127.0.0.1:49512         127.0.0.1:8085          ESTABLISHED 0          97736      17793/kube-proxy
tcp        0      0 10.194.11.7:51556       10.194.11.7:2379        ESTABLISHED 0          73319      13004/confd
tcp        0      0 127.0.0.1:49514         127.0.0.1:8085          ESTABLISHED 0          92137      17793/kube-proxy
tcp        0      0 10.194.11.7:52036       10.194.11.7:2379        ESTABLISHED 0          95747      17857/kube-apiserve
tcp        0      0 10.194.11.7:43206       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:43088       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:52046       10.194.11.7:2379        ESTABLISHED 0          92074      17857/kube-apiserve
tcp        0      0 127.0.0.1:57840         127.0.0.1:8085          ESTABLISHED 0          283917     17764/kube-controll
tcp        0      0 10.194.11.7:43254       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:43242       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:52012       10.194.11.7:2379        ESTABLISHED 0          98944      17857/kube-apiserve
tcp        0      0 10.194.11.7:8443        192.168.1.0:43213       ESTABLISHED 0          104055     17857/kube-apiserve
tcp        0      0 10.194.11.7:6380        10.194.11.7:35385       ESTABLISHED 99         154404     26659/redis-server
tcp        0      0 127.0.0.1:49658         127.0.0.1:8085          ESTABLISHED 0          100657     17764/kube-controll
tcp        0      0 127.0.0.1:41282         127.0.0.1:8085          ESTABLISHED 0          384379     17764/kube-controll
tcp        0      0 10.194.11.7:43120       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 127.0.0.1:8085          127.0.0.1:49610         ESTABLISHED 0          99469      17857/kube-apiserve
tcp        0      0 127.0.0.1:59714         127.0.0.1:8085          ESTABLISHED 0          599303     17764/kube-controll
tcp        0      0 127.0.0.1:46484         127.0.0.1:8085          ESTABLISHED 0          449507     17764/kube-controll
tcp        0      0 127.0.0.1:8085          127.0.0.1:57998         ESTABLISHED 0          583873     17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:49530         ESTABLISHED 0          92150      17857/kube-apiserve
tcp        0      0 10.194.11.7:51986       10.194.11.7:2379        ESTABLISHED 0          98941      17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:41282         ESTABLISHED 0          384381     17857/kube-apiserve
tcp        0      0 127.0.0.1:49580         127.0.0.1:8085          ESTABLISHED 0          95894      17764/kube-controll
tcp        0      0 10.194.11.7:34222       10.194.11.7:8443        ESTABLISHED 0          96516      17857/kube-apiserve
tcp        0      0 10.194.11.7:51546       10.194.11.7:2379        ESTABLISHED 0          72583      13004/confd
tcp        0      0 10.194.11.7:43202       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 127.0.0.1:44862         127.0.0.1:8085          ESTABLISHED 0          435921     17764/kube-controll
tcp        0      0 10.194.11.7:43364       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 127.0.0.1:8085          127.0.0.1:37618         ESTABLISHED 0          350305     17857/kube-apiserve
tcp        0      0 10.194.11.7:51566       10.194.11.7:2379        ESTABLISHED 0          75882      13004/confd
tcp        0      0 127.0.0.1:8085          127.0.0.1:37900         ESTABLISHED 0          346954     17857/kube-apiserve
tcp        0      0 10.194.11.7:43208       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 127.0.0.1:8085          127.0.0.1:45628         ESTABLISHED 0          439238     17857/kube-apiserve
tcp        0      0 10.194.11.7:52040       10.194.11.7:2379        ESTABLISHED 0          95749      17857/kube-apiserve
tcp        0      0 127.0.0.1:51610         127.0.0.1:8085          ESTABLISHED 0          168623     17764/kube-controll
tcp        0      0 10.194.11.7:52042       10.194.11.7:2379        ESTABLISHED 0          96495      17857/kube-apiserve
tcp        0      0 127.0.0.1:10252         127.0.0.1:52924         TIME_WAIT   0          0          -    
tcp        0      0 127.0.0.1:37224         127.0.0.1:8085          ESTABLISHED 0          651407     17764/kube-controll
tcp        0      0 127.0.0.1:8085          127.0.0.1:35422         ESTABLISHED 0          634065     17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:49514         ESTABLISHED 0          92138      17857/kube-apiserve
tcp        0      0 10.194.11.7:43164       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:52038       10.194.11.7:2379        ESTABLISHED 0          95748      17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:37698         ESTABLISHED 0          347661     17857/kube-apiserve
tcp        0      0 127.0.0.1:49596         127.0.0.1:8085          ESTABLISHED 0          96571      17764/kube-controll
tcp        0      0 10.194.11.7:43390       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:43114       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:51998       10.194.11.7:2379        ESTABLISHED 0          99378      17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:59714         ESTABLISHED 0          599304     17857/kube-apiserve
tcp        0      0 10.194.11.7:43112       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 127.0.0.1:8085          127.0.0.1:43490         ESTABLISHED 0          423791     17857/kube-apiserve
tcp        0      0 10.194.11.7:52064       10.194.11.7:2379        ESTABLISHED 0          98958      17857/kube-apiserve
tcp        0      0 127.0.0.1:49524         127.0.0.1:8085          ESTABLISHED 0          100506     17920/kube-schedule
tcp        0      0 10.194.11.7:52022       10.194.11.7:2379        ESTABLISHED 0          93987      17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:49618         ESTABLISHED 0          100646     17857/kube-apiserve
tcp        0      0 127.0.0.1:53774         127.0.0.1:8085          ESTABLISHED 0          245656     17764/kube-controll
tcp        0      0 10.194.11.7:52024       10.194.11.7:2379        ESTABLISHED 0          93988      17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:46484         ESTABLISHED 0          454863     17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:49512         ESTABLISHED 0          97737      17857/kube-apiserve
tcp        0      0 127.0.0.1:8085          127.0.0.1:49606         ESTABLISHED 0          99467      17857/kube-apiserve
tcp        0      0 10.194.11.7:24007       10.194.11.7:49151       ESTABLISHED 0          119971     21997/glusterd
tcp        0      0 127.0.0.1:58966         127.0.0.1:8085          ESTABLISHED 0          297833     17764/kube-controll
tcp        0      0 10.194.11.7:43234       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:52074       10.194.11.7:2379        ESTABLISHED 0          92080      17857/kube-apiserve
tcp        0      0 10.194.11.7:51564       10.194.11.7:2379        ESTABLISHED 0          71013      13004/confd
tcp        0      0 127.0.0.1:33658         127.0.0.1:8085          TIME_WAIT   0          0          -    
tcp        0      0 127.0.0.1:49602         127.0.0.1:8085          ESTABLISHED 0          101494     17764/kube-controll
tcp        0      0 127.0.0.1:40456         127.0.0.1:8085          TIME_WAIT   0          0          -    
tcp        0      0 10.194.11.7:43278       10.194.11.7:2379        TIME_WAIT   0          0          -    
tcp        0      0 127.0.0.1:8085          127.0.0.1:49658         ESTABLISH

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.