Skip to main content

SSH命令无法创建远程连接

· 7 min read
orange
programmer on jvm platform

今天线上出现了一个问题, 此问题的现象是用户在页面上点击创建VNC远程链接会等待很久且响应失败

原因

这个问题的原因是因为ssh命令执行超时, ssh命令超时的原因是因为每次ssh执行命令时需要进行用户认证, 系统中接入了外置认证系统ldap进行认证, 但由于访问外部认证服务很慢导致触发了超时.
认证服务对应lb会转发到出问题的ldap服务器, 从而导致认证服务访问很慢.

排查过程

1. 查看后端服务日志

后端日志如下

2023-03-09 10:34:37	2023-03-09T02:34:36.844168954Z stdout F 2023-03-09 02:34:36.844  INFO 1 --- [nio-2000-exec-6] c.f.c.s.c.LocalAuthenticationProvider    : into local DB authenticate, user is::foo
2023-03-09 10:34:40 2023-03-09T02:34:40.258742954Z stdout F 2023-03-09 02:34:40.258 INFO 1 --- [io-2000-exec-11] com.jcabi.ssh.Execution$Default : $ getent passwd foo | awk -F: '{print $6}'
2023-03-09 10:34:44 2023-03-09T02:34:44.382676144Z stdout F 2023-03-09 02:34:44.382 INFO 1 --- [io-2000-exec-11] com.jcabi.ssh.Execution$Default : $ sudo -i -u foo mkdir -p /fastone/users/foo/.foo
2023-03-09 10:34:49 2023-03-09T02:34:49.36267293Z stdout F 2023-03-09 02:34:49.362 INFO 1 --- [io-2000-exec-11] com.jcabi.ssh.Execution$Default : $ echo -e "foo\foo\nn\n" | sudo -i -u foo sh -c "VNC_HOME=/fastone/users/foo/.foo vncserver -nolisten unix -listen tcp -maxclients 512 2>&1"
2023-03-09 10:35:04 2023-03-09T02:35:04.409356857Z stdout F 2023-03-09 02:35:04.409 WARN 1 --- [io-2000-exec-11] com.jcabi.ssh.Ssh : an exception during authentication
2023-03-09 10:35:04 2023-03-09T02:35:04.409378051Z stdout F java.net.SocketTimeoutException: Read timed out
2023-03-09 10:35:04 2023-03-09T02:35:04.410988161Z stdout F 2023-03-09 02:35:04.410 ERROR 1 --- [io-2000-exec-11] c.f.common.ssh.client.SshClient : 命令执行失败!
2023-03-09 10:35:04 2023-03-09T02:35:04.411003386Z stdout F
2023-03-09 10:35:04 2023-03-09T02:35:04.411007586Z stdout F java.io.IOException: com.jcraft.jsch.JSchException: Auth fail
2023-03-09 10:35:04 2023-03-09T02:35:04.411011213Z stdout F at com.jcabi.ssh.Ssh.session(Ssh.java:281) ~[jcabi-ssh-1.6.1.jar:na]
2023-03-09 10:35:04 2023-03-09T02:35:04.411014796Z stdout F at com.jcabi.ssh.AbstractSshShell.exec(AbstractSshShell.java:95) ~[jcabi-ssh-1.6.1.jar:na]
2023-03-09 10:35:04 2023-03-09T02:35:04.41101828Z stdout F at com.jcabi.ssh.Ssh.exec(Ssh.java:77) ~[jcabi-ssh-1.6.1.jar:na]

在代码中, 我们使用了jcabi-ssh来执行ssh命令.
整个创建VNC远程连接的事务需要执行四条ssh命令, 前三条命令都执行成功了, 但是最后一条命令执行失败了
报错信息如下

java.net.SocketTimeoutException: Read timed out
java.io.IOException: com.jcraft.jsch.JSchException: Auth fail

2. 登录服务器并查看sshd日志

首先需要看下sshd的日志, 通过journalctl命令查看

sudo journalctl -t sshd

输出如下

Mar 09 02:34:38 desktop-1 sshd[11758]: Accepted publickey for fastone from 10.0.109.90 port 49590 ssh2: RSA SHA256:
sqOIF4lpjrA0+1At5dDQRZgcwWEK+oO1Lwv9fad
Mar 09 02:34:38 desktop-1 sshd[11758]: pam_unix(sshd:session): session opened for user fastone by (uid=0)
Mar 09 02:34:41 desktop-1 sshd[11758]: pam_unix(sshd:session): session closed for user fastone
Mar 09 02:34:42 desktop-1 sshd[11766]: Accepted publickey for fastone from 10.0.109.90 port 57700 ssh2: RSA SHA256:
sqOIF4lpjrA0+1At5dDQRZgcwWEK+oO1Lwv9fad
Mar 09 02:34:42 desktop-1 sshd[11766]: pam_unix(sshd:session): session opened for user fastone by (uid=0)
Mar 09 02:34:46 desktop-1 sshd[11766]: pam_unix(sshd:session): session closed for user fastone
Mar 09 02:34:47 desktop-1 sshd[11778]: Accepted publickey for fastone from 10.0.109.90 port 57710 ssh2: RSA SHA256:
sqOIF4lpjrA0+1At5dDQRZgcwWEK+oO1Lwv9fad
Mar 09 02:34:47 desktop-1 sshd[11778]: pam_unix(sshd:session): session opened for user fastone by (uid=0)
Mar 09 02:34:54 desktop-1 sshd[11778]: pam_unix(sshd:session): session closed for user fastone
Mar 09 02:35:04 desktop-1 sshd[11875]: error: Received disconnect from 10.0.109.90 port 41620:3:
com.jcraft.jsch.JSchException: Auth fail [preauth]
Mar 09 02:35:04 desktop-1 sshd[11875]: Disconnected from authenticating user fastone 10.0.109.90 port 41620 [preauth]

ssh日志中前三次是正常的, 但是第四次就报错了, 从报错信息来看, 是Auth fail的问题.
从日志上看发现每次执行命令的时间都很长, 有时候会超过10s, 所以我猜测是ssh连接超时了.

3. 查看nslcd日志

根据上述的日志得出是auth fail的问题, 那么大概率跟用户认证有关
目前系统中使用的认证服务是ldap, 并通过nslcd服务来连接ldap并提供认证服务的.
根据经验判断可能是系统中的nslcd服务问题. 接下来需要查看其日志

sudo service nslcd status

输出如下

● nslcd.service - LSB: LDAP connection daemon
Loaded: loaded (/etc/init.d/nslcd; generated)
Active: active (running) since Thu 2023-03-09 02:00:08 UTC; 58min ago
Docs: man:systemd-sysv-generator(8)
Process: 1360 ExecStart=/etc/init.d/nslcd start (code=exited, status=0/SUCCESS)
Tasks: 6 (limit: 4915)
CGroup: /system.slice/nslcd.service
└─1497 /usr/sbin/nslcd




Mar 09 02:53:36 desktop-1 nslcd[1497]: [de9570] <group/member="root"> ldap_result() failed: Can't contact LDAP server
Mar 09 02:53:51 desktop-1 nslcd[1497]: [39d9ef] <group/member="root"> ldap_result() failed: Can't contact LDAP server
Mar 09 02:53:51 desktop-1 nslcd[1497]: [2e5952] <passwd=9159> ldap_result() failed: Can't contact LDAP server
Mar 09 02:54:01 desktop-1 nslcd[1497]: [0d1fd6] <passwd=9159> ldap_result() failed: Can't contact LDAP server
Mar 09 02:54:57 desktop-1 nslcd[1497]: [bec5a1] <passwd=9159> ldap_result() failed: Can't contact LDAP server
Mar 09 02:54:57 desktop-1 nslcd[1497]: [44bbc8] <passwd=9159> ldap_result() failed: Can't contact LDAP server
Mar 09 02:55:34 desktop-1 nslcd[1497]: [f50bda] <passwd=9159> ldap_result() failed: Can't contact LDAP server
Mar 09 02:56:25 desktop-1 nslcd[1497]: [c5f72e] <group/member="fastone"> ldap_result() timed out
Mar 09 02:56:25 desktop-1 nslcd[1497]: [f644f7] <group/member="u18846424945"> ldap_result() timed out
Mar 09 02:56:25 desktop-1 nslcd[1497]: [4db6bf] <group/member="u18846424945"> ldap_result() timed out

从日志中可以看到, nslcd服务访问ldap超时了

4. 启用nscd服务从而workaround此问题

启用nscd服务, 从而workaround此问题, nscd服务会缓存nslcd服务的结果, 从而提升速度.
但是这个方案只是workaround而已, 后续需要运维同学来解决ldap的连接问题

sudo service nscd start

启用nscd服务后, 确认下速度是否有提升

time sudo getent passwd -s ldap
real0m1.326s
user0m0.023s
sys0m0.023s

备注

如何避免类似问题

服务侧

对于认证服务侧需要确保服务的稳定性

客户端侧

对于认证系统的客户端侧, 需要确保客户端的稳定, 例如增加缓存机制, 从而提升速度

参考