一臺(tái)Oracle數(shù)據(jù)庫(kù)服務(wù)器(Linux版本為Oracle Linux Server release 5.7)今天中午突然出現(xiàn)短暫的ssh連接不上的情況,ssh連接不上的時(shí)候,ping服務(wù)器正常,使用psping檢測(cè)端口22也是正常(只返回5個(gè)包,沒(méi)有持續(xù)ping),使用SQL Developer可以登錄數(shù)據(jù)庫(kù)進(jìn)行任何操作,另外,通過(guò)DPA工具發(fā)現(xiàn)該服務(wù)器的CPU等資源消耗很低(發(fā)現(xiàn)數(shù)據(jù)庫(kù)服務(wù)都正常后,就出去吃飯了),回來(lái)時(shí),同事反饋ssh已經(jīng)正常,錯(cuò)過(guò)診斷的大好時(shí)機(jī),期間另外一個(gè)同事也做了一些檢查:
檢測(cè)發(fā)現(xiàn)ping正常,但是psping檢測(cè)8088端口發(fā)現(xiàn)網(wǎng)絡(luò)時(shí)延很長(zhǎng),甚至出現(xiàn)超時(shí)。他做了一個(gè)截圖對(duì)比,如下所示.
ping是一個(gè)網(wǎng)絡(luò)層的協(xié)議,只是表明網(wǎng)絡(luò)在3層是通的;tomcat是應(yīng)用層協(xié)議
吃飯回來(lái)后,發(fā)現(xiàn)ssh已經(jīng)可以正常登錄服務(wù)器,檢查發(fā)現(xiàn)這個(gè)進(jìn)程已經(jīng)運(yùn)行了二百多天了,那么也就是說(shuō)sshd服務(wù)沒(méi)有死掉,sshd服務(wù)也沒(méi)有重啟過(guò)。
使用ps -ef | grep sshd 找到sshd的進(jìn)程,執(zhí)行下面命令
[root@mylnx01 ~]# ps -eo pid,lstart,etime | grep 3423
3423 Sun Feb 18 13:56:11 2018 234-09:01:48
檢查日志信息,發(fā)現(xiàn)里面有幾條 Did not receive identification string from xxx的信息(部分信息做了脫敏處理)。
[root@mylnx01 log]# tail -100 /var/log/secure
Oct 8 14:50:48 mylnx01 sshd[4341]: pam_unix(sshd:session): session opened for user oracle by (uid=0)
Oct 8 14:50:49 mylnx01 sshd[4341]: pam_unix(sshd:session): session closed for user oracle
Oct 10 12:26:41 mylnx01 sshd[742]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[743]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[790]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[789]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[745]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[744]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[1007]: Connection closed by 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[1006]: Connection closed by 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[746]: Did not receive identification string from 192.168.xxx.xxx
搜索了一下這個(gè)錯(cuò)誤的相關(guān)資料,一般出現(xiàn)錯(cuò)誤是因?yàn)椋?/p>
This one below means ssh server waited and did not receive what it needed in a timely fashion. This is typically due to connectivity issues. In an ssh connection, the server first provides its identification string, then waits for the client to then provide its identification string. If there is a loss in connection, or the client just bails, this is what you will see in the logs.
If someone uses telnet or netcat to fetch your ssh banner, or other various scans, the logs on the server side will show this as well.
這個(gè)錯(cuò)誤信息意味著ssh服務(wù)由于沒(méi)有及時(shí)收到它所需要的東西,而出現(xiàn)等待現(xiàn)象。 通常是由于連接問(wèn)題造成。 在ssh連接中,服務(wù)器首先提供其標(biāo)識(shí)字符串,然后等待客戶端提供其標(biāo)識(shí)字符串。 如果連接丟失,或者客戶端剛剛退出,就會(huì)出現(xiàn)日志中所看到的內(nèi)容。
雖然懷疑是路由問(wèn)題,但是個(gè)人手頭缺少網(wǎng)絡(luò)監(jiān)控方面的詳實(shí)證據(jù),但是也有一些佐證的證據(jù):最近兩地網(wǎng)絡(luò)問(wèn)題蠻多,前天還發(fā)現(xiàn)網(wǎng)絡(luò)掉包比較嚴(yán)重,網(wǎng)絡(luò)管理員找供應(yīng)商反饋過(guò),但是后面也不清楚什么情況。因?yàn)檫@方面的事情不歸我處理。