You might retest with nfsv3, the code handling v3 should be significantly different since v3 is stateless and does not maintain long-term connections.

And if the long-term connection had some sort of issue then 45 seconds may be how long it takes to figure that out and re-initiate the connection.

I know in 2004 time range nfsv3/tcp had some bugs where if you had >250ish connections that the server harvested the tcp connections and the client never seemed to realize the connection was gone and never recreated it.  And this all worked perfectly fine when below 250 clients or so.  I know this because we expanded a nfs setup from 240 nodes (ran for months) to 270 or so and after that v3/tcp never worked right and tcpdumps and other info shows that the server was harvesting the "unused" connections once it had too many and the client was never handling it. 

It could be that nfsv4+persistant connections is creating a connection with a new dir/file access and eventually you hit the magic limit, and nfs reconnections need to happen.

sar -n NFSD on the server,  sar -n SOCK and sar -n SOCK6 on both client/server and sar -n NFS on a client might show something abnormal during the issue.

On Sat, Oct 2, 2021 at 12:29 PM Roger Heflin <rogerheflin@gmail.com> wrote:
What did the sar -d look like for the 2 minutes before and 2 minutes afterward?

If it is slow or not may depend on if the directory/file fell out of cache and had to be reread from the disk.

I have also seen really large dirs take a really long time to find, but typically that takes thousands of fines in a dir.  if you do ls -ld <dirname> you will see how big the dir is if the dir is really big under some condition that can be slow, but usually not 45 seconds.

On Sat, Oct 2, 2021 at 12:00 PM Terry Barnaby <terry1@beam.ltd.uk> wrote:

I am getting more sure this is an NFS/networking issue rather than an issue with disks in the server.

I created a small test program that given a directory finds a random file in a random directory three levels below, opens it and reads up to a block (512 Bytes) of data from it and times how long it took to find the file (opendir/readir) and read the block from the file printing the results if the time is greater than previous ones (so seeing the peek times). This is repeated every 10 seconds. First param is the average time to find the file (there may not be a file 3 levels down so it repeats those searches untill it finds one that the user can access), the second is the time it took to find the file (3 x opendir/readdir) to a file that existed. the last time is how long it took to open, read and close the file.

I set one of these processes running on the server starting at the /home dir and did the same on one of my clients that has /home NFS V4 mounted with defaults + async.

The server after 12 hours had peak timings of (file paths hidden):

2021-10-02T09:26:38     0.008858     0.043513     0.031735 /home/...
2021-10-02T09:26:58     0.005384     0.050870     0.039186 /home/...
2021-10-02T09:38:09     0.006684     0.081707     0.014616 /home/...
2021-10-02T10:18:42     0.037394     0.144025     0.012603 /home/...

The client had timings of:

2021-10-02T08:48:45     0.056195     0.110149     0.019353 /home/...
2021-10-02T09:06:31     0.098647     0.098647     0.015171 /home/...
2021-10-02T09:28:38     1.060605     0.001996     0.000422 /home/...
2021-10-02T09:31:28     4.896196     2.037488     0.000836 /home/...
2021-10-02T11:48:44     4.423502     7.087917     1.111684 /home/...
2021-10-02T11:51:02    27.711746    45.646627     0.021321 /home/...

So at one point the NFS mounted client took 45 seconds to find a file (opendir/readdir 3 times) and once before 7.08 seconds with 1.1 seconds to read a block. The actual file it accessed is
46819 Bytes long and can be normally quickly accessed/copied etc.

"sar -d" reported no issues.

"mountstats /home" reported no issues

"/var/log/messages" in both systems reported no issues.

Generally the desktop system has been responsive all day (no other users and nothing obvious going on on both server and client) and I have not noticed a "lockup" on the GUI I have been using (intermittently). No noticeable network errors, no noticeable hard disk read issues, but occasional very long NFS opendir/readdir which would match up with when i see the desktop lock up for around 30secs ore more.

_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure