Hi,
We use in production cluster troque from epel6. Our cluster is part of large set of clusters that use packages EPEL repository https://twiki.cern.ch/twiki/bin/view/EMI/GenericInstallationConfigurationEMI...
Everything worked fine until torque updated to torque-4.2.10-9.el6 with enabled NUMA https://bugzilla.redhat.com/show_bug.cgi?id=1231148
This update don't starts at all. I filed bug against torque https://bugzilla.redhat.com/show_bug.cgi?id=1321154
There was suggested to remove NUMA support and build different set of packages with enabled NUMA.
But maintainer solution was to add file that required for starting pbs_mom service. Just installing of torque-4.2.10-10.el6 update still don't works (service starts but nodes down). This solution needs additional reconfiguration of pbs_server but we can't do such experiments on our cluster because there is no guarantee that after reconfiguration everything will work as expected.
EPEL update policy recommends to avoid updates that causes such problems https://fedoraproject.org/wiki/EPEL_Updates_Policy#Stable_Releases
To whom it may concern,
The bug has had several other people talk about how adding the appropriate lines will solve the issue Alexey is having. I understand the need to have an ³update friendly² experience from EPEL as it supports software on RHEL. However, Alexey is ignoring a large part of the torque community who wants to have numa support enabled. I would consider not having numa support enabled to be a "Serious bugs that cannot be fixed in the existing version² of torque as I doubt you can buy a laptop without some numa in the CPU. This support is critical for HPC applications to take advantage of and they need to be able to lock adjacent processes to adjacent cpus in modern numa systems.
This does however bring to mind something that¹s been growing in my head for a while now. I think we need a Fedora HPC SIG to help communicate the needs of the HPC community to Fedora (and then hopefully Redhat) about how HPC clusters work and what kinds of support we need when it comes to the software we run. The EPEL Guidelines Digest (https://fedoraproject.org/wiki/EPEL/GuidelinesAndPolicies#Digest) isn¹t wrong the scope just different for HPC systems and specifically the HPC software we want. The scope for an HPC system is focused on the life cycle of the cluster we purchase, not the life cycle of the RHEL version of OS. Furthermore, the things we need updated when we bring a new cluster online are the newest compiler, resource management, MPI and cluster management software. As, the software we run on the HPC systems often requires the newest of those things to take advantage of the newest features in the hardware we just purchased. Others, like Alexey, have larger community supported software stacks they have to integrate on site with whatever hardware they are tasked with using, the changes for that life cycle are based on what the larger community needs. Alexey will do updates when the community documentation has changed and he¹s required to update to continue to be a part of the community.
My arguments for having an HPC SIG is to bridge the communication gaps between HPC workflows (mentioned above) and external software like the OpenHPC project (http://www.openhpc.community/) and EasyBuild (https://github.com/hpcugent/easybuild) since these project focus on building software on RHEL systems but don¹t follow any of the guidelines we have in place, and thus produce substandard RPMs that most HPC administrators have to deal with. There are some very important features in these projects that the Fedora package manager should try and support. Since, Alexey isn¹t wrong in his assessment of the situation, he¹s just in a different part of his cluster¹s life cycle. If he bought a cluster tomorrow to replace the current one he has he may feel different about the numa support (I don¹t know). However, other users have bought clusters and are more than likely taking advantage of the numa support to gain performance on their systems.
Just my thoughtsŠ
Thanks, - David Brown
On 4/13/16, 11:00 AM, "alekcejk@googlemail.com" alekcejk@googlemail.com wrote:
Hi,
We use in production cluster troque from epel6. Our cluster is part of large set of clusters that use packages EPEL repository https://twiki.cern.ch/twiki/bin/view/EMI/GenericInstallationConfigurationE MI3
Everything worked fine until torque updated to torque-4.2.10-9.el6 with enabled NUMA https://bugzilla.redhat.com/show_bug.cgi?id=1231148
This update don't starts at all. I filed bug against torque https://bugzilla.redhat.com/show_bug.cgi?id=1321154
There was suggested to remove NUMA support and build different set of packages with enabled NUMA.
But maintainer solution was to add file that required for starting pbs_mom service. Just installing of torque-4.2.10-10.el6 update still don't works (service starts but nodes down). This solution needs additional reconfiguration of pbs_server but we can't do such experiments on our cluster because there is no guarantee that after reconfiguration everything will work as expected.
EPEL update policy recommends to avoid updates that causes such problems https://fedoraproject.org/wiki/EPEL_Updates_Policy#Stable_Releases
-- Alexey Kurov nucleo@fedoraproject.org
On Wednesday, 13 April 2016 at 22:15, Brown, David M JR wrote: [...]
This does however bring to mind something that¹s been growing in my head for a while now. I think we need a Fedora HPC SIG to help communicate the needs of the HPC community to Fedora (and then hopefully Redhat) about how HPC clusters work and what kinds of support we need when it comes to the software we run.
[...]
+1, though it seems to overlap a bit with the SciTech SIG, which focuses on packaging scientific software for Fedora/EPEL.
I'd be interested in joining the HPC SIG as maintainer of several packages which are meant to be used on HPC clusters.
Regards, Dominik
+1, though it seems to overlap a bit with the SciTech SIG, which focuses on packaging scientific software for Fedora/EPEL.
I'd be interested in joining the HPC SIG as maintainer of several packages which are meant to be used on HPC clusters.
Agreed, this would either be a clustered version of the Server SIG or maybe a more Operational Cluster focus to SciTech SIG, or somehow a combination of both.
Thanks, - David Brown
epel-devel@lists.fedoraproject.org