FAQ

Over time, we have discovered a few problems that many users experience. The solutions to these common problems are posted here.

tip

Use the search feature to find the best answer! Your question might be answered on another page.

Couldn't find an answer here or via search? Check our support options.

Connection Questions

Why does MobaXTerm throw `Authorisation not recognized` error when I try to log into Palmetto?

This error looks like this:

/usr/bin/xauth:  error in locking authority file /home/username/.Xauthority
MoTTY X11 proxy: Authorisation not recognized

Usually this happens when your home directory is full. Log into Palmetto with another terminal application (for example, PuTTY) and free up some space in your home directory. Then, MobaXTerm should work fine.

Why do I see the notice `REMOTE HOST IDENTIFICATION HAS CHANGED` when connecting to Palmetto?

If you are a Mac user, and try to ssh into Palmetto from the terminal, you might get this error message:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:Z2WLGvz7vX2t9VPap6ITwS3cBlCafN69FoIm8wmmF6g.
Please contact your system administrator.
Add correct host key in /Users/abcd/.ssh/known_hosts to get rid of this message.
Offending RSA key in /Users/abcd/.ssh/known_hosts:5
RSA host key for login.palmetto.clemson.edu has changed and you have requested strict checking.
Host key verification failed.

This is a common problem, and it's easy to fix. Please find the line "Offending RSA key" in the error message. In the example above, the number of the offending key is 5. This means that we have to remove 5th line from the lists of known SSH hosts so this line could be recreated.

There are several ways to fix the error. The first one is, to type in terminal

sed -i '5d' ~/.ssh/known_hosts

(Please replace the number 5 with the number of the offending RSA key from the error message)

Alternatively, you can type in terminal

perl -pi -e 's/\Q$_// if ($. == 5);' ~/.ssh/known_hosts

(Again, instead of number 5, put the number of the offending RSA key)

Compute Questions

How do I perform a fine-grained check for resource availability?

The following answer applies to Palmetto 1 and the PBS job scheduler. For Palmetto 2, see our main documentation.

Up until recently, whatsfree has been the standard tool to identify available computing resources on Palmetto. However, with the additional of compute nodes with large number of cores, large amount of memory, and multiple GPUs card, we have seen instances of nodes using only a portion of the available cores/memory/GPU but are still regarded as unavailable with whatsfree.

To address this issue and improve utilization on Palmetto, a new tool called freeres (abbreviation of free resources) has been made available. This tool provides a more fine-grained view into individual nodes of a specific phase. For example, at 15:12 November 19, 2021, whastfree shows that there are no available nodes on phases 18b, 18c, and 19a:

[lngo@login001 ~]$ whatsfree

C2 CLUSTER (newest nodes with interconnect=HDR except for phase19b)
PHASE 18b  TOTAL =  65  FREE =   0  OFFLINE =   0  TYPE = Dell   R740    Intel Xeon  6148G,     40 cores, 372GB, HDR, 25ge, V100
PHASE 18c  TOTAL =  10  FREE =   0  OFFLINE =   0  TYPE = Dell   R740    Intel Xeon  6148G,     40 cores, 748GB, HDR, 25ge, V100
PHASE 19a  TOTAL =  28  FREE =   0  OFFLINE =   0  TYPE = Dell   R740    Intel Xeon  6248G,     40 cores, 372GB, HDR, 25ge, V100
PHASE 19b  TOTAL =   4  FREE =   1  OFFLINE =   0  TYPE = HPE    XL170   Intel Xeon  6252G,     48 cores, 372GB,      10ge
PHASE 20   TOTAL =  22  FREE =   0  OFFLINE =   0  TYPE = Dell   R740    Intel Xeon  6238R,     56 cores, 372GB, HDR, 25ge, V100S

However, freeres shows that there are many nodes with significant number of cores/memory available on these phases:

[lngo@login001 ~]$ freeres phase18b phase18c phase19a
group file = /software/caci/cluster/phase18b
group file = /software/caci/cluster/phase18c
group file = /software/caci/cluster/phase19a
                 CPU       |       GPU       |   Memory (GB)   |
Node       Avail Used Free | Avail Used Free | Avail Used Free | State
---------------------------------------------------------------------------
node0060    40    12    28     2     2     0   376   372     4   free
node0152    40    26    14     2     2     0   376   173   203   free
node0064    40    12    28     2     2     0   376   372     4   free
node0084    40     8    32     2     2     0   376   372     4   free
node0059    40    30    10     2     2     0   376   173   203   free
node0096    40    32     8     2     2     0   376   184   192   free
node1302    40     3    37     2     0     2   376   372     4   free
node0123    40     8    32     2     2     0   376   372     4   free
node1228    40     8    32     2     2     0   376   372     4   free
node0146    40     8    32     2     2     0   376   372     4   free
node0131    40     8    32     2     2     0   376   372     4   free
node0148    40    37     3     2     2     0   376   304    72   free
node0156    40     8    32     2     2     0   376   372     4   free
node0181    40    33     7     2     2     0   376   358    18   free
node1523    40     8    32     2     2     0   376   372     4   free
node0209    40    32     8     2     2     0   376   258   118   free
node0201    40    12    28     2     2     0   376   372     4   free
node0045    40    36     4     2     2     0   754   240   514   free
node0176    40    30    10     2     2     0   376   244   132   free
node0246    40    12    28     2     2     0   376   372     4   free
node0183    40     8    32     2     2     0   376   372     4   free
node0153    40    32     8     2     2     0   754   270   484   free
node0766    40     8    32     2     2     0   376   372     4   free
node0219    40     8    32     2     2     0   376   372     4   free
node1266    40     8    32     2     2     0   376   372     4   free
node0032    40     8    32     2     2     0   376    44   332   free
node1397    40    32     8     2     2     0   376   264   112   free
node0047    40    16    24     2     2     0   376   372     4   free
node1415    40    36     4     2     1     1   376   270   106   free
node1438    40     8    32     2     2     0   376   372     4   free
node1459    40    26    14     2     2     0   376   174   202   free
node1462    40     8    32     2     2     0   376   372     4   free
node0079    40    32     8     2     2     0   376   128   248   free
node1414    40     8    32     2     2     0   376   372     4   free
node0089    40     8    32     2     2     0   376   372     4   free
node0136    40     8    32     2     2     0   376   372     4   free
node1521    40    32     8     2     1     1   376   214   162   free
node0020    40     8    32     2     2     0   376   372     4   free
node0040    40     8    32     2     2     0   376   372     4   free
node0058    40     8    32     2     2     0   376   372     4   free
node0137    40    32     8     2     2     0   376   128   248   free
node0119    40     8    32     2     2     0   376   372     4   free
node0130    40     8    32     2     2     0   376   372     4   free
checked 103 nodes in 0.42 Seconds

Taking advantage of freeres, it is possible to request portion of nodes that fit into these available resources. For example, the following qsub that specifies interconnect=25ge (to imply the usage of these nodes) is allocated in less than a minute.

[lngo@login001 ~]$ qsub -I -l select=6:ncpus=12:mem=160gb:interconnect=25ge
qsub (Warning): Interactive jobs will be treated as not rerunnable
qsub: waiting for job 4049542.pbs02 to start
qsub: job 4049542.pbs02 ready

[lngo@node0152 ~]$

Why does my program crash on login node with the message `Killed`?

When running commands or editing files on the login node, users may notice that their processes end abruptly with the error message Killed. Processes with names such as a.out, matlab, etc., are automatically killed on the login node because they may consume excessive computational resources. Unfortunately, this also means that benign processes, such as editing a file with the word matlab as part of its name could also be killed.

Solution: Request an interactive session on a compute node (salloc), and then run the application/command.

Storage Questions

How do I check how much of my `/project` storage allocation I am currently using?

There is a command you can use on the login nodes to check your quota usage.

For more details, please see the Indigo quota usage instructions

Why does my folder not show up when I do `ls /project` or `ls /scratch`?

We recently introduced an autofs system that automatically mounts the user directory when it is accessed, and unmounts it after 5 minutes of inactivity. This feature increases the robustness of our file system, and will greatly decrease the visual clutter (especially important if you are accessing Palmetto through a graphical interface).

Due to the automatic mounting, you will not initially see your folder in /project or /scratch, and tab completion won't work. However, the folder is still there. You can cd directly into it, and you will not have any issues:

cd /project/tiger/football
cd /scratch/tiger

After you access it, you will see it when you do ls /project or ls /scratch. If there is no read or write activity on the directory for 5 minutes, it will be unmounted, so next time you will have to cd into it again in order to see it.

Why are my home or scratch directories are sluggish/unresponsive?

The /home and /scratch directories can become slow/unresponsive when a user (or several users) read/write large amounts of data to these directories. When this happens, all users are affected as these filesystems are shared by all nodes of the cluster.

To avoid this issue, you can use /local_scratch. Unlike /home or the /scratch directories, which are shared by all nodes, each node has its own /local_scratch directory. It is much faster to read/write data to /local_scratch, and doing so will not affect other users. For an example, see the local scratch documentation.

danger

However, note that /local_scratch is purged immediately when the job ends, so you must copy your final data back to /scratch or /home at the end of your job to avoid data loss.

Why can't I access `/software`?

Typically losing access to /software means you have an Active Directory lockout on your account. You should call the CCIT helpdesk or drop by the support center in the library so they can help remove the lock and troubleshoot why it existed. A common cause is having a device with wrong credentials attempting to authenticate repeatedly on eduroam.

Once the lockout is removed you should have access to /software within an hour. If you still don't have access, please reach out to us and we can force sync the storage system.

Open OnDemand Questions

Why can't I request a bigger node or more than 8 hours?

Open OnDemand sessions provide an easy interactive experience for running various workloads for development and testing. We want to ensure users have fast access to resources so their jobs launch faster. We noticed a trend of increased wait times over earlier semesters. When investigating this trend, we discovered many interactive jobs started on Open OnDemand requested large walltimes and were not stopped after computation was complete. These idle jobs tied up compute resources that could have been better utilized by other jobs. We expect this problem to be exacerbated as we remove old CPU nodes and transition nodes to Slurm. The resource limits were put in place to encourage appropriate use of interactive jobs on Open OnDemand. This should enable faster access to resources for everyone.

This decision was also informed by examining interactive queueing policies at national HPC centers ( NERSC: Perlmutter, PSC: Bridges-2, etc.).

Node Owners

Node owners can make full use of their purchased resources and are not limited to 8 hours. Be sure to change the selected queue from work1 to your owner's queue at the top of the OnDemand form. Once changing to an owner queue, you should be able to specify custom chunk count, CPU count, memory, and walltime.

For users who need more resources or runtime, run your code as a batch job. Python scripts, Jupyter notebooks, and R scripts can all be run from a batch job. Within a batch job, you are no longer subjected to the limits shown in the Open OnDemand session creation forms, the limits are higher. Be sure to monitor your job and make reasonable resource requests. We are happy to help you create batch jobs; join us in office hours or submit a support request.

Why are Jupyter Notebooks launched through Open Ondemand are asking for a password?

This is caused by the "conda initialize" statements in your ~/.bashrc file. Remove the following section from your ~/.bashrc file:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/etc/profile.d/conda.sh" ]; then
        . "/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/etc/profile.d/conda.sh"
    else
        export PATH="/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

After this section is removed, activate your conda environments with source activate.

module load anaconda3
source activate <name_of_environment>

Why did my Open OnDemand job fail to launch or quit early?

View the Troubleshooting Guide for OnDemand jobs. Use the link next to the job in OnDemand (if applicable) to submit a ticket if the troubleshooting does not fix the problem.

Other

How do I fix the error `undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b` (PBS only)?

The following answer applies to Palmetto 1 and the PBS job scheduler. This error is caused by a version mismatch between the OpenSSL that is part of the Rocky operating system, and the versions that are required by some software packages. If you get that error, please load these modules:

module load libssh/0.8.5-gcc/9.5.0 krb5/1.19.3-gcc/9.5.0

If that fails, load the anaconda3/2022.05-gcc/9.5.0 module.

Error when creating conda environment after loading anaconda module

If error occurs, please try the following command:

export LD_PRELOAD=""

Error: couldn't get an RGB, Double-buffered visual

Try running with VirtualGL.

402: Could not map pixel buffer object

Try running with VirtualGL. Make sure you specify a device to vglrun, either egl to use the EGL back end or :0.0 or :0.1 if you need to use the GLX back end.

FAQ

Connection Questions​

Why does MobaXTerm throw Authorisation not recognized error when I try to log into Palmetto?​

Why do I see the notice REMOTE HOST IDENTIFICATION HAS CHANGED when connecting to Palmetto?​

Compute Questions​

How do I perform a fine-grained check for resource availability?​

Why does my program crash on login node with the message Killed?​

Storage Questions​

How do I check how much of my /project storage allocation I am currently using?​

Why does my folder not show up when I do ls /project or ls /scratch?​

Why are my home or scratch directories are sluggish/unresponsive?​

Why can't I access /software?​

Open OnDemand Questions​

Why can't I request a bigger node or more than 8 hours?​

Why are Jupyter Notebooks launched through Open Ondemand are asking for a password?​

Why did my Open OnDemand job fail to launch or quit early?​

Other​

How do I fix the error undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b (PBS only)?​

Error when creating conda environment after loading anaconda module​

Error: couldn't get an RGB, Double-buffered visual​

402: Could not map pixel buffer object​