我的动态

qiql

水能载舟，亦可赛艇

2023/07/07 06:51:52

大大降低所有服务器之间相同槽位GPU卡的通讯时延，所有通讯只需要在单台TOR交换机上完成，省去3跳交换机的延迟；

省去了流量做ECMP多链路负载分担的冲突，带宽利用率达到最大化；

故障范围也大大缩小，如果坏掉一台TOR交换机，那么影响只是一路Rail轨道;

符合NCCL(Nvidia Collective multi-GPU Communication Library的简称)的ring或者tree算法设计，如下图所示，默认NCCL会自动做网络的拓扑发现并且建立可以连接所有GPU的环状Rail拓扑，有利于做常见的AllReduce计算；

0 0
2023/06/30 01:18:27

4090服务器在nccl test中hang住的原因：p2p被禁用，执行export NCCL_P2P_DISABLE=1命令禁用p2p即可，参考：export NCCL_P2P_DISABLE=1

0 0
2023/06/25 18:59:40

终端上滚动鼠标，有可能不是预期的滚屏，而是出现一些乱码字符

解决方法：输入reset命令回车即可

0 0
2023/06/22 06:35:30

博客文章阅读量破两万了

0 0
2023/06/11 23:16:04

nginx 自动续签SSL证书：https://u.sb/acme-sh-ssl/

0 0
2023/06/11 23:01:27
[2023年 06月 11日星期日 14:55:27 CST] Your cert is in: /root/.acme.sh/.qiql.net_ecc/.qiql.net.cer
[2023年 06月 11日星期日 14:55:27 CST] Your cert key is in: /root/.acme.sh/.qiql.net_ecc/.qiql.net.key
[2023年 06月 11日星期日 14:55:27 CST] The intermediate CA cert is in: /root/.acme.sh/.qiql.net_ecc/ca.cer
[2023年 06月 11日星期日 14:55:27 CST] And the full chain certs is there: /root/.acme.sh/.qiql.net_ecc/fullchain.cer
```
    ssl_certificate "/root/.acme.sh/*.qiql.net_ecc/fullchain.cer";
    ssl_certificate_key "/root/.acme.sh/*.qiql.net_ecc/*.qiql.net.key";
```
0 0
2023/06/11 23:00:57

目前 acme.sh 支持四个正式环境 CA，分别是 Let's Encrypt、Buypass、ZeroSSL 和 SSL.com，默认使用 ZeroSSL，如果需要更换可以使用如下命令：

切换 Let's Encrypt
acme.sh --set-default-ca --server letsencrypt

切换 Buypass
acme.sh --set-default-ca --server buypass

切换 ZeroSSL
acme.sh --set-default-ca --server zerossl

0 0
2023/05/31 23:36:11

frp 旧版本文档：
https://www.bookstack.cn/read/frp/spilt.2.spilt.3.README_zh.md

0 0
2023/05/31 15:53:30

CentOS 7.9 时间同步：
sudo yum install rdate
sudo /usr/bin/rdate -s time.nist.gov

0 0
2023/05/30 19:49:17

(base) user@host01:/ddn/support/pkgs$ gcc -O3 -o stream.intel stream.c -DSTATIC -DNTIMES=10 -DSTREAM_ARRAY_SIZE=2500000000 -mcmodel=large -Ofast -fopenmp -ffreestanding
(base) user@host01:/ddn/support/pkgs$ ./stream.intel

STREAM version $Revision: 5.10 $

This system uses 8 bytes per array element.

Array size = 2500000000 (elements), Offset = 0 (elements)
Memory per array = 19073.5 MiB (= 18.6 GiB).
Total memory required = 57220.5 MiB (= 55.9 GiB).
Each kernel will be executed 10 times.
The best time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.

Number of Threads requested = 32
Number of Threads counted = 32

Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 571887 microseconds.
(= 571887 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.

WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.

Function Best Rate MB/s Avg time Min time Max time
Copy: 53664.3 0.770183 0.745375 0.798007
Scale: 54045.2 0.756513 0.740121 0.780275
Add: 60452.2 1.008408 0.992519 1.023139
Triad: 60380.8 1.008532 0.993693 1.052042

Solution Validates: avg error less than 1.000000e-13 on all three arrays

(base) user@host01:/ddn/support/pkgs$

0 0

我的动态

(base) user@host01:/ddn/support/pkgs$ gcc -O3 -o stream.intel stream.c -DSTATIC -DNTIMES=10 -DSTREAM_ARRAY_SIZE=2500000000 -mcmodel=large -Ofast -fopenmp -ffreestanding (base) user@host01:/ddn/support/pkgs$ ./stream.intel

STREAM version $Revision: 5.10 $

This system uses 8 bytes per array element.

Number of Threads requested = 32 Number of Threads counted = 32

Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 571887 microseconds. (= 571887 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test.

WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer.

Function Best Rate MB/s Avg time Min time Max time Copy: 53664.3 0.770183 0.745375 0.798007 Scale: 54045.2 0.756513 0.740121 0.780275 Add: 60452.2 1.008408 0.992519 1.023139 Triad: 60380.8 1.008532 0.993693 1.052042

Solution Validates: avg error less than 1.000000e-13 on all three arrays

(base) user@host01:/ddn/support/pkgs$ gcc -O3 -o stream.intel stream.c -DSTATIC -DNTIMES=10 -DSTREAM_ARRAY_SIZE=2500000000 -mcmodel=large -Ofast -fopenmp -ffreestanding
(base) user@host01:/ddn/support/pkgs$ ./stream.intel

Number of Threads requested = 32
Number of Threads counted = 32

Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 571887 microseconds.
(= 571887 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.

WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.

Function Best Rate MB/s Avg time Min time Max time
Copy: 53664.3 0.770183 0.745375 0.798007
Scale: 54045.2 0.756513 0.740121 0.780275
Add: 60452.2 1.008408 0.992519 1.023139
Triad: 60380.8 1.008532 0.993693 1.052042