2026-01-23
最近更新:Linux基础教程 第18课:Linux云计算基础
2026-01-23
2026-01-23
2026-01-23
最近更新:Linux基础教程 第15课:Linux内核和驱动管理
2026-01-21
浏览量:28 次 发布时间:2026-01-23 19:47 作者:明扬工控商城 下载docx
2026-01-23
最近更新:Linux基础教程 第18课:Linux云计算基础
2026-01-23
2026-01-23
2026-01-23
最近更新:Linux基础教程 第15课:Linux内核和驱动管理
2026-01-21
好的,我们继续第十六课。今天学习Linux集群和高可用性,这是构建可靠服务的关键技术。
第一部分:集群基础概念
1.1 什么是集群?
集群是一组相互连接的计算机(节点),它们协同工作以提供更高的可用性、可靠性或性能。
集群类型:
高可用集群(HA Cluster):故障时自动转移服务
负载均衡集群:将请求分发到多个节点
高性能计算集群(HPC):并行处理复杂计算
存储集群:分布式存储系统
1.2 集群的基本架构
text
客户端
|
负载均衡器 (Load Balancer)
/ \
节点1 节点2 (应用服务器)
\ /
共享存储 或 数据同步
1.3 集群中的关键概念
故障转移(Failover):主节点故障时,备用节点接管服务
故障恢复(Failback):主节点恢复后,服务切回
心跳(Heartbeat):节点间的健康检查机制
虚拟IP(VIP):集群对外提供服务的浮动IP
脑裂(Split-brain):集群节点间失去通信,都认为自己是主节点
第二部分:基础环境准备
2.1 创建测试集群环境
我们将使用3台虚拟机搭建一个高可用Web集群:
bash
# 准备3台Ubuntu 22.04服务器
# 节点1: master1 (192.168.1.101)
# 节点2: master2 (192.168.1.102)
# 节点3: lb-node (192.168.1.100)
# 在每台机器上设置主机名和hosts
sudo hostnamectl set-hostname master1 # 在对应节点执行
sudo hostnamectl set-hostname master2
sudo hostnamectl set-hostname lb-node
# 编辑所有节点的hosts文件
sudo vim /etc/hosts
添加以下内容(所有节点都要添加):
text
192.168.1.101 master1
192.168.1.102 master2
192.168.1.100 lb-node
2.2 配置SSH免密登录
bash
# 1. 在每个节点生成SSH密钥(如果还没有)
ssh-keygen -t rsa -b 4096
# 2. 配置所有节点间互相免密登录
# 在master1上执行:
ssh-copy-id master1
ssh-copy-id master2
ssh-copy-id lb-node
# 在master2上执行:
ssh-copy-id master1
ssh-copy-id master2
ssh-copy-id lb-node
# 在lb-node上执行:
ssh-copy-id master1
ssh-copy-id master2
ssh-copy-id lb-node
# 3. 测试SSH连接
ssh master2 hostname # 应该返回master2而不需要密码
2.3 安装基础软件
bash
# 在所有节点上更新系统
sudo apt update
sudo apt upgrade -y
# 安装常用工具
sudo apt install -y vim curl wget net-tools telnet
# 安装NTP时间同步
sudo apt install -y chrony
sudo systemctl enable chrony
sudo systemctl start chrony
# 验证时间同步
chronyc sources
timedatectl
第三部分:负载均衡集群
3.1 安装和配置HAProxy
HAProxy是一个高性能的TCP/HTTP负载均衡器。
bash
# 在lb-node节点上安装HAProxy
sudo apt install -y haproxy
# 备份原始配置
sudo cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.backup
# 编辑HAProxy配置
sudo vim /etc/haproxy/haproxy.cfg
配置内容:
bash
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
frontend http_front
bind *:80
stats uri /haproxy?stats
default_backend http_back
backend http_back
balance roundrobin
server master1 192.168.1.101:80 check
server master2 192.168.1.102:80 check
listen stats
bind *:9000
stats enable
stats uri /stats
stats realm HAProxy\ Statistics
stats auth admin:admin123
bash
# 创建错误页面目录
sudo mkdir -p /etc/haproxy/errors
# 创建示例错误页面
sudo tee /etc/haproxy/errors/503.http << 'EOF'
HTTP/1.0 503 Service Unavailable
Cache-Control: no-cache
Connection: close
Content-Type: text/html
<html>
<body>
<h1>503 Service Unavailable</h1>
No server is available to handle your request.
</body>
</html>
EOF
# 启动HAProxy
sudo systemctl restart haproxy
sudo systemctl enable haproxy
# 检查状态
sudo systemctl status haproxy
ss -tlnp | grep haproxy
# 测试访问
curl http://lb-node/haproxy?stats
# 或通过浏览器访问 http://lb-node:9000/stats (admin/admin123)
3.2 安装和配置Nginx(作为后端Web服务)
bash
# 在master1和master2上安装Nginx
sudo apt install -y nginx
# 为每个节点创建不同的主页以区分
# 在master1上:
sudo tee /var/www/html/index.html << 'EOF'
<!DOCTYPE html>
<html>
<head>
<title>Web Server - Master1</title>
<style>
body { font-family: Arial, sans-serif; margin: 40px; background: #f0f0f0; }
.container { background: white; padding: 30px; border-radius: 10px; }
.server { color: green; font-weight: bold; }
.ip { color: blue; }
</style>
</head>
<body>
<div class="container">
<h1>Welcome to High Availability Cluster</h1>
<h2>This is served from: <span class="server">Master1</span></h2>
<p>Server IP: <span class="ip">192.168.1.101</span></p>
<p>Time: <span id="time">$(date)</span></p>
<hr>
<p>This is part of a highly available web cluster.</p>
<p>If this server fails, requests will be automatically routed to Master2.</p>
</div>
<script>document.getElementById('time').textContent = new Date().toString();</script>
</body>
</html>
EOF
# 在master2上:
sudo tee /var/www/html/index.html << 'EOF'
<!DOCTYPE html>
<html>
<head>
<title>Web Server - Master2</title>
<style>
body { font-family: Arial, sans-serif; margin: 40px; background: #f0f0f0; }
.container { background: white; padding: 30px; border-radius: 10px; }
.server { color: red; font-weight: bold; }
.ip { color: blue; }
</style>
</head>
<body>
<div class="container">
<h1>Welcome to High Availability Cluster</h1>
<h2>This is served from: <span class="server">Master2</span></h2>
<p>Server IP: <span class="ip">192.168.1.102</span></p>
<p>Time: <span id="time">$(date)</span></p>
<hr>
<p>This is part of a highly available web cluster.</p>
<p>If Master1 fails, I will handle all requests.</p>
</div>
<script>document.getElementById('time').textContent = new Date().toString();</script>
</body>
</html>
EOF
# 启动Nginx
sudo systemctl restart nginx
sudo systemctl enable nginx
# 测试直接访问后端
curl http://master1
curl http://master2
3.3 测试负载均衡
bash
# 从lb-node测试负载均衡
# 多次请求,观察轮询效果
for i in {1..10}; do
echo "请求 $i: $(curl -s http://lb-node/ | grep -o 'Master[12]')"
sleep 1
done
# 查看HAProxy统计信息
curl -s http://lb-node:9000/stats -u admin:admin123 | grep -E "(server|Session)"
第四部分:高可用集群
4.1 安装和配置Keepalived
Keepalived提供虚拟IP(VIP)和健康检查,实现高可用。
bash
# 在master1和master2上安装Keepalived
sudo apt install -y keepalived
# 在master1上配置(主节点)
sudo tee /etc/keepalived/keepalived.conf << 'EOF'
! Configuration File for keepalived
global_defs {
router_id LVS_MASTER # 修改为LVS_BACKUP在备份节点
}
vrrp_script chk_nginx {
script "/usr/bin/killall -0 nginx" # 检查nginx进程是否存在
interval 2 # 每2秒检查一次
weight 2 # 权重变化
fall 2 # 2次失败认为节点失效
rise 1 # 1次成功认为节点恢复
}
vrrp_instance VI_1 {
state MASTER # 主节点,备份节点改为BACKUP
interface eth0 # 根据实际网卡修改
virtual_router_id 51 # 虚拟路由器ID,集群内必须一致
priority 100 # 优先级,主节点100,备份节点90
advert_int 1 # 通告间隔1秒
authentication {
auth_type PASS
auth_pass 1111 # 认证密码,集群内必须一致
}
virtual_ipaddress {
192.168.1.99/24 # 虚拟IP(VIP)
}
track_script {
chk_nginx # 关联健康检查脚本
}
notify_master "/etc/keepalived/notify.sh master"
notify_backup "/etc/keepalived/notify.sh backup"
notify_fault "/etc/keepalived/notify.sh fault"
}
EOF
# 在master2上配置(备份节点)
sudo tee /etc/keepalived/keepalived.conf << 'EOF'
! Configuration File for keepalived
global_defs {
router_id LVS_BACKUP
}
vrrp_script chk_nginx {
script "/usr/bin/killall -0 nginx"
interval 2
weight 2
fall 2
rise 1
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.99/24
}
track_script {
chk_nginx
}
notify_master "/etc/keepalived/notify.sh master"
notify_backup "/etc/keepalived/notify.sh backup"
notify_fault "/etc/keepalived/notify.sh fault"
}
EOF
4.2 创建通知脚本
bash
# 在master1和master2上创建通知脚本
sudo tee /etc/keepalived/notify.sh << 'EOF'
#!/bin/bash
# Keepalived状态变化通知脚本
TYPE=$1
NAME=$2
STATE=$3
log_file="/var/log/keepalived-notify.log"
echo "$(date '+%Y-%m-%d %H:%M:%S') - [$(hostname)] Type: $TYPE, Name: $NAME, State: $STATE" >> $log_file
case $STATE in
"MASTER")
echo "I am now MASTER" >> $log_file
# 可以在这里添加成为主节点后要执行的命令
# 比如:启动特定服务、发送通知等
;;
"BACKUP")
echo "I am now BACKUP" >> $log_file
# 可以在这里添加成为备份节点后要执行的命令
;;
"FAULT")
echo "I am now in FAULT state" >> $log_file
# 可以在这里添加故障状态要执行的命令
;;
*)
echo "Unknown state: $STATE" >> $log_file
exit 1
;;
esac
EOF
sudo chmod +x /etc/keepalived/notify.sh
4.3 配置Nginx监听VIP
bash
# 修改Nginx配置,监听所有IP(包括VIP)
sudo tee /etc/nginx/sites-available/default << 'EOF'
server {
listen 80 default_server;
listen [::]:80 default_server;
root /var/www/html;
index index.html index.htm;
server_name _;
location / {
try_files $uri $uri/ =404;
}
}
EOF
# 重启Nginx
sudo systemctl restart nginx
4.4 启动和测试Keepalived
bash
# 在master1和master2上启动Keepalived
sudo systemctl restart keepalived
sudo systemctl enable keepalived
# 检查状态
sudo systemctl status keepalived
# 查看虚拟IP
ip addr show eth0
# 应该看到master1上有192.168.1.99(VIP)
# 测试VIP访问
curl http://192.168.1.99
# 模拟故障转移测试
# 1. 在master1上停止Nginx
sudo systemctl stop nginx
# 2. 观察VIP转移到master2
# 在master1上检查VIP是否消失
ip addr show eth0
# 在master2上检查VIP是否出现
ssh master2 "ip addr show eth0"
# 3. 测试VIP仍然可以访问
curl http://192.168.1.99
# 4. 恢复master1的Nginx
sudo systemctl start nginx
# 5. 观察VIP是否会切回(根据优先级)
第五部分:数据同步和共享存储
5.1 配置NFS共享存储
bash
# 在master1上安装NFS服务器
sudo apt install -y nfs-kernel-server
# 创建共享目录
sudo mkdir -p /shared/webdata
sudo chown nobody:nogroup /shared/webdata
sudo chmod 777 /shared/webdata
# 配置NFS导出
sudo tee /etc/exports << 'EOF'
/shared/webdata 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
EOF
# 应用配置
sudo exportfs -a
sudo systemctl restart nfs-kernel-server
sudo systemctl enable nfs-kernel-server
# 在master2上安装NFS客户端并挂载
sudo apt install -y nfs-common
sudo mkdir -p /mnt/nfs/webdata
# 测试挂载
sudo mount 192.168.1.101:/shared/webdata /mnt/nfs/webdata
# 永久挂载(编辑/etc/fstab)
echo "192.168.1.101:/shared/webdata /mnt/nfs/webdata nfs defaults 0 0" | sudo tee -a /etc/fstab
# 配置Nginx使用共享存储
# 在master1和master2上:
sudo mv /var/www/html /var/www/html.local
sudo ln -s /mnt/nfs/webdata /var/www/html
# 在共享存储中创建Web内容
echo "<h1>This is served from shared storage</h1>" | sudo tee /shared/webdata/index.html
5.2 使用rsync进行数据同步
bash
# 创建同步脚本
sudo tee /usr/local/bin/sync_webdata.sh << 'EOF'
#!/bin/bash
# Web数据同步脚本
SOURCE_DIR="/var/www/html.local"
TARGET_DIR="/mnt/nfs/webdata"
LOG_FILE="/var/log/webdata_sync.log"
LOCK_FILE="/tmp/webdata_sync.lock"
# 防止同时运行多个实例
if [ -f "$LOCK_FILE" ]; then
echo "$(date) - Sync already running" >> "$LOG_FILE"
exit 1
fi
touch "$LOCK_FILE"
echo "$(date) - Starting sync" >> "$LOG_FILE"
# 使用rsync同步数据
rsync -avz --delete \
--exclude='*.tmp' \
--exclude='.git/' \
"$SOURCE_DIR/" "$TARGET_DIR/" >> "$LOG_FILE" 2>&1
SYNC_STATUS=$?
if [ $SYNC_STATUS -eq 0 ]; then
echo "$(date) - Sync completed successfully" >> "$LOG_FILE"
else
echo "$(date) - Sync failed with status $SYNC_STATUS" >> "$LOG_FILE"
fi
rm -f "$LOCK_FILE"
exit $SYNC_STATUS
EOF
sudo chmod +x /usr/local/bin/sync_webdata.sh
# 设置定时同步(每5分钟)
echo "*/5 * * * * root /usr/local/bin/sync_webdata.sh" | sudo tee /etc/cron.d/webdata_sync
5.3 配置DRBD(分布式复制块设备)
DRBD提供块级别的数据同步。
bash
# 在master1和master2上安装DRBD
sudo apt install -y drbd-utils
# 加载DRBD内核模块
sudo modprobe drbd
# 创建DRBD配置
sudo tee /etc/drbd.d/webdata.res << 'EOF'
resource webdata {
protocol C;
disk {
on-io-error detach;
}
on master1 {
device /dev/drbd0;
disk /dev/sdb1; # 假设有额外磁盘分区
address 192.168.1.101:7788;
meta-disk internal;
}
on master2 {
device /dev/drbd0;
disk /dev/sdb1;
address 192.168.1.102:7788;
meta-disk internal;
}
}
EOF
# 初始化DRBD
# 在两台机器上执行:
sudo drbdadm create-md webdata
# 启动DRBD服务
sudo systemctl start drbd
sudo systemctl enable drbd
# 在master1上设置为主节点
sudo drbdadm primary webdata --force
# 查看DRBD状态
sudo drbdadm status webdata
cat /proc/drbd
# 创建文件系统(只在主节点)
sudo mkfs.ext4 /dev/drbd0
# 挂载使用
sudo mkdir -p /mnt/drbd
sudo mount /dev/drbd0 /mnt/drbd
第六部分:集群管理和监控
6.1 集群状态监控脚本
bash
sudo tee /usr/local/bin/cluster_status.sh << 'EOF'
#!/bin/bash
# 集群状态监控脚本
echo "=== 集群状态报告 ==="
echo "时间: $(date '+%Y-%m-%d %H:%M:%S')"
echo
# 1. 检查VIP状态
echo "1. 虚拟IP状态:"
VIP="192.168.1.99"
if ip addr show | grep -q "$VIP"; then
echo " ✓ VIP $VIP 在当前节点"
else
echo " ✗ VIP $VIP 不在当前节点"
fi
echo
# 2. 检查Keepalived状态
echo "2. Keepalived状态:"
if systemctl is-active keepalived >/dev/null; then
echo " ✓ Keepalived 运行中"
# 查看VRRP状态
if ip addr show | grep -q "$VIP"; then
echo " 状态: MASTER"
else
echo " 状态: BACKUP"
fi
else
echo " ✗ Keepalived 未运行"
fi
echo
# 3. 检查Nginx状态
echo "3. Nginx状态:"
if systemctl is-active nginx >/dev/null; then
echo " ✓ Nginx 运行中"
# 检查Nginx是否可以正常响应
if curl -s -o /dev/null -w "%{http_code}" http://localhost/ | grep -q "200"; then
echo " 服务响应: 正常"
else
echo " 服务响应: 异常"
fi
else
echo " ✗ Nginx 未运行"
fi
echo
# 4. 检查HAProxy状态(如果在负载均衡节点)
echo "4. HAProxy状态:"
if systemctl is-active haproxy >/dev/null 2>&1; then
echo " ✓ HAProxy 运行中"
# 检查后端服务器状态
echo " 后端服务器:"
if command -v haproxy >/dev/null; then
echo " master1: $(curl -s http://localhost:9000/stats -u admin:admin123 | grep 'master1,http_back' | awk -F, '{print $18}')"
echo " master2: $(curl -s http://localhost:9000/stats -u admin:admin123 | grep 'master2,http_back' | awk -F, '{print $18}')"
fi
else
echo " ✗ HAProxy 未运行(或未安装)"
fi
echo
# 5. 检查NFS挂载
echo "5. NFS挂载状态:"
if mount | grep -q nfs; then
echo " ✓ NFS 已挂载"
mount | grep nfs
else
echo " ✗ NFS 未挂载"
fi
echo
# 6. 检查节点间连通性
echo "6. 节点间连通性:"
for node in master1 master2 lb-node; do
if ping -c 1 -W 1 $node >/dev/null 2>&1; then
echo " ✓ $node: 可达"
else
echo " ✗ $node: 不可达"
fi
done
echo
echo "=== 报告结束 ==="
EOF
sudo chmod +x /usr/local/bin/cluster_status.sh
6.2 自动化故障检测和恢复
bash
sudo tee /usr/local/bin/cluster_health_check.sh << 'EOF'
#!/bin/bash
# 集群健康检查和自动恢复脚本
LOG_FILE="/var/log/cluster_health.log"
VIP="192.168.1.99"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}
# 检查并重启Nginx
check_nginx() {
if ! systemctl is-active nginx >/dev/null; then
log "Nginx未运行,尝试重启..."
systemctl restart nginx
if systemctl is-active nginx >/dev/null; then
log "Nginx重启成功"
return 0
else
log "Nginx重启失败"
return 1
fi
fi
return 0
}
# 检查并重启Keepalived
check_keepalived() {
if ! systemctl is-active keepalived >/dev/null; then
log "Keepalived未运行,尝试重启..."
systemctl restart keepalived
if systemctl is-active keepalived >/dev/null; then
log "Keepalived重启成功"
return 0
else
log "Keepalived重启失败"
return 1
fi
fi
return 0
}
# 检查VIP状态
check_vip() {
local node_type="$1"
if [ "$node_type" = "master" ]; then
if ! ip addr show | grep -q "$VIP"; then
log "VIP不在主节点,尝试重新配置..."
# 强制成为主节点
systemctl stop keepalived
sleep 2
systemctl start keepalived
sleep 5
if ip addr show | grep -q "$VIP"; then
log "VIP恢复成功"
return 0
else
log "VIP恢复失败"
return 1
fi
fi
fi
return 0
}
# 主函数
main() {
local node_type="$1"
log "开始集群健康检查 - 节点类型: $node_type"
# 执行检查
check_nginx
check_keepalived
if [ "$node_type" = "master" ]; then
check_vip "master"
fi
log "健康检查完成"
}
# 运行主函数
if [ -n "$1" ]; then
main "$1"
else
echo "用法: $0 [master|backup]"
exit 1
fi
EOF
sudo chmod +x /usr/local/bin/cluster_health_check.sh
# 设置定时检查(每分钟)
echo "* * * * * root /usr/local/bin/cluster_health_check.sh master" | sudo tee /etc/cron.d/cluster_health_master
echo "* * * * * root /usr/local/bin/cluster_health_check.sh backup" | sudo tee /etc/cron.d/cluster_health_backup
6.3 使用Prometheus监控集群
bash
# 在监控节点上安装Prometheus
sudo apt install -y prometheus
# 配置Prometheus监控所有集群节点
sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets:
- 'master1:9100'
- 'master2:9100'
- 'lb-node:9100'
labels:
group: 'ha-cluster'
- job_name: 'haproxy'
static_configs:
- targets: ['lb-node:9101']
- job_name: 'nginx'
static_configs:
- targets:
- 'master1:9113'
- 'master2:9113'
EOF
# 在各节点安装Node Exporter
sudo apt install -y prometheus-node-exporter
# 在负载均衡节点安装HAProxy Exporter
wget https://github.com/prometheus/haproxy_exporter/releases/download/v0.13.0/haproxy_exporter-0.13.0.linux-amd64.tar.gz
tar xzf haproxy_exporter-*.tar.gz
sudo mv haproxy_exporter-*/haproxy_exporter /usr/local/bin/
# 创建systemd服务
sudo tee /etc/systemd/system/haproxy-exporter.service << 'EOF'
[Unit]
Description=HAProxy Exporter
After=network.target
[Service]
Type=simple
User=nobody
ExecStart=/usr/local/bin/haproxy_exporter \
--haproxy.scrape-uri="http://localhost:9000/stats;csv" \
--web.listen-address=":9101"
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl start haproxy-exporter
sudo systemctl enable haproxy-exporter
第七部分:高级集群配置
7.1 配置Pacemaker和Corosync(企业级HA)
bash
# 在master1和master2上安装Pacemaker和Corosync
sudo apt install -y pacemaker corosync pcs
# 设置hacluster用户密码(两节点相同)
sudo passwd hacluster
# 在master1上配置
sudo pcs cluster auth master1 master2 -u hacluster -p password --force
sudo pcs cluster setup --name ha_cluster master1 master2
sudo pcs cluster start --all
sudo pcs cluster enable --all
# 配置集群属性
sudo pcs property set stonith-enabled=false # 测试环境禁用STONITH
sudo pcs property set no-quorum-policy=ignore
sudo pcs property set cluster-recheck-interval=2min
# 创建虚拟IP资源
sudo pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \
ip=192.168.1.99 \
cidr_netmask=24 \
op monitor interval=30s
# 创建Nginx资源
sudo pcs resource create WebServer systemd:nginx \
op monitor interval=20s \
op start timeout=40s \
op stop timeout=60s
# 配置资源组(按顺序启动)
sudo pcs resource group add WebGroup ClusterIP WebServer
# 查看集群状态
sudo pcs status
sudo pcs cluster status
sudo crm_mon -1 # 实时监控
7.2 配置负载均衡器集群(HAProxy高可用)
bash
# 在lb-node和另一个节点上配置HAProxy高可用
# 假设有lb-backup节点 (192.168.1.103)
# 在两个负载均衡器上安装HAProxy和Keepalived
sudo apt install -y haproxy keepalived
# 配置Keepalived(类似第四部分,但用于负载均衡器)
# 在lb-node(主):
sudo tee /etc/keepalived/keepalived.conf << 'EOF'
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 52
priority 150
advert_int 1
authentication {
auth_type PASS
auth_pass 2222
}
virtual_ipaddress {
192.168.1.98/24 # 负载均衡器的VIP
}
virtual_server 192.168.1.98 80 {
delay_loop 6
lb_algo rr
lb_kind NAT
persistence_timeout 50
protocol TCP
real_server 192.168.1.101 80 {
weight 1
TCP_CHECK {
connect_port 80
connect_timeout 3
}
}
real_server 192.168.1.102 80 {
weight 1
TCP_CHECK {
connect_port 80
connect_timeout 3
}
}
}
}
EOF
# 在lb-backup(备)上配置类似,但state为BACKUP,priority较低
# 启动服务
sudo systemctl restart keepalived
sudo systemctl enable keepalived
第八部分:安全加固
8.1 集群网络安全
bash
# 配置防火墙规则
# 在所有节点上安装UFW
sudo apt install -y ufw
# 在master1和master2上:
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow from 192.168.1.0/24 to any port 22 # SSH
sudo ufw allow from 192.168.1.0/24 to any port 80 # HTTP
sudo ufw allow from 192.168.1.0/24 to any port 443 # HTTPS
sudo ufw allow from 192.168.1.0/24 to any port 3306 # MySQL(如果有)
sudo ufw enable
# 在负载均衡器上:
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 9000/tcp # HAProxy统计
sudo ufw enable
# 配置SSH安全
sudo vim /etc/ssh/sshd_config
# 修改以下配置:
# Port 2222
# PermitRootLogin no
# PasswordAuthentication no
# AllowUsers clusteradmin
8.2 集群认证和加密
bash
# 配置Corosync认证
sudo corosync-keygen
# 生成后将生成的authkey复制到所有节点
sudo scp /etc/corosync/authkey master2:/etc/corosync/
sudo scp /etc/corosync/authkey lb-node:/etc/corosync/
# 设置正确权限
sudo chmod 400 /etc/corosync/authkey
sudo chown root:root /etc/corosync/authkey
# 配置DRBD加密
# 在DRBD配置中添加:
sudo tee -a /etc/drbd.d/webdata.res << 'EOF'
net {
cram-hmac-alg sha1;
shared-secret "MySecretKey123";
}
EOF
第九部分:故障演练和测试
9.1 故障测试脚本
bash
sudo tee /usr/local/bin/cluster_failure_test.sh << 'EOF'
#!/bin/bash
# 集群故障测试脚本
echo "=== 集群故障测试 ==="
echo "警告:这将模拟故障,可能影响服务!"
echo
read -p "选择测试类型:
1) 停止Nginx服务
2) 停止Keepalived服务
3) 模拟网络分区
4) 模拟节点宕机
5) 恢复所有服务
请输入数字: " test_type
case $test_type in
1)
echo "测试:停止Nginx服务"
sudo systemctl stop nginx
echo "Nginx已停止,观察故障转移..."
;;
2)
echo "测试:停止Keepalived服务"
sudo systemctl stop keepalived
echo "Keepalived已停止,观察VIP转移..."
;;
3)
echo "测试:模拟网络分区"
# 临时阻止集群节点间的通信
for node in master1 master2; do
if [ "$(hostname)" != "$node" ]; then
sudo iptables -A INPUT -s $(host $node | awk '{print $4}') -j DROP
echo "已阻止来自 $node 的流量"
fi
done
;;
4)
echo "测试:模拟节点宕机"
read -p "输入要模拟宕机的节点 (master1/master2): " failed_node
if [ "$failed_node" = "master1" ] || [ "$failed_node" = "master2" ]; then
echo "在 $failed_node 上模拟宕机..."
ssh $failed_node "sudo systemctl stop nginx keepalived"
else
echo "无效的节点"
fi
;;
5)
echo "恢复所有服务"
sudo systemctl start nginx keepalived
sudo iptables -F # 清除所有防火墙规则
echo "服务已恢复"
;;
*)
echo "无效的选择"
;;
esac
echo
echo "测试完成,检查集群状态:"
/usr/local/bin/cluster_status.sh
EOF
sudo chmod +x /usr/local/bin/cluster_failure_test.sh
9.2 性能压力测试
bash
# 安装压力测试工具
sudo apt install -y apache2-utils siege
# 使用ab进行压力测试
ab -n 10000 -c 100 http://192.168.1.99/
# 使用siege进行长时间测试
siege -c 100 -t 60S http://192.168.1.99/
# 监控性能指标
watch -n 1 '
echo "负载: $(uptime)"
echo "连接数: $(ss -t | wc -l)"
echo "内存: $(free -h | grep Mem)"
echo "HAProxy会话: $(curl -s http://lb-node:9000/stats -u admin:admin123 | grep "http_back,FRONTEND" | awk -F, "{print \$34}")"
'
第十部分:生产环境最佳实践
10.1 集群部署检查清单
bash
#!/bin/bash
# cluster_deployment_checklist.sh
echo "=== 集群部署检查清单 ==="
echo
check_item() {
local item="$1"
local check_cmd="$2"
local expected="$3"
if eval "$check_cmd" | grep -q "$expected"; then
echo "✓ $item"
return 0
else
echo "✗ $item"
return 1
fi
}
echo "1. 网络检查:"
check_item "节点间网络连通性" "ping -c 1 master2" "1 received"
check_item "DNS解析正常" "host master1" "has address"
check_item "无重复IP地址" "arp-scan --localnet" "192.168.1."
echo
echo "2. 服务检查:"
check_item "SSH免密登录" "ssh master2 hostname" "master2"
check_item "时间同步" "chronyc sources" "^\^\*"
check_item "防火墙配置" "sudo ufw status" "active"
echo
echo "3. 高可用检查:"
check_item "VIP存在" "ip addr show" "192.168.1.99"
check_item "Keepalived运行" "systemctl is-active keepalived" "active"
check_item "Nginx运行" "systemctl is-active nginx" "active"
echo
echo "4. 负载均衡检查:"
check_item "HAProxy运行" "systemctl is-active haproxy" "active"
check_item "后端健康检查" "curl -s http://localhost:9000/stats -u admin:admin123" "UP"
echo
echo "5. 数据同步检查:"
check_item "NFS挂载" "mount | grep nfs" "nfs"
check_item "共享目录可写" "touch /mnt/nfs/webdata/test && rm /mnt/nfs/webdata/test" ""
echo
echo "=== 检查完成 ==="
10.2 备份和恢复策略
bash
# 创建集群配置备份脚本
sudo tee /usr/local/bin/cluster_backup.sh << 'EOF'
#!/bin/bash
# 集群配置备份脚本
BACKUP_DIR="/backup/cluster_config"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/cluster_backup_$DATE.tar.gz"
mkdir -p "$BACKUP_DIR"
echo "开始备份集群配置..."
echo "备份文件: $BACKUP_FILE"
# 备份重要配置文件
tar -czf "$BACKUP_FILE" \
/etc/haproxy \
/etc/keepalived \
/etc/nginx \
/etc/drbd.d \
/etc/corosync \
/etc/pcs \
/etc/exports \
/etc/fstab \
/etc/hosts \
/root/.ssh/id_rsa.pub \
/usr/local/bin/cluster_*.sh 2>/dev/null
if [ $? -eq 0 ]; then
echo "备份成功"
echo "文件大小: $(du -h "$BACKUP_FILE" | cut -f1)"
# 保留最近7天的备份
find "$BACKUP_DIR" -name "cluster_backup_*.tar.gz" -mtime +7 -delete
echo "清理了7天前的旧备份"
else
echo "备份失败"
exit 1
fi
echo "备份完成"
EOF
sudo chmod +x /usr/local/bin/cluster_backup.sh
练习项目
项目1:构建完整的高可用Web集群
使用3台虚拟机搭建集群
配置Nginx作为Web服务器
配置Keepalived实现高可用
配置HAProxy实现负载均衡
配置NFS共享存储
部署监控和告警系统
项目2:数据库高可用集群
配置MySQL主从复制
使用HAProxy进行读写分离
配置MHA(MySQL高可用)自动故障转移
设置备份和恢复策略
项目3:容器化集群(Kubernetes基础)
安装和配置Kubernetes集群
部署高可用应用
配置自动扩缩容
设置服务发现和负载均衡
项目4:混合云高可用架构
在云平台和本地数据中心部署集群
配置跨地域的数据同步
实现灾难恢复和业务连续性
优化跨地域网络延迟
今日总结
今天我们学习了Linux集群和高可用性:
集群基础:概念、类型、架构
负载均衡:HAProxy配置和优化
高可用性:Keepalived实现故障转移
数据同步:NFS共享存储和DRBD
集群管理:状态监控、健康检查
企业级方案:Pacemaker和Corosync
安全加固:网络安全、认证加密
故障演练:测试和恢复流程
最佳实践:生产环境部署和备份
关键原则:
设计冗余:消除单点故障
自动化恢复:减少人工干预
持续监控:及时发现和处理问题
定期测试:验证故障转移机制
文档化:记录架构和操作流程
集群技术是构建可靠、可扩展服务的基础。随着业务增长,这些技能会变得越来越重要。
有问题吗?完成练习项目后,我们可以继续第十七课:Linux自动化运维工具(Ansible、Puppet、Chef)。
将本文的Word文档下载到电脑
推荐度: