首页 Linux 基础教程文章内容

Linux基础教程第16课：集群和高可用性

浏览量：28 次发布时间：2026-01-23 19:47 作者：明扬工控商城下载docx

最近更新：Linux基础教程第20课：Linux安全攻防和渗透测试基础

最近更新：Linux基础教程第19课：性能调优和容量规划
2026-01-23
最近更新：Linux基础教程第18课：Linux云计算基础
2026-01-23
最近更新：Linux基础教程第17课：自动化运维工具
2026-01-23
最近更新：Linux基础教程第16课：集群和高可用性
2026-01-23
最近更新：Linux基础教程第15课：Linux内核和驱动管理
2026-01-21

最近更新：Linux基础教程第15课：Linux内核和驱动管理

好的，我们继续第十六课。今天学习Linux集群和高可用性，这是构建可靠服务的关键技术。

第一部分：集群基础概念

1.1 什么是集群？

集群是一组相互连接的计算机（节点），它们协同工作以提供更高的可用性、可靠性或性能。

集群类型：

高可用集群（HA Cluster）：故障时自动转移服务

负载均衡集群：将请求分发到多个节点

高性能计算集群（HPC）：并行处理复杂计算

存储集群：分布式存储系统

1.2 集群的基本架构

text

客户端

|

负载均衡器 (Load Balancer)

/ \

节点1 节点2 (应用服务器)

\ /

共享存储或数据同步

1.3 集群中的关键概念

故障转移（Failover）：主节点故障时，备用节点接管服务

故障恢复（Failback）：主节点恢复后，服务切回

心跳（Heartbeat）：节点间的健康检查机制

虚拟IP（VIP）：集群对外提供服务的浮动IP

脑裂（Split-brain）：集群节点间失去通信，都认为自己是主节点

第二部分：基础环境准备

2.1 创建测试集群环境

我们将使用3台虚拟机搭建一个高可用Web集群：

bash

# 准备3台Ubuntu 22.04服务器

# 节点1: master1 (192.168.1.101)

# 节点2: master2 (192.168.1.102)

# 节点3: lb-node (192.168.1.100)

# 在每台机器上设置主机名和hosts

sudo hostnamectl set-hostname master1 # 在对应节点执行

sudo hostnamectl set-hostname master2

sudo hostnamectl set-hostname lb-node

# 编辑所有节点的hosts文件

sudo vim /etc/hosts

添加以下内容（所有节点都要添加）：

text

192.168.1.101 master1

192.168.1.102 master2

192.168.1.100 lb-node

2.2 配置SSH免密登录

bash

# 1. 在每个节点生成SSH密钥（如果还没有）

ssh-keygen -t rsa -b 4096

# 2. 配置所有节点间互相免密登录

# 在master1上执行：

ssh-copy-id master1

ssh-copy-id master2

ssh-copy-id lb-node

# 在master2上执行：

ssh-copy-id master1

ssh-copy-id master2

ssh-copy-id lb-node

# 在lb-node上执行：

ssh-copy-id master1

ssh-copy-id master2

ssh-copy-id lb-node

# 3. 测试SSH连接

ssh master2 hostname # 应该返回master2而不需要密码

2.3 安装基础软件

bash

# 在所有节点上更新系统

sudo apt update

sudo apt upgrade -y

# 安装常用工具

sudo apt install -y vim curl wget net-tools telnet

# 安装NTP时间同步

sudo apt install -y chrony

sudo systemctl enable chrony

sudo systemctl start chrony

# 验证时间同步

chronyc sources

timedatectl

第三部分：负载均衡集群

3.1 安装和配置HAProxy

HAProxy是一个高性能的TCP/HTTP负载均衡器。

bash

# 在lb-node节点上安装HAProxy

sudo apt install -y haproxy

# 备份原始配置

sudo cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.backup

# 编辑HAProxy配置

sudo vim /etc/haproxy/haproxy.cfg

配置内容：

bash

global

log /dev/log local0

log /dev/log local1 notice

chroot /var/lib/haproxy

stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners

stats timeout 30s

user haproxy

group haproxy

daemon

defaults

log global

mode http

option httplog

option dontlognull

timeout connect 5000

timeout client 50000

timeout server 50000

errorfile 400 /etc/haproxy/errors/400.http

errorfile 403 /etc/haproxy/errors/403.http

errorfile 408 /etc/haproxy/errors/408.http

errorfile 500 /etc/haproxy/errors/500.http

errorfile 502 /etc/haproxy/errors/502.http

errorfile 503 /etc/haproxy/errors/503.http

errorfile 504 /etc/haproxy/errors/504.http

frontend http_front

bind :80

stats uri /haproxy?stats

default_backend http_back

backend http_back

balance roundrobin

server master1 192.168.1.101:80 check

server master2 192.168.1.102:80 check

listen stats

bind :9000

stats enable

stats uri /stats

stats realm HAProxy\ Statistics

stats auth admin:admin123

bash

# 创建错误页面目录

sudo mkdir -p /etc/haproxy/errors

# 创建示例错误页面

sudo tee /etc/haproxy/errors/503.http << 'EOF'

HTTP/1.0 503 Service Unavailable

Cache-Control: no-cache

Connection: close

Content-Type: text/html

<html>

<body>

<h1>503 Service Unavailable</h1>

No server is available to handle your request.

</body>

</html>

EOF

# 启动HAProxy

sudo systemctl restart haproxy

sudo systemctl enable haproxy

# 检查状态

sudo systemctl status haproxy

ss -tlnp | grep haproxy

# 测试访问

curl http://lb-node/haproxy?stats

# 或通过浏览器访问 http://lb-node:9000/stats (admin/admin123)

3.2 安装和配置Nginx（作为后端Web服务）

bash

# 在master1和master2上安装Nginx

sudo apt install -y nginx

# 为每个节点创建不同的主页以区分

# 在master1上：

sudo tee /var/www/html/index.html << 'EOF'

<!DOCTYPE html>

<html>

<head>

<title>Web Server - Master1</title>

<style>

body { font-family: Arial, sans-serif; margin: 40px; background: #f0f0f0; }

.container { background: white; padding: 30px; border-radius: 10px; }

.server { color: green; font-weight: bold; }

.ip { color: blue; }

</style>

</head>

<body>

<div class="container">

<h1>Welcome to High Availability Cluster</h1>

<h2>This is served from: <span class="server">Master1</span></h2>

<p>Server IP: <span class="ip">192.168.1.101</span></p>

<p>Time: <span id="time">$(date)</span></p>

<hr>

<p>This is part of a highly available web cluster.</p>

<p>If this server fails, requests will be automatically routed to Master2.</p>

</div>

<script>document.getElementById('time').textContent = new Date().toString();</script>

</body>

</html>

EOF

# 在master2上：

sudo tee /var/www/html/index.html << 'EOF'

<!DOCTYPE html>

<html>

<head>

<title>Web Server - Master2</title>

<style>

body { font-family: Arial, sans-serif; margin: 40px; background: #f0f0f0; }

.container { background: white; padding: 30px; border-radius: 10px; }

.server { color: red; font-weight: bold; }

.ip { color: blue; }

</style>

</head>

<body>

<div class="container">

<h1>Welcome to High Availability Cluster</h1>

<h2>This is served from: <span class="server">Master2</span></h2>

<p>Server IP: <span class="ip">192.168.1.102</span></p>

<p>Time: <span id="time">$(date)</span></p>

<hr>

<p>This is part of a highly available web cluster.</p>

<p>If Master1 fails, I will handle all requests.</p>

</div>

<script>document.getElementById('time').textContent = new Date().toString();</script>

</body>

</html>

EOF

# 启动Nginx

sudo systemctl restart nginx

sudo systemctl enable nginx

# 测试直接访问后端

curl http://master1

curl http://master2

3.3 测试负载均衡

bash

# 从lb-node测试负载均衡

# 多次请求，观察轮询效果

for i in {1..10}; do

echo "请求 $i: $(curl -s http://lb-node/ | grep -o 'Master[12]')"

sleep 1

done

# 查看HAProxy统计信息

curl -s http://lb-node:9000/stats -u admin:admin123 | grep -E "(server|Session)"

第四部分：高可用集群

4.1 安装和配置Keepalived

Keepalived提供虚拟IP（VIP）和健康检查，实现高可用。

bash

# 在master1和master2上安装Keepalived

sudo apt install -y keepalived

# 在master1上配置（主节点）

sudo tee /etc/keepalived/keepalived.conf << 'EOF'

! Configuration File for keepalived

global_defs {

router_id LVS_MASTER # 修改为LVS_BACKUP在备份节点

}

vrrp_script chk_nginx {

script "/usr/bin/killall -0 nginx" # 检查nginx进程是否存在

interval 2 # 每2秒检查一次

weight 2 # 权重变化

fall 2 # 2次失败认为节点失效

rise 1 # 1次成功认为节点恢复

}

vrrp_instance VI_1 {

state MASTER # 主节点，备份节点改为BACKUP

interface eth0 # 根据实际网卡修改

virtual_router_id 51 # 虚拟路由器ID，集群内必须一致

priority 100 # 优先级，主节点100，备份节点90

advert_int 1 # 通告间隔1秒

authentication {

auth_type PASS

auth_pass 1111 # 认证密码，集群内必须一致

}

virtual_ipaddress {

192.168.1.99/24 # 虚拟IP（VIP）

}

track_script {

chk_nginx # 关联健康检查脚本

}

notify_master "/etc/keepalived/notify.sh master"

notify_backup "/etc/keepalived/notify.sh backup"

notify_fault "/etc/keepalived/notify.sh fault"

}

EOF

# 在master2上配置（备份节点）

sudo tee /etc/keepalived/keepalived.conf << 'EOF'

! Configuration File for keepalived

global_defs {

router_id LVS_BACKUP

}

vrrp_script chk_nginx {

script "/usr/bin/killall -0 nginx"

interval 2

weight 2

fall 2

rise 1

}

vrrp_instance VI_1 {

state BACKUP

interface eth0

virtual_router_id 51

priority 90

advert_int 1

authentication {

auth_type PASS

auth_pass 1111

}

virtual_ipaddress {

192.168.1.99/24

}

track_script {

chk_nginx

}

notify_master "/etc/keepalived/notify.sh master"

notify_backup "/etc/keepalived/notify.sh backup"

notify_fault "/etc/keepalived/notify.sh fault"

}

EOF

4.2 创建通知脚本

bash

# 在master1和master2上创建通知脚本

sudo tee /etc/keepalived/notify.sh << 'EOF'

#!/bin/bash

# Keepalived状态变化通知脚本

TYPE=$1

NAME=$2

STATE=$3

log_file="/var/log/keepalived-notify.log"

echo "$(date '+%Y-%m-%d %H:%M:%S') - [$(hostname)] Type: $TYPE, Name: $NAME, State: $STATE" >> $log_file

case $STATE in

"MASTER")

echo "I am now MASTER" >> $log_file

# 可以在这里添加成为主节点后要执行的命令

# 比如：启动特定服务、发送通知等

;;

"BACKUP")

echo "I am now BACKUP" >> $log_file

# 可以在这里添加成为备份节点后要执行的命令

;;

"FAULT")

echo "I am now in FAULT state" >> $log_file

# 可以在这里添加故障状态要执行的命令

;;

)

echo "Unknown state: $STATE" >> $log_file

exit 1

;;

esac

EOF

sudo chmod +x /etc/keepalived/notify.sh

4.3 配置Nginx监听VIP

bash

# 修改Nginx配置，监听所有IP（包括VIP）

sudo tee /etc/nginx/sites-available/default << 'EOF'

server {

listen 80 default_server;

listen [::]:80 default_server;

root /var/www/html;

index index.html index.htm;

server_name _;

location / {

try_files $uri $uri/ =404;

}

}

EOF

# 重启Nginx

sudo systemctl restart nginx

4.4 启动和测试Keepalived

bash

# 在master1和master2上启动Keepalived

sudo systemctl restart keepalived

sudo systemctl enable keepalived

# 检查状态

sudo systemctl status keepalived

# 查看虚拟IP

ip addr show eth0

# 应该看到master1上有192.168.1.99（VIP）

# 测试VIP访问

curl http://192.168.1.99

# 模拟故障转移测试

# 1. 在master1上停止Nginx

sudo systemctl stop nginx

# 2. 观察VIP转移到master2

# 在master1上检查VIP是否消失

ip addr show eth0

# 在master2上检查VIP是否出现

ssh master2 "ip addr show eth0"

# 3. 测试VIP仍然可以访问

curl http://192.168.1.99

# 4. 恢复master1的Nginx

sudo systemctl start nginx

# 5. 观察VIP是否会切回（根据优先级）

第五部分：数据同步和共享存储

5.1 配置NFS共享存储

bash

# 在master1上安装NFS服务器

sudo apt install -y nfs-kernel-server

# 创建共享目录

sudo mkdir -p /shared/webdata

sudo chown nobody:nogroup /shared/webdata

sudo chmod 777 /shared/webdata

# 配置NFS导出

sudo tee /etc/exports << 'EOF'

/shared/webdata 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)

EOF

# 应用配置

sudo exportfs -a

sudo systemctl restart nfs-kernel-server

sudo systemctl enable nfs-kernel-server

# 在master2上安装NFS客户端并挂载

sudo apt install -y nfs-common

sudo mkdir -p /mnt/nfs/webdata

# 测试挂载

sudo mount 192.168.1.101:/shared/webdata /mnt/nfs/webdata

# 永久挂载（编辑/etc/fstab）

echo "192.168.1.101:/shared/webdata /mnt/nfs/webdata nfs defaults 0 0" | sudo tee -a /etc/fstab

# 配置Nginx使用共享存储

# 在master1和master2上：

sudo mv /var/www/html /var/www/html.local

sudo ln -s /mnt/nfs/webdata /var/www/html

# 在共享存储中创建Web内容

echo "<h1>This is served from shared storage</h1>" | sudo tee /shared/webdata/index.html

5.2 使用rsync进行数据同步

bash

# 创建同步脚本

sudo tee /usr/local/bin/sync_webdata.sh << 'EOF'

#!/bin/bash

# Web数据同步脚本

SOURCE_DIR="/var/www/html.local"

TARGET_DIR="/mnt/nfs/webdata"

LOG_FILE="/var/log/webdata_sync.log"

LOCK_FILE="/tmp/webdata_sync.lock"

# 防止同时运行多个实例

if [ -f "$LOCK_FILE" ]; then

echo "$(date) - Sync already running" >> "$LOG_FILE"

exit 1

fi

touch "$LOCK_FILE"

echo "$(date) - Starting sync" >> "$LOG_FILE"

# 使用rsync同步数据

rsync -avz --delete \

--exclude='.tmp' \

--exclude='.git/' \

"$SOURCE_DIR/" "$TARGET_DIR/" >> "$LOG_FILE" 2>&1

SYNC_STATUS=$?

if [ $SYNC_STATUS -eq 0 ]; then

echo "$(date) - Sync completed successfully" >> "$LOG_FILE"

else

echo "$(date) - Sync failed with status $SYNC_STATUS" >> "$LOG_FILE"

fi

rm -f "$LOCK_FILE"

exit $SYNC_STATUS

EOF

sudo chmod +x /usr/local/bin/sync_webdata.sh

# 设置定时同步（每5分钟）

echo "/5 * * * root /usr/local/bin/sync_webdata.sh" | sudo tee /etc/cron.d/webdata_sync

5.3 配置DRBD（分布式复制块设备）

DRBD提供块级别的数据同步。

bash

# 在master1和master2上安装DRBD

sudo apt install -y drbd-utils

# 加载DRBD内核模块

sudo modprobe drbd

# 创建DRBD配置

sudo tee /etc/drbd.d/webdata.res << 'EOF'

resource webdata {

protocol C;

disk {

on-io-error detach;

}

on master1 {

device /dev/drbd0;

disk /dev/sdb1; # 假设有额外磁盘分区

address 192.168.1.101:7788;

meta-disk internal;

}

on master2 {

device /dev/drbd0;

disk /dev/sdb1;

address 192.168.1.102:7788;

meta-disk internal;

}

}

EOF

# 初始化DRBD

# 在两台机器上执行：

sudo drbdadm create-md webdata

# 启动DRBD服务

sudo systemctl start drbd

sudo systemctl enable drbd

# 在master1上设置为主节点

sudo drbdadm primary webdata --force

# 查看DRBD状态

sudo drbdadm status webdata

cat /proc/drbd

# 创建文件系统（只在主节点）

sudo mkfs.ext4 /dev/drbd0

# 挂载使用

sudo mkdir -p /mnt/drbd

sudo mount /dev/drbd0 /mnt/drbd

第六部分：集群管理和监控

6.1 集群状态监控脚本

bash

sudo tee /usr/local/bin/cluster_status.sh << 'EOF'

#!/bin/bash

# 集群状态监控脚本

echo "=== 集群状态报告 ==="

echo "时间: $(date '+%Y-%m-%d %H:%M:%S')"

echo

# 1. 检查VIP状态

echo "1. 虚拟IP状态:"

VIP="192.168.1.99"

if ip addr show | grep -q "$VIP"; then

echo " ✓ VIP $VIP 在当前节点"

else

echo " ✗ VIP $VIP 不在当前节点"

fi

echo

# 2. 检查Keepalived状态

echo "2. Keepalived状态:"

if systemctl is-active keepalived >/dev/null; then

echo " ✓ Keepalived 运行中"

# 查看VRRP状态

if ip addr show | grep -q "$VIP"; then

echo " 状态: MASTER"

else

echo " 状态: BACKUP"

fi

else

echo " ✗ Keepalived 未运行"

fi

echo

# 3. 检查Nginx状态

echo "3. Nginx状态:"

if systemctl is-active nginx >/dev/null; then

echo " ✓ Nginx 运行中"

# 检查Nginx是否可以正常响应

if curl -s -o /dev/null -w "%{http_code}" http://localhost/ | grep -q "200"; then

echo " 服务响应: 正常"

else

echo " 服务响应: 异常"

fi

else

echo " ✗ Nginx 未运行"

fi

echo

# 4. 检查HAProxy状态（如果在负载均衡节点）

echo "4. HAProxy状态:"

if systemctl is-active haproxy >/dev/null 2>&1; then

echo " ✓ HAProxy 运行中"

# 检查后端服务器状态

echo " 后端服务器:"

if command -v haproxy >/dev/null; then

echo " master1: $(curl -s http://localhost:9000/stats -u admin:admin123 | grep 'master1,http_back' | awk -F, '{print $18}')"

echo " master2: $(curl -s http://localhost:9000/stats -u admin:admin123 | grep 'master2,http_back' | awk -F, '{print $18}')"

fi

else

echo " ✗ HAProxy 未运行（或未安装）"

fi

echo

# 5. 检查NFS挂载

echo "5. NFS挂载状态:"

if mount | grep -q nfs; then

echo " ✓ NFS 已挂载"

mount | grep nfs

else

echo " ✗ NFS 未挂载"

fi

echo

# 6. 检查节点间连通性

echo "6. 节点间连通性:"

for node in master1 master2 lb-node; do

if ping -c 1 -W 1 $node >/dev/null 2>&1; then

echo " ✓ $node: 可达"

else

echo " ✗ $node: 不可达"

fi

done

echo

echo "=== 报告结束 ==="

EOF

sudo chmod +x /usr/local/bin/cluster_status.sh

6.2 自动化故障检测和恢复

bash

sudo tee /usr/local/bin/cluster_health_check.sh << 'EOF'

#!/bin/bash

# 集群健康检查和自动恢复脚本

LOG_FILE="/var/log/cluster_health.log"

VIP="192.168.1.99"

log() {

echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"

}

# 检查并重启Nginx

check_nginx() {

if ! systemctl is-active nginx >/dev/null; then

log "Nginx未运行，尝试重启..."

systemctl restart nginx

if systemctl is-active nginx >/dev/null; then

log "Nginx重启成功"

return 0

else

log "Nginx重启失败"

return 1

fi

fi

return 0

}

# 检查并重启Keepalived

check_keepalived() {

if ! systemctl is-active keepalived >/dev/null; then

log "Keepalived未运行，尝试重启..."

systemctl restart keepalived

if systemctl is-active keepalived >/dev/null; then

log "Keepalived重启成功"

return 0

else

log "Keepalived重启失败"

return 1

fi

fi

return 0

}

# 检查VIP状态

check_vip() {

local node_type="$1"

if [ "$node_type" = "master" ]; then

if ! ip addr show | grep -q "$VIP"; then

log "VIP不在主节点，尝试重新配置..."

# 强制成为主节点

systemctl stop keepalived

sleep 2

systemctl start keepalived

sleep 5

if ip addr show | grep -q "$VIP"; then

log "VIP恢复成功"

return 0

else

log "VIP恢复失败"

return 1

fi

fi

fi

return 0

}

# 主函数

main() {

local node_type="$1"

log "开始集群健康检查 - 节点类型: $node_type"

# 执行检查

check_nginx

check_keepalived

if [ "$node_type" = "master" ]; then

check_vip "master"

fi

log "健康检查完成"

}

# 运行主函数

if [ -n "$1" ]; then

main "$1"

else

echo "用法: $0 [master|backup]"

exit 1

fi

EOF

sudo chmod +x /usr/local/bin/cluster_health_check.sh

# 设置定时检查（每分钟）

echo "* * * * * root /usr/local/bin/cluster_health_check.sh master" | sudo tee /etc/cron.d/cluster_health_master

echo "* * * * * root /usr/local/bin/cluster_health_check.sh backup" | sudo tee /etc/cron.d/cluster_health_backup

6.3 使用Prometheus监控集群

bash

# 在监控节点上安装Prometheus

sudo apt install -y prometheus

# 配置Prometheus监控所有集群节点

sudo tee /etc/prometheus/prometheus.yml << 'EOF'

global:

scrape_interval: 15s

evaluation_interval: 15s

scrape_configs:

- job_name: 'prometheus'

static_configs:

- targets: ['localhost:9090']

- job_name: 'node'

static_configs:

- targets:

- 'master1:9100'

- 'master2:9100'

- 'lb-node:9100'

labels:

group: 'ha-cluster'

- job_name: 'haproxy'

static_configs:

- targets: ['lb-node:9101']

- job_name: 'nginx'

static_configs:

- targets:

- 'master1:9113'

- 'master2:9113'

EOF

# 在各节点安装Node Exporter

sudo apt install -y prometheus-node-exporter

# 在负载均衡节点安装HAProxy Exporter

wget https://github.com/prometheus/haproxy_exporter/releases/download/v0.13.0/haproxy_exporter-0.13.0.linux-amd64.tar.gz

tar xzf haproxy_exporter-.tar.gz

sudo mv haproxy_exporter-/haproxy_exporter /usr/local/bin/

# 创建systemd服务

sudo tee /etc/systemd/system/haproxy-exporter.service << 'EOF'

[Unit]

Description=HAProxy Exporter

After=network.target

[Service]

Type=simple

User=nobody

ExecStart=/usr/local/bin/haproxy_exporter \

--haproxy.scrape-uri="http://localhost:9000/stats;csv" \

--web.listen-address=":9101"

Restart=always

[Install]

WantedBy=multi-user.target

EOF

sudo systemctl daemon-reload

sudo systemctl start haproxy-exporter

sudo systemctl enable haproxy-exporter

第七部分：高级集群配置

7.1 配置Pacemaker和Corosync（企业级HA）

bash

# 在master1和master2上安装Pacemaker和Corosync

sudo apt install -y pacemaker corosync pcs

# 设置hacluster用户密码（两节点相同）

sudo passwd hacluster

# 在master1上配置

sudo pcs cluster auth master1 master2 -u hacluster -p password --force

sudo pcs cluster setup --name ha_cluster master1 master2

sudo pcs cluster start --all

sudo pcs cluster enable --all

# 配置集群属性

sudo pcs property set stonith-enabled=false # 测试环境禁用STONITH

sudo pcs property set no-quorum-policy=ignore

sudo pcs property set cluster-recheck-interval=2min

# 创建虚拟IP资源

sudo pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \

ip=192.168.1.99 \

cidr_netmask=24 \

op monitor interval=30s

# 创建Nginx资源

sudo pcs resource create WebServer systemd:nginx \

op monitor interval=20s \

op start timeout=40s \

op stop timeout=60s

# 配置资源组（按顺序启动）

sudo pcs resource group add WebGroup ClusterIP WebServer

# 查看集群状态

sudo pcs status

sudo pcs cluster status

sudo crm_mon -1 # 实时监控

7.2 配置负载均衡器集群（HAProxy高可用）

bash

# 在lb-node和另一个节点上配置HAProxy高可用

# 假设有lb-backup节点 (192.168.1.103)

# 在两个负载均衡器上安装HAProxy和Keepalived

sudo apt install -y haproxy keepalived

# 配置Keepalived（类似第四部分，但用于负载均衡器）

# 在lb-node（主）：

sudo tee /etc/keepalived/keepalived.conf << 'EOF'

vrrp_instance VI_1 {

state MASTER

interface eth0

virtual_router_id 52

priority 150

advert_int 1

authentication {

auth_type PASS

auth_pass 2222

}

virtual_ipaddress {

192.168.1.98/24 # 负载均衡器的VIP

}

virtual_server 192.168.1.98 80 {

delay_loop 6

lb_algo rr

lb_kind NAT

persistence_timeout 50

protocol TCP

real_server 192.168.1.101 80 {

weight 1

TCP_CHECK {

connect_port 80

connect_timeout 3

}

}

real_server 192.168.1.102 80 {

weight 1

TCP_CHECK {

connect_port 80

connect_timeout 3

}

}

}

}

EOF

# 在lb-backup（备）上配置类似，但state为BACKUP，priority较低

# 启动服务

sudo systemctl restart keepalived

sudo systemctl enable keepalived

第八部分：安全加固

8.1 集群网络安全

bash

# 配置防火墙规则

# 在所有节点上安装UFW

sudo apt install -y ufw

# 在master1和master2上：

sudo ufw default deny incoming

sudo ufw default allow outgoing

sudo ufw allow from 192.168.1.0/24 to any port 22 # SSH

sudo ufw allow from 192.168.1.0/24 to any port 80 # HTTP

sudo ufw allow from 192.168.1.0/24 to any port 443 # HTTPS

sudo ufw allow from 192.168.1.0/24 to any port 3306 # MySQL（如果有）

sudo ufw enable

# 在负载均衡器上：

sudo ufw allow 80/tcp

sudo ufw allow 443/tcp

sudo ufw allow 9000/tcp # HAProxy统计

sudo ufw enable

# 配置SSH安全

sudo vim /etc/ssh/sshd_config

# 修改以下配置：

# Port 2222

# PermitRootLogin no

# PasswordAuthentication no

# AllowUsers clusteradmin

8.2 集群认证和加密

bash

# 配置Corosync认证

sudo corosync-keygen

# 生成后将生成的authkey复制到所有节点

sudo scp /etc/corosync/authkey master2:/etc/corosync/

sudo scp /etc/corosync/authkey lb-node:/etc/corosync/

# 设置正确权限

sudo chmod 400 /etc/corosync/authkey

sudo chown root:root /etc/corosync/authkey

# 配置DRBD加密

# 在DRBD配置中添加：

sudo tee -a /etc/drbd.d/webdata.res << 'EOF'

net {

cram-hmac-alg sha1;

shared-secret "MySecretKey123";

}

EOF

第九部分：故障演练和测试

9.1 故障测试脚本

bash

sudo tee /usr/local/bin/cluster_failure_test.sh << 'EOF'

#!/bin/bash

# 集群故障测试脚本

echo "=== 集群故障测试 ==="

echo "警告：这将模拟故障，可能影响服务！"

echo

read -p "选择测试类型：

1) 停止Nginx服务

2) 停止Keepalived服务

3) 模拟网络分区

4) 模拟节点宕机

5) 恢复所有服务

请输入数字: " test_type

case $test_type in

1)

echo "测试：停止Nginx服务"

sudo systemctl stop nginx

echo "Nginx已停止，观察故障转移..."

;;

2)

echo "测试：停止Keepalived服务"

sudo systemctl stop keepalived

echo "Keepalived已停止，观察VIP转移..."

;;

3)

echo "测试：模拟网络分区"

# 临时阻止集群节点间的通信

for node in master1 master2; do

if [ "$(hostname)" != "$node" ]; then

sudo iptables -A INPUT -s $(host $node | awk '{print $4}') -j DROP

echo "已阻止来自 $node 的流量"

fi

done

;;

4)

echo "测试：模拟节点宕机"

read -p "输入要模拟宕机的节点 (master1/master2): " failed_node

if [ "$failed_node" = "master1" ] || [ "$failed_node" = "master2" ]; then

echo "在 $failed_node 上模拟宕机..."

ssh $failed_node "sudo systemctl stop nginx keepalived"

else

echo "无效的节点"

fi

;;

5)

echo "恢复所有服务"

sudo systemctl start nginx keepalived

sudo iptables -F # 清除所有防火墙规则

echo "服务已恢复"

;;

)

echo "无效的选择"

;;

esac

echo

echo "测试完成，检查集群状态："

/usr/local/bin/cluster_status.sh

EOF

sudo chmod +x /usr/local/bin/cluster_failure_test.sh

9.2 性能压力测试

bash

# 安装压力测试工具

sudo apt install -y apache2-utils siege

# 使用ab进行压力测试

ab -n 10000 -c 100 http://192.168.1.99/

# 使用siege进行长时间测试

siege -c 100 -t 60S http://192.168.1.99/

# 监控性能指标

watch -n 1 '

echo "负载: $(uptime)"

echo "连接数: $(ss -t | wc -l)"

echo "内存: $(free -h | grep Mem)"

echo "HAProxy会话: $(curl -s http://lb-node:9000/stats -u admin:admin123 | grep "http_back,FRONTEND" | awk -F, "{print \$34}")"

'

第十部分：生产环境最佳实践

10.1 集群部署检查清单

bash

#!/bin/bash

# cluster_deployment_checklist.sh

echo "=== 集群部署检查清单 ==="

echo

check_item() {

local item="$1"

local check_cmd="$2"

local expected="$3"

if eval "$check_cmd" | grep -q "$expected"; then

echo "✓ $item"

return 0

else

echo "✗ $item"

return 1

fi

}

echo "1. 网络检查:"

check_item "节点间网络连通性" "ping -c 1 master2" "1 received"

check_item "DNS解析正常" "host master1" "has address"

check_item "无重复IP地址" "arp-scan --localnet" "192.168.1."

echo

echo "2. 服务检查:"

check_item "SSH免密登录" "ssh master2 hostname" "master2"

check_item "时间同步" "chronyc sources" "^\^\"

check_item "防火墙配置" "sudo ufw status" "active"

echo

echo "3. 高可用检查:"

check_item "VIP存在" "ip addr show" "192.168.1.99"

check_item "Keepalived运行" "systemctl is-active keepalived" "active"

check_item "Nginx运行" "systemctl is-active nginx" "active"

echo

echo "4. 负载均衡检查:"

check_item "HAProxy运行" "systemctl is-active haproxy" "active"

check_item "后端健康检查" "curl -s http://localhost:9000/stats -u admin:admin123" "UP"

echo

echo "5. 数据同步检查:"

check_item "NFS挂载" "mount | grep nfs" "nfs"

check_item "共享目录可写" "touch /mnt/nfs/webdata/test && rm /mnt/nfs/webdata/test" ""

echo

echo "=== 检查完成 ==="

10.2 备份和恢复策略

bash

# 创建集群配置备份脚本

sudo tee /usr/local/bin/cluster_backup.sh << 'EOF'

#!/bin/bash

# 集群配置备份脚本

BACKUP_DIR="/backup/cluster_config"

DATE=$(date +%Y%m%d_%H%M%S)

BACKUP_FILE="$BACKUP_DIR/cluster_backup_$DATE.tar.gz"

mkdir -p "$BACKUP_DIR"

echo "开始备份集群配置..."

echo "备份文件: $BACKUP_FILE"

# 备份重要配置文件

tar -czf "$BACKUP_FILE" \

/etc/haproxy \

/etc/keepalived \

/etc/nginx \

/etc/drbd.d \

/etc/corosync \

/etc/pcs \

/etc/exports \

/etc/fstab \

/etc/hosts \

/root/.ssh/id_rsa.pub \

/usr/local/bin/cluster_.sh 2>/dev/null

if [ $? -eq 0 ]; then

echo "备份成功"

echo "文件大小: $(du -h "$BACKUP_FILE" | cut -f1)"

# 保留最近7天的备份

find "$BACKUP_DIR" -name "cluster_backup_.tar.gz" -mtime +7 -delete

echo "清理了7天前的旧备份"

else

echo "备份失败"

exit 1

fi

echo "备份完成"

EOF

sudo chmod +x /usr/local/bin/cluster_backup.sh

练习项目

项目1：构建完整的高可用Web集群

使用3台虚拟机搭建集群

配置Nginx作为Web服务器

配置Keepalived实现高可用

配置HAProxy实现负载均衡

配置NFS共享存储

部署监控和告警系统

项目2：数据库高可用集群

配置MySQL主从复制

使用HAProxy进行读写分离

配置MHA（MySQL高可用）自动故障转移

设置备份和恢复策略

项目3：容器化集群（Kubernetes基础）

安装和配置Kubernetes集群

部署高可用应用

配置自动扩缩容

设置服务发现和负载均衡

项目4：混合云高可用架构

在云平台和本地数据中心部署集群

配置跨地域的数据同步

实现灾难恢复和业务连续性

优化跨地域网络延迟

今日总结

今天我们学习了Linux集群和高可用性：

集群基础：概念、类型、架构

负载均衡：HAProxy配置和优化

高可用性：Keepalived实现故障转移

数据同步：NFS共享存储和DRBD

集群管理：状态监控、健康检查

企业级方案：Pacemaker和Corosync

安全加固：网络安全、认证加密

故障演练：测试和恢复流程

最佳实践：生产环境部署和备份

关键原则：

设计冗余：消除单点故障

自动化恢复：减少人工干预

持续监控：及时发现和处理问题

定期测试：验证故障转移机制

文档化：记录架构和操作流程

集群技术是构建可靠、可扩展服务的基础。随着业务增长，这些技能会变得越来越重要。

有问题吗？完成练习项目后，我们可以继续第十七课：Linux自动化运维工具（Ansible、Puppet、Chef）。

明扬工控商城

推荐阅读：

Linux基础教程第20课：Linux安全攻防和渗透测试基础

Linux基础教程第19课：性能调优和容量规划

Linux基础教程第18课：Linux云计算基础

Linux基础教程第17课：自动化运维工具

Linux基础教程第16课：集群和高可用性

Linux基础教程第15课：Linux内核和驱动管理

热门标签:

Linux基础教程第16课：集群和高可用性.docx

将本文的Word文档下载到电脑

Linux基础教程第15课：Linux内核和驱动管理

全部评论

相关文章

为您推荐

最近更新：Linux基础教程第20课：Linux安全攻防和渗透测试基础

Linux基础教程第20课：Linux安全攻防和渗透测试基础

2026-01-23

最近更新：Linux基础教程第19课：性能调优和容量规划

Linux基础教程第19课：性能调优和容量规划

2026-01-23

最近更新：Linux基础教程第18课：Linux云计算基础

Linux基础教程第18课：Linux云计算基础

2026-01-23

最近更新：Linux基础教程第17课：自动化运维工具

Linux基础教程第17课：自动化运维工具

2026-01-23

最近更新：Linux基础教程第16课：集群和高可用性

Linux基础教程第16课：集群和高可用性

2026-01-23

最近更新：Linux基础教程第15课：Linux内核和驱动管理

Linux基础教程第15课：Linux内核和驱动管理

2026-01-21

精品文章

最近更新：Linux基础教程第20课：Linux安全攻防和渗透测试基础

热门推荐

最近更新：Linux基础教程第20课：Linux安全攻防和渗透测试基础

2026-01-23

最近更新：Linux基础教程第19课：性能调优和容量规划

2026-01-23

最近更新：Linux基础教程第18课：Linux云计算基础

2026-01-23

最近更新：Linux基础教程第17课：自动化运维工具

2026-01-23

最近更新：Linux基础教程第16课：集群和高可用性

2026-01-23

最近更新：Linux基础教程第15课：Linux内核和驱动管理

2026-01-21

大家都在看