ELK 故障的核心是数据流任何一环卡住都会导致下游断流。Filebeat 不发 → Logstash 没数据 → ES 没新文档 → Kibana 空白。诊断必须逆向追踪:从 Kibana 看不到日志开始,反向走到 Filebeat 是否在发送。EFK 把 Logstash 换成 Fluent Bit/Fluentd,逻辑同理。
哪一段断了从哪一段查。生产环境常见在 Filebeat-Logstash 之间(buffer)和 Logstash-ES 之间(背压)。
# 1. 看服务和最新日志 systemctl status filebeat journalctl -u filebeat -f # 2. 看 Filebeat 自己的统计(内置 API) curl http://localhost:5066/stats?pretty # 重点看 harvester(读文件数), output.events.acked(发送成功数) # 3. 看 registry 文件,记录每个文件读到哪了 cat /var/lib/filebeat/registry/filebeat/data.json | jq # 4. 配置测试 filebeat test config filebeat test output
| 原因 | 识别 | 解决 |
|---|---|---|
| output 不通 | test output 失败 | 查 Logstash/ES 网络 |
| 路径不匹配 | harvester 数为 0 | 检查 paths 配置和文件名 |
| 权限不足 | error log 报 permission | chmod 让 filebeat 用户能读 |
| 已读过文件 | registry 显示 offset 在尾部 | 正常,等新写入 |
| 文件被截断 | truncate 但 offset 没归零 | 删 registry 重读 |
| backpressure | output queue 满 | 下游消费跟不上 |
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/myapp/*.log
fields:
app: myapp
env: prod
multiline.pattern: '^\d{4}-\d{2}-\d{2}' # 日期开头
multiline.negate: true
multiline.match: after
# 优化采集性能
filebeat.registry.flush: 5s
queue.mem:
events: 4096
flush.min_events: 512
output.logstash:
hosts: ["logstash:5044"]
worker: 2
bulk_max_size: 2048
loadbalance: true
# 开启监控 API
http.enabled: true
http.port: 5066
tags: ["_grokparsefailure"]# 1. Kibana → Dev Tools → Grok Debugger(最方便) # 2. 命令行测试 /usr/share/logstash/bin/logstash -e ' input { stdin {} } filter { grok { match => { "message" => "%{COMBINEDAPACHELOG}" } } } output { stdout { codec => rubydebug } } '
input {
beats {
port => 5044
}
}
filter {
# 解析 Nginx 访问日志
grok {
match => {
"message" => '%{IPORHOST:clientip} - - \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status:int} %{NUMBER:bytes:int}'
}
}
# 解析时间字段为 @timestamp
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
target => "@timestamp"
}
# 异常时记录到死信队列
if "_grokparsefailure" in [tags] {
mutate {
add_field => { "parse_failed" => "true" }
}
}
}
output {
elasticsearch {
hosts => ["es:9200"]
index => "nginx-%{+YYYY.MM.dd}"
}
}
# /etc/logstash/logstash.yml dead_letter_queue.enable: true path.dead_letter_queue: /var/lib/logstash/dead_letter_queue # 解析失败的事件会进 DLQ,可单独消费分析 # 在 pipeline 中读取 DLQ: input { dead_letter_queue { path => "/var/lib/logstash/dead_letter_queue" commit_offsets => true pipeline_id => "main" } }
# 看集群健康 GET _cluster/health?pretty # 关键字段 # status: green / yellow / red # number_of_nodes # active_primary_shards # unassigned_shards (有未分配分片时大于 0)
| 状态 | 含义 | 紧急程度 |
|---|---|---|
| Green | 所有分片都正常 | ✓ 正常 |
| Yellow | 所有主分片正常,有副本未分配 | ⚠ 可读写,但失冗余 |
| Red | 至少一个主分片未分配 | 🔴 该索引不可用,要立即处理 |
# 看所有未分配的分片及原因 GET _cluster/allocation/explain # 看特定索引的分片 GET _cat/shards/myindex?v # 看所有未分配分片 GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED
| 原因 | 含义 | 解决 |
|---|---|---|
| INDEX_CREATED | 新创建,正在分配 | 等待 |
| CLUSTER_RECOVERED | 集群刚启动 | 等待恢复 |
| NODE_LEFT | 节点离线 | 恢复节点或扩容 |
| ALLOCATION_FAILED | 分配失败 | 看 explain 详情 |
| DISK_THRESHOLD | 磁盘水位过高 | 清磁盘 / 调阈值 |
| FAILED_DECIDER | 不满足分配规则 | 看 awareness / shard 限制 |
# 1. 重试分配(临时性失败时) POST _cluster/reroute?retry_failed=true # 2. 临时调高磁盘水位(磁盘紧张时) PUT _cluster/settings { "transient": { "cluster.routing.allocation.disk.watermark.low": "90%", "cluster.routing.allocation.disk.watermark.high": "95%", "cluster.routing.allocation.disk.watermark.flood_stage": "98%" } } # 3. 副本数量减少(节点少时) PUT myindex/_settings { "index.number_of_replicas": 0 } # 4. 强制分配主分片(最后手段,可能丢数据!) POST _cluster/reroute { "commands": [{ "allocate_empty_primary": { "index": "myindex", "shard": 0, "node": "node-1", "accept_data_loss": true } }] }
PUT */_settings { "index.blocks.read_only_allow_delete": null }
# 看线程池状态 GET _cat/thread_pool/write?v&h=node_name,active,queue,rejected # 看节点资源 GET _cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,disk.used_percent # 看 hot threads(找出当前在干啥) GET _nodes/hot_threads
# 1. 增大 refresh_interval(单索引) PUT myindex/_settings { "index.refresh_interval": "30s", # 默认 1s,写多读少调长 "index.number_of_replicas": 0, # 导入时先关副本 "index.translog.durability": "async", # 异步刷盘(危险但快) "index.translog.sync_interval": "30s" } # 写完恢复副本 PUT myindex/_settings { "index.number_of_replicas": 1 }
# Logstash output output { elasticsearch { hosts => ["es:9200"] pipeline => "my_pipeline" # 批量大小(单个 bulk 请求的事件数) flush_size => 5000 # worker 数(并发 bulk) workers => 4 # 失败重试 retry_initial_interval => 2 retry_max_interval => 64 } }
PUT myindex { "mappings": { "dynamic": "false", // 关闭自动 mapping "properties": { "timestamp": { "type": "date" }, "level": { "type": "keyword" }, // 不分词字段用 keyword "message": { "type": "text", "index": true }, "user_id": { "type": "keyword", "index": false // 不查询就关索引 } } } }
| 水位 | 默认 | 触发后果 |
|---|---|---|
| low | 85% | 不再分配新分片到该节点 |
| high | 90% | 把已有分片迁出该节点 |
| flood_stage | 95% | 所有索引变只读! |
# Step 1. 看磁盘使用 GET _cat/allocation?v # Step 2. 触发了 flood_stage,先解除只读 PUT */_settings { "index.blocks.read_only_allow_delete": null } # Step 3. 删老索引腾空间(按日期降序) GET _cat/indices?v&s=store.size:desc # 删除指定索引 DELETE nginx-2024.05.* # Step 4. 强制 merge(回收已删除文档空间) POST myindex/_forcemerge?only_expunge_deletes=true
# Index Lifecycle Management — 自动滚动 + 删除 PUT _ilm/policy/logs-policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" } } }, "warm": { "min_age": "7d", "actions": { "forcemerge": { "max_num_segments": 1 }, "shrink": { "number_of_shards": 1 } } }, "delete": { "min_age": "30d", "actions": { "delete": {} } } } } }
# 看每个索引的段数(segment) GET _cat/segments/myindex?v&h=index,shard,segment,size # 索引总数和大小 GET _cat/indices?v&s=docs.count:desc # 节点上打开的文件句柄 GET _cat/nodes?v&h=name,fielddata.memory_size,segments.count,segments.memory
# 不再写入的老索引,merge 成 1 个段(节省内存) POST nginx-2025.04.*/_forcemerge?max_num_segments=1 # 注意: # 1. forcemerge 是 IO 密集型,只在低峰期做 # 2. 只对只读的老索引做,不要对正在写的 # 3. 大索引可能几个小时
PUT _index_template/logs { "index_patterns": ["logs-*"], "template": { "settings": { "number_of_shards": 3, "number_of_replicas": 1, "refresh_interval": "30s", "index.lifecycle.name": "logs-policy", "index.lifecycle.rollover_alias": "logs" }, "mappings": { "dynamic_templates": [ { "strings_as_keyword": { "match_mapping_type": "string", "mapping": { "type": "keyword" } } } ] } } }
# 看哪些字段占用 field data 多 GET _cat/fielddata?v&h=node,total,field # 紧急清理 field data 缓存 POST _cache/clear?fielddata=true # 永久限制 field data PUT _cluster/settings { "persistent": { "indices.fielddata.cache.size": "20%" } }
keyword 类型;text 类型默认禁止聚合(避免 field data 爆内存)。日志字段如 service_name/level/host 一律 keyword。
# 看慢查询(需要先开启 slow log) PUT myindex/_settings { "index.search.slowlog.threshold.query.warn": "10s", "index.search.slowlog.threshold.query.info": "5s", "index.search.slowlog.threshold.fetch.warn": "1s" } # 看 search 线程池 GET _cat/thread_pool/search?v
*error* 极慢,改用 token 匹配// 慢:从头扫描 GET myindex/_search { "query": { "wildcard": { "message": "*error*" // 前导 * 极慢 } } } // 快:利用倒排索引 GET myindex/_search { "query": { "bool": { "filter": [ { "range": { "@timestamp": { "gte": "now-1h" }}}, { "match": { "message": "error" }}, { "term": { "level": "ERROR" }} ] } } }
# Kibana → Stack Management → Advanced Settings # discover:sampleSize 减少 Discover 加载 # discover:maxDocFieldsDisplayed 限制字段数 # query:queryString:options 调整解析行为 # Dashboard 端: # - 限制每个图表 size # - 用 saved search 复用查询 # - 大盘默认时间范围别选 last 1 year
logs-* 会扫所有 logs 开头的索引,查 1 天用 logs-2026.05.11 直接指定就够了。Kibana 的 time-based index pattern 会自动按时间筛索引。