重新规划了监控的架构Thanos+Prometheus

Thanos 监控

字数统计: 1.9k阅读时长: 8 min

 2021/04/02   Share

前言

在使用监控的过程中, 逐渐发现了很多问题. 首先,我们的项目是一个项目组一个项目组的, 每次项目组上线新的项目, 监控也是各自用各自的. 没有一个统一的视图,管理起来非常麻烦。

前段时间在看Thanos, 觉得挺适合的,也就尝试的改了一下监控的架构, 而且重新梳理了一下接入规范。

此次调整主要针对于2个大方面.

将监控数据统一存储, 并且使用公用的仪表盘
制定监控的接入文档和规范

架构

数据结构

要将所有的监控数据统一存储起来,那必定需要不同的标签来区分数据. 所以, 我们需要提前规格好标签规范。

因为业务存在不同的发行商，不同的云服务商, 所以通过:

region,business,publisher区分, 而一些项目使用了k8s集群或者mesos,我们则使用cluster标签进行区分.

并且有些项目使用了高可用的prometheus模式,那么我们也需要通过标签来判断, 此处定义为replicas

// <Prometheus所在节点地区:str>
region: "", # 如:"腾讯云-北美"
// <当前业务:str>
business: "",
// <发行商:str>
publisher: "",
// <所属集群:str> 
cluster: "", # k8s，mesos集群使用
// <高可用标签:str>
replica: ""

产品架构

架构图

Sidecar：连接Prometheus, 并且读取数据, 去接收查询请求, 或者上传到云存储中
Store Gateway: 暴露sidecar 和对象存储的metrics的网关接口
Compactor：负责删除，和下采样cos数据
Receiver：负责接收prometheus的 remote-write 数据，上传到cos中
Querier/Query：负责兼容prometheus的查询API，并且向后端查询和去重的操作。

具体的架构和相关功能,建议还是直接查看官方文档,这里就不在过多叙述了.

部署

store-gateway

store-gateway负责去查询cos中的数据
此次使用了阿里云的存储桶

bucket_storage.yaml

type: ALIYUNOSS
config:
  endpoint: "oss-cn-hangzhou.aliyuncs.com"
  bucket: "dawdaw"
  access_key_id: "dawdaw"
  access_key_secret: "dawdaw"

启动

1	thanos store --objstore.config-file /etc/thanos/bucket_storage.yaml --http-address 0.0.0.0:19091 --grpc-address 0.0.0.0:19191

thanos-querier

querier 负责去查询sidecar和store-gateway的数据.

启动

1	thanos query --http-address 0.0.0.0:9091 --query.replica-label replica --store.sd-files=/data/thanos/storage_sd.json

storage_sd.json

[
  {
    "targets": ["49.51.185.122:10901",
                "127.0.0.1:19191"]
  }
]

不过这里有一个问题, store.sd 的话暂时是使用的文件, 官方并不支持也不打算去支持类似于consul的服务发现, 所以后续我们需要自己去解决这个发现的问题.

我们可以去写脚本监控consul，从而动态的生成这个文件，也可以通过接口，提交的时候主动去更改这个文件。

这个问题过后在解决

[未完成] store的服务发现

Compactor

使用Compactor来压缩监控数据.

首先要明白的是, 使用Compactor并不会减少监控数据的使用量, 反而还会增加. 降采样的目的是为大时间范围（如数月或数年）的查询快速提供结果。

所以我们需要规划源数据, 5分钟下采样数据, 1小时下采样数据的保留周期.

安装目前的需求, 采用以下方案

源数据保留60天
5分钟下采样保留120天
1小时下采样保留1年

配置说明

retention.resolution-raw: 原始数据保留的天数
retention.resolution-5m 1(5)分钟的样本保留天数
retention.resolution-1h 1小时的样本保留天数
delete-delay: 数据被标记前是否延迟删除, 如果为0则立即删除
wait: 循环执行
wait-interval: 循环的周期
compact.concurrency：并发线程, 建议CPU数量
data-dir: 临时数据
启动

/data/thanos/thanos compact --retention.resolution-raw=60d  --retention.resolution-5m=120d --retention.resolution-1h=1y --compact.concurrency=4 --delete-delay=48h  --consistency-delay=30m --data-dir="/data/tmp-data" --objstore.config-file=/data/thanos/bucket_storage.yaml --wait --wait-interval=5m

[未完成]Receiver

接入部署

此部分简单介绍如何接入监控.

目前接入方式有自定义的prometheus-operator和手动部署的方式,此次介绍手动方式,实际上逻辑是一样的.

安装Consul

1	docker run -d --name=consul --net=host -v /data/consul/data:/consul/data/ consul:1.9 agent -server -client 0.0.0.0 -bootstrap-expect=1 -advertise=172.17.0.50 -data-dir=/consul/data/ -ui

注册主机

curl -X PUT --data '{"id": "node-exporter-'${i}'","name": "node-exporter","address": "'${i}'","port": 9100,"tags": ["monitoring"],"checks": [{"http": "http://'${i}':9100/metrics", "interval": "5s"}], "meta":{"role":"'${role}'", "zone": "'${zone}'"}}' http://172.17.0.50:8500/v1/agent/service/register

安装Prometheus

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    region: "腾讯云-上海"
    business: "卡片怪兽"
    publisher: "Tencent"
    cluster: "BCS-MESOS-30000"
    replica: 0

rule_files:
  - /prometheus-rules/*.rules

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9090']
  - job_name: 'consul-prometheus'
    consul_sd_configs:
      - server: '172.30.12.167:8500'
       # 监控所有的service
        services: []  
    relabel_configs:
      - source_labels: [__meta_consul_tags]
      # 只保留匹配 tags 包含 monitoring 的service
        regex: '.*,monitoring,.*'
        action: keep
      - regex: __meta_consul_service_metadata_(.+)
        action: labelmap

docker run -d --net=host --restart always \
    -v /data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
    -v /data/prometheus/data:/prometheus \
    -v /data/prometheus/rules:/prometheus-rules \
    -u root \
    --name prometheus \
    quay.io/prometheus/prometheus:v2.14.0 \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/prometheus \
    --storage.tsdb.max-block-duration=2h \
    --storage.tsdb.min-block-duration=2h \
    --web.listen-address=:9090 \
    --web.enable-lifecycle \
    --web.enable-admin-api && echo "Prometheus started!"

安装Sidecar

storage-bucket.yaml

此处参考: aliyun-oss

type: ALIYUNOSS
config:
  endpoint: ""
  bucket: ""
  access_key_id: ""
  access_key_secret: ""

docker run -d --net=host --restart always \
    -v /data/prometheus/storage-bucket.yaml:/etc/thanos/storage-bucket.yaml \
    -v /data/prometheus/data:/prometheus \
    --name prometheus-0-sidecar \
    -u root \
    quay.io/thanos/thanos:v0.19.0 \
    sidecar \
    --tsdb.path /prometheus \
    --objstore.config-file /etc/thanos/storage-bucket.yaml \
    --shipper.upload-compacted \
    --http-address 0.0.0.0:19090 \
    --grpc-address 0.0.0.0:19190 \
    --prometheus.url http://127.0.0.1:9090

告警

告警的话, 有两种类型. 一种是自建, 另一种是接入专人维护的alertmanager集群。而如果要使用外部集群, 首先就要定义好rules的编写规则

Rules 规范

alert：简明扼要的提示性信息, 精炼为主, 不要过长.
annotations:
- message: 告警详情信息, 一般为告警邮件的主题内容
- summary: 告警的总结性信息
- runbook_url：指导手册, 出现告警后如何解决,此为内部链接
expr: 触发告警的表达式
for: 出发后多久发送告警信息
labels: 额外标签
- severity: 告警等级
  - critical: 严重的
  - warning: 警告
  - info: 提醒

上述指定了一些必要的字段, 接入外部告警必须遵守rule的书写规范才行.

启动alermanager

docker run -d -u 1000 --name alertmanager --net=host --restart always \
-v /data/alertmanager/templates/:/etc/alertmanager/templates/  \
-v /data/alertmanager/data/:/alertmanager/data/ \
-v /data/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
bitnami/alertmanager:0.21.0 \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/alertmanager/data \
--web.listen-address=":9095"

如何保护alertmanager？

接入外部告警, 那就不可避免的会出现一个问题，那就是如何保护部署在公网环境下的alertmanager？

经过思考, 最终决定使用 IP白名单 + 正向代理 的方式. 当然,如果有需要的话可以加上 basic auth .

之所以用正向代理,而不是IP白名单, 主要考虑到可能有太多的集群, 导致每个都需要配置白名单, 不如直接每个机房放置一个正向代理, 那么我们只需要配置一个IP就行了。

配置 basic auth

1	htpasswd -c /etc/nginx/.htpasswd admin

nginx.conf

http {
    server {
        listen 12321;

        location /prometheus/ {
            auth_basic           "Prometheus";
            auth_basic_user_file /etc/nginx/.htpasswd;

            proxy_pass           http://localhost:9090/;
        }
    }
}

使用正向代理

正向代理这边选择了Nginx, 并且已经配置为容器。

alermanager 只需要将正向代理的IP配置为白名单即可

1	docker run -e DNS=114.114.114.114 -e PORT=333 --net host --rm -d -v /data/nginx_proxy_logs:/var/log/nginx/ momo184/nginx-forward-proxy:latest

添加Prometheus 配置

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            # alermanager 地址
            - '120.132.84.198:9093'
      # 正向代理地址
      proxy_url: 'http://172.17.0.4:333'

原文作者：Momo

原文链接：https://mo.xmomo521.top/2021/04/02/重新规划了监控的架构Thanos-Prometheus/

发表日期：April 2nd 2021, 4:45:11 pm

更新日期：December 21st 2021, 3:32:58 pm

Next Post

Django drf前后端分离中JWT认证
Previous Post

prometheus-operator使用外部的端口

