Linux 内核 Nexthop 对象化: Patch 4c7e8084 深度解析

一、问题起源 (Problem Origin)

2.1 性能瓶颈 (Performance Bottleneck)

David Ahern 在 2017 年左右开始构思 nexthop 对象化 (nexthop objectification),核心动机来自大规模路由注入的性能瓶颈 (large-scale route injection performance bottleneck)。当时的状况:

场景 (Scenario) 耗时 (Duration)
注入 70 万+ IPv4 路由,单路径 (single path) 18 秒
注入 70 万+ IPv4 路由,4 路径 ECMP 28 秒
注入 70 万+ IPv4 路由,16 路径 ECMP 72+ 秒

时间随 ECMP (Equal-Cost Multi-Path,等价多路径) 路径数急剧膨胀,这对 BGP 大表 (full internet table,完整互联网路由表) 场景完全不可接受。

2.2 旧架构的根本问题 (Root Cause in Legacy Architecture)

在 nexthop 对象化之前,下一跳信息 (next-hop information) 是内嵌在路由条目 (route entry) 中的:

1
2
3
4
5
6
7
8
9
10
11
旧模式 - IPv4 (Legacy Mode - IPv4):
fib_info ──→ fib_nh[0] ← 下一跳直接嵌在结构体尾部
fib_nh[1] (next-hops embedded at the end of the struct)
fib_nh[2]
...

旧模式 - IPv6 (Legacy Mode - IPv6, 更糟糕 / even worse):
fib6_info → sibling → sibling → ... → sibling
│ │
└───── 每个 sibling 是独立的 fib6_info ─┘
(each sibling is a separate fib6_info)

这种设计带来三大痛点 (three major pain points):

痛点 1:重复验证,浪费 CPU (Redundant Validation)

每条路由创建时都要独立验证其 nexthop 规格 (nexthop specification)——包括 device (网络设备)、gateway (网关)、encap (封装)。即使 1000 条路由指向同一个网关,这 1000 次验证完全一样。

痛点 2:过多的 synchronize_rcu 调用 (Excessive synchronize_rcu,性能杀手)

IPv4 在路由变更路径中有大量的 synchronize_rcu() 调用。这是一个同步阻塞操作 (synchronous blocking operation)——等待所有 CPU 完成当前 RCU 读临界区 (RCU read-side critical section)。在大批量路由注入时造成极大延迟。

痛点 3:ECMP 路径数线性膨胀 (Linear Scaling with ECMP Paths)

每条路由都要独立存储下一跳数组 (next-hop array),N 路径 ECMP 的开销是 O(N) × 路由数。IPv6 的 sibling 链表 (sibling linked list) 更差——每个路径是独立的 fib6_info,内存和 CPU 开销都成倍增长。


二、总结 (Summary)

4c7e8084nexthop 对象化改造中,将独立 nexthop 对象对接到 IPv4 FIB 路由核心数据结构 (fib_info) 的关键一步

它的作用是打通通道 (plumbing)

  1. fib_info 中新增 struct nexthop *nh 指针,使路由可以引用独立管理的 nexthop 对象
  2. struct nexthop 中新增 fi_list,建立 nexthop → fib_info 的反向追踪关系
  3. 适配所有相关的核心函数 (core functions),处理新旧两条路径
  4. 为后续 “Allow routes to use nexthop objects” 奠定基础

一句话概括 (One-line Summary):这个 patch 在 fib_info 中打通了指向独立 nexthop 对象的通道,使得多条路由共享 nexthop、nexthop 变更一处生效成为可能,从而将 74 万路由的注入时间从 18 秒降至 4.3 秒。

三、Patch 核心内容详解 (Patch Detailed Analysis)

4.1 新增的数据结构关系 (New Data Structure Relationships)

1
2
3
4
5
6
7
旧模式 (Legacy Mode):fib_info 内嵌 fib_nh[]
fib_info ──→ fib_nh[0], fib_nh[1], ... (数组直接在结构体尾部)

新模式 (New Mode):fib_info 引用独立 nexthop 对象 (standalone nexthop object)
fib_info ──→ nexthop ──→ nh_group ──→ nh_info[]

└── fi_list (反向链表/reverse linked list,追踪所有引用者)

完整的新架构图 (Complete New Architecture Diagram):

1
2
3
4
5
6
7
8
9
10
                  ┌─────────────────────────────────────┐
route A ──→ fib_info ──┐ │
│ │
route B ──→ fib_info ──┤──→ nexthop (id=100) ──→ nh_group
│ │ │
route C ──→ fib_info ──┘ │ ├─ nh_info → fib_nh (gw=10.0.0.1, dev=eth0)
│ ├─ nh_info → fib_nh (gw=10.0.0.2, dev=eth1)
│ └─ nh_info → fib_nh (gw=10.0.0.3, dev=eth2)

└──→ fi_list (反向链表,追踪所有引用的 fib_info)

关键设计 (Key Design Decisions):

  • 多条路由共享同一个 nexthop 对象 (multiple routes share one nexthop object) —— 通过 nh 指针引用
  • nexthop 对象通过 fi_list 反向追踪 (reverse tracking) 所有使用它的 fib_info
  • nexthop 变更(如链路故障 / link failure)只需修改一处,所有路由自动生效
  • fib_infonh 指针与 fib_nh[] 数组互斥 (mutually exclusive) —— 要么走新路径,要么走传统路径

4.2 fib_info 新增字段 (New Fields in fib_info)

1
2
3
4
5
6
7
8
struct fib_info {
...
struct nexthop *nh; // 新版 nexthop 对象指针(与 fib_nh[] 互斥)
// New nexthop object pointer (mutually exclusive with fib_nh[])
struct list_head nh_list; // 挂载到 nexthop->fi_list,nexthop 更新/删除时遍历
// Linked to nexthop->fi_list, traversed on nexthop update/delete
...
};

4.3 nexthop 结构体新增字段 (New Field in struct nexthop)

1
2
3
4
5
6
struct nexthop {
...
struct list_head fi_list; // 追踪所有使用该 nexthop 的 fib_info
// Tracks all fib_info entries referencing this nexthop
...
};

4.4 核心函数适配 (Core Function Adaptations)

函数 (Function) 改动说明 (Change Description)
fib_create_info 验证 nexthop 引用 (verify nexthop reference),将 fib_info 加入 nexthop 的 fi_list
fib_release_info 从 nexthop 的 fi_list 中移除 (remove from fi_list)
free_fib_info_rcu 释放 nexthop 对象引用 (put nexthop object reference)
nh_comp 使用 nexthop_cmp 比较两个 nexthop 是否相同
fib_info_hashfn 使用 nexthop id 进行哈希 (hash by nexthop id),替代原来按 oif 哈希
fib_nlmsg_size 增加 RTA_NH_ID 属性的空间 (add space for RTA_NH_ID attribute)
fib_select_multipath 使用 nexthop_path_fib_result() 进行 ECMP 选路 (path selection)
fib_table_lookup 处理 blackhole nexthop 场景 (handle blackhole nexthop)

4.5 新增 nexthop 辅助函数 (New Nexthop Helper Functions)

函数 (Function) 用途 (Purpose)
nexthop_cmp 比较两个 nexthop 是否相同 (compare if two nexthops are identical)
nexthop_path_fib_result ECMP 选路:在多路径 nexthop 中选择一条路径 (select a path in multipath nexthop)
nexthop_fib_nhc 获取 nexthop 中特定的 fib_nh_common (get a specific fib_nh_common)
__remove_nexthop_fib nexthop 删除时遍历 fi_list,标记关联的 fib_info 为 dead

四、Commit Message 原文及翻译 (Original Commit Message with Translation)

Add ‘struct nexthop’ and nh_list list_head to fib_info. nh_list is the fib_info side of the nexthop <-> fib_info relationship.

向 fib_info 添加 struct nexthopnh_list 链表头。nh_list 是 nexthop ↔ fib_info 双向关系中 fib_info 侧的链表节点。

Add fi_list list_head to ‘struct nexthop’ to track fib_info entries using a nexthop instance. Add __remove_nexthop_fib and add it to __remove_nexthop to walk the new list_head and mark those fib entries as dead when the nexthop is deleted.

struct nexthop 添加 fi_list 链表头,用于追踪使用该 nexthop 实例的 fib_info 条目。新增 __remove_nexthop_fib 函数并将其加入 __remove_nexthop,当 nexthop 被删除时遍历该链表并将关联的 fib 条目标记为 dead。

Add a few nexthop helpers for use when a nexthop is added to fib_info:

  • nexthop_cmp to determine if 2 nexthops are the same
  • nexthop_path_fib_result to select a path for a multipath ‘struct nexthop’
  • nexthop_fib_nhc to select a specific fib_nh_common within a multipath ‘struct nexthop’

新增若干 nexthop 辅助函数,用于 nexthop 被添加到 fib_info 时:

  • nexthop_cmp:判断两个 nexthop 是否相同
  • nexthop_path_fib_result:在多路径 nexthop 中选择一条路径
  • nexthop_fib_nhc:在多路径 nexthop 中选择特定的 fib_nh_common

Update existing fib_info_nhc to use nexthop_fib_nhc if a fib_info uses a ‘struct nexthop’, and mark fib_info_nh as only used for the non-nexthop case.

更新现有的 fib_info_nhc,当 fib_info 使用了 nexthop 对象时调用 nexthop_fib_nhc,并将 fib_info_nh 标记为仅用于非 nexthop 的传统场景。

The bulk of the changes are in fib_semantics.c and most of that is moving the existing change_nexthops into an else branch.

大部分改动在 fib_semantics.c 中,主要是将已有的 change_nexthops 逻辑移入 else 分支。

Update the nexthop code to walk fi_list on a nexthop deleted to remove fib entries referencing it.

更新 nexthop 代码,当 nexthop 被删除时遍历 fi_list,移除引用它的 fib 条目。


五、Cover Letter 原文及翻译 (Original Cover Letter with Translation)

以下摘自阶段二 Merge commit 48debfd736d5 的 cover letter:

“When I started this idea almost 2 years ago, it took 18 seconds to inject 700k+ IPv4 routes with 1 hop and about 28 seconds for 4-paths. Some of that time was due to inefficiencies in ‘ip’, but most of it was kernel side with excessive synchronize_rcu calls in ipv4, and redundant processing validating a nexthop spec (device, gateway, encap). Worse, the time increased dramatically as the number of legs in the routes increased; for example, taking over 72 seconds for 16-path routes.”

翻译:当我在大约 2 年前开始这个想法时,注入 70 万+ 条单路径 IPv4 路由需要 18 秒,4 路径需要约 28 秒。部分时间是由于 ip 工具的低效,但大部分是内核侧的问题——IPv4 中过多的 synchronize_rcu 调用,以及冗余的 nexthop 规格验证 (device, gateway, encap)。更糟糕的是,时间随路径数急剧增长——例如 16 路径路由需要超过 72 秒。

“After this set, with increased dirty memory limits (fib_sync_mem sysctl), an improved ip and nexthop objects a full internet fib (743,799 routes based on a pull in January 2019) can be pushed to the kernel in 4.3 seconds.”

翻译:在这组 patch 之后,通过增加脏内存限制 (fib_sync_mem sysctl)、改进的 ip 工具和 nexthop 对象,一个完整的互联网 FIB (基于 2019 年 1 月的 743,799 条路由) 可以在 4.3 秒内推送到内核。

“Even better, the time to insert is ‘almost constant’ with increasing number of paths. The ‘almost constant’ time is due to expanding the nexthop definitions when generating notifications. A follow on patch will be sent adding a sysctl that allows an admin to avoid the nexthop expansion and truly get constant route insert time regardless of the number of paths in a route!”

翻译:更好的是,插入时间随路径数增加”几乎恒定”。”几乎恒定”是因为生成通知时仍需展开 nexthop 定义。后续 patch 将添加一个 sysctl 参数,允许管理员跳过 nexthop 展开,真正实现无论路径数多少都恒定的路由插入时间!


六、性能效果 (Performance Results)

场景 (Scenario) 旧模式 (Legacy) 新模式 (New with Nexthop Objects) 提升 (Improvement)
注入 74 万 IPv4 全表 (full table),单路径 ~18s ~4.3s ~4.2x
多路径路由注入时间增长 (multipath scaling) O(N) 线性增长 几乎恒定 (almost constant)

七、整体时间线 (Overall Timeline)

1
2
3
4
5
6
7
8
9
10
graph TD
A["2017: 发现大规模路由注入性能瓶颈<br/>Discovered large-scale route injection bottleneck"] --> B["2018: 重构 fib_nh_common<br/>Refactored fib_nh_common<br/>统一 IPv4/IPv6 下一跳"]
B --> C["2019-05: 阶段一 Phase 1<br/>建立 nexthop 基础设施<br/>(UAPI + core framework + IPv4/v6/group)<br/>c38e57aecbb4"]
C --> D["2019-06: 阶段二 Phase 2 - Merge 1<br/>前置准备 + fib_info 植入 nexthop<br/>9ec49a7e58fb"]
D --> E["★ 4c7e8084<br/>ipv4: Plumb support for<br/>nexthop object in a fib_info"]
E --> F["2019-06: 阶段二 Phase 2 - Merge 2<br/>允许路由真正使用 nexthop<br/>48debfd736d5"]
F --> G["2019-06: selftests + replace 支持"]
G --> H["2021: Resilient nexthop groups<br/>弹性 nexthop 组<br/>(减少流量中断 / minimize traffic disruption)"]

style E fill:#ff6b6b,stroke:#333,color:#fff

八、解决方案:Nexthop 作为独立对象 (Solution: Nexthop as Standalone Object)

3.1 整体分两阶段落地 (Two-Phase Implementation)

阶段一 (Phase 1):建立 nexthop 基础设施 (2019-05, 6 patches)

Merge commit: c38e57aecbb4“net: API and initial implementation for nexthop objects”

# Commit 标题 (Title) 说明 (Description)
1 65ee00a9409f net: nexthop uapi 定义 UAPI:RTM_*NEXTHOP 命令、NHA_* 属性
2 ab84be7e54fc net: Initial nexthop code 核心框架:rbtree、RTM 命令处理、通知机制 (notification)
3 597cfe4fc339 nexthop: Add support for IPv4 nexthops IPv4 网关 + 网络设备事件处理 (netdev event handling)
4 53010f991a9f nexthop: Add support for IPv6 gateways IPv6 网关支持
5 b513bd035f40 nexthop: Add support for lwt encaps 轻量隧道封装 (Lightweight Tunnel encapsulation)
6 430a049190de nexthop: Add support for nexthop groups ECMP nexthop 组

此阶段结束后,可以创建/删除 nexthop 对象,但路由还不能引用它们 (routes cannot reference them yet)

Cover letter 原文:

“At the end of this set, nexthop objects can be created and deleted and userspace can monitor nexthop events, but ipv4 and ipv6 routes can not use them yet.”

翻译:在这组 patch 之后,nexthop 对象可以被创建和删除,用户态可以监听 nexthop 事件,但 IPv4 和 IPv6 路由还不能使用它们。

阶段二 (Phase 2):将 nexthop 对接到路由系统 (2019-06)

分为两个 merge:

Merge 1: 9ec49a7e58fb“net: add struct nexthop to fib info” (7 patches)

# Commit 标题 (Title) 说明 (Description)
1 5481d73f8154 ipv4: Use accessors for fib_info nexthop data 前置准备:引入 accessor 函数 (访问器函数)
2 dcb1ecb50edf ipv4: Prepare for fib6_nh from a nexthop object 前置准备:适配 fib6_nh
3 4c7e8084fd46 ipv4: Plumb support for nexthop object in a fib_info ★ 本 patch:向 fib_info 植入 nexthop 指针
4 f88d8ea67fbd ipv6: Plumb support for nexthop object in a fib6_info IPv6 同步改造
5 54250805d8e4 mlxsw: Fail attempts to use routes with nexthop objects mlxsw 驱动适配
6 6a87afc072c3 mlx5: Fail attempts to use routes with nexthop objects mlx5 驱动适配
7 dbcc4fa718ee rocker: Fail attempts to use routes with nexthop objects rocker 驱动适配

Merge 2: 48debfd736d5“net: Enable nexthop objects with IPv4 and IPv6 routes” (20 patches)

# Commit 标题 (Title) 说明 (Description)
1-10 f88c9aa1..2d44234b ipv6: Handle all fib6_nh in a nexthop in … IPv6 遍历 nexthop 中的所有 fib6_nh
11 493ced1ac47c ipv4: Allow routes to use nexthop objects 路由真正可以使用 nexthop
12 6c48ea5fe639 ipv4: Optimization for fib_info lookup with nexthops 查找优化
13 5b98324ebe29 ipv6: Allow routes to use nexthop objects IPv6 路由使用 nexthop
14 7bf4796dd099 nexthops: add support for replace nexthop 替换支持
15-20 243781db..cab14d10 selftests: … 测试用例 (test cases)