how does tcp server bind on a socket

bind的port是通过tcp_hashinfo里的bhash管理的。
跟tcp client的端口号管理有一样。

注:bind 系统调用时,还不会把当前的socket挂载到listen队列上,需要等待listen系统调用。
bind系统调用只是把对应的这个端口給占用上了, 其他程序没发bind。

Read More

basic data structure of inet socket: inetsw

protocol链表数组

inet有些例外,因为inet支持的类型太过复杂(maybe), 所以引入了一个inetsw的链表数组。
就像注释里说的一样。inetsw是inet socket创建的基础,包含创建inet socket全部的所需要信息。

inetsw

1
2
3
4
5
125 /* The inetsw table contains everything that inet_create needs to
126 * build a new socket.
127 */
128 static struct list_head inetsw[SOCK_MAX];
129 static DEFINE_SPINLOCK(inetsw_lock);

inetsw是一个链表头的数据,每个链表是具有相同的type的, 具体见socket type.
每个节点是一个struct inet_protosw. 每个节点是通过net_register_protosw
插入到其type对应的链表里的。

Read More

How to select tcp client port

1
upstream source v4.2+(commit ID:a794b4f).

key var

tcp_death_row
1
97 struct inet_hashinfo tcp_hashinfo;

callback

1
2
3
4
5
6
7
8
9
> connect syscall
> > sock->ops->connect
> > ==> tcp_v4_connect
> > > inet_hash_connect
> > > > port_offset = inet_sk_port_offset(sk);
> > > > __inet_hash_connect
> > > > > inet_get_local_port_range
> > > > > for each port in range: start with port_offset.
> > > > > > inet_is_local_reserved_port //skip the reserved ports.

There is a important data struct struct inet_hashinfo
其对应的变量是 tcp_hashinfo.

共包含3部分:

  1. ehash: establish hash. netnamspace,saddr, daddr, sport, dport. 每个struct sock, 通过&sk->sk_nulls_node,挂接到hash链上.
  2. bhash: 使用netnamespace和localport作为key。
    hashlist节点是struct inet_bind_bucket,通过其node域链接到hash链上。
    一个节点对应一个local port。 节点下又一个owner 链。num_owners表示owner链的长度。
    每个struct sock,通过sk->sk_bind_node,链接到owner链上。
  3. listening_hash: 待后续完善。tcp server使用。

Read More

bridge zero copy transmit

redhat 7 full support it

How redhat 7 said

1
2
3
4
5
6
7
8
5.5.1. Bridge Zero Copy Transmit

Zero copy transmit mode is effective on large packet sizes. It typically reduces the host CPU overhead by up to 15% when transmitting large packets between a guest network and an external network, without affecting throughput.
It does not affect performance for guest-to-guest, guest-to-host, or small packet workloads.
Bridge zero copy transmit is fully supported on Red Hat Enterprise Linux 7 virtual machines, but disabled by default. To enable zero copy transmit mode, set the experimental_zcopytx kernel module parameter for the vhost_net module to 1.
NOTE
An additional data copy is normally created during transmit as a threat mitigation technique against denial of service and information leak attacks. Enabling zero copy transmit disables this threat mitigation technique.
If performance regression is observed, or if host CPU utilization is not a concern, zero copy transmit mode can be disabled by setting experimental_zcopytx to 0.

Read More

how tcpdown direction filter work

tcpudmp对 direction的支持。

内核代码有一个关键数据结构:skb的pkt_type字段。
在收发路径这个域被赋值为PACKET_OUTGOING或者其他。
这个值被传递到往用户空间,libpcap根据它判断报文的方向是否是期望的。

pkt_type的可能取值

1
2
3
4
5
24 #define PACKET_HOST             0               /* To us                */
25 #define PACKET_BROADCAST 1 /* To all */
26 #define PACKET_MULTICAST 2 /* To group */
27 #define PACKET_OTHERHOST 3 /* To someone else */
28 #define PACKET_OUTGOING 4 /* Outgoing of any type */

Read More

vhost net study

vhost net 的目的是为了避免在host kerne上做一次qemu的调度,提升性能。
xmit: 让vm的数据报在 host的内核就把报文发送出去。
rcv:

核心数据结构

vhost_poll是vhost里最关键的一个数据结构。

1
2
3
4
5
6
7
8
9
10
27 /* Pol> > file (eventfd or socket) */
28 /* Note: there's nothing vhost specific about this structure. */
29 struct vhost_poll {
3> > > > poll_tabl> > > > > > > > table;
3> > > > wait_queue_head_> > > > *wqh;
3> > > > wait_queue_> > > > > > wait;
3> > > > struct vhost_wor> > > > work;
3> > > > unsigned lon> > > > > > mask;
3> > > > struct vhost_de> > > > *dev;
36 };
  • table:每次负责把wait域放倒wqh里。vhost_net_open将它的执行函数vhost_poll_func
  • wqh:它的wqh被初始化指向一个eventfd的ctx,
  • wait:每次把wait放倒这个wqh链表里,当guest vm的发送报文时,wait被摘下,
    并执行其对应的func,vhost_net_open将该func被初始化为vhost_poll_wakeup。
    vhost_poll_wakeup负责将work放入对应vhost_dev下的work_list链表中。
  • work: 每个vhost_dev有一个thread,负责从work_list链表里的摘除work节点,
    并执行work节点对应的fn. fn是真正干活的的函数。
    对于rx vhost_virqueue, vhost_net_open将该fn初始化为handle_rx_kick
    对于tx vhost_virqueue, vhost_net_open将该fn初始化为handle_tx_kick
    对于rx vhost_virqueue, vhost_net_open将该fn初始化为
    handle_rx_kick.
  • mask:是需要监听的eventfd的事件集合
  • dev: 该vhost_poll对应的vhost_dev;

Read More

ipvlan study

L2 mode

xmit packet
xmit a normal pkt to other phy machine
1
2
3
4
==> ipvlan_start_xmit
==> ==> ipvlan_xmit_mode_l2
==> ==> ==> skb->dev = ipvlan->phy_dev;
==> ==> ==> return dev_queue_xmit(skb);
xmit a normal pkt to other namespace
1
2
3
4
5
6
7
8
9
==> ipvlan_start_xmit
==> ==> ipvlan_xmit_mode_l2
==> ==> ==> if (ether_addr_equal(eth->h_dest, eth->h_source))
==> ==> ==> addr = ipvlan_addr_lookup(ipvlan->port, lyr3h, addr_type, true);
==> ==> ==> ipvlan_rcv_frame(addr, skb, true);
==> ==> ==> ==> skb->dev = dev; <== dst namespace dev
==> ==> ==> ==> dev_forward_skb(ipvlan->dev, skb)
==> ==> ==> ==> ==> return __dev_forward_skb(dev, skb) ?: netif_rx_internal(skb);
==> ==> ==> ==> ==> ==> enqueue_to_backlog(skb, get_cpu(), &qtail);
xmit a mutlicast pkt
1
2
3
4
5
6
==> ipvlan_start_xmit
==> ==> ipvlan_xmit_mode_l2
==> ==> ==> else if (is_multicast_ether_addr(eth->h_dest))
==> ==> ==> ipvlan_multicast_frame(ipvlan->port, skb, ipvlan, true);
==> ==> ==> ==> list_for_each_entry(ipvlan, &port->ipvlans, pnode)
==> ==> ==> ==> ==> dev_forward_skb or netif_rx(nskb);
recv packet

All the packet are get by the rx_handler, ipvlan_handle_frame.

unicast packet

lookup the dest ipvlan port(net_device)
by the dst IPv4/6 address, and send to it.

1
2
3
4
5
==> ipvlan_handle_frame
==> ==> ipvlan_handle_mode_l2
==> ==> ==> ipvlan_addr_lookup(port, lyr3h, addr_type, true);
==> ==> ==> ==> skb->dev = dev;
==> ==> ==> ==> dev_forward_skb or ret = RX_HANDLER_ANOTHER;
multicast packet.
1
2
3
4
5
6
==> ipvlan_handle_frame
==> ==> ipvlan_handle_mode_l2
==> ==> ==> if (is_multicast_ether_addr(eth->h_dest)) {
==> ==> ==> ipvlan_addr_lookup(port, lyr3h, addr_type, true);
==> ==> ==> ==> if (ipvlan_external_frame(skb, port))
==> ==> ==> ==> ==> ipvlan_multicast_frame(port, skb, NULL, false);

l3 mode

1
ipvlan_start_xmit
1
2
3
4
5
6
7
8
9
10
11
12
13
308 static const struct net_device_ops ipvlan_netdev_ops = {
309 .ndo_init = ipvlan_init,
310 .ndo_uninit = ipvlan_uninit,
311 .ndo_open = ipvlan_open,
312 .ndo_stop = ipvlan_stop,
313 .ndo_start_xmit = ipvlan_start_xmit,
314 .ndo_fix_features = ipvlan_fix_features,
315 .ndo_change_rx_flags = ipvlan_change_rx_flags,
316 .ndo_set_rx_mode = ipvlan_set_multicast_mac_filter,
317 .ndo_get_stats64 = ipvlan_get_stats64,
318 .ndo_vlan_rx_add_vid = ipvlan_vlan_rx_add_vid,
319 .ndo_vlan_rx_kill_vid = ipvlan_vlan_rx_kill_vid,
320 };
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
495 int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
496 {
497 struct ipvl_dev *ipvlan = netdev_priv(dev);
498 struct ipvl_port *port = ipvlan_port_get_rcu(ipvlan->phy_dev);
499
500 if (!port)
501 goto out;
502
503 if (unlikely(!pskb_may_pull(skb, sizeof(struct ethhdr))))
504 goto out;
505
506 switch(port->mode) {
507 case IPVLAN_MODE_L2:
508 return ipvlan_xmit_mode_l2(skb, dev);
509 case IPVLAN_MODE_L3:
510 return ipvlan_xmit_mode_l3(skb, dev);
511 }
512
513 /* Should not reach here */
514 WARN_ONCE(true, "ipvlan_queue_xmit() called for mode = [%hx]\n",
515 port->mode);
516 out:
517 kfree_skb(skb);
518 return NET_XMIT_DROP;
519 }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
457 static int ipvlan_xmit_mode_l2(struct sk_buff *skb, struct net_device *dev)
458 {
459 const struct ipvl_dev *ipvlan = netdev_priv(dev);
460 struct ethhdr *eth = eth_hdr(skb);
461 struct ipvl_addr *addr;
462 void *lyr3h;
463 int addr_type;
464
465 if (ether_addr_equal(eth->h_dest, eth->h_source)) {
466 lyr3h = ipvlan_get_L3_hdr(skb, &addr_type);
467 if (lyr3h) {
468 addr = ipvlan_addr_lookup(ipvlan->port, lyr3h, addr_type, true);
469 if (addr)
470 return ipvlan_rcv_frame(addr, skb, true);
471 }
472 skb = skb_share_check(skb, GFP_ATOMIC);
473 if (!skb)
474 return NET_XMIT_DROP;
475
476 /* Packet definitely does not belong to any of the
477 * virtual devices, but the dest is local. So forward
478 * the skb for the main-dev. At the RX side we just return
479 * RX_PASS for it to be processed further on the stack.
480 */
481 return dev_forward_skb(ipvlan->phy_dev, skb);
482
483 } else if (is_multicast_ether_addr(eth->h_dest)) {
484 u8 ip_summed = skb->ip_summed;
485
486 skb->ip_summed = CHECKSUM_UNNECESSARY;
487 ipvlan_multicast_frame(ipvlan->port, skb, ipvlan, true);
488 skb->ip_summed = ip_summed;
489 }
490
491 skb->dev = ipvlan->phy_dev;
492 return dev_queue_xmit(skb);
493 }

ftrace study

test case

We found ixgbe rx softirq poll function ixgbe_poll was called even without pkt coming.
how to prove it and who call it?

analysis

By browse source ixgbe_poll is called, it should be done by napi schedule.
If this is true, __napi_schedule should be called.

Read More