udp rss hash causes low iperf perforamnce

vxlan下iperf性能问题

VXlan网络下,在两个容器(分别在两个host上)上,使用iperf进行tcp网络性能测试,带宽只能达到3.5Gb/s左右。
而两个容器所在的host机器之间是万兆网络环境,host上的网卡是ixgbe 10G网卡

解决方法

  1. 发送端使用多线程参数 -P

    1
    iperf  -c 192.168.51.2   -P8 -t 1000
  2. 接收端IXGBE网卡RSShash使用hash(SrcIP, DstIP, SrcPort, DstPort)

    1
    ethtool -N em2 rx-flow-hash udp4 sdfn

    Read More

TCP ack study

kernel version v4.5

两个重要并且容易混淆的函数:

  • tcp_v4_rcv
  • tcp_v4_do_rcv

类似于中断处理的上半部和下半部,
tcp的处理分为了的总入口函数是tcp_v4_rcv,
tcp_v4_do_rcv则是真正处理tcp报文,
并传送到用户空间。

其他的像拥塞控制,乱序调整等都在tcp_v4_do_rcv之前被做掉了。

Read More

how does tcp server bind on a socket

bind的port是通过tcp_hashinfo里的bhash管理的。
跟tcp client的端口号管理有一样。

注:bind 系统调用时,还不会把当前的socket挂载到listen队列上,需要等待listen系统调用。
bind系统调用只是把对应的这个端口給占用上了, 其他程序没发bind。

Read More

basic data structure of inet socket: inetsw

protocol链表数组

inet有些例外,因为inet支持的类型太过复杂(maybe), 所以引入了一个inetsw的链表数组。
就像注释里说的一样。inetsw是inet socket创建的基础,包含创建inet socket全部的所需要信息。

inetsw

1
2
3
4
5
125 /* The inetsw table contains everything that inet_create needs to
126 * build a new socket.
127 */
128 static struct list_head inetsw[SOCK_MAX];
129 static DEFINE_SPINLOCK(inetsw_lock);

inetsw是一个链表头的数据,每个链表是具有相同的type的, 具体见socket type.
每个节点是一个struct inet_protosw. 每个节点是通过net_register_protosw
插入到其type对应的链表里的。

Read More

How to select tcp client port

1
upstream source v4.2+(commit ID:a794b4f).

key var

tcp_death_row
1
97 struct inet_hashinfo tcp_hashinfo;

callback

1
2
3
4
5
6
7
8
9
> connect syscall
> > sock->ops->connect
> > ==> tcp_v4_connect
> > > inet_hash_connect
> > > > port_offset = inet_sk_port_offset(sk);
> > > > __inet_hash_connect
> > > > > inet_get_local_port_range
> > > > > for each port in range: start with port_offset.
> > > > > > inet_is_local_reserved_port //skip the reserved ports.

There is a important data struct struct inet_hashinfo
其对应的变量是 tcp_hashinfo.

共包含3部分:

  1. ehash: establish hash. netnamspace,saddr, daddr, sport, dport. 每个struct sock, 通过&sk->sk_nulls_node,挂接到hash链上.
  2. bhash: 使用netnamespace和localport作为key。
    hashlist节点是struct inet_bind_bucket,通过其node域链接到hash链上。
    一个节点对应一个local port。 节点下又一个owner 链。num_owners表示owner链的长度。
    每个struct sock,通过sk->sk_bind_node,链接到owner链上。
  3. listening_hash: 待后续完善。tcp server使用。

Read More

bridge zero copy transmit

redhat 7 full support it

How redhat 7 said

1
2
3
4
5
6
7
8
5.5.1. Bridge Zero Copy Transmit

Zero copy transmit mode is effective on large packet sizes. It typically reduces the host CPU overhead by up to 15% when transmitting large packets between a guest network and an external network, without affecting throughput.
It does not affect performance for guest-to-guest, guest-to-host, or small packet workloads.
Bridge zero copy transmit is fully supported on Red Hat Enterprise Linux 7 virtual machines, but disabled by default. To enable zero copy transmit mode, set the experimental_zcopytx kernel module parameter for the vhost_net module to 1.
NOTE
An additional data copy is normally created during transmit as a threat mitigation technique against denial of service and information leak attacks. Enabling zero copy transmit disables this threat mitigation technique.
If performance regression is observed, or if host CPU utilization is not a concern, zero copy transmit mode can be disabled by setting experimental_zcopytx to 0.

Read More