2016-02-26

socket

TCP ack study

kernel version v4.5

两个重要并且容易混淆的函数:

tcp_v4_rcv
tcp_v4_do_rcv

类似于中断处理的上半部和下半部，
tcp的处理分为了的总入口函数是tcp_v4_rcv,
而tcp_v4_do_rcv则是真正处理tcp报文,
并传送到用户空间。

其他的像拥塞控制，乱序调整等都在tcp_v4_do_rcv之前被做掉了。

2016-01-29

netdev

hw timestamp in tcpdump

PCAP_TSTAMP_ADAPTER and PCAP_TSTAMP_ADAPTER_UNSYNCED

tcpdump help

tcpudmp use -j option

tcpdump main function

2016-01-29

netdev

the timestamp in tcpdump Part1

tcpdump help

2015-09-14

socket

how does tcp socket listen

调用inet_hash将对应的socket挂到一个chainlist上

Call treace

1
2
3

inet_listen
 inet_csk_listen_start
   sk->sk_prot->hash(sk); 相当于inet_hash

2015-09-14

socket

how does tcp server bind on a socket

bind的port是通过tcp_hashinfo里的bhash管理的。
跟tcp client的端口号管理有一样。

注：bind 系统调用时，还不会把当前的socket挂载到listen队列上，需要等待listen系统调用。
bind系统调用只是把对应的这个端口給占用上了，其他程序没发bind。

2015-09-11

socket

basic data structure of inet socket: inetsw

protocol链表数组

inet有些例外，因为inet支持的类型太过复杂(maybe), 所以引入了一个inetsw的链表数组。
就像注释里说的一样。inetsw是inet socket创建的基础，包含创建inet socket全部的所需要信息。

`inetsw`

125 /* The inetsw table contains everything that inet_create needs to
126  * build a new socket.
127  */
128 static struct list_head inetsw[SOCK_MAX];
129 static DEFINE_SPINLOCK(inetsw_lock);

inetsw是一个链表头的数据，每个链表是具有相同的type的, 具体见socket type.
每个节点是一个struct inet_protosw. 每个节点是通过net_register_protosw
插入到其type对应的链表里的。

2015-09-09

socket

How to select tcp client port

1	upstream source v4.2+(commit ID:a794b4f).

key var

tcp_death_row

1	97 struct inet_hashinfo tcp_hashinfo;

callback

> connect syscall
> > sock->ops->connect
> > ==> tcp_v4_connect
> > > inet_hash_connect
> > > > port_offset = inet_sk_port_offset(sk);
> > > > __inet_hash_connect
> > > > > inet_get_local_port_range
> > > > > for each port in range: start with port_offset.
> > > > > > inet_is_local_reserved_port //skip the reserved ports.

There is a important data struct struct inet_hashinfo
其对应的变量是 tcp_hashinfo.

共包含3部分:

ehash: establish hash. netnamspace，saddr, daddr, sport, dport. 每个struct sock, 通过&sk->sk_nulls_node，挂接到hash链上.
bhash: 使用netnamespace和localport作为key。
hashlist节点是struct inet_bind_bucket，通过其node域链接到hash链上。
一个节点对应一个local port。节点下又一个owner 链。num_owners表示owner链的长度。
每个struct sock,通过sk->sk_bind_node，链接到owner链上。
listening_hash: 待后续完善。tcp server使用。

2015-06-26

netdev

xen tcp checksum for back and front net driver

Test case

two linux guest vm on a same host(physical machine), there is a tcp session
between them.

conclusion:

the checksum in tcp header is only cover the faked tcp header not the whole
tcp packet.

call trace

vm1(netfront) —> host backend(vif_1) —> bridge —> host backend(vif_2) —> vm2(netfrontend)

2015-06-26

netdev

bridge zero copy transmit

redhat 7 full support it

How redhat 7 said

5.5.1. Bridge Zero Copy Transmit

Zero copy transmit mode is effective on large packet sizes. It typically reduces the host CPU overhead by up to 15% when transmitting large packets between a guest network and an external network, without affecting throughput.
It does not affect performance for guest-to-guest, guest-to-host, or small packet workloads.
Bridge zero copy transmit is fully supported on Red Hat Enterprise Linux 7 virtual machines, but disabled by default. To enable zero copy transmit mode, set the experimental_zcopytx kernel module parameter for the vhost_net module to 1.
NOTE
An additional data copy is normally created during transmit as a threat mitigation technique against denial of service and information leak attacks. Enabling zero copy transmit disables this threat mitigation technique.
If performance regression is observed, or if host CPU utilization is not a concern, zero copy transmit mode can be disabled by setting experimental_zcopytx to 0.

2015-05-22

bpf

how tcpdown direction filter work

tcpudmp对 direction的支持。

内核代码有一个关键数据结构：skb的pkt_type字段。
在收发路径这个域被赋值为PACKET_OUTGOING或者其他。
这个值被传递到往用户空间，libpcap根据它判断报文的方向是否是期望的。

pkt_type的可能取值

24 #define PACKET_HOST             0               /* To us                */
25 #define PACKET_BROADCAST        1               /* To all               */
26 #define PACKET_MULTICAST        2               /* To group             */
27 #define PACKET_OTHERHOST        3               /* To someone else      */
28 #define PACKET_OUTGOING         4               /* Outgoing of any type */