dst garbage

dst garbage summary

garbage collection is a common method used in kernel.
When a object(struct,memeory) become invalid, we need
free them, but the object maybe reference by others.

such as a dst_entry is not invalid, and it is still
referenced(used) by others.

then __dst_free will be called for this case.
It will first set dst to dirty(dead),
and then put it into dst_garbage.list by dst->next.

Then a workqueue task will check the dst‘s reference,
and free(destory) it when no reference on it.

Two key struct struct dst_garbage and dst_gc_work

Read More

dst ops

Call trace

forward a packet.

1
2
3
4
5
6
> ip_rcv_finish
> > ip_route_input_noref
> > > ip_route_input_slow
> > > > fib_lookup
> > > > ip_mkroute_input
> > dst_input(skb)
1
2
3
4
> > > > ip_mkroute_input
> > > > > __mkroute_input
> > > > > > rth = rt_dst_alloc(...)
> > > > > > skb_dst_set(skb, &rth->dst);

Read More

how to xmit a packet with Qdisc

summary

We think it as a ideal and simple case:

Call Trace:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> dev_queue_xmit
> > __dev_queue_xmit(skb, NULL);
> > > rcu_read_lock_bh();
> > > txq = netdev_pick_tx(dev, skb, accel_priv);
> > > q = rcu_dereference_bh(txq->qdisc);
> > > rc = __dev_xmit_skb(skb, q, dev, txq);
> > > > skb_dst_force(skb);
> > > > q->enqueue(skb, q);
> > > > qdisc_run_begin(q)
> > > > __qdisc_run(q);
> > > > > while (qdisc_restart(q))
> > > > > > __netif_schedule
> > > > > qdisc_run_end(q)
> > > rcu_read_unlock_bh();
> > > return rc;

Read More

Qdisc running flag

Summary

In struct Qdisc, there are two similar fileds.
running flag is stored in __state of struct Qdisc, NOT state.
Every time, when we send a packet from qdisc, the running flag is
set by qdisc_run_begin, and after that, it is removed by qdisc_run_end.

1
2
3
84         unsigned long           state;
...
87 unsigned int __state;

todo

why need busylock?

Read More

how to create dev qdisc

Summary

Part 1: Register multi queue net device.

In this part, only the framework is prepared for qdisc,
and the noop_qdisc is set as default.

prepare netdev_queues.

for example: intel igb hardware has 8 hardware tx queue,
and nic driver create 8 corresponding struct netdev_queue
in the _tx of struct net_device.

prepare mq_qdisc

The mq_qdisc is attached to the corresponding device.
In mq_qdisc private field, a default qdisc will be
create for each NIC’s hardware queue.
This is done in mq_init.
The default qdisc is pfifo_fast_ops.

attach mq_qdisc to netdev_queue.

In mq_attach, these qdiscs are attatched to corresponding
struct netdev_queue.

Part 2: Active a net device with right qdiscs

Here only trace with the case mq_qdisc.
When dev is up, dev_open is called, which will call dev_activate.

Read More

qdisc study part1: qdisc_base

###Qdisc_ops is the core of a Qdisc.
All kinds of the Qdisc_ops are linked in a list by qdisc_base.
The key item of different Qdisc_ops is id[IFNAMSIZ].

Note: the list is a Singly-linked list, not a common list of kernel.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
158 struct Qdisc_ops {
159 struct Qdisc_ops *next;
160 const struct Qdisc_class_ops *cl_ops;
161 char id[IFNAMSIZ];
162 int priv_size;
163
164 int (*enqueue)(struct sk_buff *, struct Qdisc *);
165 struct sk_buff * (*dequeue)(struct Qdisc *);
166 struct sk_buff * (*peek)(struct Qdisc *);
167 unsigned int (*drop)(struct Qdisc *);
168
169 int (*init)(struct Qdisc *, struct nlattr *arg);
170 void (*reset)(struct Qdisc *);
171 void (*destroy)(struct Qdisc *);
172 int (*change)(struct Qdisc *, struct nlattr *arg);
173 void (*attach)(struct Qdisc *);
174
175 int (*dump)(struct Qdisc *, struct sk_buff *);
176 int (*dump_stats)(struct Qdisc *, struct gnet_dump *);
177
178 struct module *owner;
179 };

qdisc_base

1
2
3
134 /* The list of all installed queueing disciplines. */
135
136 static struct Qdisc_ops *qdisc_base;

Read More

how to remember ip/tcp/udp header

IP HEADER

以前找工作时,常被问到IP头都有哪些字段?
现在觉得真的理解了记起来没那么难。

上路时总要记得终点和起点(src/dst ip),
要倒几次车,大体也知道(ttl).

路有大路有小路,有高速路和土路。
大路到小路要分片,小路到大路可能重组。
上高速路需要通行证(qos/tos).

路上有警察,查你是不是非法(csum, header len, ip option),
查你装的什么货(protocol)?

Read More

how does tcp server accept a new connection request

summary

Two packets will be proessed by tcp server side:

  1. SYN pakcet: For the first packet(syn) of handshake, it first lookup the listen socket,
    and create a req socket as temporary. Send SYN+ACK packet to client.
  2. ACK packet: For the third packet(ack) of handshake, it will lookup the req socket created
    in previous steps.

SYN 报文calltrace

内核版本 v6.14
####

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
=> tcp_v4_rcv(struct sk_buff *skb)
=> => sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
=> => tcp_v4_do_rcv(sk, skb);
=> => => struct sock *nsk = tcp_v4_hnd_req(sk, skb)
=> => => tcp_rcv_state_process
=> => => => if (icsk->icsk_af_ops->conn_request(sk, skb) < 0) 相当 tcp_v4_conn_request
=> => => => tcp_v4_conn_request(&tcp_request_sock_ops, &tcp_request_sock_ipv4_ops, sk, skb);
=> => => => => inet_reqsk_alloc
=> => => => => tcp_openreq_init(req, &tmp_opt, skb);
=> => => => => inet_csk_reqsk_queue_hash_add
=> => => => => => reqsk_queue_hash_req
=> => => => => => => inet_ehash_insert //将 req sock放到 establish hash链表里
=> => => => => => inet_csk_reqsk_queue_added
=> => => => => => => reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); //半链接队列,真实名字叫 accept。 功能也只有计数能力了,没有队列了。
=> => => => => => => => atomic_inc(&queue->young);
=> => => => => => => => atomic_inc(&queue->qlen);
=> => => => => af_ops->send_synack //发送synack报文
=> => => => return // tcp_v4_conn_request
=> => => return; // tcp_rcv_state_process
注意
  • tcp_request_sock_ops: 是一个结构体的名字,同时又是一个变量的名字。
  • icsk_accept_queue: 在结构体struct inet_connection_sock里的这个队列并不在保存半链接队列的 req socket,而是计数。

Read More