Docker网络技术(Bridge)

Docker网络技术,用来保证Container之间正常通讯的技术,作为Docker自身提供的网络分为4种,Bridge、Host、None、Container。本文重点介绍 Bridge

环境

  • Docker 版本
    Docker version 1.12.5, build 7392c3b
  • OS 版本
    Red Hat Enterprise Linux Server release 7.2 (Maipo)
  • kernel 版本
    Linux 3.10.0-327.el7.x86_64

Bridge介绍

longlong ago :-)

docker-network-bridge-br0

早期的二层网络中,bridge可以连接不同的LAN网,当host1 发出一个数据包时,LAN1的其他主机和网桥br0都会收到该数据包。网桥再将数据包从入口端广播到其他端口上(我的理解是,多端口网桥叫交换机)。因此,LAN2上的主机也会接收到host1发出的数据包,从而实现不同LAN网上所有主机的通信。

docker-network-bridge-linux

后来linux kernel借鉴桥设备的原理实现了虚拟bridge,用到了veth pair技术,实现了不同子网通讯的二层基础。

Docker Bridge

(正题)Docker Bridge不同于linux bridge也不同于桥设备,但Docker Bridge的构建基于linux bridgeNetwork Namespaceiptables

  • Network Namespace
    实现了子网之间的隔离
  • iptables
    解决了NAT映射问题,使容器有(被)访问外网的能力。
  • linux bridge
    实现了Host内跨子网通讯

docker-network-bridge-main

在桥接模式下,Docker Daemon将veth0附加到docker0网桥上,保证宿主机的报文有能力发往veth0。再将veth1添加到Docker容器所属的网络命名空间,保证宿主机的网络报文若发往veth0可以立即被veth1收到。容器如果需要联网,则需要采用NAT方式。准确的说,是NATP(网络地址端口转换)方式。NATP包含两种转换方式:SNAT和DNAT。

下行访问流程

docker-network-bridge-downflow
由于容器的IP与端口对外都是不可见的,所以数据包的目的地址为宿主机的ip和端口,为192.168.1.10:24。
数据包经过路由器发给宿主机eth0,再经eth0转发给docker0网桥。
由于存在DNAT(Destination NAT,修改数据包的目的地址)规则,会将数据包的目的地址转换为容器的ip和端口,为172.17.0.n:24。宿主机上的docker0网桥识别到容器ip和端口,于是将数据包发送附加到docker0网桥上的veth0接口,veth0接口再将数据包发送给容器内部的veth1接口,容器接收数据包并作出响应。
docker-network-bridge-downflow-detail

上行访问流程

看了上面的下行访问流程用到了DNAT,那么上行访问一定会使用SNAT了吧。可实时却并非如此。
容器内的请求可以正常发送到host外,是因为host开启的ip_forward。如果host关闭转发功能echo 0 > /proc/sys/net/ipv4/ip_forward,容器能的请求只能发送到于自己相同网段的节点容器内,不同网段及跨主机的网段是不通的。
docker-network-bridge-upflow-detail

Docker bridge中关键技术

Docker bridge充分利用了linux bridge和iptabels、namespace等技术。将其中需要很多命令做成自动化脚本以方便执行维护。如:pipework是一个shell脚本,用于完成bridge网络管理。

veth pair

veth pair是一对虚拟网卡,从一张veth网卡发出的数据包可以直接到达它的peer veth,两者之间存在着虚拟链路。veth网卡和常规的以太网区别仅在于xmit接口:将数据发送到其peer,触发peer的Rx 过程。
docker-network-bridge-vethpair
veth pair是用于不同network namespace间进行通信的方式,veth pair将一个network namespace数据发往另一个network namespace的veth。如果多个network namespace需要进行通信,则需要借助bridge。

属于iproute2工具包中的ip-link提供的功能

创建veth pair,vp16与vp19是一对儿,vp26与vp29是一对儿。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ sudo ip link add vp16 type veth peer name vp19
$ sudo ip link add vp26 type veth peer name vp29
$ sudo ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 00:1c:42:c6:de:63 brd ff:ff:ff:ff:ff:ff
3: vp19@vp16: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b6:62:99:1e:0c:2a brd ff:ff:ff:ff:ff:ff
4: vp16@vp19: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1e:ea:cf:86:51:ab brd ff:ff:ff:ff:ff:ff
5: vp29@vp26: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 4e:16:07:e6:77:2c brd ff:ff:ff:ff:ff:ff
6: vp26@vp29: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether be:98:08:90:da:7a brd ff:ff:ff:ff:ff:ff

创建namespace ns19和ns29,并设置vp19和vp29的netns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ sudo ip netns add ns19
$ sudo ip netns add ns29
$ sudo ip netns list
ns19
ns29

$ sudo ip link set netns ns19 vp19
$ sudo ip link set netns ns29 vp29
$ sudo ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 00:1c:42:c6:de:63 brd ff:ff:ff:ff:ff:ff
4: vp16@if3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1e:ea:cf:86:51:ab brd ff:ff:ff:ff:ff:ff link-netnsid 0
6: vp26@if5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether be:98:08:90:da:7a brd ff:ff:ff:ff:ff:ff link-netnsid 1

设置完netns后,在当前namespace中查看网卡信息就看不到vp19和vp29这两个网卡了,但在ns19和ns29 namespace中却能查看到对应的网卡。

1
2
3
4
5
6
7
8
9
10
$ sudo ip netns exec ns19 ip link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: vp19@if4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b6:62:99:1e:0c:2a brd ff:ff:ff:ff:ff:ff link-netnsid 0
$ sudo ip netns exec ns29 ip link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5: vp29@if6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 4e:16:07:e6:77:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0

此时将这四个网卡激活,并配置ip,他们便可以相互通讯了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
$ sudo ip link set dev vp16 up
$ sudo ip link set dev vp26 up
$ sudo ip netns exec ns19 ip link set dev vp19 up
$ sudo ip netns exec ns29 ip link set dev vp29 up

$ sudo ip addr add 192.168.200.16/24 dev vp16
$ sudo ip addr add 192.168.200.26/24 dev vp26
$ sudo ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:1c:42:c6:de:63 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.5/24 brd 192.168.3.255 scope global enp0s5
valid_lft forever preferred_lft forever
inet6 fe80::21c:42ff:fec6:de63/64 scope link
valid_lft forever preferred_lft forever
4: vp16@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 1e:ea:cf:86:51:ab brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.200.16/24 scope global vp16
valid_lft forever preferred_lft forever
inet6 fe80::1cea:cfff:fe86:51ab/64 scope link
valid_lft forever preferred_lft forever
6: vp26@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether be:98:08:90:da:7a brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet 192.168.200.26/24 scope global vp26
valid_lft forever preferred_lft forever
inet6 fe80::bc98:8ff:fe90:da7a/64 scope link
valid_lft forever preferred_lft forever

$ sudo ip netns exec ns19 ip addr add 192.168.200.19/24 dev vp19
$ sudo ip netns exec ns19 ip addr show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: vp19@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether b6:62:99:1e:0c:2a brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.200.19/24 scope global vp19
valid_lft forever preferred_lft forever
inet6 fe80::b462:99ff:fe1e:c2a/64 scope link
valid_lft forever preferred_lft forever

$ sudo ip netns exec ns29 ip addr add 192.168.200.29/24 dev vp29
$ sudo ip netns exec ns29 ip addr show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5: vp29@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 4e:16:07:e6:77:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.200.29/24 scope global vp29
valid_lft forever preferred_lft forever
inet6 fe80::4c16:7ff:fee6:772c/64 scope link
valid_lft forever preferred_lft forever

  • ns19、ns29这两个namespace的网络不能ping通各自的192.168.200.0网段的ip地址
    将ns19、ns29中的lo设备激活(up),便能ping通各自的ip地址了。
  • ns19 namespace的网络能ping同192.168.200.16192.168.200.26,但不能ping通192.168.200.29
    ns29 namespace的网络却不能ping通任何一个ip地址。
  • 将ns29所对应的veth pair划分独立网段(192.168.29.0),veth pair对应的两个网卡便能正常ping通
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ sudo ip netns exec ns19 ping 192.168.200.16
ING 192.168.200.16 (192.168.200.16) 56(84) bytes of data.
64 bytes from 192.168.200.16: icmp_seq=1 ttl=64 time=0.060 ms
64 bytes from 192.168.200.16: icmp_seq=2 ttl=64 time=0.065 ms
...
$ sudo ip netns exec ns19 ping 192.168.200.26
PING 192.168.200.26 (192.168.200.26) 56(84) bytes of data.
64 bytes from 192.168.200.26: icmp_seq=1 ttl=64 time=0.041 ms
64 bytes from 192.168.200.26: icmp_seq=2 ttl=64 time=0.063 ms
64 bytes from 192.168.200.26: icmp_seq=3 ttl=64 time=0.057 ms
...
$ sudo ip netns exec ns29 ping 192.168.200.26 -w 5
PING 192.168.200.26 (192.168.200.26) 56(84) bytes of data.

--- 192.168.200.26 ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 4999ms
$ sudo ip netns exec ns29 ping 192.168.200.16 -w 5
PING 192.168.200.16 (192.168.200.16) 56(84) bytes of data.

--- 192.168.200.16 ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 4999ms
$ sudo ip netns exec ns29 ip addr delete 192.168.200.29/24 dev vp29
$ sudo ip netns exec ns29 ip addr add 192.168.29.29/24 dev vp29
$ sudo ip addr delete 192.168.200.26/24 dev vp26
$ sudo ip addr add 192.168.29.26/24 dev vp26
$ sudo ip netns exec ns29 ping 192.168.29.26
PING 192.168.29.26 (192.168.29.26) 56(84) bytes of data.
64 bytes from 192.168.29.26: icmp_seq=1 ttl=64 time=0.059 ms
64 bytes from 192.168.29.26: icmp_seq=2 ttl=64 time=0.051 ms
...

brctl

brctl是bridge-utils包中的程序,用于管理linux bridge的CLI工具。

创建bridge设备

1
2
3
4
$ sudo brctl addbr vb
$ sudo brctl show
bridge name bridge id STP enabled interfaces
vb 8000.000000000000 no

将网卡vp16和vp26添加到bridge设备中,可实现vp19与vp29的通讯

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ sudo brctl addif vb vp16
$ sudo brctl addif vb vp26
$ sudo brctl show
bridge name bridge id STP enabled interfaces
vb 8000.1eeacf8651ab no vp16
vp26
$ sudo ip netns exec ns29 ping 192.168.200.19
PING 192.168.200.19 (192.168.200.19) 56(84) bytes of data.
64 bytes from 192.168.200.19: icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from 192.168.200.19: icmp_seq=2 ttl=64 time=0.072 ms
...
$ sudo ip netns exec ns19 ping 192.168.200.29
PING 192.168.200.29 (192.168.200.29) 56(84) bytes of data.
64 bytes from 192.168.200.29: icmp_seq=1 ttl=64 time=0.066 ms
64 bytes from 192.168.200.29: icmp_seq=2 ttl=64 time=0.058 ms
...

若希望namespace网络能ping通宿主机内所有网络,需要添加默认网关

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ sudo ip netns exec ns19 ping 192.168.3.5
connect: Network is unreachable
$ sudo ip netns exec ns29 ping 192.168.3.5
connect: Network is unreachable
$ sudo ip netns exec ns19 route add default gw 192.168.200.1
$ sudo ip netns exec ns19 route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.200.1 0.0.0.0 UG 0 0 0 vp19
192.168.200.0 0.0.0.0 255.255.255.0 U 0 0 0 vp19
$ sudo ip netns exec ns19 ping 192.168.3.5
PING 192.168.3.5 (192.168.3.5) 56(84) bytes of data.

--- 192.168.3.5 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 2999ms

配置完默认网关后,网络不可达变成了访问超时,此时需要将网卡vp16vp26的IP地址删除

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ sudo ip addr delete 192.168.200.16/24 dev vp16
$ sudo ip addr delete 192.168.200.26/24 dev vp26
$ sudo ip addr show
...
4: vp16@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vb state UP group default qlen 1000
link/ether 1e:ea:cf:86:51:ab brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::1cea:cfff:fe86:51ab/64 scope link
valid_lft forever preferred_lft forever
6: vp26@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vb state UP group default qlen 1000
link/ether be:98:08:90:da:7a brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::bc98:8ff:fe90:da7a/64 scope link
valid_lft forever preferred_lft forever
7: vb: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 1e:ea:cf:86:51:ab brd ff:ff:ff:ff:ff:ff
inet 192.168.200.1/24 scope global vb
valid_lft forever preferred_lft forever
inet6 fe80::1cea:cfff:fe86:51ab/64 scope link
valid_lft forever preferred_lft forever
$ sudo ip netns exec ns19 ping 192.168.3.5
PING 192.168.3.5 (192.168.3.5) 56(84) bytes of data.
64 bytes from 192.168.3.5: icmp_seq=1 ttl=64 time=0.064 ms
64 bytes from 192.168.3.5: icmp_seq=2 ttl=64 time=0.062 ms
...

iptables/netfilter

netfilter是Linux操作系统核心层内部的一个数据包处理模块,它具有网络地址转换(Network Address Translate)、数据包过滤、数据包处理、地址伪装、透明代理,以及基于用户及媒体访问控制(Media Access Control,MAC)地址的过滤和基于状态的过滤、包速率限制等。 netfilter工作在三层

iptables是与netfilter交互的CLI工具。

按照上述的配置方式,只能做到访问宿主机内的所有网络,若希望ping通宿主机所在网络的其它主机,需要开启ip_forward(echo 1 >> /proc/sys/net/ipv4/ip_forward,需要以root身份操作,仅仅使用root权限不够),并关闭防火墙(iptables)或配置防火墙POSTROUTING

1
2
3
4
5
6
7
8
9
10
11
12
13
$ sudo iptables -t nat -A POSTROUTING -s 192.168.200.0/24 -p all -j MASQUERADE
hain PREROUTING (policy ACCEPT)
target prot opt source destination

Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE all -- 192.168.200.0/24 0.0.0.0/0

TUN/TAP驱动

Todo.

参考&鸣谢