Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/NunuM/linux_namespaces_tutorial
Tutorial of Linux Namespaces
https://github.com/NunuM/linux_namespaces_tutorial
linux namespaces tutorial
Last synced: 9 days ago
JSON representation
Tutorial of Linux Namespaces
- Host: GitHub
- URL: https://github.com/NunuM/linux_namespaces_tutorial
- Owner: NunuM
- Created: 2018-12-05T22:15:47.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2018-12-05T22:18:05.000Z (almost 6 years ago)
- Last Synced: 2024-08-01T13:36:27.396Z (3 months ago)
- Topics: linux, namespaces, tutorial
- Homepage:
- Size: 15.6 KB
- Stars: 3
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Linux Namespaces
The subject of this tutorial is about Linux namespaces, with the objective of understanding and using them in order to implement basic containers. Containers introduces a lightweight kind of virtualization,
all software that runs inside it, thinks that is running in physically host. Linux since kernel 2.6.23 offers tools to achieve this behaviour, namely Linux namespaces. Today we can use 6 namespaces:| Namespaces | Constant | Isolates |
| ------------- |:-------------:| -----:|
| Cgroup | CLONE_NEWCGROUP | Cgroup root directory |
| IPC | CLONE_NEWIPC | System V IPC, POSIX message queues |
| Network | CLONE_NEWNET | Network devices, stacks, ports, etc. |
| Mount | CLONE_NEWNS | Mount points |
| PID | CLONE_NEWPID | Process IDs |
| User | CLONE_NEWUSER | User and group IDs |
| UTS | CLONE_NEWUTS | Hostname and NIS domain name |Each namespace wraps a particular global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.
## Cgroups
Cgroup namspace gives the possibility of administrate a number of resources and set resource limits to them by using certain resources controllers also known as Cgroup subsystems.
With tihs we can adminbstrate CPU, memory, network brandwith and I/O amonsgst hierarchically ordered groups of processes. In the hierarchies the are composed by slices, they dont have any
processes instead, they prodive a blueprint for organizing a hierarchy around processes. A slice can have a scope (transient processes, eg: VM, user sessins) or a service (system services, normally started via systemd).systemd-------
!
!
______!_____
! ! !
Service Scope SliceThere are four default slices:
1) -.slice: The root slice at the top of the Cgrpup tree
2) System.slice: The default place for all system services
3) User.slice: The default +lace for all user sessions
4) Machine.slice: VM and containers.
![SystemD](https://image.ibb.co/jhhW5d/Screenshot_from_2018_05_13_23_28_08.png)
```bash
# See Cgroup hierarchy of processes
systemd-cgls# See the number of tasks, CPU consuption, Memory, I/O
systemd-cgtop#Run transient process in a new sclice
systemd-run --unit=top --slice=nuno.slice top -b#Check if is running
systemctl status nuno.slice#Check again
systemd-cgls#stop it
systemctl stop nuno.slice#Apply Cgroup limit
#Run a program that requires 2g of RAM
echo "while true; do memhog 2g; sleep 2; done" > memhogtest.shcat < memhogtest.service
[Unit]
After=network.target remote-fs.target nss-lookup.target[Service]
Type=notify
ExecStart=/home/nuno/memtest.sh -DBACKGROUND
ExecStop=/bin/kill -WINCH ${MAINPID}
killSignal=SIGTERM
PrivateTmp=true[Install]
WantedBy=multi-user.targetEOF
cp memhogtest.service /usr/lib/systemd/system
systemctl deamon-reaload
systemctl enable memhogtest.service#Force Ram limit
systemctl set-property --runtime memhogtest.service MemoryLimit=1Gsystemctl deamon-reaload
#Check if the proccess was killed
systemctl status memhogtest.service```
![SystemDD](https://image.ibb.co/c7Q7dy/Screenshot_from_2018_05_13_23_30_42.png)
Other way of limit process resources is to manipulate the files that kernel expose. To our goal we will limit the memory usage up to 100MB using Cgroups.
All containers that will be spawned will be this resource limited. The kernel exposes cgroups through the /sys/fs/cgroup directory.```bash
ls /sys/fs/cgroup
#create new memory groupmkdir /sys/fs/cgroup/memory/cogsi
```
![Cg](https://image.ibb.co/erjL0d/Screen_Shot_2018_05_15_at_01_54_24.png)
Once created the kernel creates all files that we can mannually configure. our goal is to set memory up to 100Mb and disable swap
```bash
echo "100000000" > /sys/fs/cgroup/memory/cogsi/memory.limit_in_bytes
echo "0" > /sys/fs/cgroup/memory/demo/memory.swappiness
```The special task file, holds all PIDs that will have this policy activated, later on we will show a full container that
have this restriction.
## NetworkThis let us have isolated network environments on a single host, each nampespace has its own interfaces and routing table.
```bash
#List all inherit namespaces
ip netns```
-----------------------
! Linux Kernel !
! ______________ !! !_Default NS_! !
!_____________________!To our project will have add two namespaces:
1) net1
2) net2
```bash
ip netstat add net1
ip netstat add net2```
Default NS
net1 net2```bash
#Check if was created
ip netns list
```once a namespace is added, a new file is created in /var/run/netns with the same name as the namespace. Our next goal os to ping each other, using virtual switch.
```bash
apt-get install openvswitch-switch#start
systemctl start openvswitch#Create a virtual switch
ovs-vsctl add-br name_switch#Show created switch
ovs-vsctl show```
We need 2 virtual ethernets to connect each network namespace, we can created a type of veth that create pair of tubes, we conect the one exterminty to the switch an other to the created namespace.
```bash
# netX-netsn will be in namespace, and the netX-ovs will bve on switch side
ip link add net1-netns type veth peer name net1-ovs
ip link add net2-netns type veth peer name net2-ovs# Connect the netX-netns to the netX namespace
ip link set net1-netns netns net1
ip link set net2-netns netns net2# Connect netX-ovs to the virtual switch
ovs-vsctl add-port name_switch net1-ovs
ovs-vsctl add-port name_switch net2-ovs```
OpenSwitch
-------------------
! __net1-ovs __ net2-ovs
! ! ! ! ! !
! !--! !--! !
---!-----------!---
! !
! !
! !
net1-netns net2-netns
! !
! !
_ !_ _ !__
!net1! ! net2!```bash
#Now we can enable the devices in the 'default' namespace
sudo ip link set net1-ovs up
sudo ip link set net2-ovs up#Enable into namespaced land
sudo ip netns exec net1 ip link set dev lo up
sudo ip netns exec net1 ip link set dev net1-netns upsudo ip netns exec net2 ip link set dev lo up
sudo ip netns exec net2 ip link set dev net2-netns up#Assign static address
sudo ip netns exec net1 ip addr add 10.0.0.1/24 dev net1-netns
sudo ip netns exec net2 ip addr add 10.0.0.2/24 dev net2-netns#Ping between namespaces using ip command
sudo ip netns exec net1 ping 10.0.0.2#Ping between namespaces using a friendly way
sudo ip netns exec net1 /bin/bash
#enter inside the namespace
ping 10.0.0.2```
This configurations let us have a local netowrk only for namspaces purposes.
![Network](https://image.ibb.co/haAvWJ/Screenshot_from_2018_05_14_00_50_41.png)
On a fully isolated container (alike docker)
![Container](https://image.ibb.co/bPGoJy/Screenshot_from_2018_05_14_01_04_04.png)## Mount
This namespace isolates the mounting points seen by the processes in a namespace. Exists four types or markers that we can give to a specific
mounting point, the marker determinates the event propragation between them. Currently exists 4 type:1) MS_SHARED - All events are propageted to his peers
2) MS_PRIVATE - No event is propagated to his peers
3) MS_SLAVE - Events in a master are propagated, but not from slave to the master.
4) MS_UNBINDABLE - Like private, thus cannot bind mount operation
To acheive our goal, we will create a master slave configuration. The master will share a read only mounting point that contains config
files and executable files, while the container has their own mounting point, that only it has permission to write. For emulate real block device,
we will create RAM disks. RAM disks use the normal RAM in main memory as if it were a partition on a hard drive rather than
actually accessing the data bus normally used for secondary storage such as hard disk.```bash
mkfs -q /dev/ram1 8192
mkfs -q /dev/ram2 8192mkdir -p /mnt/ram1
mkdir -p /mnt/ram2mkfs -t ext4 -q /dev/ram1 8192
mkfs -t ext4 -q /dev/ram2 8192mount /dev/ram1 /opt/container/shared
mount --make-shared /opt/container/shared
# Share a read only filesystem into two containers
mount --bind -o ro /opt/container/shared /opt/container/rootfs/shared
mount --bind -o ro /opt/container/shared /opt/container/otherfs/shared```
### Docker Alike Container.
This leads to the final of this sprint, that we have a program written in C that uses the namepsace API to compose
allmost all of the mentioned namespaces. Soo far, we already have two network namespaces that can be used by two containers that
require communication, one shared mounting point, that the two can read from, and one that they share.This completes by joining to the network namespace with setns sys call, wich have a well known file. Before
spwan a container we need a container image.```bash
# Download image
wget https://github.com/NunuM/containers/archive/v0.1.0.zipmkdir -p /opt/container
mv v0.1.0.zip /opt/container
cd /opt/container
unzip v0.1.0.zip
cd -
# Compile the source code
gcc file.c -o launcher#spawning a container
./launcher -n -u -i -m -p -N net1 -M zion chroot /opt/container/rootfs /bin/bash```
This image is based on devian jessie, and has python. Now that we make are in our fresh container,
restained by RAM, let's create an hungry python program.```bash
cat < hungry.py
f = open("/dev/urandom", "r")
data = ""i=0
while True:
data += f.read(10000000) # 10mb
i += 1
print "%dmb" % (i*10,)
EOF/usr/bin/python hungry.py
```
Note that if we want to execute the script it will fail, since we do not have the special file /dev/urandom, we need to create it.```bash
mknod -m 444 /dev/urandom c 1 9
```![Run Inside container](https://image.ibb.co/e59HDy/Screen_Shot_2018_05_15_at_02_49_22.png)
The script is killed by the control group that is associated with it.
Next we want to have internet access inside the two containers. We almost have all set. To accomplish, was created a port in open vSwitch bridge 'name_switch' that we have created. The port
will bind the physically ethernet interface with 'name_switch'. (The host loses the internet connectivity, since the interface is not connected no more to the default IP stack of the system).
In order to recover internet connection, two steps were required: 1) remove physically interface address; 2) assign my_bridge with address. The systems will looks like: IP stack -> name_bridge -> enp0s25```bash
#find enp0s25 address
ip addr#add physicall interface - CAUTION : Lost cconnectivity after this command
ovs-vsctl add-port name_switch enp0s25#delete adrress
ip addr del 192.168.2.60/24 dev enp0s25#configure name_switch
dhclient name_switch
```To test the result of the documented steps were launched two containers in two diferrent terminals:
```bash
#copy a program to a shared filesystem
cat < /opt/container/shared/hungry.py
f = open("/dev/urandom", "r")
data = ""i=0
while True:
data += f.read(10000000) # 10mb
i += 1
print "%dmb" % (i*10,)
EOF#launch zion - terminal 1
./launcher -i -m -n -p -u -N net1 -H zion chroot /opt/container/rootfs /bin/bash
ping localhostping 10.0.0.2
ping 8.8.8.8
ping google.pt
#launch moon - terminal 2
./launcher -i -m -n -p -u -N net2 -H moon chroot /opt/container/otherfs /bin/bash```
As result![Inside a container](https://image.ibb.co/jENNFd/Screenshot_from_2018_05_15_14_41_56.png)
We also can install other sofware using apt tool command.
![APT Tool](https://image.ibb.co/koiq1J/Screenshot_from_2018_05_15_14_46_04.png)
run a program from a shared filesystem
![python prd](https://image.ibb.co/fO9mad/Screenshot_from_2018_05_15_14_48_26.png)
The last part is to demonstrate how to mount a filesystem in a running container, this will share a mounting point only and only between the two containers,
event the parent tree does not see this subtree.```bash
#We can cat the /sys/fs/cgroup/memory/cogsi/tasks
cat /sys/fs/cgroup/memory/cogsi/tasks#or uses ps
ps aux | grep bash
#and to be sure send a message to a pseudo-terminal, and repeat the process to find the other container
echo "Is you container?" > /dev/pts/18```
100% positive that we have find the PID![telephone ](https://image.ibb.co/ctLv1J/Screenshot_from_2018_05_15_14_57_45.png)
```bash
#now we can create the private filesystem
nsenter -m -t [zion_PID] mount --make-slave --read-write /dev/ram2 /opt/container/rootfs/private
nsenter -m -t [moon_PID] mount --make-slave --read-write /dev/ram1 /opt/container/otherfs/private#In a zion container - this file only is visible by the two containers
echo "I am private" > /private/text.txt#In moon container
ls /private#We also can mount proc using this command
nsenter -p -t [zion_PID] mount -t proc proc /opt/container/rootfs/proc
```Nsenter command allows us to executes commands inside a running container.
![private](https://image.ibb.co/mxnuTy/Screenshot_from_2018_05_15_15_08_39.png)
![proc](https://image.ibb.co/jdUmad/Screenshot_from_2018_05_15_15_16_24.png)
```c
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)struct child_args {
char **argv; /* Command to be executed by child, with arguments */
char * hostname; /* Set Hostname UTS */
char * netns; /* Set network namespace */
int pipe_fd[2]; /* Pipe used to synchronize parent and child */
};static int verbose;
static void
usage(char *pname)
{
fprintf(stderr, "Usage: %s [options] cmd [arg...]\n\n", pname);
fprintf(stderr, "Create a child process that executes a shell command in a new user namespace,\n");
fprintf(stderr, "Options can be:\n\n");
#define fpe(str) fprintf(stderr, " %s", str);
fpe("-i New IPC namespace\n");
fpe("-m New mount namespace\n");
fpe("-n New network namespace\n");
fpe("-p New PID namespace\n");
fpe("-u New UTS namespace\n");
fpe("-U New user namespace\n");
fpe("-M uid_map Specify UID map for user namespace\n");
fpe("-G gid_map Specify GID map for user namespace\n");
fpe(" If -M or -G is specified, -U is required\n");
fpe("-N Join to network namespace\n");
fpe("-H Set UTS namespace\n");
fpe("-v Display verbose messages\n");
fpe("\n");
fpe("Map strings for -M and -G consist of records of the form:\n");
fpe("\n");
fpe(" ID-inside-ns ID-outside-ns len\n");
fpe("\n");exit(EXIT_FAILURE);
}static void update_map(char *mapping, char *map_file)
{
int fd, j;
size_t map_len; /* Length of 'mapping' */map_len = strlen(mapping);
for (j = 0; j < map_len; j++)
if (mapping[j] == ',')
mapping[j] = '\n';fd = open(map_file, O_RDWR);
if (fd == -1) {
fprintf(stderr, "open %s: %s\n", map_file, strerror(errno));
exit(EXIT_FAILURE);
}if (write(fd, mapping, map_len) != map_len) {
fprintf(stderr, "write %s: %s\n", map_file, strerror(errno));
exit(EXIT_FAILURE);
}close(fd);
}static int childFunc(void *arg)
{
struct child_args *args = (struct child_args *) arg;
char ch;close(args->pipe_fd[1]);
if (read(args->pipe_fd[0], &ch, 1) != 0) {
fprintf(stderr, "Failure in child: read from pipe returned != 0\n");
exit(EXIT_FAILURE);
}if(args->hostname != NULL && strlen(args->hostname) > 0){
if (sethostname(args->hostname, strlen(args->hostname)) == -1){
errExit("sethostname");
}
}int nslen = 0;
if(args->netns != NULL && (nslen = strlen(args->netns)) > 0){
char * buf = (char *) malloc(124);
snprintf(buf, nslen + 18, "/var/run/netns/%s", args->netns);
int fd = open(buf, O_RDONLY); /* Get file descriptor for namespace */
if (fd == -1){
errExit("open");
}
if (setns(fd, 0) == -1){ /* Join network namespace */
errExit("setns");
}
}/* Execute a shell command */
execvp(args->argv[0], args->argv);
errExit("execvp");
}#define STACK_SIZE (1024 * 1024)
static char child_stack[STACK_SIZE]; /* Space for child's stack */
int main(int argc, char *argv[])
{
int flags, opt, fd,res;
pid_t child_pid;
struct child_args args;
char *uid_map, *gid_map,*hostname,*netns;
char map_path[PATH_MAX];flags = 0;
verbose = 0;
gid_map = NULL;
uid_map = NULL;
while ((opt = getopt(argc, argv, "+imnpuUM:G:H:N:v")) != -1) {
switch (opt) {
case 'i': flags |= CLONE_NEWIPC; break;
case 'm': flags |= CLONE_NEWNS; break;
case 'n': flags |= CLONE_NEWNET; break;
case 'p': flags |= CLONE_NEWPID; break;
case 'u': flags |= CLONE_NEWUTS; break;
case 'v': verbose = 1; break;
case 'M': uid_map = optarg; break;
case 'G': gid_map = optarg; break;
case 'U': flags |= CLONE_NEWUSER; break;
case 'H': args.hostname = optarg; break;
case 'N': args.netns = optarg; break;
default: usage(argv[0]);
}
}if ((uid_map != NULL || gid_map != NULL) &&
!(flags & CLONE_NEWUSER))
usage(argv[0]);args.argv = &argv[optind];
if (pipe(args.pipe_fd) == -1)
errExit("pipe");/* Create the child in new namespace(s) */
child_pid = clone(childFunc, child_stack + STACK_SIZE,
flags | SIGCHLD, &args);
if (child_pid == -1)
errExit("clone");/* Parent falls through to here */
if (verbose)
printf("%s: PID of child created by clone() is %ld\n",
argv[0], (long) child_pid);/* Update the UID and GID maps in the child */
if (uid_map != NULL) {
snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
(long) child_pid);
update_map(uid_map, map_path);
}
if (gid_map != NULL) {
snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
(long) child_pid);
update_map(gid_map, map_path);
}/* Limit child memory */
fd = open("/sys/fs/cgroup/memory/cogsi/tasks", O_RDWR);
if(fd == -1){
errExit("open");
}const int n = snprintf(NULL, 0, "%lu", (long) child_pid);
char buf[n+1];
int c = snprintf(buf, n+1, "%lu", (long) child_pid);res = write(fd, buf, n);
if(res == -1){
errExit("write");
}close(args.pipe_fd[1]);
if (waitpid(child_pid, NULL, 0) == -1) /* Wait for child */
errExit("waitpid");if (verbose)
printf("%s: terminating\n", argv[0]);exit(EXIT_SUCCESS);
}```