class: title, self-paced Week 3 Part 1:
Container Security
.nav[*Self-paced version*] .debug[ ``` ``` These slides have been built from commit: 44d41b6 [shared/title.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/title.md)] --- class: title, in-person Week 3 Part 1:
Container Security
.footnote[ **Slides[:](https://www.youtube.com/watch?v=h16zyxiwDLY) https://chicago.bretfisher.com/** ] .debug[[shared/title.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/title.md)] --- ## Prep: Things to do before we get started 1. Open these slides: https://chicago.bretfisher.com/ 2. Get a cluster IP. Teams has a spreadsheet of IPs. Pick one and add your initials. 3. Access your server over WebSSH (https://webssh.bret.today) - username: k8s | password: training 4. Your cluster has two nodes. Test ssh from node1 to node2 with `ssh node2` etc. **Note** - This is hands on. You'll want to do most of these commands with me. - Everything is take home (except for the server 😉). We'll get to that later. .debug[[logistics-bret-security.md](https://github.com/BretFisher/container.training/tree/tampa/slides/logistics-bret-security.md)] --- ## Introductions - Hello! I'm Bret Fisher ([@bretfisher]), a fan of 🐳 🏖 🥃 👾 ✈️ 🐶 - I'm a [DevOps Consultant+Trainer], [Open Source maintainer], and [Docker Captain]. - I have a weekly [DevOps live stream] with guests. Join us on Thursdays! - That show turns into a podcast called "[DevOps and Docker Talk]." - You can get my weekly updates in email by following [my Patreon page]. [@bretfisher]: https://twitter.com/bretfisher [DevOps Consultant+Trainer]: https://www.bretfisher.com/courses/ [Open Source maintainer]: https://github.com/bretfisher [Docker Captain]: https://www.docker.com/captains/bret-fisher/ [DevOps live stream]: https://www.youtube.com/channel/UC0NErq0RhP51iXx64ZmyVfg [DevOps and Docker Talk]: https://podcast.bretfisher.com/ [my Patreon page]: https://patreon.com/BretFisher .debug[[logistics-bret-security.md](https://github.com/BretFisher/container.training/tree/tampa/slides/logistics-bret-security.md)] --- ## Logistics - The training will run for 3 hours each day, with Q&A before and after. - We'll do a short half-time break. - Feel free to interrupt for questions at any time on voice or Teams chat. - *Especially when you see full screen container pictures!* .debug[[logistics-bret-security.md](https://github.com/BretFisher/container.training/tree/tampa/slides/logistics-bret-security.md)] --- ## Exercises - At the end of each day, there is an exercise. - To make the most out of the training, please try the exercises! (it will help to practice and memorize the content of the day) - We recommend to take at least one hour to work on the exercises. (if you understood the content of the day, it will be much faster) - Each day will start with a quick review of the exercises of the previous day. .debug[[logistics-bret-security.md](https://github.com/BretFisher/container.training/tree/tampa/slides/logistics-bret-security.md)] --- ## Limited time signup for my video courses - I make bestselling Docker & Kubernetes courses on Udemy (nearly 300,000 students). - **As part of this workshop, you get free lifetime access to all of them!** - **But you must "buy" each course with the coupon before the coupon expires.** - Use the coupon code `CHICAGO22` to get the courses [in this list](https://www.udemy.com/user/bretfisher/). - Details will be emailed out to you as a reminder at the end of this workshop. .debug[[logistics-bret-security.md](https://github.com/BretFisher/container.training/tree/tampa/slides/logistics-bret-security.md)] --- ## Accessing these slides now - We recommend that you open these slides in your browser: https://chicago.bretfisher.com/ - This is a public URL, you're welcome to share it with others! - Use arrows to move to next/previous slide (up, down, left, right, page up, page down) - Type a slide number + ENTER to go to that slide - The slide number is also visible in the URL bar (e.g. .../#123 for slide 123) .debug[[shared/about-slides.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/about-slides.md)] --- ## These slides are open source - The sources of these slides are available in a public GitHub repository: https://github.com/bretfisher/container.training - These slides are written in Markdown - You are welcome to share, re-use, re-mix these slides - Typos? Mistakes? Questions? Feel free to hover over the bottom of the slide ... .footnote[👇 Try it! The source file will be shown and you can view it on GitHub and fork and edit it.] .debug[[shared/about-slides.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/about-slides.md)] --- ## Accessing these slides later - Slides will remain online so you can review them later if needed (let's say we'll keep them online at least 1 year, how about that?) - You can download the slides using that URL: https://chicago.bretfisher.com/slides.zip (then open the file `security.yml.html`) - You can also generate a PDF of the slides (by printing them to a file; but be patient with your browser!) .debug[[shared/about-slides.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/about-slides.md)] --- ## These slides are constantly updated - https://container.training - Upstream repo https://github.com/jpetazzo/container.training .debug[[shared/about-slides.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/about-slides.md)] --- class: extra-details ## Extra details - This slide has a little magnifying glass in the top left corner - This magnifying glass indicates slides that provide extra details - Feel free to skip them if: - you are in a hurry - you are new to this and want to avoid cognitive overload - you want only the most essential information - You can review these slides another time if you want, they'll be waiting for you ☺ .debug[[shared/about-slides.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/about-slides.md)] --- name: toc-part-1 ## Part 1 - [Choosing your base image](#toc-choosing-your-base-image) - [CVE image scanning](#toc-cve-image-scanning) - [Running apps as a non-root user](#toc-running-apps-as-a-non-root-user) - [Linux security features in containers](#toc-linux-security-features-in-containers) - [Bret's container security advice](#toc-brets-container-security-advice) .debug[(auto-generated TOC)] --- name: toc-part-2 ## Part 2 - [(Extra security and advanced content)](#toc-extra-security-and-advanced-content) - [Deep dive into container internals](#toc-deep-dive-into-container-internals) - [Control groups](#toc-control-groups) - [Namespaces](#toc-namespaces) .debug[(auto-generated TOC)] .debug[[shared/toc.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/toc.md)] --- class: pic .interstitial[] --- name: toc-choosing-your-base-image class: title Choosing your base image .nav[ [Previous part](#toc-) | [Back to table of contents](#toc-part-1) | [Next part](#toc-cve-image-scanning) ] .debug[(automatically generated title slide)] --- # Choosing your base image - *base image* = The existing image that you start new dev projects from - We often recommend "Docker Hub Official Images" as your "base" image - For prebuilt apps like MySQL, Postgres, Redis, you typically leave them unchanged -- - For development languages and frameworks, it's not so simple - We want feature-rich images for easy dev, but slim and secure images for prod - Docker Hub defaults to "easy for beginners" with Node.js, PHP, Ruby, Python, etc. -- - *Let's look at a few examples to understand the difference* .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Exploring official Python images - Let's look at the size difference of various Python image types .lab[ - Let's download a few different images ```bash docker pull python:latest docker pull python:slim docker pull python:slim-bullseye docker pull python:alpine ``` - Let's list only the python images to check their size ```bash docker images --filter=reference='python' # OR docker images | grep python ``` ] .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Explaining the Python image differences - All these images provide Python 3.x, so why is their size so different? -- - `slim` is only 14% the size of `latest` - `alpine` is only 5% the size of `latest` - If you just `docker pull python`, it assumes `latest`, so isn't that the *best*? -- - Docker Hub Official Images for programming languages are "easy by default" - Security is often a pendulum, and `latest` sacrifices secure-by-default for convenience - `latest` includes lots of tools like compliers and common libs we'll never use - It's much better (but more work) to control our app dependencies ourselves .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Python image types - Docker Hub Convention: Official images tend to have common tag names - Docker Hub Convention: Official images built on minimal Debian, unless specified - `:latest` = Latest stable version, with a bunch of extras for convenience - Never use this for anything but samples/demos/learning - `:slim` = A slimmed down minimal version of the latest stable version - A much better starting point for your apps in the real world - `:alpine` = Very minimal, replaces Debian with Busybox + Alpine - Seems much more secure then Debian, but can have compatibility issues .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Debian as the base of Official images - All images start with `scratch`, which is empty - To use `apt` (`apt-get`) or `yum` packages, you need a Linux distributions basic tools - Package managers, OpenSSL, shells (bash, zsh, sh), root certs, etc. - Without these tools, you couldn't do much when building your app image - Docker Inc. choose Debian as the standard base - A common distro, and the basis for many others like Ubuntu and Mint - This makes migrating app builds to a Dockerfile way easier .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Official image builds are open source - Still curious how a Official image is built? Dockerfiles are open source! - Starting from the final `python:latest`, we can walk back through the Dockerfiles - Click any image tag in a Docker Hub official image page to see the Dockerfile - [hub.docker.com/_/python] is built on top of other images: 1. `python:latest` (920MB) is [built from its Dockerfile] in the Docker Hub repo README 2. That Dockerfile has `FROM buildpack-deps:bullseye` as its base image 3. `buildpack-deps:bullseye` (834MB) [Dockerfile installs a bunch] of packages 4. That Dockerfile has `FROM buildpack-deps:bullseye-scm` as its base image 5. `buildpack-deps:bullseye-scm` (306MB) [Dockerfile installs versioning tools] 6. That Dockerfile has `FROM buildpack-deps:bullseye-curl` as its base image 7. `FROM buildpack-deps:bullseye-curl` (154MB) [Dockerfile installs curl, wget and others] 8. That Dockerfile has `FROM debian:bullseye` as its base image 9. `debian:bullseye` (118MB) is a [simple 3-line Dockerfile] that has minimal Debian [hub.docker.com/_/python]: https://hub.docker.com/_/python [built from its Dockerfile]: https://github.com/docker-library/python/blob/56cea612ab370f3d05b29e97466d418a0f07e463/3.10/bullseye/Dockerfile [Dockerfile installs a bunch]: https://github.com/docker-library/buildpack-deps/blob/65d69325ad741cea6dee20781c1faaab2e003d87/debian/bullseye/Dockerfile [Dockerfile installs versioning tools]: https://github.com/docker-library/buildpack-deps/blob/65d69325ad741cea6dee20781c1faaab2e003d87/debian/bullseye/scm/Dockerfile [Dockerfile installs curl, wget and others]: https://github.com/docker-library/buildpack-deps/blob/98a5ab81d47a106c458cdf90733df0ee8beea06c/debian/bullseye/curl/Dockerfile [simple 3-line Dockerfile]: https://github.com/debuerreotype/docker-debian-artifacts/blob/6251ccd8060ae10b12bd881975cf37eee84ffbb0/bullseye/Dockerfile .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Side effects of using Debian as the base - Debian is a great open source org, but it's not the right fit for everyone - If you're a Red Hat shop (`yum` based) you have to convert to `apt` and test - Debian is historically slower to patch CVEs than Ubuntu for RHEL - Debian has way more CVEs than Alpine, BusyBox, or scratch-based images -- - But what other options are there? - *let's look at many options for Node.js, another dynamically compiled language* .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- .small[ ## CVEs, Size, and Support of Node.js Images | Image Name | Tier 1 Support | CVEs (High+Crit)/TOTAL | Node Version Control | Image Size (Files) | Min Pkgs | | -------------------------------------- | -------------- | ---------------------- | -------------------- | ------------------ | -------- | | node:latest | Yes | 332/853 | **No** | 991MB (203,325) | No | | node:16 | Yes | 259/1954 | Yes | 906MB (202,898) | No | | node:16-alpine | **No** | 0/0 | Yes | 111MB (179,510) | Yes | | node:16-slim | Yes | 36/131 | Yes | 175MB (182,843) | Yes | | node:16-bullseye | Yes | 130/947 | Yes | 936MB (201,425) | No | | node:16-bullseye-slim | Yes | 12/74 | Yes | 186MB (183,416) | Yes | | ubuntu:20.04+nodesource package | Yes | 2/18 | Yes | 188MB (182,609) | No | | **ubuntu:20.04+node:16-bullseye-slim** | Yes | **0/15** | Yes | 168MB (183,094) | **Yes** | | **gcr.io/distroless/nodejs:16** | Yes | **1/12** | **No** | **108MB (2,120)** | **Yes** | .footnote[ *CVEs and size are from April 2022. RH UBI not considered since it's on Node 10.* ]] .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Explaining the previous slide table - `node:latest` and `node:16` are almost 1GB (and don't even contain your app yet) - Those "full" images have hundreds of CVEs - We can't easily remove a CVE if it's already in base image - `node:alpine` has NO CVEs and is small, but not supported by the Node.js project - YMMV: Many articles detailing how Node.js on Alpine had issues in the real world - `node:slim` makes a **huge** difference in CVEs and size over "full" images - `node:16-bullseye-slim` Replaces the underlying Debian version with an updated one - Less CVEs and more up-to-date `apt` packages! - `ubuntu:20.04+node:16-bullseye-slim` copies `/usr/local` from `node` to `ubuntu` .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Better base image habits - Use `slim` by default, if it's available. - Pin exact versions of `FROM` images. You want deterministic builds. - Only try `:alpine` variants once you're well versed in your app dependencies - You really need good automated testing, including performance, to trust it - Consider setting your standard on `ubuntu` or which has a lower CVE count .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Example of making a custom base image ```bash FROM ubuntu:jammy-20220531 as base RUN apt-get update \ && apt-get -qq install -y --no-install-recommends \ python3-minimal python3-pip \ && rm -rf /var/lib/apt/lists/* # add dev and test stages here FROM base as prod EXPOSE 3000 WORKDIR /app COPY requirements.txt ./ RUN pip install -qr requirements.txt COPY . . CMD ["python", "something.py"] ``` .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Key steps in the previous Dockerfile example - `FROM ubuntu:jammy-20220531 as base` - tag is specific to a build date - `as base` gives us an alias for multi-stage ease of use - Install Python your preferred way. In this case we install `minimal` via `apt` - `FROM base as prod` starts a new stage from the end of a previous stage - This is so you could add more dev/test stages later - Then target specific stages during the build process .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- ## Bonus: watch my DockerCon 2022 talk - "Node.js Rocks in Docker" - It focuses a lot on base images, custom base images, multi-stage images, and more - [27min YouTube video] - [GitHub Repo with tons] of examples and documentation - 95% of it is directly transferable to Python, PHP, Ruby, etc. - 75% of it is transferable to any other language image [27min YouTube video]:https://www.youtube.com/watch?v=Z0lpNSC1KbM [GitHub Repo with tons]:https://github.com/BretFisher/nodejs-rocks-in-docker .debug[[containers/base-images.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/base-images.md)] --- class: pic .interstitial[] --- name: toc-cve-image-scanning class: title CVE image scanning .nav[ [Previous part](#toc-choosing-your-base-image) | [Back to table of contents](#toc-part-1) | [Next part](#toc-running-apps-as-a-non-root-user) ] .debug[(automatically generated title slide)] --- # CVE image scanning - Wonderful news: CVEs are easier to manage and mitigate with Containers - Let's learn about what they are, how to scan for them, and how to mitigate .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- ## What is a CVE? - CVE = Common Vulnerabilities and Exposures - A list of publicly disclosed computer security flaws - Every year the pace increases. We've seen 13,170 new CVEs in 2022 - Your sec and admin teams regularly update systems to avoid these vulnerabilities - You might know them from CVEs in your app dependencies - GitHub and other repositories now scan your app dependencies automatically - But often, your apps have `apt` or `yum` dependencies on those Linux hosts - Those are not tracked in your app dependency manager - There's no common way to control host-based dependencies - Often we have system "drift", where prod dependencies are older then dev/test .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- ## Containers change vulnerability management - Container images store *all* your app dependencies including `apt` and `yum` packages - Images are verifiable artifacts. They are immutable - If you need to update a dependency, you rebuild the image and deploy - This means engineers can verify the code running is the code you built - No more drift with your app dependencies on servers! .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- ## Image CVE scanners > app dep scanners alone - Image scanners are the new normal, and work locally, in CI, or in production - They can scan *every* dependency in the image, including base image and `apt`/`yum` - Scanning images during CI builds = reasonable* certainty its the same result in prod - None of this was true before containers .footnote[ .small[ Reasonable: Rogue agent with root on host server could change things on the fly in a container, unless you enable read-only file-system ]] .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- ## Scan an image with Trivy - We've installed [Trivy], an open source scanner from Aqua Security .lab[ - `trivy i` is what we want to scan an image (rather than host or source code) - View our command options ```bash trivy i --help ``` - Scan the `ubuntu:latest` image ```bash trivy i ubuntu:latest ``` ] [Trivy]:https://aquasecurity.github.io/trivy .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- ## Result of Trivy scans - You'll get back a summary of the vulnerabilities found - You'll get a table with details about each - There are options for file output and format (JSON, sarif, spdx, etc.) -- - Maybe we only care about HIGH and CRITICAL vulnerabilities? .lab[ - Scan the `ubuntu:latest` image ```bash trivy i -s HIGH,CRITICAL ubuntu:latest ``` ] .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- ## Scan Python images - Lets look for HIGH and CRITICAL CVEs, and only output the first 4 line summary .lab[ - scan the latest "full" image ```bash trivy -q i -s HIGH,CRITICAL python:latest | head -n 4 ``` - scan the latest "slim" image ```bash trivy -q i -s HIGH,CRITICAL python:slim | head -n 4 ``` - scan the latest "alpine" image ```bash trivy -q i -s HIGH,CRITICAL python:alpine | head -n 4 ``` ] .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- ## Scanning ~~best~~ better practices - Scan early and often, ideally in CI right after you build image for testing - Many tools now auto-scan, like image registries (Harbor, Docker Hub, JFrog) - Output the results to a file and upload to your security tools (GitHub Security, etc.) - Output summary in PRs, so everyone can see the results before code is merged -- - Remember: dev's control the update process for their app dependencies - Educate developers on which scanner you'll use and standards (e.g. HIGH/CRITICAL) - Empower developers to (optionally) scan locally with same settings that they use in CI -- - Goal: Catch CVEs as early as possible, either in local dev or pull-request builds - Goal: Empower developers to know about CVEs in their app dependencies, and fix them .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- ## CVE Scanner implementation plan: 1, 2, 3 1. Scan your existing images manually so you at least know where you stand 1. If you've never scanned before, you likely have dozens or hundreds -- 2. Assign devs the job of reducing the count 1. Start by only working on removing/updating CRITICAL CVEs 2. The move to HIGH, then MEDIUM, LOW, etc. 3. Don't expect zero -- 3. Assign DevOps/build engineers the job of automating scans during builds 1. Start in audit-mode only for weeks/months 2. Get the whole software lifecycle onboard and aware of this effort 3. Get feedback from dev/ops/sec on how to elevate scan results 4. Watch/support the team as they fix CVEs and lower the count 5. Change from audit-mode to build-fail mode if CRICITAL CVE is found 6. Once processes are solid and issues with workflow resolved, fail for HIGH .debug[[containers/cve-scanning.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/cve-scanning.md)] --- class: pic .interstitial[] --- name: toc-running-apps-as-a-non-root-user class: title Running apps as a non-root user .nav[ [Previous part](#toc-cve-image-scanning) | [Back to table of contents](#toc-part-1) | [Next part](#toc-linux-security-features-in-containers) ] .debug[(automatically generated title slide)] --- # Running apps as a non-root user - Containers run anything inside them as a Linux user - That user is controllable, and not the same as a host user - Container runtime translates container->host user depending on setup - This section is all about *inside* the container (the part *devs* control) -- - With all programming language images on Hub, the default user is `root` - You can change the user to `python` or `whatever` - Linux doesn't care about names, just the User/Group IDs - Many just use `1000` or `1001` as an ID convention for "my app user" -- - This is good! We want to avoid apps-as-root, even in a container - It's typical in production for Kubernetes admins to prevent running as root in pods - Example: non-root enforcement in pod spec: `allowPrivilegeEscalation: false` .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## General steps to run set non-root in Dockerfile 1. Create the user (`groupadd` and `useradd`) 2. Change user of future RUN commands with `USER
` 3. Change owner of files during COPY with `COPY --chown=
:
` 4. Test, rinse, and repeat (so many little things break when not root) .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## Making a custom non-root Python base image ```bash FROM ubuntu:jammy-20220531 as base RUN apt-get update \ && apt-get -qq install -y --no-install-recommends \ python3-minimal python3-pip \ && rm -rf /var/lib/apt/lists/* RUN groupadd --gid 1000 python \ && useradd --uid 1000 --gid python --shell /bin/bash --create-home python \ && mkdir /app && chown -R python:python /app # add dev and test stages here FROM base as prod EXPOSE 3000 WORKDIR /app USER python COPY --chown=python:python requirements.txt ./ RUN pip install --user -qr requirements.txt COPY --chown=python:python . . CMD ["python", "something.py"] ``` .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## Kuberentes-friendly non-root Python image ```bash FROM ubuntu:jammy-20220531 as base RUN apt-get update \ && apt-get -qq install -y --no-install-recommends \ python3-minimal python3-pip \ && rm -rf /var/lib/apt/lists/* RUN groupadd --gid 1000 python \ && useradd --uid 1000 --gid python --shell /bin/bash --create-home python \ && mkdir /app && chown -R python:python /app # add dev and test stages here FROM base as prod EXPOSE 3000 WORKDIR /app # change USER to number so Kubernetes can be sure it's not root USER 1000 COPY --chown=python:python requirements.txt ./ RUN pip install --user -qr requirements.txt COPY --chown=python:python . . CMD ["python", "something.py"] ``` .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## Key steps in the previous Dockerfile example - `FROM ubuntu:jammy-20220531 as base` - tag is specific to a build date - `as base` gives us an alias for multi-stage ease of use - `groupadd` and `useradd` creates our non-root `python` user - `FROM base as prod` starts a new stage from the end of a previous stage - `USER python` sets our container to run as `python` user - `COPY --chown=python:python` ensures our code files have the right ownership .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## Which user are we running as? - `docker inspect
` doesn't help us -- - `docker top
` shows process, but wrong non-root user - That's because it's translating user IDs in container to host names - It's been a bug for a [long time] 😢 -- - `ps aux` in container will work, but many don't have `ps` installed by default - *Let's demo in a NGINX container* [long time]: https://github.com/moby/moby/issues/17719 .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## Use `ps` (process status) inside a container - We'll run a NGINX server in the background, named `finduser` - Then we'll install `ps` with the `procps` package - Finally, we'll run `ps` inside the container with options `aux` .lab[ ```bash docker run -d --name finduser nginx docker exec finduser bash -c 'apt-get update && apt-get install procps -y' docker exec finduser ps aux ``` ] - Notice that NGINX runs as `root`, but its listening sub-processes run as `nginx` user - Running the NGINX main process as `root` is required to listen on ports 80/443 .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## Other ways to run as non-root - Changing Dockerfile is the most reliable and testable way to avoid running as root - But you can change user at runtime. Let's try it! .lab[ - Check which user we are on host ```bash whoami ``` - Run `ubuntu` as default user and print out which user we are ```bash docker run --rm ubuntu whoami ``` - Ugg, we're root in container by default. Let's run `ubuntu` as the common `ubuntu` user ```bash docker run --rm --user ubuntu ubuntu whoami ``` ] .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## Other ways to run as non-root (cont) - There is no `ubuntu` user by default in a `ubuntu` container - We could add one with `useradd` - Or we could check the `/etc/passwd` file and use one of those existing users .lab[ - Print the list of users ```bash docker run --rm ubuntu cat /etc/passwd ``` - Start container as `www-data` user, a common user for web server (apache, PHP, etc.) ```bash docker run --rm --user www-data ubuntu whoami ``` - `ubuntu` has `ps` installed by default, let's try it ```bash docker run --rm --user www-data ubuntu ps aux ``` ] .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- ## Kubernetes pod spec for running as non-root - Example pod spec hard-coding pre-created user/group and preventing sudo/root use ```yaml apiVersion: v1 kind: Pod metadata: name: httpenv spec: securityContext: runAsUser: 1000 # this is httpenv user runAsGroup: 1000 # this is httpenv group containers: - name: httpenv image: bretfisher/httpenv ports: - containerPort: 8888 securityContext: allowPrivilegeEscalation: false ``` .debug[[containers/non-root-user.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/non-root-user.md)] --- class: pic .interstitial[] --- name: toc-linux-security-features-in-containers class: title Linux security features in containers .nav[ [Previous part](#toc-running-apps-as-a-non-root-user) | [Back to table of contents](#toc-part-1) | [Next part](#toc-brets-container-security-advice) ] .debug[(automatically generated title slide)] --- # Linux security features in containers - Namespaces and cgroups are not enough to ensure strong security - We need extra mechanisms: capabilities, seccomp, SELinux, AppArmor, etc. -- - These mechanisms were already used before containers to harden security - But most people didn't use them, they are off by default, and complex - They can be used together with containers and k8s/Docker can manage them - Note, of these four mechanisms we'll discuss, they have a huge overlap -- - Docker enabled them by default, to make everything safer in a container - They even scanned the top open source tools to design sane default profiles - (This is why you'll hear me say any pre-container app is [more secure in Docker]) -- - Sadly, Kubernetes disabled many by default, but we can re-enable! [Docker enabled them by default]: https://docs.docker.com/engine/security/ [more secure in Docker]: https://docs.docker.com/engine/security/non-events/ .debug[[containers/linux-security-in-containers.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/linux-security-in-containers.md)] --- ## Capabilities - In traditional UNIX, many operations are possible if and only if UID=0 (root) - Some of these operations are very powerful: - changing file ownership, accessing all files ... - Some of these operations deal with system configuration, but can be abused: - setting up network interfaces, mounting filesystems ... - Some of these operations are not very dangerous but are needed by servers: - binding to a port below 1024 - Capabilities are per-process flags to allow these operations individually .debug[[containers/linux-security-in-containers.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/linux-security-in-containers.md)] --- ## Some capabilities - `CAP_CHOWN`: arbitrarily change file ownership and permissions - `CAP_DAC_OVERRIDE`: arbitrarily bypass file ownership and permissions - `CAP_NET_ADMIN`: configure network interfaces, iptables rules, etc. - `CAP_NET_BIND_SERVICE`: bind a port below 1024 See `man capabilities` for the full list and details .debug[[containers/linux-security-in-containers.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/linux-security-in-containers.md)] --- ## Using capabilities - Container engines will typically drop all "dangerous" capabilities - You can then re-enable capabilities on a per-container basis, as needed - With the Docker engine: `docker run --cap-add ...` - Enable ALL capabilities with: `docker run --privileged ...` - This gives the container all the capabilities of the host `root` user - This is not recommended, but is sometimes necessary as a last resort - In Kubernetes, you can use the `capabilities` field in the `spec: Containers: securityContext` to specify which capabilities are allowed - In Kubernetes, avoid `allowPrivilegeEscalation: true` in the `spec: Containers: securityContext` or `privileged: true` which is the same as `docker run --privileged` .debug[[containers/linux-security-in-containers.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/linux-security-in-containers.md)] --- ## Seccomp - Seccomp is "secure computing" - Achieve high level of security by restricting drastically available syscalls - Original seccomp only allows `read()`, `write()`, `exit()`, `sigreturn()` - The seccomp-bpf extension allows specifying custom filters with BPF rules - This allows filtering by syscall, and by parameter - BPF code can perform arbitrarily complex checks, quickly, and safely - [Docker enables a moderate policy by default], disabling 44 of the 300+ system calls - But, Kubernetes disables it by default, so reenable it with: `spec: securityContext: seccompProfile: type: RuntimeDefault` [Docker enables a moderate policy by default]: https://docs.docker.com/engine/security/seccomp/ .debug[[containers/linux-security-in-containers.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/linux-security-in-containers.md)] --- ## Linux Security Modules - The most popular ones are SELinux and AppArmor - Red Hat distros generally use SELinux - Debian distros (in particular, Ubuntu) generally use AppArmor - LSMs add a layer of access control to all process operations - If installed, [Docker enables a default policies] that worked in most cases - Again, Kubernetes disables them by default [Docker enables a default policies]: https://docs.docker.com/engine/security/apparmor/ .debug[[containers/linux-security-in-containers.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/linux-security-in-containers.md)] --- ## Other Linux security features - There are other complex topics to research (esp. if your Ops or DevSecOps) - Linux *User* Namespaces - Not enabled by default in Docker or Kubernetes, but possible (k8s coming soon) - Map container users to high UIDs on host to further prevent escaping - This feature could prevent attacks on multiple previous CVEs - Read-Only Container Filesystem - Not enabled by default in Docker or Kubernetes, but possible - Enabled per container file system, not all apps would support this - Container Runtime Runs Rootless - By default, runtimes run as root, increasing risk of privilege escalation - Usually you *need* root so it can do things, but [that's changing] [that's changing]: https://kubernetes.io/docs/tasks/administer-cluster/kubelet-in-userns/ .debug[[containers/linux-security-in-containers.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/linux-security-in-containers.md)] --- ## More reading and references - [Kubernetes Docs excellent jumping off point] on all these security topics (and more) - Yes! [Enable seccomp by default cluster-wide], an alpha feature in K8s 1.22 - Fantastic [Docker Docs on container runtime security], written for mere mortals [Kubernetes Docs excellent jumping off point]: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/ [Enable seccomp by default cluster-wide]: https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads [Docker Docs on container runtime security]: https://docs.docker.com/engine/security/ .debug[[containers/linux-security-in-containers.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/linux-security-in-containers.md)] --- class: pic .interstitial[] --- name: toc-brets-container-security-advice class: title Bret's container security advice .nav[ [Previous part](#toc-linux-security-features-in-containers) | [Back to table of contents](#toc-part-1) | [Next part](#toc-extra-security-and-advanced-content) ] .debug[(automatically generated title slide)] --- # Bret's container security advice - There's mountains of features, tools, and techniques to improve security for containers - I thought I'd give you a top 10-ish list of activities I see as valuable in each area - This is exclusively about containers, in 3 parts: - Image security - Container/Pod security - Kubernetes cluster security - This list was inspired by my [security AMA on the topic] [security AMA on the topic]: https://github.com/BretFisher/ama/discussions/150 .debug[[shared/brets-security-advice.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/brets-security-advice.md)] --- ## Image security top 10-ish - Use `slim` or `ubuntu` for language base images - Implement multi-stage builds so prod doesn't have dev/test dependencies - Create and use non-root users in Dockerfiles - Define your Dockerfile `USER` as the ID, to work best with Kubernetes - Don't reuse image tags for prod-destined images. Use semver or/with date tags - [Consider an init process] like `tini` to avoid zombie processes - Use comments heavily in Dockerfiles to document your build process - Copy `.gitignore` to `.dockerignore` everywhere there's a Dockerfile. (add `.git`!) - Focus on reducing CVE count. [Automate builds and CVE scans] for every PR commit [Consider an init process]: https://github.com/BretFisher/nodejs-rocks-in-docker#proper-nodejs-startup-tini [Automate builds and CVE scans]: https://github.com/BretFisher/allhands22 .debug[[shared/brets-security-advice.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/brets-security-advice.md)] --- ## Container security top 10-ish - You're running a non-root user in the container, right? - Run your apps on a high port (3000, 8000, 8080, etc.), for easier rootless containers - Lock down your pod spec with defaults for non-root, seccomp, and privilege escalation ```yaml spec: securityContext: runAsUser: 1000 runAsGroup: 1000 seccompProfile: type: RuntimeDefault containers: - name: httpenv image: bretfisher/httpenv securityContext: allowPrivilegeEscalation: false privileged: false ``` .debug[[shared/brets-security-advice.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/brets-security-advice.md)] --- ## Kubernetes security top 10-ish - Use a vendor or cloud Kubernetes installer rather than "vanilla" upstream - Scan your cluster often for configuration issues with [Kubescape] (NSA and CIS options) - Add automated GitOps tools and prevent humans from having kubectl root access - Enable Admission Controllers to enforce security policies ([Kyverno]) ([a great post]) - Scan all YAML files (manifests, kustomize, Helm) for security and config issues - Add to CI automation. Scan on every PR of infrastructure-as-code - K8s specific tools include [Trivy] and [Datree] - Even "all in one" tools like [Super-Linter] and [MegaLinter] can help - [Research sigstore], and implement Content Trust for a secure supply chain - Besides the obvious log and monitoring, [use Falco] to watch for bad behavior [Kubescape]: https://github.com/armosec/kubescape [Trivy]: https://aquasecurity.github.io/trivy [Datree]: https://www.datree.io/ [Super-Linter]: https://github.com/github/super-linter [MegaLinter]: https://oxsecurity.github.io/megalinter/latest/descriptors/kubernetes/ [Kyverno]: https://kyverno.io/ [a great post]: https://www.appvia.io/blog/podsecuritypolicy-is-dead-long-live [Research sigstore]: https://www.sigstore.dev/ [use Falco]: https://falco.org/ .debug[[shared/brets-security-advice.md](https://github.com/BretFisher/container.training/tree/tampa/slides/shared/brets-security-advice.md)] --- class: pic .interstitial[] --- name: toc-extra-security-and-advanced-content class: title (Extra security and advanced content) .nav[ [Previous part](#toc-brets-container-security-advice) | [Back to table of contents](#toc-part-2) | [Next part](#toc-deep-dive-into-container-internals) ] .debug[(automatically generated title slide)] --- # (Extra security and advanced content) .debug[[security.yml](https://github.com/BretFisher/container.training/tree/tampa/slides/security.yml)] --- class: pic .interstitial[] --- name: toc-deep-dive-into-container-internals class: title Deep dive into container internals .nav[ [Previous part](#toc-extra-security-and-advanced-content) | [Back to table of contents](#toc-part-2) | [Next part](#toc-control-groups) ] .debug[(automatically generated title slide)] --- # Deep dive into container internals In this chapter, we will explain some of the fundamental building blocks of containers. This will give you a solid foundation so you can: - understand "what's going on" in complex situations, - anticipate the behavior of containers (performance, security...) in new scenarios, - implement your own container engine. The last item should be done for educational purposes only! .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## There is no container code in the Linux kernel - If we search "container" in the Linux kernel code, we find: - generic code to manipulate data structures (like linked lists, etc.), - unrelated concepts like "ACPI containers", - *nothing* relevant to "our" containers! - Containers are composed using multiple independent features. - On Linux, containers rely on "namespaces, cgroups, and some filesystem magic." - Security also requires features like capabilities, seccomp, LSMs... .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[] --- name: toc-control-groups class: title Control groups .nav[ [Previous part](#toc-deep-dive-into-container-internals) | [Back to table of contents](#toc-part-2) | [Next part](#toc-namespaces) ] .debug[(automatically generated title slide)] --- # Control groups - Control groups provide resource *metering* and *limiting*. - This covers a number of "usual suspects" like: - memory - CPU - block I/O - network (with cooperation from iptables/tc) - And a few exotic ones: - huge pages (a special way to allocate memory) - RDMA (resources specific to InfiniBand / remote memory transfer) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Crowd control - Control groups also allow to group processes for special operations: - freezer (conceptually similar to a "mass-SIGSTOP/SIGCONT") - perf_event (gather performance events on multiple processes) - cpuset (limit or pin processes to specific CPUs) - There is a "pids" cgroup to limit the number of processes in a given group. - There is also a "devices" cgroup to control access to device nodes. (i.e. everything in `/dev`.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Generalities - Cgroups form a hierarchy (a tree). - We can create nodes in that hierarchy. - We can associate limits to a node. - We can move a process (or multiple processes) to a node. - The process (or processes) will then respect these limits. - We can check the current usage of each node. - In other words: limits are optional (if we only want accounting). - When a process is created, it is placed in its parent's groups. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Example The numbers are PIDs. The names are the names of our nodes (arbitrarily chosen). .small[ ```bash cpu memory ├── batch ├── stateless │ ├── cryptoscam │ ├── 25 │ │ └── 52 │ ├── 26 │ └── ffmpeg │ ├── 27 │ ├── 109 │ ├── 52 │ └── 88 │ ├── 109 └── realtime │ └── 88 ├── nginx └── databases │ ├── 25 ├── 1008 │ ├── 26 └── 524 │ └── 27 ├── postgres │ └── 524 └── redis └── 1008 ``` ] .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Cgroups v1 vs v2 - Cgroups v1 are available on all systems (and widely used). - Cgroups v2 are a huge refactor. (Development started in Linux 3.10, released in 4.5.) - Cgroups v2 have a number of differences: - single hierarchy (instead of one tree per controller), - processes can only be on leaf nodes (not inner nodes), - and of course many improvements / refactorings. - Cgroups v2 enabled by default on Fedora 31 (2019), Ubuntu 21.10... .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup: accounting - Keeps track of pages used by each group: - file (read/write/mmap from block devices), - anonymous (stack, heap, anonymous mmap), - active (recently accessed), - inactive (candidate for eviction). - Each page is "charged" to a group. - Pages can be shared across multiple groups. (Example: multiple processes reading from the same files.) - To view all the counters kept by this cgroup: ```bash $ cat /sys/fs/cgroup/memory/memory.stat ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup v1: limits - Each group can have (optional) hard and soft limits. - Limits can be set for different kinds of memory: - physical memory, - kernel memory, - total memory (including swap). .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Soft limits and hard limits - Soft limits are not enforced. (But they influence reclaim under memory pressure.) - Hard limits *cannot* be exceeded: - if a group of processes exceeds a hard limit, - and if the kernel cannot reclaim any memory, - then the OOM (out-of-memory) killer is triggered, - and processes are killed until memory gets below the limit again. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Avoiding the OOM killer - For some workloads (databases and stateful systems), killing processes because we run out of memory is not acceptable. - The "oom-notifier" mechanism helps with that. - When "oom-notifier" is enabled and a hard limit is exceeded: - all processes in the cgroup are frozen, - a notification is sent to user space (instead of killing processes), - user space can then raise limits, migrate containers, etc., - once the memory usage is below the hard limit, unfreeze the cgroup. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Overhead of the memory cgroup - Each time a process grabs or releases a page, the kernel update counters. - This adds some overhead. - Unfortunately, this cannot be enabled/disabled per process. - It has to be done system-wide, at boot time. - Also, when multiple groups use the same page: - only the first group gets "charged", - but if it stops using it, the "charge" is moved to another group. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Setting up a limit with the memory cgroup Create a new memory cgroup: ```bash $ CG=/sys/fs/cgroup/memory/onehundredmegs $ sudo mkdir $CG ``` Limit it to approximately 100MB of memory usage: ```bash $ sudo tee $CG/memory.memsw.limit_in_bytes <<< 100000000 ``` Move the current process to that cgroup: ```bash $ sudo tee $CG/tasks <<< $$ ``` The current process *and all its future children* are now limited. (Confused about `<<<`? Look at the next slide!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## What's `<<<`? - This is a "here string". (It is a non-POSIX shell extension.) - The following commands are equivalent: ```bash foo <<< hello ``` ```bash echo hello | foo ``` ```bash foo <
$CG/tasks" ``` The following commands, however, would be invalid: ```bash sudo echo $$ > $CG/tasks ``` ```bash sudo -i # (or su) echo $$ > $CG/tasks ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Testing the memory limit Start the Python interpreter: ```bash $ python Python 3.6.4 (default, Jan 5 2018, 02:35:40) [GCC 7.2.1 20171224] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ``` Allocate 80 megabytes: ```python >>> s = "!" * 1000000 * 80 ``` Add 20 megabytes more: ```python >>> t = "!" * 1000000 * 20 Killed ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup v2: limits - `memory.min` = hard reservation (guaranteed memory for this cgroup) - `memory.low` = soft reservation ("*try* not to reclaim memory if we're below this") - `memory.high` = soft limit (aggressively reclaim memory; don't trigger OOMK) - `memory.max` = hard limit (triggers OOMK) - `memory.swap.high` = aggressively reclaim memory when using that much swap - `memory.swap.max` = prevent using more swap than this .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## CPU cgroup - Keeps track of CPU time used by a group of processes. (This is easier and more accurate than `getrusage` and `/proc`.) - Keeps track of usage per CPU as well. (i.e., "this group of process used X seconds of CPU0 and Y seconds of CPU1".) - Allows setting relative weights used by the scheduler. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Cpuset cgroup - Pin groups to specific CPU(s). - Use-case: reserve CPUs for specific apps. - Warning: make sure that "default" processes aren't using all CPUs! - CPU pinning can also avoid performance loss due to cache flushes. - This is also relevant for NUMA systems. - Provides extra dials and knobs. (Per zone memory pressure, process migration costs...) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Blkio cgroup - Keeps track of I/Os for each group: - per block device - read vs write - sync vs async - Set throttle (limits) for each group: - per block device - read vs write - ops vs bytes - Set relative weights for each group. - Note: most writes go through the page cache.
(So classic writes will appear to be unthrottled at first.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Net_cls and net_prio cgroup - Only works for egress (outgoing) traffic. - Automatically set traffic class or priority for traffic generated by processes in the group. - Net_cls will assign traffic to a class. - Classes have to be matched with tc or iptables, otherwise traffic just flows normally. - Net_prio will assign traffic to a priority. - Priorities are used by queuing disciplines. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Devices cgroup - Controls what the group can do on device nodes - Permissions include read/write/mknod - Typical use: - allow `/dev/{tty,zero,random,null}` ... - deny everything else - A few interesting nodes: - `/dev/net/tun` (network interface manipulation) - `/dev/fuse` (filesystems in user space) - `/dev/kvm` (VMs in containers, yay inception!) - `/dev/dri` (GPU) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[] --- name: toc-namespaces class: title Namespaces .nav[ [Previous part](#toc-control-groups) | [Back to table of contents](#toc-part-2) | [Next part](#toc-) ] .debug[(automatically generated title slide)] --- # Namespaces - Provide processes with their own view of the system. - Namespaces limit what you can see (and therefore, what you can use). - These namespaces are available in modern kernels: - pid - net - mnt - uts - ipc - user - time - cgroup (We are going to detail them individually.) - Each process belongs to one namespace of each type. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Namespaces are always active - Namespaces exist even when you don't use containers. - This is a bit similar to the UID field in UNIX processes: - all processes have the UID field, even if no user exists on the system - the field always has a value / the value is always defined
(i.e. any process running on the system has some UID) - the value of the UID field is used when checking permissions
(the UID field determines which resources the process can access) - You can replace "UID field" with "namespace" above and it still works! - In other words: even when you don't use containers,
there is one namespace of each type, containing all the processes on the system. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Manipulating namespaces - Namespaces are created with two methods: - the `clone()` system call (used when creating new threads and processes), - the `unshare()` system call. - The Linux tool `unshare` allows doing that from a shell. - A new process can re-use none / all / some of the namespaces of its parent. - It is possible to "enter" a namespace with the `setns()` system call. - The Linux tool `nsenter` allows doing that from a shell. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Namespaces lifecycle - When the last process of a namespace exits, the namespace is destroyed. - All the associated resources are then removed. - Namespaces are materialized by pseudo-files in `/proc/
/ns`. ```bash ls -l /proc/self/ns ``` - It is possible to compare namespaces by checking these files. (This helps to answer the question, "are these two processes in the same namespace?") - It is possible to preserve a namespace by bind-mounting its pseudo-file. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Namespaces can be used independently - As mentioned in the previous slides: *A new process can re-use none / all / some of the namespaces of its parent.* - We are going to use that property in the examples in the next slides. - We are going to present each type of namespace. - For each type, we will provide an example using only that namespace. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## UTS namespace - gethostname / sethostname - Allows setting a custom hostname for a container. - That's (mostly) it! - Also allows setting the NIS domain. (If you don't know what a NIS domain is, you don't have to worry about it!) - If you're wondering: UTS = UNIX time sharing. - This namespace was named like this because of the `struct utsname`,
which is commonly used to obtain the machine's hostname, architecture, etc. (The more you know!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Creating our first namespace Let's use `unshare` to create a new process that will have its own UTS namespace: ```bash $ sudo unshare --uts ``` - We have to use `sudo` for most `unshare` operations. - We indicate that we want a new uts namespace, and nothing else. - If we don't specify a program to run, a `$SHELL` is started. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Demonstrating our uts namespace In our new "container", check the hostname, change it, and check it: ```bash # hostname nodeX # hostname tupperware # hostname tupperware ``` In another shell, check that the machine's hostname hasn't changed: ```bash $ hostname nodeX ``` Exit the "container" with `exit` or `Ctrl-D`. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Net namespace overview - Each network namespace has its own private network stack. - The network stack includes: - network interfaces (including `lo`), - routing table**s** (as in `ip rule` etc.), - iptables chains and rules, - sockets (as seen by `ss`, `netstat`). - You can move a network interface from a network namespace to another: ```bash ip link set dev eth0 netns PID ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Net namespace typical use - Each container is given its own network namespace. - For each network namespace (i.e. each container), a `veth` pair is created. (Two `veth` interfaces act as if they were connected with a cross-over cable.) - One `veth` is moved to the container network namespace (and renamed `eth0`). - The other `veth` is moved to a bridge on the host (e.g. the `docker0` bridge). .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Creating a network namespace Start a new process with its own network namespace: ```bash $ sudo unshare --net ``` See that this new network namespace is unconfigured: ```bash # ping 1.1 connect: Network is unreachable # ifconfig # ip link ls 1: lo:
mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Creating the `veth` interfaces In another shell (on the host), create a `veth` pair: ```bash $ sudo ip link add name in_host type veth peer name in_netns ``` Configure the host side (`in_host`): ```bash $ sudo ip link set in_host master docker0 up ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Moving the `veth` interface *In the process created by `unshare`,* check the PID of our "network container": ```bash # echo $$ 533 ``` *On the host*, move the other side (`in_netns`) to the network namespace: ```bash $ sudo ip link set in_netns netns 533 ``` (Make sure to update "533" with the actual PID obtained above!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Basic network configuration Let's set up `lo` (the loopback interface): ```bash # ip link set lo up ``` Activate the `veth` interface and rename it to `eth0`: ```bash # ip link set in_netns name eth0 up ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Allocating IP address and default route *On the host*, check the address of the Docker bridge: ```bash $ ip addr ls dev docker0 ``` (It could be something like `172.17.0.1`.) Pick an IP address in the middle of the same subnet, e.g. `172.17.0.99`. *In the process created by `unshare`,* configure the interface: ```bash # ip addr add 172.17.0.99/24 dev eth0 # ip route add default via 172.17.0.1 ``` (Make sure to update the IP addresses if necessary.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Validating the setup Check that we now have connectivity: ```bash # ping 1.1 ``` Note: we were able to take a shortcut, because Docker is running, and provides us with a `docker0` bridge and a valid `iptables` setup. If Docker is not running, you will need to take care of this! .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Cleaning up network namespaces - Terminate the process created by `unshare` (with `exit` or `Ctrl-D`). - Since this was the only process in the network namespace, it is destroyed. - All the interfaces in the network namespace are destroyed. - When a `veth` interface is destroyed, it also destroys the other half of the pair. - So we don't have anything else to do to clean up! .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Other ways to use network namespaces - `--net none` gives an empty network namespace to a container. (Effectively isolating it completely from the network.) - `--net host` means "do not containerize the network". (No network namespace is created; the container uses the host network stack.) - `--net container` means "reuse the network namespace of another container". (As a result, both containers share the same interfaces, routes, etc.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Mnt namespace - Processes can have their own root fs (à la chroot). - Processes can also have "private" mounts. This allows: - isolating `/tmp` (per user, per service...) - masking `/proc`, `/sys` (for processes that don't need them) - mounting remote filesystems or sensitive data,
but make it visible only for allowed processes - Mounts can be totally private, or shared. - At this point, there is no easy way to pass along a mount from a namespace to another. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Setting up a private `/tmp` Create a new mount namespace: ```bash $ sudo unshare --mount ``` In that new namespace, mount a brand new `/tmp`: ```bash # mount -t tmpfs none /tmp ``` Check the content of `/tmp` in the new namespace, and compare to the host. The mount is automatically cleaned up when you exit the process. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## PID namespace - Processes within a PID namespace only "see" processes in the same PID namespace. - Each PID namespace has its own numbering (starting at 1). - When PID 1 goes away, the whole namespace is killed. (When PID 1 goes away on a normal UNIX system, the kernel panics!) - Those namespaces can be nested. - A process ends up having multiple PIDs (one per namespace in which it is nested). .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespace in action Create a new PID namespace: ```bash $ sudo unshare --pid --fork ``` (We need the `--fork` flag because the PID namespace is special.) Check the process tree in the new namespace: ```bash # ps faux ``` -- class: extra-details, deep-dive 🤔 Why do we see all the processes?!? .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespaces and `/proc` - Tools like `ps` rely on the `/proc` pseudo-filesystem. - Our new namespace still has access to the original `/proc`. - Therefore, it still sees host processes. - But it cannot affect them. (Try to `kill` a process: you will get `No such process`.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespaces, take 2 - This can be solved by mounting `/proc` in the namespace. - The `unshare` utility provides a convenience flag, `--mount-proc`. - This flag will mount `/proc` in the namespace. - It will also unshare the mount namespace, so that this mount is local. Try it: ```bash $ sudo unshare --pid --fork --mount-proc # ps faux ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## OK, really, why do we need `--fork`? *It is not necessary to remember all these details.
This is just an illustration of the complexity of namespaces!* The `unshare` tool calls the `unshare` syscall, then `exec`s the new binary.
A process calling `unshare` to create new namespaces is moved to the new namespaces...
... Except for the PID namespace.
(Because this would change the current PID of the process from X to 1.) The processes created by the new binary are placed into the new PID namespace.
The first one will be PID 1.
If PID 1 exits, it is not possible to create additional processes in the namespace.
(Attempting to do so will result in `ENOMEM`.) Without the `--fork` flag, the first command that we execute will be PID 1 ...
... And once it exits, we cannot create more processes in the namespace! Check `man 2 unshare` and `man pid_namespaces` if you want more details. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## IPC namespace -- - Does anybody know about IPC? -- - Does anybody *care* about IPC? -- - Allows a process (or group of processes) to have own: - IPC semaphores - IPC message queues - IPC shared memory ... without risk of conflict with other instances. - Older versions of PostgreSQL cared about this. *No demo for that one.* .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## User namespace - Allows mapping UID/GID; e.g.: - UID 0→1999 in container C1 is mapped to UID 10000→11999 on host - UID 0→1999 in container C2 is mapped to UID 12000→13999 on host - etc. - UID 0 in the container can still perform privileged operations in the container. (For instance: setting up network interfaces.) - But outside of the container, it is a non-privileged user. - It also means that the UID in containers becomes unimportant. (Just use UID 0 in the container, since it gets squashed to a non-privileged user outside.) - Ultimately enables better privilege separation in container engines. .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## User namespace challenges - UID needs to be mapped when passed between processes or kernel subsystems. - Filesystem permissions and file ownership are more complicated. .small[(E.g. when the same root filesystem is shared by multiple containers running with different UIDs.)] - With the Docker Engine: - some feature combinations are not allowed
(e.g. user namespace + host network namespace sharing) - user namespaces need to be enabled/disabled globally
(when the daemon is started) - container images are stored separately
(so the first time you toggle user namespaces, you need to re-pull images) *No demo for that one.* .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Time namespace - Virtualize time - Expose a slower/faster clock to some processes (for e.g. simulation purposes) - Expose a clock offset to some processes (simulation, suspend/restore...) .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)] --- ## Cgroup namespace - Virtualize access to `/proc/
/cgroup` - Lets containerized processes view their relative cgroup tree ??? :EN:Containers internals :EN:- Control groups (cgroups) :EN:- Linux kernel namespaces :FR:Fonctionnement interne des conteneurs :FR:- Les "control groups" (cgroups) :FR:- Les namespaces du noyau Linux .debug[[containers/Namespaces_Cgroups.md](https://github.com/BretFisher/container.training/tree/tampa/slides/containers/Namespaces_Cgroups.md)]