{"id":22962940,"url":"https://github.com/alessandro308/ict-infrastructure","last_synced_at":"2026-01-12T06:46:28.617Z","repository":{"id":98218302,"uuid":"114417976","full_name":"alessandro308/ICT-infrastructure","owner":"alessandro308","description":"Just try to recap all the topic debated in the ICT Infrastructure Course","archived":false,"fork":false,"pushed_at":"2019-01-29T10:37:13.000Z","size":5187,"stargazers_count":9,"open_issues_count":0,"forks_count":8,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-02-07T18:19:35.958Z","etag":null,"topics":["architecture","data-center","datacenter","disk","ethernet","infrastructure","protocol","storage"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alessandro308.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-15T23:00:36.000Z","updated_at":"2023-03-09T01:33:21.000Z","dependencies_parsed_at":"2023-03-16T18:30:55.899Z","dependency_job_id":null,"html_url":"https://github.com/alessandro308/ICT-infrastructure","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alessandro308%2FICT-infrastructure","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alessandro308%2FICT-infrastructure/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alessandro308%2FICT-infrastructure/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alessandro308%2FICT-infrastructure/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alessandro308","download_url":"https://codeload.github.com/alessandro308/ICT-infrastructure/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246750114,"owners_count":20827653,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["architecture","data-center","datacenter","disk","ethernet","infrastructure","protocol","storage"],"created_at":"2024-12-14T19:18:47.633Z","updated_at":"2026-01-12T06:46:28.611Z","avatar_url":"https://github.com/alessandro308.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"ICT Infrastructures - University of Pisa (Italy)\n\n*Since there is no material on ICT Infrastructures course, I'm trying to recap all lessons done in this page. The notes are written trying to remember the contents of the course (in accordance with the OneNote Notebook published on course page) and then expanding that contents with structured resources found online. If you find any error please, fork and submit a pull request!*\n\n# Table of contents\n\u003c!-- AUTO-GENERATED-CONTENT:START (TOC:collapse=true\u0026collapseText=Click to expand) --\u003e\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand\u003c/summary\u003e\n\n- [Introduction](#introduction)\n- [Cloud Computing Reference Model](#cloud-computing-reference-model)\n  * [Virtual Layer](#virtual-layer)\n  * [Control Layer](#control-layer)\n  * [Service Layer](#service-layer)\n  * [Service Managment](#service-managment)\n  * [Business Continuity](#business-continuity)\n- [Datacenters](#datacenters)\n- [Design and Architectures](#design-and-architectures)\n- [Cooling](#cooling)\n    + [CRAC: Computer Room Air Conditioner](#crac-computer-room-air-conditioner)\n    + [Hot aisle datacenter](#hot-aisle-datacenter)\n    + [In-Row cooling](#in-row-cooling)\n  * [Liquid cooling](#liquid-cooling)\n- [Current](#current)\n  * [Power Distribution](#power-distribution)\n  * [PUE: Power Usage Effectiveness](#pue-power-usage-effectiveness)\n- [Fabric](#fabric)\n    + [Ethernet](#ethernet)\n    + [Infiniband](#infiniband)\n    + [Omni-Path](#omni-path)\n      - [RDMA: Remote Direct Memory Access](#rdma-remote-direct-memory-access)\n  * [Some consideration about numbers](#some-consideration-about-numbers)\n    + [Real use case](#real-use-case)\n  * [Connectors \u0026 plugs](#connectors--plugs)\n  * [Software Defined *** and Open Newtwork](#software-defined--and-open-newtwork)\n    + [Open Flow](#open-flow)\n    + [SDN: Software Defined Networking](#sdn-software-defined-networking)\n    + [Software-defined data center](#software-defined-data-center)\n  * [Hyperconvergence](#hyperconvergence)\n- [Network topologies](#network-topologies)\n    + [Spanning Tree Protocol (STP)](#spanning-tree-protocol-stp)\n    + [Three-tier design](#three-tier-design)\n    + [Network Chassis](#network-chassis)\n    + [Stacking](#stacking)\n    + [Spine and leaf Architecture](#spine-and-leaf-architecture)\n    + [Full Fat Tree](#full-fat-tree)\n  * [VLAN](#vlan)\n  * [Switch Anatomy](#switch-anatomy)\n- [Disks and Storage](#disks-and-storage)\n- [Interfaces](#interfaces)\n- [Redundancy](#redundancy)\n- [IOPS](#iops)\n- [Functional programming](#functional-programming)\n- [Memory Hierarchy](#memory-hierarchy)\n  * [NVMe](#nvme)\n  * [Storage aggregation](#storage-aggregation)\n- [Network Area Storage (NAS)](#network-area-storage-nas)\n- [Storage Area Network (SAN)](#storage-area-network-san)\n  * [Benefits](#benefits)\n- [HCI - Hyperconvergent Systems](#hci---hyperconvergent-systems)\n- [SDS - Software Defined Storage](#sds---software-defined-storage)\n- [Non-RAID drive architectures](#non-raid-drive-architectures)\n- [Some consideration about Flash Drives](#some-consideration-about-flash-drives)\n- [Storage in the feature](#storage-in-the-feature)\n- [Hypervisors](#hypervisors)\n- [Servers](#servers)\n- [Form-factors](#form-factors)\n    + [Miscellaneous](#miscellaneous)\n- [Cloud](#cloud)\n  * [Rapid Elasticity](#rapid-elasticity)\n  * [High Avaialability](#high-avaialability)\n- [Cloud computering Layer](#cloud-computering-layer)\n    + [Phyisical Layer](#phyisical-layer)\n- [Virtual Layer](#virtual-layer-1)\n    + [About the virtual memory:](#about-the-virtual-memory)\n      - [Balooning](#balooning)\n    + [Other considerations about the Virtual Layer](#other-considerations-about-the-virtual-layer)\n      - [vMotion - Live Migration](#vmotion----live-migration)\n    + [Docker](#docker)\n- [Control Layer](#control-layer-1)\n  * [Service orchestration Layer](#service-orchestration-layer)\n- [Business Continuity](#business-continuity-1)\n    + [Backups](#backups)\n  * [Security](#security)\n    + [Firwall](#firwall)\n  * [Service Managment](#service-managment-1)\n- [GDPR General Data Protection Regulation](#gdpr-general-data-protection-regulation)\n- [Vendor Lock-in](#vendor-lock-in)\n  * [Standardization-Portability](#standardization-portability)\n- [Orchestration](#orchestration)\n- [Fog Computing](#fog-computing)\n- [Miscellaneous](#miscellaneous-1)\n  * [Redundancy](#redundancy-1)\n- [In class exercises](#in-class-exercises)\n- [1 - Discuss the difference between spine and leaf fabric and the more traditional fabric architecture based on larger chassis. How bandwidth and latency are affected?](#1---discuss-the-difference-between-spine-and-leaf-fabric-and-the-more-traditional-fabric-architecture-based-on-larger-chassis-how-bandwidth-and-latency-are-affected)\n- [Spine and Leaf](#spine-and-leaf)\n- [Traditional Chassis](#traditional-chassis)\n- [2 - What actions can take the orchestration layer of a cloud system, and based on what information, in order to decide how many web server istances should be used to serve a Web system?](#2---what-actions-can-take-the-orchestration-layer-of-a-cloud-system-and-based-on-what-information-in-order-to-decide-how-many-web-server-istances-should-be-used-to-serve-a-web-system)\n- [3 - Discuss a datacenter architecture made of 10 racks. Assuming a power distribution of 15 W/ rack.](#3---discuss-a-datacenter-architecture-made-of-10-racks-assuming-a-power-distribution-of-15-w-rack)\n- [4 - A service requires a sustained throughput towards the storage of 15 GB/s. Would you recomment using a SAN architecture or an hyperconvergent one.](#4---a-service-requires-a-sustained-throughput-towards-the-storage-of-15-gbs-would-you-recomment-using-a-san-architecture-or-an-hyperconvergent-one)\n  * [SAN area network (recap)](#san-area-network-recap)\n  * [NAS](#nas)\n  * [HCI (hyperconvergent)](#hci-hyperconvergent)\n    + [Discussion](#discussion)\n- [What should I look for..](#what-should-i-look-for)\n- [5 - A service requires a sustained throughput towards the storage of 15 GB/s. How would you dimension an hyperconvergent system to ensure it works properly?](#5---a-service-requires-a-sustained-throughput-towards-the-storage-of-15-gbs-how-would-you-dimension-an-hyperconvergent-system-to-ensure-it-works-properly)\n- [References](#references)\n\n\u003c/details\u003e\n\u003c!-- AUTO-GENERATED-CONTENT:END --\u003e\n\n# Introduction\nThe world is changing and a lot of axiom are becoming false. Some example? In the bachelor course (and not, sigh), the teachers say: \"The main bottleneck is the disk\", and so all the performance are evalueted with reference to disk usage, number of IOs operations and so on... This, nowadays, is false.  Just thing of [Intel Optane SSD](https://www.anandtech.com/show/11702/intel-introduces-new-ruler-ssd-for-servers) where the new SSD tecnologie based on 3D NAND permits to write and read more fast then previous SSD (the disk that we have installed on our system, sigh number 2), and so we have to redesign the system. There is also **nvRAM**, non volatile RAM : a module similar to the hard drive but really fast. Some distributed file system, written in '90s, are crashing due the axiom that the disks are slower than CPU and so you have enough time to do all the computation needed. False! \n\nAnother example is in application and server distribution. In the past many application was managed on each server with a shared storage, nowadays we have deploy a large application on a clusters of server with local storage, so new system to develop and manage distributed computing application is needed (Hadoop, cassandra (distributed DB), Spark (Computation)...).\n\nThe world is evolving faster than I can write this notes, so maybe some things written here are already obsolete, so we can not waste any more time on introduction to avoid to need to rewrite the introduction. \n\nLet's start to see how a datacenter is build to support new requests. \n\n# Cloud Computing Reference Model\nJust a brief overview on the reference model of cloud computing.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"600\" src=\"./assets/referenceModel.png\"\u003e\n\u003c/p\u003e\n\n### Virtual Layer\nThe physical server is partitioned in many virtual ones to use the hardware better.  \n**High Parformance Computing** bypasses the virtual layer for performance reasons.\n\n### Control Layer\nDynamic allocation rather than static.\n\n### Service Layer\nKind of self service. Use the resource you need without knowing where they are allocated.\n\n### Service Managment\nUpgrading the software or the firmware while the system is running.\n\n### Business Continuity\n**Backups** vs **Replicas**: doing a backup of 1 PB may be a problem.  \n**Fault Tolerance**: I should be able to power off a server without anyone noticing it.\n\n\n# Datacenters\n\nA data center is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes redundant or backup components and infrastructure for power supply, data communications connections, environmental controls (e.g. air conditioning, fire suppression) and various security devices. A large data center is an industrial-scale operation using as much electricity as a small town.\n\nOn average there are only 6 person managing 1 million servers.\nPrefabricated group of racks, already cabled and cooled, are automatically inserted in the datacenter (POD Point Of Delivery). If something is not working in the prefabricated, the specific server is shutted down. If more than the 70% is not working the POD producer will simply change the entire unity.\n\nThe datacenter is a place where we concentrate IT system in order to reduce costs. Servers are demanding in terms of current, cooling and security. \n\n\n## Design and Architectures\n\n## Cooling\n\nToday cooling is air based. Just the beginning for liquid cooling.  \nThe air pushed throught the server gets a 10/15 degrees temperature augment.\n#### CRAC: Computer Room Air Conditioner\n\nPopular in the '90 (3-5KW/rack), but not very efficient in terms of energy consumption.  \nThere is a **floating floor**, all the cabling and the cooling is done under the floor. The air goes up because of termal convection.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"600\" src=\"./assets/crac.png\"\u003e\n\u003c/p\u003e\n\nDrawbacks are density (if we want to go dense this approach fails) and the absence of locality.\nNoone is doing this today.\n\n#### Hot aisle datacenter\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"600\" src=\"./assets/crac1.png\"\u003e\n\u003c/p\u003e\nHot and cold corridors. \nThe **workload balacing** may be a problem: there can be the situation where a rack is hotter than the other depending on the workload. Difficult to module the ammount of hot and cold air. In CRAC model the solution is pumping for the higher consumer, \n\n#### In-Row cooling\n\nIn-row cooling technology is a type of air conditioning system commonly used in data centers in which the cooling unit is placed between the server cabinets in a row for offering cool air to the server equipment more effectively.\n\nIn-row cooling systems use a horizontal airflow pattern utilizing hot aisle/cold aisle configurations and they only occupy one-half rack of row space without any additional side clearance space. Typically, each unit is about 12 inches wide by 42 inches deep.\n\nThese units may be a supplement to raised-floor cooling (creating a plenum to distribute conditioned air) or may be the primary cooling source on a slab floor.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"600\" src=\"./assets/in-row-cooling.jpg\"\u003e\n\u003c/p\u003e\n\nThe in-row cooling unit draws warm exhaust air directly from the hot aisle, cools it and distributes it to the cold aisle. This ensures that inlet temperatures are steady for precise operation. Coupling the air conditioning with the heat source produces an efficient direct return air path; this is called “close coupled cooling,” which also lowers the fan energy required. In-row cooling also prevents the mixing of hot and cold air, thus increasing efficiency.\n\nIt's possible to give more cooling to a single rack, moduling the air needed. In front of the rack thare are temperature and humidity sensors (humidity should be avoided because can conduct electricity).\nThere are systems collecting data from the sensors and adjusting the fans. The racks are covered to separate cool air and hot air. It's also possible to optimize the datacenter cooling according to the temperature changes of the region where the datacenter is.\n\nGenerally 2 racks (each 70 cm), 1 colling row (30 cm), 2 racks, 1 row ..\n\n### Liquid cooling\nHaving water in a DC is a risky business (even if there are different ways to handle a fire). Make the water flow ont thre CPUs lowers the temperature for ~40%. One way of chilling the water could be pushing it down to the ground. Water Distribution System, like the Power Distribution System.\n\n## Current\nA 32 KW (consume of 10 appartments) datacenter is small.  \n**Direct Current Transformers** from AC to DC. Direct current is distributed inside the datacenter even if is more dangerous than Alternating current.\n\nWatt = cos fi * V * A  \n**cos fi** gives the efficency of the power supply and generally it changes according to the ammount of current needed (idle vs under pressure).\n\nFor example an idle server with 2 CPUs (14 cores each) consumes 140 Watts.\n\n### Power Distribution\nThe Industrial current il 380 Volts, 3 phases.  \nThe ammount of current allowed in a DC are the Ampere on the **PDU** (Power Distribution Unit)\n\nThere are one or more lines (for reliability and fault tolerance reasond) coming from different generators to the datacenter (i.e. each line 80 KW , 200 A more or less. Can use it for 6 racks 32A/ rack. Maybe I will not use the whole 32 A so I can put more racks).  \nThe lines are attached to an **UPS Uninterruptible Power Supply/Source**. It is a rack or half a rack with batteries (not enought to keep on the servers) that in some cases can power the DC for ~20 minutes. There are a **Control Panel** and a **Generator**. When the power lines fail the UPS is active between their failure and the starting of the generator.  The energy that arrives to the UPS should be divided among the servers and the switches.\nThe UPS is attached to the **PDU** (Power Distribution Unit) which is linked to the **server PDU** with a pair of lines for redundancy. In the server there are the power plugs in a row that can monitored via a web server running on the rack PDU. Example of rack PDU: 2 banks, 12 plugs each, 16 A each bank, 15 KW per rack, 42 servers per rack.\n\n### PUE: Power Usage Effectiveness\n\nPUE is a ratio that describes how efficiently a computer data center uses energy; specifically, how much energy is used by the computing equipment (in contrast to cooling and other overhead).\n\nPUE is the ratio of total amount of energy used by a computer data center facility  to the energy delivered to computing equipment. PUE is the inverse of data center infrastructure efficiency (DCIE).\n\nAs example, consider that the PUE of the university's datacenter during 2018 is less 1.2, while the average italian datacenter's PUE are around 2-2.5.\nIf the PUE is equal to 2 means that for each Watt used for computing, 2 Watts are used for cooling.\nThe ratio is Total Current divided by Compute Current.\n\n# Fabric\nThe fabric is the interconnection between nodes inside a datacenter. We can think this level as a bunch of switch and wires. \n\nWe refer to North-South traffic indicating the traffic outgoing and incoming to the datacenter (internet), while we refer to East-West as the internal traffic between servers.\n\n#### Ethernet\nThe connection can be performed with various technologies, the most famous is **Ethernet**, commonly used in Local Area Networks (LAN) and Wide Area Networks (WAN). Ethernet use twisted pair and fiber optic links. Ethernet as some famous features such as 48-bit MAC address and Ethernet frame format that influenced other networking protocols. \n\n**MTU** (Maximum Transfer Unit) up to 9 KB with the so called **Jumbo Frames**.\nOn top of ethernet there are TCP/IP protocols (this is a standard), they introduce about 70-100 micro sec of latency.\n\n#### Infiniband\nEven if Ethernet is so famous, there are other standard to communicate. **InfiniBand (IB)** is another standard used in high-performance computing (HPC) that features very high throughtput and very low latency (about 2 microseconds). InfiniBand is a protocol and a physical infrastructure and it can send up to 2GB massages with 16 priorities level.\nThe [RFC 4391](https://tools.ietf.org/html/rfc4391) specifies a method for encapsulating and transmitting IPv4/IPv6 and Address Resolution Protocol (ARP) packets over InfiniBand (IB).\n\nInfiniBand trasmits data in packets up to 4KB. A massage can be:\n - a direct memory access read from or write to a remote node (**RDMA**)\n - a channel send or receive\n - a transaction-based operation (that can be reversed)\n - a multicast trasmission\n - an atomic operation\n\nPros:\n - no retransmission\n - QoS, trafic preserved, reliable\n\n#### Omni-Path\nMoreover, another communication architecture that exist and is interested to see is Omni-Path. This architecture is owned by Intel and performs high-performance communication. Production of Omni-Path products started in 2015 and a mass delivery of these products started in the first quarted of 2016 (you can insert here some more stuff written on [Wikipedia](https://en.wikipedia.org/wiki/Omni-Path)). \nThe interest of this architecture is that Intel plans to develop technologiy based on that will serve as the on-ramp to exascale computing (a computing system capacle of the least one exaFLOPS). \n\n##### RDMA: Remote Direct Memory Access\nIf you read wikipedia pages about IB and OmniPath you will find a acronym: RDMA. This acronym means Remote Direct Memory Access, a direct memory access (really!) from one computer into that of another without involving either one's OS, this permits high-throuhput, low-latency networking performing.\n\nRDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.\n\n### Some consideration about numbers\nStart think about real world. We have some server with 1 Gbps (not so high speed, just think that is the speed you can reach with your laptop attaching a cable that is in classroom in the univesity). We have to connects this servers to each other, using a switches (each of them has 48 ports). We have a lots of servers... The computation is done.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"600\" src=\"./assets/speed-required.png\"\u003e\n\u003c/p\u003e\n\n#### Real use case\nAs we see we need a lots of bandwith to manage a lots of service (you don't say?) and even if the north-south traffic (the traffic that goes outsite from our datacenter) can be relatively small (the university connection exits on the world with 40 Gbps), the east-west traffic (the traffic inside the datacenter) can reach a very huge number of Gbps. [Aruba datacenter](https://www.arubacloud.com/infrastructures/italy-dc-it1.aspx) (called IT1) with another Aruba datacenter (IT2) reach a bandwidth of 82 Gbps of Internet connection.\n\nYesterday I went to master degree thesis discussion of my friend. He is a physicist and his experiment requires 2.2Tbps of bandwidth to store produced data, so public cloud is impossible to use. How can manage 2.2 Tbps? Maybe we can reply to this answer (hopefully, otherwise the exam is failed :/ ).\n\n### Connectors \u0026 plugs\nNow we try to analyse the problem from the connector point of view. The fastest wire technology avaiable is the optic fiber. It can be divided into two categories: monomodal (1250 nm) or multimodal (850 nm). The monomodal fiber is more expensive but has better properties, the multimodal one is acceptable for a datacenter. They also have different transceiver. There are two kind of connectors LC, ok for datacenters, and SC, usually used in metropolitan areas because it has a better signal propagation (there can be a cable with a LC in one side and a SC on the other side).  \n\n Of course, a wire is a wire, and we need something to connect it to somewhere. One of them is the Small form-factor pluggable transceiver (SFP), a compact, hot-pluggable optical module transceiver. The upgrade of this connector is the SFP+ that supports data rates up to 16 Gbps. It supports 10 Gigabit ethernet and can be combined with some other SFP+ with QSFP to reach 4x10Gbps. If combined with QSFP28 we can reach 100 Gbps on the ethernet that is the upper limit nowadays for the data rate.\n\nFrom letf to right: RJ45 plug, SFP+ and QSFP+ **transceiver module**, LC connector. \n\u003cp float=\"left\"\u003e\n  \u003cimg width=\"100\" src=\"./assets/rj45.jpeg\"\u003e\n  \u003cimg width=\"150\" src=\"./assets/sfpplus.jpg\"\u003e\n  \u003cimg width=\"250\" src=\"./assets/qsfpplus.png\"\u003e\n  \u003cimg width=\"150\" src=\"./assets/lc-duplex.jpg\"\u003e\n\u003c/p\u003e\n\n\n**RJ45** plug supports 10/100 Mbps, 1/2.5/5 Gbps but in datacenters there are almost no installations of it.  \nCables have categories: \n- cat4\n- cat5  \n- cat6  \n\n2.5/5 Gbps are new standards working on cat5 and cat6 cables respectively, in order to deliver more bandwidth to the wifi access point.  \n16 Gbps uses **SFP+** plug (SFP28, where 28 is number of pins).  \n40 Gbps (4 lines 10 Gbps each) uses **QSFP+** (QSFP28).\n\nNowadays we have:\n- 25 Gbps \n- 50 Gbps (2 * 25)\n- 100 Gbps (4 * 25)\n\nThe **Transceiver module** can serve copper or optical fiber; it has a chip inside and is not cheap.\n\n### Software Defined *** and Open Newtwork\n\nThe Software Defined something, where something is Networking (**SDN**) or Storage (**SDS**), is a novel approach to cloud computing. \n\n#### Open Flow\n\n[OpenFlow](https://en.wikipedia.org/wiki/OpenFlow) is a communications protocol that gives access to the forwarding plane of a network switch or router over the network.\nThe switch, once approved the initial connection with a firewall, redirect the allowed traffic to anther port, bypassing the firewall since it is not able to handle the entire data flow bandwidth ([Open daylight](https://www.opendaylight.org/)).\n\n- copy/redirect/ close the flow to optimize and control the behaviour of the network.\n\n#### SDN: Software Defined Networking\nSDN is an architecture purposing to be dynamic, manageable, cost-effective and some more nice attribute readable [here](https://en.wikipedia.org/wiki/Software-defined_networking#Concept). This type of software create a virtual network to manage the network with more simplicity.\n\nThe main concept are the following:\n - Network control is directly programmable (also from remote)\n - The infrastructure is agile, since it can be dynamically adjustable\n - It is programmatically configured and is managed by a software-based SDN controller\n - It is Open Standard-based and Vendor-neutral\n\nThee is a **flow table** in the switches that remembers the connection. The routing policies are adopted according to this table.  \nDeep pkt instection made by a level 7 firewall. The firewalll validates the flow and if it's aware that the flow needs bandwidth, the firewall allows it to bypass the redirection (of the firewall).\n\n#### Software-defined data center\nSoftware-defined data center is a sort of upgrade of the previous term and indicate a series of virtualization concepts such as abstraction, pooling and automation to all data center resources and services to achieve IT as a service.\n\n### Hyperconvergence\nSo we virtualize the networking, the storage, the data center... and the cloud! Some tools, as [Nutanix](https://www.nutanix.com/hyperconverged-infrastructure/) build the [hyperconverged infrastructure HCI](https://en.wikipedia.org/wiki/Hyper-converged_infrastructure) technology.\n\n## Network topologies\n\nA way of cabling allowing multiple computers to comunicate. It's not necessary a graph, but for the reliability purpose it often realized as a set of connected  nodes. At least 10% of nodes should be connected in order to guarantee a sufficient reliability ([Small World Theory](https://en.wikipedia.org/wiki/Small-world_network)).\n\n At layer 2 there is no routing table, even if there are some cache mechanism. The topology is more like a tree than a graph because some edges can be cutted preserving reachability and lowering the costs.\n\n#### Spanning Tree Protocol (STP) \n\nThe spanning Tree Protocol is a network protocol that builds a logical loop-free topology for Ethernet networks. The spanning tree is built using some Bridge Protocol Data Units (BPDUs) frames. In 2001 the IEEE introduced Rapid Spanning Tree Protocol (RSTP) that provides significantly faster spanning tree convergence after a topology change.\n\nNow days this protocol is used only in campus and not in datacenters, due to its hight latency of convergence (up to 10-15 seconds to activate a backup line).\n\n#### Three-tier design\n\nThis architecture is simple architecture where each component has a redundant unit to replace it in case of failure.\n\n#### Network Chassis\nThe Network Chassis is a sort of big  modular and resilient switch. At the bottom it has a pair of power plugs and then it's made of modular **line cards** (with some kind of ports) and a pair of **RPM** Routing Processing Modules to ensure that the line cards work. The chassis can be over provisioned to resist to aging but it has a limit.  \nPros\n- resilient\n- 1 CLI per switch\n- expandible  \n\nCons\n- exepensive\n- not entirely future proof (today some switches may need up to 1KW power supply, while years ago they needed only 200 W)\n- aging problem\n\nThe chassis is connected with the rack's **tor** and **bor** (top/bottom of rack) switches via a double link. \n\n #### Stacking\n Indipendent switches stacked with dedicated links. It's cheaper than the chassis but there is less redundancy.\n\n#### Spine and leaf Architecture\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"600\" src=\"./assets/spine-leaf-vs-3-tier.png\"\u003e\n\u003c/p\u003e\n\nWith the increased focus on east-west data transfer the three-tier design architecture is being replaced with Spine-Leaf design. The switches are diveded into 2 groups, the leaf switches and spine switches. Every leaf switch in a leaf-spine architecture connects to every switch in the network fabric. \nIn that topoligy the **Link Aggregation Control Protocol (LACP) is used**. It provides a method to control the bundling of several physical ports together to form a single logical channel. The bandwidth is aggregated (i.e. 2*25 Gbps), but it's still capped to 25 Gbps because the traffic goes only from one way to the other each time. \n\n- fixed form factor (non modular switches)\n- active-active redundancy\n- loop aware topology (no links disabled).\n- interconnect using standard cables (decide how many links use to interconnect spines with leaves and how many others link to racks).\n\nWith this architecture it's possible to turn off one switch, upgrade it and rebbot it without compromising the network.\n\nA tipicall configuration of the ports and bandwidth of the leaves is:\n- one third going upwards and two thirds going downwards\n- 48 ports 10 Gbps each , 6 ports 40 Gbps each\n- or 48 ports 25 each, 6 ports 100 each\n\nJust a small remark: with spine and leaf we introduce more hops, so more latency, than the chassis approach.\n\n#### Full Fat Tree\n\nIn this network topology, the link that are nearer the top of the hierarchy are \"fatter\" (thicker) than the link further down the hierarchy. Used only in high performance computing where performances have priority over budgets.\n\nThe full fat tree resolves the problem of over-subscription. Adopting the spine and leaf there is the risk that the links closer to the spines can't sustain the traffic coming from all the links going from the servers to the leaves. The full fat tree is a way to build a tree so that the capacity is never less than the incoming trafic. It's quite expensive and because of this reason some over suscription can be accepted.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"200\" src=\"./assets/full-fat-tree-network.png\"\u003e\n\u003c/p\u003e\n\n### VLAN\nNow, the problem is that every switch can be connected to each other and so there is no more LANs separation in the datacenter, every packet can go wherever it wants and some problems may appear. For this problem the VLAN is invented. It partition a broadcast domain and create a isolated computer network.\n\nIt works by applying _tags_ to network packets (in Ethernet frame) and handling these tags in the networking systems. \n\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"600\" src=\"./assets/vlan.png\"\u003e\n\u003c/p\u003e\n\nA switch can be configured to accept some tags on some ports and some other tags on some other ports. \n\nVLAN are useful to manage the access control to some resources (and avoid to access to some subnetwork from other subnetwork). Different VLANs for different purposes.\n\n### Switch Anatomy\nA switch is an ASIC (application-specific integrated circuit). It can be proprietary architecture or non-proprietary. Layer two switches receive pkts and implements the equivalent of a bus: store and forward (there is a special address allowing broadcast). At layer 3 there is no loop problem, as in layer 2, because of the Internet Table.\n\nDatacenter's switches are usually non-blocking. It basically means that this switches have the forwarding capacity that supports concurrently all ports at full port capacity.\n\nNow some standard are trying to impose a common structure to the network elements (switch included) to facilitate the creation of standard orchestration and automation tools.\n\nThe internal is made of a **control plane** which is configurable and a **data plane** where there are the ports. The control plain evolved during the years, now they run an OS and Intel CPU's. Through a CLI Command Line Interface it's possible to configure the control plaun. Some exaples of command are:\n- show running config\n- show interfaces status \n- show vlan\n- config ( to enter in config mode)\n\nSome protocols in the switch (bold ones are important):\n- PING to test connectivity.\n- LLDP Local Link Discovery Protocol ( a way to explore the graph).\n- **STP** Spanning Tree Protocol (to avoid loops).\n- RSTP Rapid-STP\n- DCBX Data Center Bridging eXchange (QoS, priority)\n- PFC Priority Flow Control\n- ETS Enanched Transmission Selection (priority)\n- **LACP**  Link Aggregation Control Protocol (use two wires as they are one).\n\n**ONIE** (Open Netwoking Installed Environment) boot loader  \n\nThe switch has a firmware and two slots for the OS images. When updating in the first slot we store the old OS image, in the second slot the new one.\n\n**NFV** Network Functions Virtualization (5G mostly NFV based)  \nTHe data plain is connected to a DC's VM which acts as a control plane.\n\n# Disks and Storage\nAfter the fabric, another fondamental component of a datacenter is the storage. The storage can be provided with various tecnologies. \nThe simple one is that the disk are put inside each servers and are used as we use the disk on our laptop. Of course it is not useful fs we have a bunch of data to manage, and some networking solution can be better to use.\n\n## Interfaces\n\n- SATA\n- SAS Serial Attached SCSI\n- NVMe (Non Volatile Memory express): controller-less, protocol used over PCI express bus\n- ...\n\n## Redundancy\n\n[RAID](https://en.wikipedia.org/wiki/RAID#Standard_levels) stands for Redundant Array of Independent Disks. The RAID is done by the disk controller or the OS.   \nThe more common RAID configurations are:\n\n- RAID-0: striping, two drivers aggregated that works as a single one.\n- RAID-1: mirroring,write on both the drives one is the copy of the other.\n- RAID-5: block-level striping with distributed parity. It's xor based: the first bit goes in the first disk, the second bit in the second one and their xor in the third. If one disk crashes I can recompute it ( for each two bits of info I need one extra bit, so one third more disk storage).\n- RAID-6: block-level striping with double distributed parity. Similar to RAID1 but with more disks.\n\n## IOPS\n\nInput/output operations per second is an input/output performance measurement used to characterize computer storage devices (associated with an access pattern: random or sequential).\n\n## Functional programming\n\nHas become so popular also because of its nature: its pure functions can easily computed in a parallel system (no storage so no necessity of locks). It's an event based programmuing: pass a function when something appens. In Object Oriented languages it's more complicated cause we have interfaces, event listeners...  \n\n## Memory Hierarchy\n\n- CPU Registries\n- CPU Cache \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; | Caching\n- RAM\n- nvRAM  \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;| Memory tiering\n- SS Memory\n- Hard drive\n- Tape \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;| Storage tiering\n\nAs technology evolves, the harder is to maintain a model that lasts. Memory tiering a new term introduced nowadays with the Intel Sky Lake processors family (XEON). \n\nnvRAM uses [nvDIMM](https://en.wikipedia.org/wiki/NVDIMM) (non volatalie Dual Inline Memory Module) to save energy because you can change the ammount of current given to each pin; moreover the data doesn't need to be refreshed periodically to maintain data. \n\nIn-memory database, like Redis. If you loose power there are still mechanisms to avoid data loss.\n\nProcesses can share memory though the memory mapping technique (the memory is seen as a file).\n\n### NVMe\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/3d-xpoint-performance.jpg\" width=\"600\"\u003e\n\u003c/p\u003e\n\nIt's a protocol on the PCI express bus and it's totally controller-less. From the software side it's simpler in this way to talk with the disk because the driver is directly attached to the PCI, there is no controller and minor latency.\n\nA bus is a component where I can attach different devices. It has a clock and some lanes (16 in PCI,  ~ 15 GB per second because each lane is slightly less then 1 GB). Four drives are enought to exhaust a full PCI III gen bus. They are also capable of saturating a 100 Gbps link.\n\nNAND is a standard Solid State Technology.\n\nBeside Volatile RAM it's now possible to have Persistent State RAM.\n\n### Storage aggregation\n\nThe strategy for accessing drive makes the difference.  \nFiber channel is the kind of fabric dedicated for the storage. The link coming from the storage ends up in the Host Based Adapter in the server.\n\n## Network Area Storage (NAS)\nNAS is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS systems are networked appliances which contain one or more storage drives, often arranged into logical, redundant storage containers or RAID. They typically provide access to files using network file sharing protocols such as NFS, SMB/CIFS, or AFP over a optical fiber.\n\nWhen using a network file systsem protocol, you are uing a NAS.\n\n## Storage Area Network (SAN)\nWhile NAS provides both storage and a file system, SAN provides only block-based storage and leaves file system concerns on the \"client\" side. SAN protocols include Fibre Channel, iSCSI (SCSI over the fiber), ATA over Ethernet (AoE) and HyperSCSI. It can be implemented as some controllers attached to some JBoDS (Just a Bunch of Disks).  \nThe SAN can be divided in different LUNs Logical Units.\n\nIf the drive is seen as phisically attached to the machine, and a block transmission protocol is adopted that means that you are using a SAN. The optical fiber has become the bottleneck (just four drives to saturate a link).\n\nWith SAN the server has the impression that the LUN is attached directly to him, locally; with NAS there isn't this kind of abstraction.\n\nSome latency can be reduced if we stripe data in a correct way and we exploit the multiple seeks.\n\n### Benefits\nThe main features that are provided by a storage system are the following:\n - Thin provisioning\n\t- This is a virtualization technology that gives the appearance of having more physical resources than are actually avaiable. Thin provisioning allows space to be easily allocated to servers, on a just-enough and just-in-time basis. Thin provisioning is called \"sparse volumes\" in some contexts.\n - Deduplication\n\t- If the same file is required in two context, it is saved one time and is served to different context.\n - Compression\n - Authentication\n - RTO/RPO \"support\" DR\n \t- The Recovery Point Objective is defined by business continuity planning. It is the maximum targeted period in which data might be lost from an IT service due to a major incident. \n - Network Interface (iSCSI, Fibre Channel...)\n - RAID\n - Tiering\n \t- Tiering is a technology to assign a category to data to choose various type of storage media to reduce total storage cost. Tiered storage policies place the most frequently accessed data on the highest performing storage. Rarely accessed data goes on low-performance, cheaper storage.\n - NAS Protocols\n - Snapshot\n\n## HCI - Hyperconvergent Systems\n\n- Nutanix: is the current leader of this technoogy\n- Ceph: is a different architecture/approach\n- vSAN\n- SSD - Storage Spaces Direct\n\nThis kind of software is expensive (Nutanix HCI is fully software defined so you do not depend on the vendors hardware).\n\nThe main idea is not to design three different systems (compute, networking, storage) and then connect them, but it's better to have a bit of them in each server I deploy. \"Adding servers adds capacity\".\n\nThe software works with the cooperations of different controller (VMs) in each node (server). The controller (VM) implements the storage abstraction throught the node and it implements also the logical mooving of data. Every write keeps a copy on the local server storage exploiting the PCI bus and avoiding the network cap; a copy of the data is given to the controller of another node. The read is performed locally gaining high performances. The VM is aware that there are two copies of the data so it can exploit this fact. Once a drive fails it's copy is used to  make another copy of the data.\n\n## SDS - Software Defined Storage\nSoftware-defined Storage is a term for computer data storage software for policy-based provisioning and management of data storage independent of the underlying hardware. This type of software includes a storage virtualization to separate storage hardware from the software that manages it.  \nIt's used to build a distributed system that provides storage services.\n\n**objec storage** (i.e. S3 by Amazon)  \nWrite, read, rewrite, version delete an object using HTTP.  \nAn object has:\n- object ID\n- metadata\n- binary data\n\n\n\n## Non-RAID drive architectures\nAlso other architectures exist and are used when RAID is too expensive or not required.\n - JBOD (\"just a bunch of disks\"): multiple hard disk drives operated as individual independent hard disk drives\n - SPAN: A method of combining the free space on multiple hard disk drives from \"JBoD\" to create a spanned volume\n - DAS (Direct-attached storage): a digital storage directly attached to the computer accessing it.\n\n## Some consideration about Flash Drives\nThe bottleneck in new drives is the connector. The SATA connector is too slow to use SSD at the maximum speed. Some results can be see [here](http://www.itc.unipi.it/wp-content/uploads/2016/02/ITC-TR-01-16.pdf).\n\nThe solution? Delete the connector and attach it to PCIe. So new Specification is used, the NVMe, an open logical device interface specification for accessing non-volatile storage media attached via a PCI Express bus.\n\n## Storage in the feature\n\n![Memory History](https://img.digitaltrends.com/image/3dxpointslide1-1000x559.jpg)\n\nAs we can see in the image, it's been decades since the last mainstream memory update is done. In fact, the SSD became popular in the last years due the cost but they exists since 1989. \n\n![3D XPoint Technology](http://cdn.wccftech.com/wp-content/uploads/2015/07/Intel-Micron-3D-XPoint-Memory.jpg)\n\nNew technology was introduced in 2015, the 3D XPoint. This improvement takes ICT world in a new phase? If yesterday our problem was the disk latency, so we design all algorithm to reduce IOs operation, now the disk is almost fast as the DRAM, as shown the following image:\n\n![Disk latancy](https://images.anandtech.com/doci/9470/asd14.PNG)\n\nWith the NVMe drives we can reach 11GBps, aka 88 Gbps. Since the software latency is circa 5 microseconds, TCP/IP software introduces also a latency, 70-80 microseconds, the disk is no more a problem.\n\n![RDMA how does it work](https://image.slidesharecdn.com/1mellanox-140331123657-phpapp02/95/infiniband-essentials-every-hpc-expert-must-know-10-638.jpg?cb=1396269459)\n\n# Hypervisors\nA hypervisor is a software, firmware or hardware that create and runs virtual machines. \nIt can be bare-metal hypervisor or hosted hypervisor. A bare-metal is where the hypervisor is the OS itself, often requires certified hardware. Hosted hypervisor is VirtualBox.\n\nAn hypervisor permits to overbook physical resources to allocate more resources than exist.\n\nIt create also a virtual switch to distribute the networking over all VMs. \n\n# Servers\nThey are really different from desktops, the only common part is the CPU istruction set.\nFor istance, servers have an ECC memory with Error Correction Code built in.\n\nRacks are divided in Units: 1 U is the minimal size you can allocate on a rack. Generraly 2 meters rack has 42 Units. \n\n## Form-factors\n\n- 1U Pizza box:  \ntwo sockets (CPU),  \n~10 drives disposed orizontally.  \nIn the bottom part there are 2 power plugs, networking plugs for KVM (configuration console) and a **BMC** (Base Management Console) which is a stand alone OS talking with the motherboard used for remote monitoring, shut down ...  \nThe drives are in the front (up) part, immediatly above them there are the fans and the disk controller. Tipically the max number of CPUs is four and they are closed to the memory modules.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/pizzabox.png\" width=\"600\"\u003e\n\u003c/p\u003e\n\n- 2U: 2 CPUs, 24 drives disposed vertically.\n- 2U Twin square: 24 drives on the front disposed vertically, 4 servers 2 CPUs each, they share only the power.\n- 10U Blade server: big chassis, up to 16 servers 2 CPUs each, simpler cabling, easy management and cost reduced. \n- Intel Ruler up to 1 petabyte but there is no room for CPU because it is a SS media. Possible to design a one half PB ruler with room for CPUs.\n\nDiffers from desktop systems. \n- CPU architecture with a new generation memory called [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access).\n- [Hyper threading](https://en.wikipedia.org/wiki/Hyper-threading)\n- https://en.wikipedia.org/wiki/Intel_UltraPath_Interconnect\n- Intra socket connection \n- Intel [AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) CPU architecture\n- MCDRAM (multi channel RAM) with less latency\n\n#### Miscellaneous\nTrade-off in CPU design: high frequency, low cores. All dipends on the application running: it can benefits of high frequency or not (big data systems are more about capacity than latency).\n\nLatency is slightly higher when I access a RAM bank of another socket because I have to ask for it via a bus that interconnects them.\n\nCrossbar interconnection (each CPU at the vertex of a square connected by the edges and the diagonals too) between CPU's to reduce 1 hope.\n\n**NUMA** Non Uniform Memory Architecture  \nDrop the assumption that all the RAMs are equal. NUMA is supported in the most used servers and virtualizer. Create threads and process that are NUMA aware: split data in an array and each thread works on a part of it.\n\n**Inter socket** and **Intra Socket** connection: initially cores used a token ring or two token rings, now they use a mash. \n\nInside the core there are some funtional units like: branch missprediction unit, FMA (Floating point Multiply Add).Each core hads a dedicated cache at L1 and a shared cache at L2.\n\nIf I have two threads in many cases I can execute 2 istruction at time (thread overlapping, hyper threading). \n\nMulti Channel DRAM: more bandwidth than DDR.\n\n**SMART technology** in drives: predictive system in the drive that gives the probability that the drive will fail in the next hours. Used by the driver provider for statistics, usage patterns.\n\n# Cloud\n\nIs a business model. The cloud is someone else's computer that you can use (paying) to execute your application with more realiable feature than your laptop (i.e. paying for doing tests on your app using the cloud infrastructure because you need more resources). A cloud is a collection of network-accessible IT resources.  \nWhen you program for the cloud you dont know where your process will be executed or where you data will be stored.\n- over provisioning the system\n- rent the over provisioned resources\n- reallocating resourcis, VMs\n\n**Private Cloud** set of IT resources that are local.\n\nThere is a trade off between centralization ( the bottleneck is the storage) and distribution (the bottleneck is the network).\n\n**SLA** Service Leval Agreement: how much do I make users pay?\n![Infrastructure](./assets/cloud-services.png)\n\n\n### Rapid Elasticity\nConsumers can adapt to variation in workloads and mantain required performance levels. This permits also to reduce costs avoiding the overprovisining.\n\n### High Avaialability\nThe cloud provide high avaialabity. This feature can be achived with redundancy of resources to avoid system failure. Some Load Balancer is used to balance the request between all the resources to avoid failure due the resources saturation on some machine.\n\nThe cloud infrastrucure can be public, if it is provisioned for open use by the general public; or private, if is provisioned for exclusive use by a single organization comprising multiple consumers.\n\n## Cloud computering Layer\nThe cloud infrastrucure can be see as a layered infrastructure. \n\n#### Phyisical Layer\nExecutes requests generated by virtualization and control layer. Specifies entities that operate at this layer (devices, systems, protocols...)\n\n## Virtual Layer\nDeployed on the physical layer. Abstract physical resources and makes them appear as virtual resources. Executes the requests generated by control layer. It permits a better use of the hardware when you have services that underuse it.  With VMs there is a 10% of performance loss but we gain in flexibility, security ...\n\nThis allows a **multi tenant environment** since I can run multile organizations VMs on the same server.\n\nThe **hypervisor** is responsible for running multiple VMs. Since I want to execute x86 ISA over an x86 server I don't need to translate the code. **KVM** kernel, preempting the VOS process.\n- **paravirtualization** the virtual kernel cooperates with the hosting OS.\n- the CPU is aware of the virtualization, it distinguishes the interrupts generated by the vos.\n- **driver integration** you don't have to emlulate all the drivers but you can ask the underlying OS for this service.\n\nEach VM has a **configuration file** where there are the values aswering the questions: how much memory, how much disk, where is the disk file, how many CPU's cores ...\n\nThe disk is virtualized usign a file, while for the Network there are a VNIC (Network Interface Card) connected to a VSWITCH, comunicating with the physical NIC. The VNIC is used also by the real OS because it's physical NIC is busy doing the VSWITCH.  \nThe Virtual Disk is a file of fixed size or dynamically expanding. The VOS can be shared among the VMs and stored elsewhere than in the vdisk file. Each write goes on the vdisk (can undo all the write ops), instead each read first look in the \"file\" where the VOS is, than in the vdisk file if the previous check wasn't successful.  \n\nThe Virtual CPU masks the feature of a CPU to a VM. The VCPU can be overbooked, up to twice the number of cores. The CPU has several rings of protection (user ... nested vos,vos,os).\n\n#### About the virtual memory:  \nIt's not allowed to use a virtual memory as VM RAM because the sum of the VM RAM should be less or equal to the actual RAM. Fragmentation could be a problem if there is lot of unused reserved memory.\n\n##### Balooning \nIt is sayd to the VM: \"Look, you have 1TB of RAM but most of it it's occupied\". In this way we have dynamically expanding blocks of RAM: if the OS needs memory I can deflate the baloon.\n\n#### Other considerations about the Virtual Layer\nThe persistent state of a VM is made of the **conf file** and the file of the disk. Mooving a VM it's really simple: just stop it and moove the two files just mentioned.  \n\n##### vMotion -  Live Migration\nMooving a VM from server A to B while it's running. The user could experience a degradation of the service but not a disruption.\n- copy the RAM and at the end, copy the pages writed during this phase.\n- create an empty drive on B\n- copy the CPU registers (the VM is stopped for a really short period)\n- manage VSwitch and ARP protocol. The virtual switch must be aware of the migration: if the old vswitch receives a pkt for the just migrated VM it should send it to B.\n- continue running the VM on B, only when it needs the disk you stop it and start copying the disk file. A jumboframe can be used to avoid storage traffic fragmentation.\n\n#### Docker\nIt exploits Linux's Resource Group. The processes in the container can see only a part of the OS. The containers have to share the networking. Docker separates different software stacks on a single node.\n\n## Control Layer\nEnables resource configutarion and resource pool configuration. Enable resource provisioning. Execute requests generated by service layer. It takes physical or virtual resources and puts them in a common domain allocating existing and new resources.\n\n**open stack**  \nGood idea but bad implementation. Various open source softwares, difficult to deply, lots of dead code, bad security implementation. It has a small form of orchestration but it's not a service orchestrator( i.e. no distribution of the workload, scaling)\n\n### Service orchestration Layer\nProvides workflow for executing automated tasks. \n\n## Business Continuity \n#### Backups\nIt' a data protection solution.\n\n**RTO** Recovery Time Objective: time it will take to have a full recovery.  \n**RPO** Recovery Point Objective: what is the last consistent copy of the storage I will find. How many data points do you have to go back in time?\n\nNetwork it's the first problem when I want to make a backup, beacuse the size of the backup is bigger than the network bandwidth.  \nSometimes it's simply impossible to make a backup.\n\n**incremental backup**\nBackup only the updated parts. High RTO cause I have to reconstruct all the files hierarchy going back througth the back ups. Some times snapshots are needed.\n\n**image level**  \nuses snapshots. It's agentless (agent == client), the agent can't crash since there isn't one.\n\n**backup windows**  \nthe horizon effect: you decide a window but the stuff you need will be always in the deleted part.\n\nsome servers + backup unit  \nsome others servers + some other backup unit\n\nTake the hash of two identical files, store only one of the two files and both the hashes.\n\nThe **replica** it's a whole complete copy.  The syncronous replica needs an acknowledgement before proceeding. DBs like Oracle, Sequel Servers want syncronous replica. \nWith the **backup** you can choose the chunk of files to \"backup\".\n\n\n### Security\nFirewall, Antivirus, Standard procedures to direct safe execution of operations...\n\nThree levels of security:\n- **Procedural**: phising, the weakest link is the human.\n- **Logical**: abstraction produced by the OS. **mandatory access** (classification of the infos); **discreptional access** (~ ACL)\n- **Physical**\n\nAccess Control Lists are difficult to manage with lots of users.  \n**PAM** (linux) Password Authentication Module: few systems use ACL via PAM.\n\n**auditing** activity of checking that system security is properly working. Keep monitoring the interaction of the user on a resource; get an allert when something suspicious occurs.\n\nMINIMUM PROVILEGE PRINCIPLE : every user must be able to access only the information and resources that are necessary for its legitimate purpose.\n\n**right != privilege**  \nThe first is given to you by someone, the second it's posssesed by you just because who you are.  \nIn Windows you (the admin) can take the ownership, but you can't give it. Noone logs as **system** (like linux root but in Windows). SID in Windows is unique for the entire system. (sysprep, sys internals, process explorer)\n\n**OAuth** authorization mechanism  \n**OpenID** authentication\n**RBAC** Roled Based Access Control\n\n**Kerberos** based on symmetric crypthography. The clietn first asks for a ticket to the Kerbero's KDC, then it can access the resource.\n\n**byometric security** : once it gets compromized can't be restored, because you can't change someone biometrical data.\n\nDisable the possibility of changing the MAC address at the hypervisor level.\n\n#### Firwall\n- **level 3 firwall**: looks at the envelop, source address, port ...\n- **level 7 firewall**: reconstruct the full pkt looking inside it's content.\n\n---\n\nShare the identities of the users to not replicate them in each server:\n- **lDAP** lightweight Directly Access Protocol: distributed database organized as a tree where we store the name of the users.\n\n- **active-directory**: uses a secure protocol to exchange credentials throught the network. It's a centralized data structure listing users.\n\n### Service Managment\nBe aware of regulations and legal constraints that define how to run a system.\n\nLevel of compliancy to the policy. Demonstrate compliancy. Is this system behaving according to the regulations?  \nInformation processors (cloud providers) are responsible of the infos they process.\n\n**SLA** Service Level Agrrement: legal contract thet you sign as a customer to the provider defining what the user is paying for.  \n**service avaiability** = 1 - (downtime/ agreed service time)  \nThe uptime is difficult to define and to test because the reachability of the cloud depend also from the service providers.\n\nThe lower the resources used, the higher the margin got. Low level magrgin business: very high numbers * low margins = big profits.\n\n**Service Operation** is crucial, it keeps up the whole thing running.  \n**Service Level** not only functional requirements.\n\nEnsure **charge-back** (pay per use), **show-back** (I exhausted the resources so I need more): make a good use of the money spent on hardware, people. Measure how much are you efficient in spending money.\n\n**TCO** Total Cost Ownership: time cost, resource ...  \nReducing risk is a kind of **ROI** Return On Investment.\n\n**CAPEX** CAPital EXpenses: buy something.\n**OPEX** OPerational EXpences (use sometihing)\n\n**capacity planning**: make some forecast to find when we will exhaust the resources and how many resources we will really need.  \n**monitoring**: collecting data (in a respectfull way).\n\nKeep track of things, processes, servers, configurations so that you can roll back.\n\n**Incident/Problem Management**\nIndentify the impact of a failure to all the other services.\n\nOvercommitment of resources can bring to capacity issues.\n\n## GDPR General Data Protection Regulation\nAbout protection personal data. What's a personal data? i.e. matricola, email, phone number.. it's everything that uniquely identifies you.\n\nGDPR applies both to digital and not digital information. \n\nIf you, as an individual, get damaged by a bad use of your personal data, you can complain to the data owner and get compensated.\n\n\n## Vendor Lock-in\nThe cloud introduces some problems, one of them is the vendor lock-in. It appers when I write a software that uses a vendor API that not respects any standard. If I would like to change cloud I use, I need to modify the code (good luck!).\n\nEven in Open Source there is vendor lock-in due to the difficulty of mooving from the dependency of a software to another one. To avoid the vendor lock-in you should relay on different softwares and vendors.\n\n### Standardization-Portability\nIt' rare that a leading vendor define a common standard. Standardization it's important but it's not feasable. It partly avoids lock-in. \"\"The only thing that can be standardize it's the VM\"\". Every platform tends to have its own API. REST is the standard that is working today in the cloud.\n\n\n# Orchestration\n\n2 types of orchestration:\n- low level: eg. installation of a new VM\n- high level: eg. configuration of the new VM. At the end of this process the VM will be up and running\n\n# Fog Computing\nThe fog computing is an architecture that uses one or more collaborative end-user clients or near-user edge devices to carry out a substantial amount of storage (rather than stored primarily in cloud data centers), communication (rather than routed over the internet backbone), control, configuration, measurement and management (rather than controlled primarily by network gateways such as those in the LTE core network).\n\n# Miscellaneous\n\n**greenfield installation** : format, configure everything from scratch.\n\n**license** : boundary to the number of installations you can have.\n\nIt's acceptable that some users experiments performances issues while upgrading.\n\n**Procedures** are really important: knowing the procedure and applying it can avoid lost of data, users, money.\n\n**NIC teaming**\n\n**Erasure Coding** like RAID 5 (xor)\n\n### Redundancy\n\nTry to have links in rings insteand of single lines. \n\nSome services run in multiple **zones**.  \n**Service Availability Zones**: system divided in zones, thing is some zone can fail togheter, but things from different zones can't. Run multiple istances on different zones (i.e racks). They can describe also geographical localtions. \n\n**cross connection** typical pattern for redundancy.\n\n**active-passive** the II system is off and will be online only in case of failure of the first one.  \n**active-active** i.e. two links aggregated both working.\n\n**active/passive failure**: when a system fails but also the \"passive\" part fails immediatly because I haven't checked it.\n\n**n+1, n+2 schema** : need n components, deploy n+1 \n\n**multipath** give different addresses to each component.\n\n# In class exercises\n\n## 1 - Discuss the difference between spine and leaf fabric and the more traditional fabric architecture based on larger chassis. How bandwidth and latency are affected?\n\n\n## Spine and Leaf\nNon modular, fixed switches are interconnected with some MLAG (Multi-chassis Link Aggregation). Loosely copuled form of aggregation: the two switches are independent and share some form of aggregation. LCP protocol allowing to bind multiple links to a single conceptual link (link aggregation, active-active).  \n**over-subscription** the links to the spine should be able to sustain the trafic coming from all the links below. This is not a problem for EW trafic between servers attached to the same switch (because the link to the spine is not affected).  \nPros:\n- resilient\n- active-active\n- can be uptdated while the system is running\n\nIt became popular after 10 GBps; before it was difficult to use it with 4/8/16 ports per server. Different VLANs are used.\n\n## Traditional Chassis\nTipically two modular chassis connected by two links (STP) in atcive-passive (the second chassis goes up only when the first isn't working). The ration between the number of ports and the bandwidth is completely different from spine and leaf. Link aggregation is possible but it's not convenient.  \nPros:\n- room for growing\n- protection on the investment\n- share power\n- pay only once and just add line cards\n- ~ simplifying the cabling \n\nToday is not so much used because it's difficult to design a backplane offering terabits.  \n- **Capex and Opex reasons**: in active-passive I use only half of the bandwidth I'm paying for.  \n- **latency issues**: with STP when a link goes down it can take up to seconds to activate the other link.\n\n---\n\n## 2 - What actions can take the orchestration layer of a cloud system, and based on what information, in order to decide how many web server istances should be used to serve a Web system?\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/ex2.png\" width=\"400\"\u003e\n\u003c/p\u003e\n\n- Assuming the DB is distributed and has infinite capacity, because tipically the bottleneck is the Web Server\n\nAn orchestrator can:\n- create new VM running the WS, getting a new IP and talking to the **Load Balancer**\n- delete a VM\n- save a VM (freezing it)\n- increase memory\n- etc..\n\nBased on:\n- average response time\n- available memory in the WS\n- latency on web requests (if it goes beyond a treshold spawn another service)\n- number of connections (requests)\n- CPU usage\n\n## 3 - Discuss a datacenter architecture made of 10 racks. Assuming a power distribution of 15 W/ rack.\nUse an in row cooling approach trying to reduce the rows to be cooled. Do not forget to mention the PDU and the UPS. (2 plugs per rack 32A each).\n\n## 4 - A service requires a sustained throughput towards the storage of 15 GB/s. Would you recomment using a SAN architecture or an hyperconvergent one.\n\n- 15 GB is the max bandwidth of a PCI express bus with 16 lanes.  \n- 100 Gbps bandwidth of a single link (even if internally is 4*25 Gbps).  \n- 15 GBps * 8 = 120 Gbps  \n- The PCI has some overhead, so it's bandwidth is not fully 15 GB.  \n- 15 GB = 54 TB/hour = 1 PB/day = half exaB / year. I have also to consider where am I  going to store this data, not focusing only on the bandwidth.  \n- 8/ 10 TB mechanical drive capacity.  \n- SATA SSD has 500 MB bandwidth.  \n- With 15 GB/sec, 4 fiber channel are enougth.\n\n\n### SAN area network (recap)\nISCSI internet protocol (SCSI on fiber) allows to mount blocks/disks.  \nBlock-based access: you mount a chunk of bytes seen as a drive.  \nLUN (Logical UNits), can be replicated, compression can be used, it can be overbooked.  \nServers and drives are separated, drives are pooled togheter.\n\n- **Capex and Opex reasons**: When I first buy the SAN I need to pay extra room for growing (Capex cost). The risk is to be surpasses by technology changes (not good investment, bad ROI). Overprovision sometimes could be bad: suboptimal allocation of resources.\n\n### NAS\nIt uses istead a network file system protocol to access the pooled resources (SIFS, NFS). We access files not blocks.   \nNAS gives the file system, with SAN I decide the FS.  \nSecurity in SAN is bounded to the compute OS, which decide the authentication domain.\nIstead NAS has the responsibility of the security and the filesystem abstraction (Active Directory and NFS security).\n\nBoth SAN and NAS separate the sotrage from the compute. Configure one for all the storage (backup, compression...) and look at it as blocks or files.  \nThis architecture is failing because of the throughput of the drive (very fast) that saturates the link.\n\n### HCI (hyperconvergent)\nBefore we talked about three independent units: compute, storage and network. With  HCI istead we have boxes (servers) with a little bit of network, drive and compute.\n\nIt's not true that the compute and the drive are completely unrelated and can be completely separated: also the CPUs have their own limits in data processing even if large (risk to waist resources).\n\nHCI by Nutanix allows to simply add a bit of storage, a bit of compute and a bit of network by buying a server. You pay as you grow.\n\n#### Discussion\nThe choice depends also on the kind of data I assume to process (assume at least one: sensors, bank financial data ...). For example HCI is not convenient if a want to do archiving because I pay for extra unused CPU.  \n\nIt's not enought to say: I take 5 big drives, because their bandwidth can be a bottleneck.\n\nSAN could be the good solution because it's cheaper. SAN can be used with **tiering**: in the first layer I keep SSD \"buffers\",  in the second layer mechanical drives. If I keep a buffer of 1TB I'll have a minute to copy down the buffered data to the mech drives.\n\n## What should I look for..\n- Capex Opex\n- Resilience\n- Bandwidth (network, drives)\n- etc... \n\n## 5 - A service requires a sustained throughput towards the storage of 15 GB/s. How would you dimension an hyperconvergent system to ensure it works properly?\n\nLook first at the network (fabric is the glue of the infrastructure).  \nCan't have 100 GBps to the server because of spine and leaf.\n\nJust 1 or 2 ports of 100Gbps are enought to saturate the PCIe. \n\nNot good to have 100Gbps for each node cause I'm overloading that single node while HCI is distributed.\n\n400 GBps links are used for spine.  \nBetter 10 GBps or 25GBps depending on Capex.  \nWith spine and leaf I have 50 Gbps  cause I double (active-active).\n\nConsider at leat 5 full used nodes with 25 Gbps network. Since I want to have some redundancy and efficiency I can use 8 to 10 nodes. I'm overprovisioning but it's good.\n\nEvery HCI node will have some SSD (at leat 2, 1 GB/sec writing) and some mechanical drives.  If I use SATA drives I need al leat 6 for each node because the bottle neck is in their bandwidth. I can use NVMe drives: lower number but I pay more.\n\n- **Consider SLA**: how much I gonna pay for the missed target/data? If it's a lot it's better to overprovision.\n\nRemember that bandwidth are not fully used because of some overhead..\n\n\n\n# References\n - https://tools.ietf.org/html/rfc4391\n - https://en.wikipedia.org/wiki/Omni-Path\n - https://en.wikipedia.org/wiki/Remote_direct_memory_access\n - https://www.arubacloud.com/infrastructures/italy-dc-it1.aspx\n - https://en.wikipedia.org/wiki/Software-defined_networking\n - https://en.wikipedia.org/wiki/Software-defined_storage\n - https://en.wikipedia.org/wiki/Software-defined_data_center\n - https://en.wikipedia.org/wiki/Spanning_Tree_Protocol#Rapid_Spanning_Tree_Protocol\n - https://en.wikipedia.org/wiki/Multitier_architecture\n - https://blog.westmonroepartners.com/a-beginners-guide-to-understanding-the-leaf-spine-network-topology/\n - http://searchdatacenter.techtarget.com/definition/Leaf-spine\n - https://en.wikipedia.org/wiki/Network-attached_storage\n - https://en.wikipedia.org/wiki/Non-RAID_drive_architectures\n - https://en.wikipedia.org/wiki/Fog_computing\n - https://www.openfogconsortium.org\n - https://en.wikipedia.org/wiki/Power_usage_effectiveness\n - https://howdoesinternetwork.com/2015/what-is-a-non-blocking-switch\n - https://en.wikipedia.org/wiki/Network_function_virtualization\n -http://www.itc.unipi.it/index.php/2016/02/23/comparison-of-solid-state-drives-ssds-on-different-bus-interfaces/\n- http://www.itc.unipi.it/wp-content/uploads/2016/05/ITC-TR-02-16.pdf\n- https://www.nutanix.com/hyperconverged-infrastructure/","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falessandro308%2Fict-infrastructure","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falessandro308%2Fict-infrastructure","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falessandro308%2Fict-infrastructure/lists"}