Tuesday, May 14, 2013

Creating Solaris(11.1) zones on the Exalogic Shared Storage

Introduction

Until recently running solaris zones using the older Exalogic release (Solaris 11 Express) was quite possible but there was a significant limitation.  Namely for a supported configuration the zone had to be located on the local SSD drive of the Exalogic compute node.  Because of the limited size of these disks there was effectively a limit to the number/sizes of zones that could be created on each compute node.  With the recent release of Exalogic support for Solaris 11 and some further development from the Engineering teams it is now possible to run Solaris Zones on the ZFS appliance making use of the iscsi protocol.

Prerequisites

In order to get this working on an Exalogic you should image the rack to Solaris (Exalogic 2.0.4.0.0) and upgrade the rack to the  latest patch set (April 2013 PSU - My Oracle Support ID=1545364.1) and also apply a specific patch (My Oracle Support ID=16514816) for "Zones on Shared Storage (ZOSS) over ISCSI".  

Creating the LUNs on the ZFS Appliance

The first activity is to create the various iscsi groups and initiators on the ZFS appliance so that the LUNs that will host the zones can be created.  This is a fairly simple process that involves setting up a SAN (Storage Area Network) with iscsi targets and initiators which can be linked to the LUN storage that is made available to the compute nodes.

We will start with some terminology explanations of what the various components we need to setup actually are:-

TermDescription
Logical UnitA term used to describe a component in a storage system. Uniquely numbered, this creates what is referred to as a Logicial Unit Number, or LUN.  The ZFS Appliance may contain many LUNS. These LUNs, when associated with one or more SCSI targets, forms a unique SCSI device, a device that can be accessed by one or more SCSI initiators.
TargetA target is an end-point that provides a service of processing SCSI commands and I/O requests from an initiator.  A target, once configured, consists of zero or more logical units.
Target GroupA set of targets. LUNs are exported over all the targets in one specific target group.
InitiatorAn application or production system end-point that is capable of initiating a SCSI session, sending SCSI commands and I/O requests. Initiators are also identified by unique addressing methods.
Initiator GroupA set of initiators. When an initiator group is associated with a LUN, only initiators from that group may access the LUN.

1. Create iSCSI Targets

To set things up on the ZFS appliance navigate to Configuration-->SAN & select the iSCSI Targets.  then click on the + sign beside the iSCSI Targets title to add a target.  Having added a target it is possible to drag and drop the target to the right of the screen, into an iSCSI Target Group.  Either adding to an existing group or creating a new group.  (To drag and drop you need to hover the mouse over the target then a crossed pair of arrows appears, click on this to pick up the target and drag it over to the groups.)

Setting up iSCSI Targets on the ZFS Storage Appliance BUI

2. Setup iSCSI Initiators

The setup for the iSCSI initiators and groups is similar in nature to the setup of the targets.  i.e. You click on the + symbol for the iSCSI Initiators, fill in the details then drag and drop the initiator over to the initiator group to either create a new group or add it to an existing one.  The only significant complication is that the creation of an iSCSI Initiator involves specifying an Initiator IQN.  This is a unique reference number that relates to a specific host.  (The compute node that will mount a LUN.)  To find this number is is necessary to log onto each compute node in the Exalogic rack and run the iscsiadm list initiator-node command.



# iscsiadm list initiator-node
Initiator node name: iqn.1986-03.com.sun:01:e00000000000.51891a8b
Initiator node alias: el01cn01
        Login Parameters (Default/Configured):
                Header Digest: NONE/-
                Data Digest: NONE/-
                Max Connections: 65535/-
        Authentication Type: NONE
        RADIUS Server: NONE
        RADIUS Access: disabled
        Tunable Parameters (Default/Configured):
                Session Login Response Time: 60/-
                Maximum Connection Retry Time: 180/-
                Login Retry Time Interval: 60/-
        Configured Sessions: 1

So in the example above the Initiator IQN is:-

 iqn.1986-03.com.sun:01:e00000000000.51891a8b

This is reflected in the ZFS BUI as shown for the first compute node in the list on the left.

ZFS Appliance iSCSI Initiators added and included in a group.

3. Create Storage Project and LUNS

The final step on the storage server side of things is to create your project & LUNs.   The process to create the project & shares (LUNS in this case) is similar to the process for creating filesystems for use via NFS, as described in an earlier blog posting.  In this case rather than creating a Filesystem though you create a LUN.

Creating a LUN on the ZFS Storage Appliance
 
The LUN will now be available to be mounted on any of the compute nodes that are part of the Initiator Group.

Creating the Solaris Zone on the Shared Storage

We now have the storage prepared so that it can be mounted on the compute nodes, our intention is to store the zone on the shared storage and setup an additional bonded network on the 10GbE Exalogic client network that the zone will have exclusive access to.

1. Ensure the disk (LUN) is visible to the node and ready for use.

The first step we need to take is to ensure that the LUN that was created on the storage device is available to the compute node and that the disk is formatted ready for usage.  Prior to checking for the disk it may be necessary to run the iscsiadm commands to setup the shared storage as a supplier of LUNs.  This should only need to be run once on each compute node but we have found that when all zones are removed from a  node it is necessary to re-run this discovery-address command to make the LUNS visible.

# iscsiadm add discovery-address <IP of ZFSSA>
# iscsiadm modify discovery -t enable
# devfsadm -c iscsi
# echo | format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t600144F09C96CCA90000518CDEB10005d0 <SUN-ZFS Storage 7320-1.0-64.00GB>
          /scsi_vhci/disk@g600144f09c96cca90000518cdeb10005
       1. c0t600144F09C96CCA90000518CDF100006d0 <SUN-ZFS Storage 7320-1.0-64.00GB>
          /scsi_vhci/disk@g600144f09c96cca90000518cdf100006
       2. c0t600144F09C96CCA90000518CDFB60007d0 <SUN-ZFS Storage 7320-1.0-64.00GB>
          /scsi_vhci/disk@g600144f09c96cca90000518cdfb60007
       3. c0t600144F09C96CCA900005190BFC4000Ad0 <SUN-ZFS Storage 7320-1.0 cyl 8352 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f09c96cca900005190bfc4000a

       4. c7t0d0 <LSI-MR9261-8i-2.12-28.87GB>
          /pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@0,0
Specify disk (enter its number): Specify disk (enter its number):
Identifying the LUN on the compute node

 The format command can pick out the LUN which is presented to the Compute Node as a local disk.  The value after the line /scsi_vhci/disk@g maps onto the GUID of the LUN that was created.  This identifies that it is the disk c0t600144F09C96CCA900005190BFC4000Ad0 that is to be formatted and labelled.

# format -e c0t600144F09C96CCA900005190BFC4000Ad0
selecting c0t600144F09C96CCA900005190BFC4000Ad0
[disk formatted]

FORMAT MENU:
...
format> fdisk
No fdisk table exists. The default partition for the disk is:

  a 100% "SOLARIS System" partition

Type "y" to accept the default partition,  otherwise type "n" to edit the
partition table. n
SELECT ONE OF THE FOLLOWING:
...
Enter Selection: 1
Select the partition type to create:
   1=SOLARIS2   2=UNIX      3=PCIXOS     4=Other        5=DOS12
   6=DOS16      7=DOSEXT    8=DOSBIG     9=DOS16LBA     A=x86 Boot
   B=Diagnostic C=FAT32     D=FAT32LBA   E=DOSEXTLBA    F=EFI (Protective)
   G=EFI_SYS    0=Exit? f

SELECT ONE...
...
6

format> label
[0] SMI Label
[1] EFI Label
Specify Label type[1]: 1

Ready to label disk, continue? y

format> quit

We can now see that the format command shows the disk as available and now sized as per the LUN size.

# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t600144F09C96CCA90000518CDEB10005d0 <SUN-ZFS Storage 7320-1.0-64.00GB>
          /scsi_vhci/disk@g600144f09c96cca90000518cdeb10005
       1. c0t600144F09C96CCA90000518CDF100006d0 <SUN-ZFS Storage 7320-1.0-64.00GB>
          /scsi_vhci/disk@g600144f09c96cca90000518cdf100006
       2. c0t600144F09C96CCA90000518CDFB60007d0 <SUN-ZFS Storage 7320-1.0-64.00GB>
          /scsi_vhci/disk@g600144f09c96cca90000518cdfb60007
       3. c0t600144F09C96CCA900005190BFC4000Ad0 <SUN-ZFS Storage 7320-1.0-64.00GB>
          /scsi_vhci/disk@g600144f09c96cca900005190bfc4000a

       4. c7t0d0 <LSI-MR9261-8i-2.12-28.87GB>
          /pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@0,0
Specify disk (enter its number):

2. Setup the Networking for Client access.  (10GbE network.)

The zone that is being setup will be given access to an exclusive IP network, what this means is that we need to create the appropriate VNICs on the global zone and hand control for these VNICs over to the zone to manage.  An earlier blog posting discusses setting up the 10GbE network for Solaris running on an Exalogic and this will build on that knowledge.

All we need to perform on the global zone is the creation of the VNICs, to do this firstly identify the physical links that relate to the Ethernet over Infiniband devices that the switches present to the Infiniband Host Channel Adapter and hence as devices to the OS.  Then using the two links (one for each physical port) to create the VNICs.

# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net6              Infiniband           up         32000  unknown   ibp1
net0              Ethernet             up         1000   full      igb0
net1              Ethernet             unknown    0      unknown   igb1
net3              Ethernet             unknown    0      unknown   igb3
net4              Ethernet             up         10     full      usbecm0
net8              Ethernet             up         10000  full      eoib1
net2              Ethernet             unknown    0      unknown   igb2
net5              Infiniband           up         32000  unknown   ibp0
net9              Ethernet             up         10000  full      eoib0


One on each link in this case net8 and net9 from above

# dladm create-vnic -l net8 -v 1706 vnic2_1706
# dladm create-vnic -l net9 -v 1706 vnic3_1706

3. Create the Zone

We now have the prerequisites necessary to create our zone.  (A fairly simple example.)  Namely, the storage available via iSCSI and the VNICs we will hand in to the zone to use.

# zonecfg -z zone04
Use 'create' to begin configuring a new zone.
zonecfg:zone04 create
create: Using system default template 'SYSdefault'
zonecfg:zone04> set zonepath=/zones/zone04
zonecfg:zone04> add rootzpool
zonecfg:zone04:rootzpool> add storage iscsi://192.168.14.133/luname.naa.600144f09c96cca900005190bfc4000a
zonecfg:zone04:rootzpool> end
zonecfg:zone04> remove anet
zonecfg:zone04> add net
zonecfg:zone04:net> set physical=vnic2_1706
zonecfg:zone04:net> end
zonecfg:zone04> add net
zonecfg:zone04:net> set physical=vnic3_1706
zonecfg:zone04:net> end

zonecfg:zone04> verify
zonecfg:zone04> commit
zonecfg:zone04> info
zonename: zone04
zonepath: /zones/zone04
brand: solaris
autoboot: false
bootargs:
file-mac-profile:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
fs-allowed:
net:
    address not specified
    allowed-address not specified
    configure-allowed-address: true
    physical: vnic2_1706
    defrouter not specified
net:
    address not specified
    allowed-address not specified
    configure-allowed-address: true
    physical: vnic3_1706
    defrouter not specified
rootzpool:
    storage: iscsi://192.168.14.133/luname.naa.600144f09c96cca900005190bfc4000a
zonecfg:zone04>

During this configuration process we use the default zone creation template which includes a network for the net0 link (1GbE management) which we do not need in our zone so we remove this as part of the configuration.  The storage is defined using the URL for the LUN, this includes the LUN GUID prefixed by iscsi://<IP Address of the Shared Storage>/luname.naa.

The next step is to install the zone and boot it up.  Before attempting to do this ensure that you have a valid repository for the Solaris installation setup on the global zone.  The zone creation will use this repository to lay down the OS files for the zone. 


# zoneadm -z zone04 install
Configured zone storage resource(s) from:
    iscsi://192.168.14.133/luname.naa.600144f09c96cca900005190bfc4000a
Created zone zpool: zone04_rpool
Progress being logged to /var/log/zones/zoneadm.20130513T104657Z.zone04.install
       Image: Preparing at /zones/zone04/root.

 AI Manifest: /tmp/manifest.xml.lPaGVo
  SC Profile: /usr/share/auto_install/sc_profiles/enable_sci.xml
    Zonename: zone04
Installation: Starting ...

              Creating IPS image
Startup linked: 1/1 done
              Installing packages from:
                  exa-family
                      origin:  http://localhost:1008/exa-family/acbd22da328c302a86fb9f23d43f5d10f13cf5a6/
                  solaris
                      origin:  http://install1/release/solaris/
DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            185/185   34345/34345  229.7/229.7 10.6M/s

PHASE                                          ITEMS
Installing new actions                   48269/48269
Updating package state database                 Done
Updating image state                            Done
Creating fast lookup database                   Done
Installation: Succeeded

        Note: Man pages can be obtained by installing pkg:/system/manual

 done.

        Done: Installation completed in 81.509 seconds.


  Next Steps: Boot the zone, then log into the zone console (zlogin -C)

              to complete the configuration process.

Log saved in non-global zone as /zones/zone04/root/var/log/zones/zoneadm.20130513T104657Z.zone04.install


# zoneadm -z zone04 boot

The zone should boot up very quickly then you can zlogin to the zone to setup the networking.  This will involve using the VNICs given to the zone for exclusive control to create interfaces, bond them together using the Solaris ipmp functionality and allocate an IP address.  We found that we also had to setup the routing table to give a default route.

# zlogin zone04
[Connected to zone 'zone04' pts/7]
Oracle Corporation    SunOS 5.11    11.1    December 2012

root@zone04:~# dladm show-vnic
LINK                OVER         SPEED  MACADDRESS        MACADDRTYPE       VID
vnic2_1706          ?            10000  2:8:20:f5:83:fa   random            1706
vnic3_1706          ?            10000  2:8:20:fa:ab:98   random            1706
root@zone04:~# ipadm create-ip vnic2_1706
root@zone04:~# ipadm create-ip vnic3_1706
root@zone04:~# ipadm create-ipmp bond1
root@zone04:~# ipadm add-ipmp -i vnic2_1706 -i vnic3_1706 bond1
root@zone04:~# ipadm set-ifprop -p standby=on -m ip vnic3_1706
root@zone04:~# ipadm show-if
IFNAME     CLASS    STATE    ACTIVE OVER
lo0        loopback ok       yes    --
vnic2_1706 ip       ok       yes    --
vnic3_1706 ip       ok       no     --
bond1      ipmp     down     no     vnic2_1706 vnic3_1706
root@zone04:~# ipadm create-addr -T static -a local=138.3.51.2/22 bond1/v4
root@zone04:~# ipadm show-if
IFNAME     CLASS    STATE    ACTIVE OVER
lo0        loopback ok       yes    --
vnic2_1706 ip       ok       yes    --
vnic3_1706 ip       ok       no     --
bond1      ipmp     ok       yes    vnic2_1706 vnic3_1706
root@zone04:~# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
bond1/v4          static   ok           138.3.51.2/22
lo0/v6            static   ok           ::1/128
root@zone04:~# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
127.0.0.1            127.0.0.1            UH        2          0 lo0      
138.3.48.0           138.3.51.2           U         2          0 bond1    

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If  
--------------------------- --------------------------- ----- --- ------- -----
::1                         ::1                         UH      2       0 lo0  
root@zone04:~# route -p add default 138.3.48.1
add net default: gateway 138.3.48.1
add persistent net default: gateway 138.3.48.1
root@zone04:~# netstat -rn

Routing Table: IPv4
  Destination           Gateway           Flags  Ref     Use     Interface
-------------------- -------------------- ----- ----- ---------- ---------
default              138.3.48.1           UG        1          0          
127.0.0.1            127.0.0.1            UH        2          0 lo0      
138.3.48.0           138.3.51.2           U         2          0 bond1    

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If  
--------------------------- --------------------------- ----- --- ------- -----
::1                         ::1                         UH      2       0 lo0  
root@zone04:~#

Migrating the Zone from one host to another.

As a final activity we tried going through the process to see how simple it is to move the zone from one physical host to another.  This approach seems to work smoothly and allowed the zone to be moved in a matter of minutes although it did have to be shutdown during the process. (ie. If you are needing 100% service availability then make sure you use a clustered software solution that will enable continuous availability.)

Firstly on the compute node that originally hosts the zone shutdown & detatch the zone then export the configuration.  We exported it to a filesystem on the ZFS storage that was mounted on both the original and target hosts (/u01/common/general)   Alternatively the export could be simply scp'd between the nodes.)


# zoneadm -z zone04 shutdown
# zoneadm -z zone04 detach
zoneadm: zone 'zone04': warning(s) occured during processing URI: 'iscsi://192.168.14.133/luname.naa.600144f09c96cca900005190bfc4000a'
Could not remove one or more iSCSI discovery addresses because logical unit is in use
Exported zone zpool: zone04_rpool
Unconfigured zone storage resource(s) from:
        iscsi://192.168.14.133/luname.naa.600144f09c96cca900005190bfc4000a


# mkdir -p /u01/common/general/zone04
# zonecfg -z zone04 export > /common/general/zone04/zone04.cfg

Then on the new zone host we import the zone from the export created on the original host, attach the zone and boot it up.

# zonecfg -z zone04 -f /common/general/zone04/zone04.cfg
# zoneadm -z zone04 attach
Configured zone storage resource(s) from:
    iscsi://192.168.14.133/luname.naa.600144f09c96cca900005190bfc4000a
Imported zone zpool: zone04_rpool
Progress being logged to /var/log/zones/zoneadm.20130513T135704Z.zone04.attach
    Installing: Using existing zone boot environment
      Zone BE root dataset: zone04_rpool/rpool/ROOT/solaris
                     Cache: Using /var/pkg/publisher.
  Updating non-global zone: Linking to image /.
Processing linked: 1/1 done
  Updating non-global zone: Auditing packages.
No updates necessary for this image.

  Updating non-global zone: Zone updated.
                    Result: Attach Succeeded.
Log saved in non-global zone as /zones/zone04/root/var/log/zones/zoneadm.20130513T135704Z.zone04.attach

# zoneadm -z zone04 boot

The only issue that we identified was that the process of detaching and attaching cause the server to boot up with the system configuration wizard running.   (Logon to the console to complete the wizard - # zlogin -C zone04 This needs to be completed to allow the zone to boot fully.

Friday, May 10, 2013

10GbE connections with Exalogic running Solaris 11.1

All the networking has been significantly changed from Solaris 10 to Solaris 11, full details of just how to configure Solaris 11 networking can be found in the documentation.  However this is a short "how to" posting about how to setup a 10GbE client connection on a Exalogic rack running Solaris.

Infiniband Switch Configuration

Firstly there are some changes needed to the Infiniband switch.  Namely it is necessary to run the command allowhostconfig this is because with Solaris 11 some of the VNIC configuration and setting up of VLANs is done on the compute node and pushed out to the infiniband switches.  Running allowhostconfig means that the switch is set to enable this.  Then create a VLAN with ID -1 on each of the connectors to the 10GbE network.  Repeat the process to create the VLAN you want to use.  In our example this is VLAN 1706.  Finally create the VNICs on the IB switch as described in the Exalogic documentation, ensure that the VNICs created are for the VLAN 0.

#
# allowhostconfig
# createvlan 1A-ETH-1 -vlan -1 -pkey ffff
# showvlan
  Connector/LAG  VLN   PKEY
  -------------  ---   ----
  1A-ETH-1        0    ffff
  1A-ETH-1        1706 ffff
# createvnic ......
#

Go through the process for identifying the GUIDs etc. on the host and defining a MAC address to create the vnic.  (Use ibstat on the host to identify the Infiniband GUID.)

Once all the VNICs for each compute node are created on the switch it will look something like this:-


# showvnics
ID  STATE     FLG IOA_GUID                NODE                             IID  MAC               VLN PKEY   GW
--- --------  --- ----------------------- -------------------------------- ---- ----------------- --- ----   --------
  1 WAIT-IOA    N 0021280001A18A10         EL-C  192.168.14.125            0000 A0:8A:10:50:00:25 NO  ffff   1A-ETH-1
  7 WAIT-IOA    N 0021280001CED42F         EL-C  192.168.14.123            0000 A0:D4:2F:50:00:23 NO  ffff   1A-ETH-1
 10 WAIT-IOA    N 0021280001CEC533         EL-C  192.168.14.119            0000 A0:C5:33:50:00:19 NO  ffff   1A-ETH-1
  0 WAIT-IOA    N 0021280001CEC644         EL-C  192.168.14.124            0000 A0:C6:44:50:00:24 NO  ffff   1A-ETH-1
  2 WAIT-IOA    N 0021280001CED348         EL-C  192.168.14.128            0000 A0:D3:48:50:00:28 NO  ffff   1A-ETH-1

  3 WAIT-IOA    N 0021280001CED44C         EL-C  192.168.14.127            0000 A0:D4:4C:50:00:27 NO  ffff   1A-ETH-1

 11 WAIT-IOA    N 0021280001CED45B         EL-C  192.168.14.120            0000 A0:D4:5B:50:00:20 NO  ffff   1A-ETH-1
  6 WAIT-IOA    N 0021280001CED368         EL-C  192.168.14.126            0000 A0:D3:68:50:00:26 NO  ffff   1A-ETH-1
  9 WAIT-IOA    N 0021280001CED373         EL-C  192.168.14.122            0000 A0:D3:73:50:00:22 NO  ffff   1A-ETH-1
  5 UP          N 0021280001CED37C         EL-C  192.168.14.129            0040 A0:D3:7C:50:00:29 NO  ffff   1A-ETH-1
  4 UP          N 0021280001CED384         EL-C  192.168.14.130            0040 A0:D3:84:50:00:30 NO  ffff   1A-ETH-1
 12 WAIT-IOA    N 0021280001CEC99B         EL-C  192.168.14.117            0000 A0:C9:9B:50:00:17 NO  ffff   1A-ETH-1
  8 WAIT-IOA    N 0021280001CEC6A7         EL-C  192.168.14.121            0000 A0:C6:A7:50:00:21 NO  ffff   1A-ETH-1
 13 WAIT-IOA    N 0021280001CED3EF         EL-C  192.168.14.118            0000 A0:D3:EF:50:00:18 NO  ffff   1A-ETH-1

ie. Expect to see them in a WAIT-IOA mode, they will not change to UP until the corresponding create-ip is run against the links on the solaris hosts.

Exalogic Solaris Host Configuration


The Exalogic I used to discover this information was setup for manual network configuration which is how I would expect it to always be but you can check the setup using the netadm list command.  Assuming manual configuration the first step is to remove any of the hostname files in /etc that relate to the bond number you wish to create.  (The bond1 hostname files were created during the running of the ECU for the 2.0.4.0.0 Exalogic release.)  Then reboot the node.


# rm /etc/hostname.bond1
# rm /etc/hostname.eoib0
# rm /etc/hostname.eoib1
# reboot

Now we need to use the dladm and ipadm commands which allow manipulation of the Solaris networking.
dladm - Data Link administration, ipadm - IP administration.

Firstly identify the data links that relate to the VNICs you created on the Infiniband switch, this is done with the dladm show-phys command.  These are the links that are on the devices identified by eoibn.  In the output below they map to the links net8 and net9.  (On most of the nodes these were net7 & net8)  You can also use the -m option to display the MAC addresses, these will correspond with the MAC addresses used on each Infiniband switch for that VNIC.


# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net6              Infiniband           up         32000  unknown   ibp1
net0              Ethernet             up         1000   full      igb0
net1              Ethernet             unknown    0      unknown   igb1
net3              Ethernet             unknown    0      unknown   igb3
net4              Ethernet             up         10     full      usbecm0
net8              Ethernet             up         10000  full      eoib1
net2              Ethernet             unknown    0      unknown   igb2
net5              Infiniband           up         32000  unknown   ibp0
net9              Ethernet             up         10000  full      eoib0
root@el2bcn01:~# dladm show-phys -m
LINK                SLOT     ADDRESS            INUSE CLIENT
net6                primary  unknown            no   --
net0                primary  0:21:28:d7:e9:44   yes  net0
net1                primary  0:21:28:d7:e9:45   no   --
net3                primary  0:21:28:d7:e9:47   no   --
net4                primary  2:21:28:57:47:17   yes  usbecm0
net8                primary  a0:c5:80:50:0:1    no   --
net2                primary  0:21:28:d7:e9:46   no   --
net5                primary  unknown            no   --
net9                primary  a0:c5:7f:50:0:1    no   --

Now use IP admin to create the IP interfaces for the two links you have identified and the ipmp group.  The link name maps to the names you identified from the dladm show-phys command.

# ipadm create-ipmp bond1
# ipadm create-ip net8
# ipadm create-ip net9

Now create the VNICs and in our case they are to use a tagged VLAN, mapping onto the links identified earlier.  The create VNIC command links together the physical link, a VLAN and gives the interface a name.   The dladm show-vnic command displays what VNICS have been created.

Now create the two interfaces for thes VNICS using the ipadm create-ip command again and then set the properties of one of the interfaces to make it standby  for the bonded interface.  (In Solaris speak ipmp ~= linux bond.)  Then add the two interfaces to the ipmp bond we created earlier.

# dladm create-vnic -l net8 -v 1706 eoib0_1706
# dladm create-vnic -l net9 -v 1706 eoib1_1706
# dladm show-vnic
LINK                OVER         SPEED  MACADDRESS        MACADDRTYPE       VID
eoib1_1706          net8         10000  2:8:20:3:df:5f    random            1706
eoib0_1706          net9         10000  2:8:20:c9:8b:b2   random            1706
# ipadm create-ip eoib0_1706

# ipadm create-ip eoib1_1706
# ipadm set-ifprop -p standby=on -m ip eoib1_1706
# ipadm add-ipmp -i eoib0_1706 -i eoib1_1706 bond1

Then give the bonded interface an IP address.


# ipadm create-addr –T static –a local=10.100.44.68/22 bond1/v4
# ipadm show-if
IFNAME     CLASS    STATE    ACTIVE OVER
lo0        loopback ok       yes    --
net0       ip       ok       yes    --
net4       ip       ok       yes    --
net8       ip       down     no     --
net9       ip       down     no     --
bond0_0    ip       ok       yes    --
bond0_1    ip       ok       no     --
bond0      ipmp     ok       yes    bond0_0 bond0_1
bond1      ipmp     ok       yes    eoib1_1706 eoib0_1706
eoib1_1706 ip       ok       no     --
eoib0_1706 ip       ok       yes    --


# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
net0/v4           static   ok           138.3.2.87/21
net4/v4           static   ok           169.254.182.77/24
bond0/v4          static   ok           192.168.14.101/24
bond1/v4          static   ok           138.3.48.35/22
bond1/v4a         static   ok           138.3.51.1/22
lo0/v6            static   ok           ::1/128
net0/v6           addrconf ok           fe80::221:28ff:fed7:e944/10

Now your Solaris 10GbE network connection using a tagged VLAN should be up and running.  Looking at the VNICs on the Infiniband switch we can see that an additional VNIC now appears, the MAC address matching onto the MAC address of the underlying interface of bond1.  You can further check the Infiniband IOA_GUID against the host channel adapter in Solaris by either using the dladm show-ib command or ibstat to output the GUIDs.

# showvnics | grep -e STATE -e ----- -e 101
ID  STATE     FLG IOA_GUID                NODE                             IID  MAC               VLN PKEY   GW
--- --------  --- ----------------------- -------------------------------- ---- ----------------- --- ----   --------
 18 UP          N 0021280001CEC57F         EL-C  192.168.14.101            0000 A0:C5:7F:50:00:01 NO  ffff   1A-ETH-1
 35 UP          H 0021280001CEC57F         EL-C  192.168.14.101            8001 02:08:20:C9:8B:B2 1706 ffff   1A-ETH-1


& on the compute node

# ifconfig eoib0_1706
eoib0_1706: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 12
        inet 0.0.0.0 netmask ff000000
        groupname bond1
        ether 2:8:20:c9:8b:b2
root@el2bcn01:~# ifconfig eoib1_1706
eoib1_1706: flags=261000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY,INACTIVE,CoS> mtu 1500 index 11
        inet 0.0.0.0 netmask ff000000
        groupname bond1
        ether 2:8:20:3:df:5f


Note - Under Solaris 11 the VNICs we create are dynamic and appear on the infiniband switches when the host OS starts up the VNIC.  Because we configure things with an ipmp group the VNIC is only reported on the actively used switch.  Unlike a physical linux environment these do not appear in the /conf/bx.conf file on the switch.

So in summary the commands to use on the Solaris 11.1 compute node are:-

# ipadm create-ipmp <BOND NAME>
# ipadm create-ip <link name of eoib0>
# ipadm create-ip <link name of eoib1>
# dladm create-vnic -l
<link name of eoib0> -v <VLAN ID> <IF 1 NAME>
# dladm create-vnic -l
<link name of eoib1> -v <VLAN ID> <IF 2 NAME>
# ipadm create-ip <IF 1 NAME>
# ipadm create-ip <IF 2 NAME>
# ipadm set-ifprop -p standby=on -m ip <IF 2 NAME>
# ipadm add-ipmp -i <IF 1 NAME> -i <IF 2 NAME> <BOND NAME>
# ipadm create-addr –T static –a local=<IPv4 ADDRESS>/<Netmask in CIDR format> <BOND NAME>/v4
 

Many thanks to Steve Hall for working all this out!

Friday, April 5, 2013

Access LDAP from internal Exalogic vServers

Introduction

As discussed in my earlier postings about setting up LDAP to enable access to shared storage for NFSv4 requirements, the same is true for a virtualised Exalogic.   In many data centres a directory of some sort may already be setup external to the Exalogic that holds the user accounts for access to unix environments - an Exalogic should be able to use this authentication source.

An LDAP service may be available on one or more networks but is most likely to appear on one only.  An Exalogic has at least two networks connected to the datacentre, firstly a management network that links the physical components together and secondly a 10GbE network that can provide access to a deployed application.  Often these two networks are kept separate for security reasons.  For Exalogic this poses an issue for shared authentication between the storage and the running vServers as the storage has no direct access to the 10GbE network and the vServers have no direct access to the 1GbE management network but both need to be able to access a shared LDAP resource.

The issues is compounded further if we are building up a secure vServer topology with only web tier having access to the client network, as shown in the deployment topology discussed when considering the Infiniband network.


Figure 1 : vServer deployment with a web and application tier

In this situation the application tier vServers have no access to either the 10GbE nework or the 1GbE management network.  As such how can we use an external directory to provide a shared authentication source?

This posting considers a few possible solutions to the scenario.

Problem Statement

The problem is that both vServers and the shared storage require access to the same directory for authentication purposes so that shares can be mounted using NFSv4.  The visibility of networks is limited and different for different components as shown in Table 1.

ComponentManagement Network (1GbE)External Network(10GbE)Internal/Private Network (IB)
ZFS Storage ApplianceYesNoYes (vServer-shared-storage)
vServer
(web tier)
NoYesYes (vServer-shared-storage)
vServer (internal/app tier)NoNoYes (vServer-shared-storage)
Table 1 - Component network access

So how do we setup an environment that all vServers and the shared storage can all access the same directory service?

Potential Solutions

  1. Ensure the directory service is available/routable on both the management and 10GbE networks.  Give all vServers an interface to the 10GbE network.  (The 10GbE network can be VLAN tagged to a management only network.)
  2. Ensure the directory service is available/routable on both management and 10GbE network.  Create a new vserver that has interfaces on both the 10GbE network and the internal private network.   (IPoIB-vserver-shared-storage is a good internal candidate for this.)  Then setup this vServer to be a gateway/router.  All internal vServers must have a static route that will go via this gateway for the IP addresses of the directory servers.
  3. Make the directory available on the 10GbE network and then create replicas of the directory that run in vServers on the Exalogic rack.  These replicas can make their services available to the internal components.
  4. Make the directory available on the 10GbE network and then include a vServer that runs an LDAP proxy so that internal components can access the external vServer through the proxy service.
This blog posting is going to consider the fourth option in more detail, this is partially in the light of the recent 11.1.1.7 release of Oracle Traffic Director that now supports load balancing LDAP requests and hence is a good candidate for use as an LDAP proxy.

OTD can be downloaded from the public Oracle website here.  The primary new functionality in this release is:
  1. TCP load-balancing support . This allows OTD to be an entry point to load balance HTTP and non HTTP traffic including connect-time LDAP, T3/RMI etc.
  2. HTML5 WebSockets reverse proxy support
  3. Graphical expression builder for reverse proxy routing rules
  4. Additional WLS load-balancing/keepalive synchronization optimizations
  5. Web Application Firewall Support - (ModSecurity based Firewall to inspect and reject requests). Supports well recognized rulesets from OWASP Core Ruleset
  6. OAM 11g WebGate support
  7. Inter-operability certification with FMW 11.1.1.1.7 and with Classic Portal / Forms.  
  8. Exalogic Solaris support
We are interested in the load balancing of LDAP.  In a recent Blog posting Paul Done wrote about using OTD with T3/RMI load balancing.

LDAP Proxy Solution

For this solution we will setup a vServer that hosts OTD, it will listen on the internal networks and forward LDAP requests to the external LDAP server.  The architecture of such a design is shown in Figure 2.

Figure 2: High Level Architecture of using LDAP Proxy

Thus in this case the ZFS SA and the internal vServer are both pointing to the LDAP Proxy which is setup using OTD as an TCP/LDAP load balancer,  it listens on the IPoIB-vServer-shared-storage network for incoming LDAP requests.  These requests are forwarded on to the external LDAP service.  Thus any vServer with access to the IPoIB-vserver-shared-storage network is able to mount the shares from the ZFS internal appliance using NFSv4.

Considerations

 

High Availability

In the architecture that is shown in Figure 1 the external LDAP is a highly available service that is running on two physically separate OS instances so that should one fail the other is able to service the requests.  The diagram shows a single LDAP proxy vServer so should that fail then the NFSv4 mounts would also fail because the ZFS Appliance would not have a route to the external directory.  There are two solutions to this issue, either use the HA features of vServers running on Exalogic or use the HA features of OTD to create a VIP and run two vServers as part of an failover group group.

The former case of using Exalogic vServer HA is by far the simplest solution.  If Exalogic senses that the LDAP Proxy vServer has failed it will automatically restart the server.  Thus, provided the OTD instances are configured to start on boot, the LDAP service should only be down for a short period of time while the vServer restarts.  Probably acceptable in non-production environments.  However, it is possible for the service within the vServer to fail for some reason and in this scenario the LDAP service would then become unavailable as the Exalogic vServer HA would not be activated.

To cater for this situation two vServers in a distribution group should be configured as LDAP proxies with OTD running as an HA Failover group.  This solution would identify a vServer failure very quickly and migrate the VIP over to the remaining vServer immediately.  A slightly more complex environment to configure but for running a production environment where any down time is critical this solution should be used.

vServer access on the vserver-shared-storage network

When a vServer is given access to the vserver-shared-storage network it will automatically be setup as a limited member of the Infiniband partition.   This makes perfect sense as a security consideration because it means that any vServer on this network is only able to access the shared storage appliance and no other vServer on the network.  However, in the case of setting up an LDAP proxy server we want to enable the vServer to be a full member of the partition so that any of the other vServers can access it.   Only a full system administrator of the Exalogic rack will be able to do this.  The process to follow is:-

1.  Shutdown the vServer you want to promote. (This example assumes a server has access to the IPoIB-vserver-shared-storage and it is this network that is being promoted to full member.)

2.  Locate the vm.cfg of the server by ssh into any of the underlying OVS physical compute nodes. Change directory to the /OVS/Repositories/nnnnnn/VirtualMachines The number in example path shown below is unique for each Exalogic Control implementation.  In the example below we are going to make the LDAP proxy server visible on this network.

[root@el01cn01 ~]# cd /OVS/Repositories/0004fb00000300000ca29f8ce7f571fa/VirtualMachines
[root@el01cn01 VirtualMachines]# grep -r ldap .
./0004fb0000060000d4f615c6df13c8f1/vm.cfg:OVM_simple_name = 'ldap-proxy''

This identifies the vm.cfg file we need to edit.

3.  Identify the partition number for the network you want the vServer to become a full member of. Generally this is likely to be the IPoIB-vserver-shared-storage. In which case the default partition is 0005.  As shown in the Exalogic Control screenshot below.

Figure 3 : Network summary details showing the Partition (P-Key)

4.  Edit the vm.cfg file and change the entries in the line identified by exalogic_ipoib and change the partition from 0005 to 8005. (The most significant bit of an IB partition indicates the membership type, hence 0005 and 8005 are referring to the same partition but with an 8 at the start it becomes a full member.


exalogic_ipoib = [{'pkey': ['0x0005', '0x0003'], 'port': '1'}, {'pkey': ['0x0005', '0x0003'], 'port': '2'}]

To

exalogic_ipoib = [{'pkey': ['0x8005', '0x0003'], 'port': '1'}, {'pkey': ['0x8005', '0x0003'], 'port': '2'}]


Remember to change the partition key for BOTH ports.

5.  Restart the vServer to ensure that the visibility is as expected and it can be accessed from other vServers.


Appendix

Auto-start of OTD instance

Below is a very simple example script that can be used to automatically startup the OTD instance.


[root@ldap-proxy ~]# cat /etc/init.d/otd
#!/bin/sh
# chkconfig init header
#
# otd: Oracle Traffic Manager
#
# chkconfig: 345 92 8
# description: Oracle Traffic Manager Server \
# Start/Stop the OTD installation automatically
#
#
#Script to start and stop the OMS agent during shutdown and restart of the machine
PATH=/usr/bin:/bin:/usr/local/bin:$PATH
export PATH
OTD_HOME=/u01/instances/otd/admin
export OTD_HOME
installUser=oracle

case "$1" in
start)
COMMAND="$OTD_HOME/admin-server/bin/startserv"
su - $installUser -c "$COMMAND"
COMMAND="$OTD_HOME/net-ldap-proxy/bin/startserv"
su - $installUser -c "$COMMAND"
;;
stop)
COMMAND="$OTD_HOME/admin-server/bin/stopserv"
su - $installUser -c "$COMMAND"
COMMAND="$OTD_HOME/net-ldap-proxy/bin/stopserv"
su - $installUser -c "$COMMAND"
;;
status)
ps -ef | grep net-ldap-proxy
;;
*)
echo $"Usage: $0 {start|stop|status}"
exit 1
esac

Simply create the otd file in /etc/init.d then use the # chkconfig --add otd command to add it to the list of managed services. then the service should automatically start on boot.

[root@ldap-proxy ~]# chkconfig --list otd
otd             0:off   1:off   2:on    3:on    4:on    5:on    6:off
[root@ldap-proxy ~]# service otd stop
server has been shutdown
server has been shutdown
[root@ldap-proxy ~]# service otd start
Oracle Traffic Director 11.1.1.7.0 B01/14/2013 04:13
[NOTIFICATION:1] [OTD-80118] Using [Java HotSpot(TM) 64-Bit Server VM, Version 1.6.0_35] from [Sun Microsystems Inc.]
[NOTIFICATION:1] [OTD-80000] Loading web module in virtual server [admin-server] at [/admin]
[NOTIFICATION:1] [OTD-80000] Loading web module in virtual server [admin-server] at [/jmxconnector]
[NOTIFICATION:1] [OTD-10358] admin-ssl-port: https://ldap-proxy:1895 ready to accept requests
[NOTIFICATION:1] [OTD-10487] successful server startup
Oracle Traffic Director 11.1.1.7.0 B01/14/2013 04:13
[NOTIFICATION:1] [OTD-10358] tcp-listener-1: tcp://tcpserver:3389 ready to accept requests
[NOTIFICATION:1] [OTD-10487] successful server startup
[root@ldap-proxy ~]# service otd status
oracle   28131     1  0 06:07 ?        00:00:00 trafficd-wdog -d /u01/instances/otd/admin/net-ldap-proxy/config -r /u01/products/otd -t /tmp/net-ldap-proxy-7dd0931e -u oracle
oracle   28132 28131  1 06:07 ?        00:00:00 trafficd -d /u01/instances/otd/admin/net-ldap-proxy/config -r /u01/products/otd -t /tmp/net-ldap-proxy-7dd0931e -u oracle
oracle   28133 28132  0 06:07 ?        00:00:00 trafficd -d /u01/instances/otd/admin/net-ldap-proxy/config -r /u01/products/otd -t /tmp/net-ldap-proxy-7dd0931e -u oracle
root     28162 28160  0 06:07 pts/0    00:00:00 grep net-ldap-proxy

Thursday, March 14, 2013

Virtualised Exalogic and its use of Infiniband Partitions

Introduction

Right from the outset of Exalogic it has depended heavily on the Infiniband network interconnect to link all the components together.  With the release of the virtualised Exalogic and all the multi-tenancy features the flexibility of the Infiniband fabric to provide secure networking is critical.  This posting is an attempt to explain some of the underlying operations of the Infiniband network and how it provides secure networking capability.

Infiniband Partitions

When thinking about how we can keep virtual servers (or vServers which equate to the guest operating systems) isolated from each other there is often a statement made that Infiniband Partitions are analogous to VLANs in the Ethernet world.  This analogy is a good one although the underpinnings of Infiniband (IB) is very different from the Ethernet routings.

An IB fabric will consist of a number of switches, these can either be Spine switches or Leaf switches, a spine switch being one that connects switches together and a leaf being one that connects to hosts.  On a standalone Exalogic or a smaller Exalogic cabled together with an Exadata then we can link all the compute nodes, storage heads/Exadata storage cells together via leaf nodes only.  In each physical component or host that is connected to the fabric is a dual ported Host Channel Adapter (HCA) card that allows cabling from the host to multiple switches. i.e. there are two architectural deployment diagrams that are possible, as shown below:-


Simple topology - 1 level only
Spine Switch topologies - 2 Levels
In both of these cases a "Fat Tree" or "Clos" topology is followed, this simply means that there are more physical cables connecting the switches together so that there is additional bandwidth/capacity on the busier logical links.  In the above diagrams the "host" could be a compute node or storage component.

When we create an IB partition what we are doing is instructing the fabric about which hosts can communicate with other hosts in the fabric over that particular partition.  There are a couple of very simple rules to an Infiniband fabric:-
  • A full member can communicate with all members of the partition
  • A limited member can only communicate with a full member of the partition
Using these rules we can create multiple partitions to build up the appropriate visibility.   So for example suppose we are using an exalogic and are deploying App A to hosts 1 & 2 and App B to hosts 3 & 4 but both applications had a need to access a shared service App C that is located on hosts 5 & 6 then we might have partitions setup as shown below.

A Simple Partitioning example

In this case Hosts 1 & 2 are both full members of partition 1 so app A can talk to both instances.  Partition 2 hosts 3 & 4 as full members for App B.  Then Partition 3 has hosts 5 & 6 as full members and Hosts 1-4 as limited members, in this way both applications A & B are able to access the services of App C but using partition 3 it is impossible for App A to have any access to App B.


Virtualised Exalogic and Partitions

Now we have an idea of what a partition does lets dive into the details of a virtualised exalogic and consider just how we can create multiple vServers and maintain network isolation between them.  We will do this by considering an example topology consisting of two applications each deployed in a 3 tier topology although we will not concentrate much on the DB tier in this blog posting.  The first tier being a load balancing "Oracle Traffic Director" set of instances, then an application tier with multiple WebLogic server instances and below that a database tier.   The diagram below shows the sort of Exalogic deployment we will consider.  Database tier omitted for simplicity.

Application Deployments on a virtualised Exalogic

So in this case the client population is split on different tagged VLANS -  Application A using VLAN 100 and Application B using VLAN 101.  Both of these VLANS are connected to different "Ethernet over Infiniband" or EoIB networks on the Exalogic rack where two partitions are created with each partition including the special Ethernet "Bridge"  ports on the Exalogic gateway switches.  Internally each set of vServers is also connected to another private network which is implemented as another IB partition.  Should the vServers need access to the ZFS Storage appliance that is held within the Exalogic rack then they are connected to the storage network.  The storage network is a special one created on installation of the rack, any connected vServers are limited members with just the storage appliance as a full member, in this way there is no way to connect between vServers on this network.

Now think about the switch setup for each of these networks and how the data is transmitted.


Ethernet over Infiniband Networks (EoIB connecting to vLAN tagged external network.)

 

These networks can be created as described in the "tea break snippets" what we will concentrate on is what is happening behind the scenes and use some of the Infiniband commands to investigate further.  If we consider the connection to the Ethernet VLAN 100 then we can investigate just how the partition has been setup, firstly using the showvnics command to identify the external NICs that are in use and then using the smpartition list active to pick out the partition setup.


[root@<gw name> ~]# showvnics | grep UP | grep 100
ID  STATE     FLG IOA_GUID                NODE                        IID  MAC               VLN PKEY   GW
--- --------  --- ----------------------- --------------------------- ---- ----------------- --- ----   --------
  4 UP          N A5960EE97D134323        <CN name> EL-C <cn IP>      0000 00:14:4F:F8:69:5D 100 800a   0A-ETH-1
 38 UP          N 4A282D7AEEF49768       
<CN name> EL-C <cn IP>      0000 00:14:4F:FB:34:19 100 800a   0A-ETH-1
 48 UP          N 0C525ADA561F7179       
<CN name> EL-C <cn IP>      0000 00:14:4F:FB:22:14 100 800a   0A-ETH-1
 49 UP          N F9E4DC33D70A0DA2       
<CN name> EL-C <cn IP>      0000 00:14:4F:F8:BE:1F 100 800a   0A-ETH-1
 30 UP          N AE82EFAD4B2425C7       
<CN name> EL-C <cn IP>      0000 00:14:4F:F8:55:58 100 800a   0A-ETH-1
 25 UP          N 73BDD3B88EE8CFE1       
<CN name> EL-C <cn IP>      0000 00:14:4F:F9:FA:01 100 800a   0A-ETH-1
 50 UP          N CC5301F73630E6EA       
<CN name> EL-C <cn IP>      0000 00:14:4F:F9:5A:85 100 800a   0A-ETH-1
 28 UP          N D2B7E3F14328A5F6       
<CN name> EL-C <cn IP>      0000 00:14:4F:FA:66:7E 100 800a   0A-ETH-1
 29 UP          N 31E36AB07CFE81FB       
<CN name> EL-C <cn IP>      0000 00:14:4F:FB:5B:45 100 800a   0A-ETH-1

#
Display of virtual Network Interface Cards on IB network

We are seeing the virtual NICs that are configured for VLAN 100 on the switch, in this example we have 9 vNICs operational and they are using the partition identified by the hex number 800a.  (In my test environment there are more vServers than shown in the architecture diagrams above.)  The IOA GUID is reflected on the vServer that is associated with the specific vNIC, this can be shown using the mlx4_vnic_info command on the vServer, with the ethernet NIC matched by the MAC or HWaddr.

[root@<My vServer> ~]# ifconfig eth166_1.100
eth166_1.100 Link encap:Ethernet  HWaddr 00:14:4F:FB:22:14 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:514101 errors:0 dropped:0 overruns:0 frame:0
          TX packets:52307 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:769667118 (734.0 MiB)  TX bytes:9041716 (8.6 MiB)



#[root@<My vServer> ~]# mlx4_vnic_info  -i eth166_1.100 | grep -e IOA
IOA_PORT      mlx4_0:1
IOA_NAME      localhost HCA-1-P1
IOA_LID       0x0040
IOA_GUID      0c:52:5a:da:56:1f:71:79
IOA_LOG_LINK  up
IOA_PHY_LINK  active
IOA_MTU       2048
Matching GUID of VNIC to GUID in the vServer

We now can consider the actual partition.

[root@<gateway 1> ~]# smpartition list active
...
  = 0x800a:
0x0021280001a16917=both,
0x0021280001a1809f=both,
0x0021280001a180a0=both,
0x0021280001a17f23=both,
0x0021280001a17f24=both,
0x0021280001a1788f=both,
0x0021280001a1789b=both,
0x0021280001a1716c=both,
0x0021280001a1789c=both,
0x0021280001a1716b=both,
0x0021280001a17890=both,
0x0021280001a17d3c=both,

0x0021280001a17d3b=both,
0x0021280001a17717=both,
0x0021280001a16918=both,
0x0021280001a17718=both,
0x002128c00b7ac042=full,
0x002128c00b7ac041=full,
0x002128bea5fac002=full,
0x002128bea5fac001=full,
0x002128bea5fac041=full,
0x002128bea5fac042=full,
0x002128c00b7ac001=full,
0x002128c00b7ac002=full;
...


Partition Membership for partition 800a

What we can see here is that the partition includes 16 port GUIDs that have membership of "both" and a further 8 port GUIDs that are full members.  (IB partition membership of both is an Oracle value add to infiniband to allow vServers to be given either full or limited partition membership.)

What I can surmise from this output is that we are dealing with a 1/4 rack, I know this because there are 16 entries with both membership, each physical compute node is dual ported so has two entries in the partition.    We can match the GUID to the physical channel adapter (CA) by running the ibstat command on the compute node.

[root@el2dcn07 ~]# ibstat
CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 2
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x0021280001a17d3a
        System image GUID: 0x0021280001a17d3d
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 64
                LMC: 0
                SM lid: 57
                Capability mask: 0x02510868
                Port GUID: 0x0021280001a17d3b
                Link layer: IB
        Port 2:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 65
                LMC: 0
                SM lid: 57
                Capability mask: 0x02510868
                Port GUID: 0x0021280001a17d3c
                Link layer: IB
Infiniband port information on a compute node

There are also the 8 full member entries.  These relate to the Ethernet bridge technology of the Infiniband gateway switch.  Each switch has two physical ports that allow Ethernet connectivity to the external datacentre and each port is viewed as a dual ported channel adapter, hence each IB switch has four adapter GUIDs or 8 for the pair of gateway switches.

For a virtual Exalogic partitions are enforced at the end points - in the ports of the host channel adapter.  The partition plays no part in the routing of traffic through the fabric which is manged by the subnet manager, using the local Identifiers (LIDs).  Thus traffic linked to a particular partition can be routed anywhere in the fabric.  It is possible to use Infiniband such that each switch maintains a partition table then the switch can inspect the packet headers to match the P-Key in the header and enforce that only packets matching to entries in the partition table are allowed through.  This is not done for a virtual Exalogic.

Each HCA has two ports and each port maintains its own partition table which is updated by the subnet manager with all the partition keys that are accessible by that HCA.  Thus when a packet comes in the pKey of the header is matched to the local partition table and if no match found then the packet is dropped.  So in our example we can see that the partition handling traffic from the external world is effectively allowed to travel to every compute node.  This is necessary because the vServer may migrate from physical compute node to compute node and must still be able to communicate with the external world.   However how does this help with security and multi-tenancy if traffic can flow to every node?

The answer is that the isolation is solved at a different level.  Each vServer gets allocated a virtual function within the HCA.  The diagram below shows how the physical HCA can create multiple virtual functions  (up to 63) that are then allocated to each virtual machine.  This is also the mechanism that provides single root IO virtualisation (SR-IOV) for the optimal performance with flexibility.



From the fabric perspective partitioning is always setup at the physical level. There is one physical partition table per port of the HCA, and Subnet Manager updates this table with pKeys of partitions that are accessible by that HCA. So when packet comes in, the pKey in its header is matched with partition table of the physical port receiving the packet, and if no match is found the packet is dropped.  In a virtual Exalogic all compute nodes in the Exalogic are in the partition so traffic is never rejected via this route, unless it has been directed at the storage or another IB connected device such as an Exadata.

So what about isolation for Virtual Machines?   Each Virtual Function (pcie function) has its own virtual partition table,  this table is not visible or programmed by the Subnet Manager but is by the Dom0 driver. The partition table carries an index of the entries in the physical partition table that are accessible for each virtual function.  i.e. a mapping of partitions in the Infiniband network to specific vServers running on the compute node.

Enforcement is achieved by using the inifiniband construct of a queue pair (QP).  Queue pairs consist of a pair of send and receive queues that are used by software to communicate between hardware nodes.  Each HCA can support millions of QPs and each QP belongs to a pcie function (either a physical function or a virtual function).  Assignment is performed by the Dom0 driver on creation of a QP.  Each QP can only belong to one partition so when Dom0 creates a queue pair for a vServer on a specific partition it will use a virtual partition table of the VF to ensure this QP can be created.  When the QP recieves a packet it checks that the PKey in the packet header matches the pKey that is assigned to the QP.

Or to put it another way, when a virtual machine is started up on a compute node then the configuration of the guest VM interacts with the hypervisor to ensure that the QPs for the guest VM will only be for partitions that it is allowed to communicate over.  The allowed partitions defined in the vServers vm.cfg file.


Conclusion

By using Infiniband partitions and the Exalogic "secret sauce" in the virtualised Exalogic we setup secure communication paths where the IB fabric and HCA ensure that traffic on a particular partition can only communicate with the Exalogic Compute Nodes in the environment, then the hypervisor and HCA work in conjunction with each other to ensure that each vServer only has access to the specific partitions that have been administratively allocated to it.
In this way it is possible to maintain complete network isolation from each vServer while making use of a shared physical infrastructure.