amit.handa
2018-03-01 08:26
virtualbox VM is not PXE booting from DRP (bootenv: sledgehammer). after tftp'ing lpxelinux.0. VM network logs (wireshark) show following error: 477 0.378296662 10.10.20.76 10.10.31.96 TFTP 159 Error Code, Code: File not found, Message: open /var/lib/dr-provision/tftpboot/pxelinux.cfg/16089a59-9abd-48c2-850a-2ac3bc134935: no such file or directory

amit.handa
2018-03-01 08:27
I see that there is no such file in dr-provision file area

amit.handa
2018-03-01 08:27
what can be possible bug. I might have done something.

amit.handa
2018-03-01 08:27
Thanks

vlowther
2018-03-01 13:19
That is expected behavior for lpxelinux.0. see http://www.syslinux.org/wiki/index.php?title=PXELINUX#Configuration

vlowther
2018-03-01 14:52
Specifically, lpxelinux tries to fetch config files in a specific order. The first one os the DHCP client ID, which we don't use because (for servers or anything else that network boots) the client ID is a much worse unique identifier than the MAC address of the interface or the IP address the interface was assigned.

greg
2018-03-01 15:53
@wdennis - tip community content has a fix for ubuntu install.

greg
2018-03-01 15:54
You can use the current content tip with stable drp.

greg
2018-03-01 15:55
I also have PR in community content that moves all the partman options into the schema file. I know this will break community and haven?t pulled it in.

greg
2018-03-01 15:55
I also want to spend more time on it to reorg it a little more.

wdennis
2018-03-01 16:28
Thx @greg

wdennis
2018-03-01 16:30
I do think consolidating all partman directives into a single template is the sanest option... But do recognize the need to get current users on board with that change.

amit.handa
2018-03-02 12:47
thanks !

wdennis
2018-03-02 17:07
@greg Confirming new partman preseed directives in `tip` community content work now...

wdennis
2018-03-02 17:07
I got: ``` #Partitioning Scheme d-i partman-auto/disk string /dev/sda d-i grub-installer/choose_bootdev select /dev/sda d-i grub-installer/bootdev string /dev/sda d-i partman-auto/method string lvm d-i partman-auto-lvm/guided_size string max d-i partman-auto-lvm/new_vg_name string testnode01 d-i partman-auto/choose_recipe select atomic d-i grub-installer/only_debian boolean true d-i partman/confirm_write_new_label boolean true d-i partman/choose_partition select finish d-i partman/confirm boolean true d-i partman/confirm_nooverwrite boolean true ``` and it successfully partitioned the drive.

wdennis
2018-03-02 17:53
Refresh my memory - the string value of `select-kickseed` is the template name WITH or WITHOU the `.tmpl` at the end?

greg
2018-03-02 17:53
with

wdennis
2018-03-02 17:54
Ah, that's why it didn't work :stuck_out_tongue_winking_eye:

wdennis
2018-03-02 17:55
It's the part-scheme one that doesn't want the .tmpl

lae
2018-03-02 19:34
is it possible to have like a bootenv with some set templates and kernel parameters, but then also have several stages with an "extra" template and kernel parameter?

lae
2018-03-02 19:35
and while I was typing that out, it just hit me that I guess this is a scenario I could also solve with a drpcli runner, hm...

lae
2018-03-02 19:40
Is there a better way to get the bare DRP server IP or hostname other than parsing out `.Env.InstallUrl` or `.ProvisionerURL`?


rstarmer
2018-03-03 00:23
FYI, something is wrong in either the upstream, or more likley the ubuntu repo pointers: ``` drpcli bootenvs uploadiso ubuntu-16.04-install Error: Unable to initiate download of http://mirrors.kernel.org/ubuntu-releases/16.04/ubuntu-16.04.3-server-amd64.iso: 404 Not Found ```

rstarmer
2018-03-03 00:23
this was from a new install against stable.

shane
2018-03-03 00:24
@rstarmer checking it ...

shane
2018-03-03 00:24
we had an issue w/ CentOS yanking the 7.3 ISOs with zero warnings

rstarmer
2018-03-03 00:25
seems .3 ISO is gone.

shane
2018-03-03 00:25
also note ... if you have a copy of the ISO (Until we update the Contents) - you can install the ISO via the UX - go to the Boot ISOs menu item

shane
2018-03-03 00:26
or you can copy the ISO to your tftpboot/isos/ directory (either in ~/drp/drpdata, or the /var/lib/dr-provision directory)

shane
2018-03-03 00:26
then restart DRP

shane
2018-03-03 00:27
we'll put out an updated version of Contents shortly

rstarmer
2018-03-03 00:34
Ubuntu has the older ISOs here too for future reference: http://old-releases.ubuntu.com/releases/xenial/ubuntu-16.04.3-server-amd64.iso

shane
2018-03-03 00:34
yeah ... never mind using a symlink to point "latest" at the moving target ... sigh ...

rstarmer
2018-03-03 00:35
Question, or docs pointer if possible: The host I installed DR on has letsencrypt credentials, how do I tell DR who it actually is?

shane
2018-03-03 00:36
did you do `isolated` or production install ?

rstarmer
2018-03-03 00:36
production

rstarmer
2018-03-03 00:50
also, can I pass a domain name rather than IP address to the config?

shane
2018-03-03 00:53
@rstarmer there is a cert and key that gets installed (the self-signed). It's either in `/` or in `/var/lib/dr-provision/` directories. replace those with your certs and restart dr-provision

shane
2018-03-03 00:53
on the domain name - are you referring to `drpcli` commands ?

rstarmer
2018-03-03 00:56
no, I was thinking about advertised address from the server, I?d rather it passes out the domain name than the IP.

rstarmer
2018-03-03 00:56
was wondering if I should have passed a name with --static-ip=

shane
2018-03-03 00:56
via the `--static-ip` flag ?

rstarmer
2018-03-03 00:57
yeah, that?s what I was wondering :slightly_smiling_face:

shane
2018-03-03 00:57
ah - you really don't need `static-ip` we do a bunch of magic with caching address tables and interface info - and we dynamically serve the correct IP to a client based on their network connection

shane
2018-03-03 00:58
the problem w/ using a hostname/domain - you have to make sure that resolves at the PXE firmware/boot level ... and any DNS issues will impact your provisioning activities

shane
2018-03-03 00:58
you can of course hand out DNS servers w/ the DHCP assignment ... nothing stopping you from doing that ...

rstarmer
2018-03-03 00:58
ok, won?t worry there then.

shane
2018-03-03 00:58
but you'll be circumventing our magic logic

rstarmer
2018-03-03 00:59
so the key/cert are in /, this seems like a bad place to put these items. And in reality, I?d rather point to their locations in the letsencrypt directories (so that they stay up to date). IS there a config parameter I can set somewhere?

rstarmer
2018-03-03 00:59
no desire to bypass the magic!

shane
2018-03-03 01:00
```dr-provision --help --tls-key= The TLS Key File (default: server.key) --tls-cert= The TLS Cert File (default: server.crt)```

rstarmer
2018-03-03 01:00
k. will re-provision with those.

shane
2018-03-03 01:01
you can specify the location ... that's just the default

rstarmer
2018-03-03 01:23
so that doesn?t seem to be working. when I restart dr-provision it regenerates new self signed keys. Note I?m passing in .pem formatted key/cert, but I also don?t see any errors in trying to read them

shane
2018-03-03 01:24
@rstarmer I'm not sure if we accept a PEM format ... will have to check w/ @greg and/or @vlowther on that one

rstarmer
2018-03-03 01:24
Ok, I?m seeing the following: ``` curl -fsSL get.rebar.digital/stable | bash -s -- --tls-key=/etc/letsencrypt/live/gitlab.kumulus.co/privkey.pem --tls-cert=/etc/letsencrypt/live/gitlab.kumulus.co/cert.pem install Overriding TLS_KEY with /etc/letsencrypt/live/gitlab.kumulus.co/privkey.pem Overriding TLS_CERT with /etc/letsencrypt/live/gitlab.kumulus.co/cert.pem 'dr-provision' service is not running, beginning install process ... Ensuring required tools are installed Installing Version stable of Digital Rebar Provision ... ```

rstarmer
2018-03-03 01:28
started manually, passed the tls params, and that works.

rstarmer
2018-03-03 01:28
going to restart the service and see if it takes this time

greg
2018-03-03 01:28
You have to add to the service file

shane
2018-03-03 01:29
ah

rstarmer
2018-03-03 01:29
^which - where?

shane
2018-03-03 01:29
yeah - in `/etc/systemd/system/dr-provision.service` - assuming SystemD

greg
2018-03-03 01:29
The install.sh script doesn?t do anything. We need a plan for that

shane
2018-03-03 01:30
@rstarmer the content update w/ the 16.04.4 fix will be out later this evening the PR just needs to go through approval and release process now


rstarmer
2018-03-03 01:35
ok, trying that now. I?ll let you know if/as I succeed

rstarmer
2018-03-03 01:37
Yes, success, I had to update the /etc/systemd/system/dr-provision.service file: ``` [Service] ExecStart=/usr/local/bin/dr-provision --tls-key=/etc/letsencrypt/live/gitlab.kumulus.co/privkey.pem --tls-cert=/etc/letsencrypt/live/gitlab.kumulus.co/cert.pem ```

shane
2018-03-03 01:39
yay !

rstarmer
2018-03-03 01:54
now I?m stuck, the UI just spins on loading plugins

shane
2018-03-03 01:55
shift-reload

rstarmer
2018-03-03 02:41
helps a lot if I read the error message. gotta go find my glasses? (e.g. you must install providers before installing plugins?)

rstarmer
2018-03-03 06:29
interesting behavior with the terraform plugin, only the first node gets provisioned, and the system state doesn?t seem to get updated completely, leaving the one provisioned node in ?power off? state, thought it is running in packet.

rstarmer
2018-03-03 06:33
thoughts on how I might debug this?

stanchan740
2018-03-03 12:15
I have an Ansible role I wrote to setup dr-provision on alpine/debian/centos? just updated it to install 3.7.0, but post setup steps seem to be failing (setup a new admin user, drop the rocketskates user, setup all the preferences and profiles and setup the boot environments). Seems to be auth related? Did something change from 3.6.0? The UI seems to work fine.

zehicle
2018-03-03 14:28
can you share the script?

ced.hnyda
2018-03-03 17:26
has joined #community201803

spector
2018-03-03 17:32
hello @ced.hnyda $welcome

2018-03-03 17:32
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

stanchan740
2018-03-03 17:52

stanchan740
2018-03-03 17:53
I flipped it back to v3.6.0 and everything works? seems to be related to the new token auth system that is in place.

greg
2018-03-03 19:14
@stanchan740 - I?ll check but first pass is that the drp options changed

stanchan740
2018-03-03 20:45
Thanks for the tip? I?ll try looking at the options, but I believe I took all the defaults from dr-provision --help

greg
2018-03-03 20:48
actually, that isn?t it.

greg
2018-03-03 20:48
Sigh. I?m working on it now.

greg
2018-03-03 20:48
I hope to have a PR for you shortly.

stanchan740
2018-03-03 20:53
no rush?. thanks for the help greg!

stanchan740
2018-03-03 21:03
are there API docs somewhere for dr-provision? if I wanted to create a custom frontend for it?

greg
2018-03-03 21:03
yes, but that is that drpcli is.

greg
2018-03-03 21:04
$docs

greg
2018-03-03 21:04
$faq

greg
2018-03-03 21:04
$faq


greg
2018-03-03 21:04
That should be close. The nav should have API or something like that.

stanchan740
2018-03-03 21:05
swagger?

greg
2018-03-03 21:06
dang it. Something broke

greg
2018-03-03 21:06
something else to look at.

greg
2018-03-03 21:06
you can hit the endpoint with /swagger-ui

greg
2018-03-03 21:06
it will have a graphical UI

stanchan740
2018-03-03 21:06
found it in the docs :slightly_smiling_face:

stanchan740
2018-03-03 21:06
thanks

greg
2018-03-03 21:07

stanchan740
2018-03-03 21:07
yah? I just noticed that the v3.7.0 was just released not too long ago :slightly_smiling_face:

shane
2018-03-03 21:09
@stanchan the `drpcli` usage is a very very good tutor for the API - the CLI is dynamically generated from the API - so the resources closely follow the API resources ... that coupled with the Swagger-UI should give you the complete picture ...

stanchan740
2018-03-03 21:12
thanks @shane

shane
2018-03-03 21:13
and ... v3.7.2 will be released shorty fixing a few minor issues

greg
2018-03-03 21:13
already out

shane
2018-03-03 21:13
dang @greg - you's too fast !!

stanchan740
2018-03-03 21:15
how is logging handle? is there existing support for a prometheus endpoint? wanted to do something like opentracing against provisioning jobs and display it in something like jaeger or zipkin? just openly thinking :slightly_smiling_face:

greg
2018-03-03 21:16
umm - well - umm . that sounds cool. We don?t do that. You can grab events from a websocket stream.

greg
2018-03-03 21:16
or we can work with you on a plugin to push data as appropriate.

greg
2018-03-03 21:17
plugin to push to prometheus sounds plausible.

stanchan740
2018-03-03 21:18
:thumbsup: would be interested on working on something like that

greg
2018-03-03 21:18
What would ?a prometheus endpoint? need?

greg
2018-03-03 21:18
I haven?t looked at any of it. We can off-line it as well.

stanchan740
2018-03-03 21:19
that part is easy to implement? opentracing would be the interesting part

shane
2018-03-03 21:19
Plugin is definitely the best way, IMO ... but we also support Websocket events - so you can register for specific events via that standard method, details: http://provision.readthedocs.io/en/tip/doc/integrations/websocket.html#rs-websocket

stanchan740
2018-03-03 21:21
is 3.7.2 just release? it says 7 hours ago

greg
2018-03-03 21:21
yes

shane
2018-03-03 21:22
here's a websocket listener that can log out to prometheus https://github.com/closeio/socketshark

stanchan740
2018-03-03 21:26
prometheus is more for internal service metrics? distributed tracing is used for end to end transactions. all cncf projects.

stanchan740
2018-03-03 21:32
still returns an error

stanchan740
2018-03-03 21:33
``` 2018/03/03 13:31:16 &{403 Forbidden 403 HTTP/1.1 1 1 map[Date:[Sat, 03 Mar 2018 21:31:16 GMT] Content-Length:[0] Content-Type:[text/plain; charset=utf-8]] {} 0 [] false false map[] 0xc4201b6800 0xc4200a6370} ````

stanchan740
2018-03-03 21:33
I?ll revert back to v3.6.0 for now for testing

greg
2018-03-03 21:33
Your playbook needs a little tweaking.

stanchan740
2018-03-03 21:34
is this valid?

greg
2018-03-03 21:34
The problem is the password setting section of the playbook

stanchan740
2018-03-03 21:34
`RS_KEY=\"admin:password\""`

greg
2018-03-03 21:34
I?m fixing other things too

greg
2018-03-03 21:34
HMM - should be, but I?m not getting the admin password set.

stanchan740
2018-03-03 21:34
or should I switch to tokens

greg
2018-03-03 21:35
You can, but I?m almost there.

greg
2018-03-03 21:35
not sure why ```drpcli -U {{ provision_admin_user }} -P \"{{ provision_admin_password }}\" prefs list >/dev/null 2>&1 && exit 0 || drpcli users password {{ provision_admin_user }} \"{{ provision_admin_password }}\" && exit 99```

greg
2018-03-03 21:35
is not working correctly.

stanchan740
2018-03-03 21:35
oh? that part isn?t fixed yet :slightly_smiling_face:

stanchan740
2018-03-03 21:35
sorry

greg
2018-03-03 21:36
where are you getting the maperror thing?

stanchan740
2018-03-03 21:36
just running `drpcli users list`

greg
2018-03-03 21:37
okay - well if admin?s password isn?t set then it will have problems if you set RS_KEY=admin:password

greg
2018-03-03 21:37
don?t need the extra quotes

stanchan740
2018-03-03 21:37
ah

stanchan740
2018-03-03 21:37
see the issue

greg
2018-03-03 21:37
`export RS_KEY="admin:password"`

stanchan740
2018-03-03 21:37
it never changed the password :slightly_smiling_face:

greg
2018-03-03 21:37
right!!

greg
2018-03-03 21:55
I feel some more unit tests and a v3.7.3 coming on

greg
2018-03-03 21:56
Can?t set user passwords for some reason.

stanchan740
2018-03-03 22:16
yah? I see the same issue

stanchan740
2018-03-03 22:16
it responds back like it did something

greg
2018-03-03 22:17
yeah - in 3.7.0, to fix all the deadlocks in 3.6.0, we introduced system to prevent that. The password save is a special pass that we didn?t undo all the testing for.

stanchan740
2018-03-03 22:41
I?ll use the default password for now? I tried using the API to change the password without any luck

greg
2018-03-03 22:41
yeah - the cli and the API use the same backend path.

greg
2018-03-04 03:22
@stanchan740 - I fixed the user bug and cut a v3.7.3 release. I have a pull request against your tree that does quite a few changes and fixes. It seems to work for me.

stanchan740
2018-03-04 03:44
@greg Thanks! Looks good. Will merge. I added a few things for idempotence in my working branch. Decided to call the dr-provision API directly for that since it seems to be much more predictable.

stanchan740
2018-03-04 03:49
Really enjoy using dr-provision? much better and easier to use then cobbler! It just seems to work without much fiddling around. Will be working on an awx integration to do a tensorflow on top of kubernetes demo. Looking forward to digging into the code a bit more, too.

greg
2018-03-04 03:49
cool

zehicle
2018-03-04 04:28
@stanchan740 I looked at API integration w/ Tower (AWX) earlier. It would be a natural plugin to push machine updates into the AWX including when machines are online or not. Something to talk about 1x1

lae
2018-03-04 19:54
@stanchan I'm only skimming through chat but 3.7.0 deployed fine for me with the following: ``` - name: Configure DR Provision API user shell: "drpcli users create {{ provision_api_user }}" args: creates: "/var/lib/dr-provision/digitalrebar/users/{{ provision_api_user }}.json" - name: Update DR Provision API password shell: "drpcli -U {{ provision_api_user }} -P \"{{ provision_api_password }}\" prefs list >/dev/null 2>&1 && exit 0 || drpcli users password {{ provision_api_user }} \"{{ provision_api_password }}\" && exit 114" register: provision_password_update changed_when: provision_password_update.rc == 114 failed_when: provision_password_update.rc != 114 and provision_password_update.rc != 0 - name: Update shell environment with DR Provision API credentials copy: content: "#!/bin/bash\nexport RS_KEY=\"{{ provision_api_user }}:{{ provision_api_password }}\"" dest: "/etc/profile.d/dr-provision.sh" mode: 0755 - name: Remove default rocketskates user if different api_user is set shell: "drpcli users destroy rocketskates" args: removes: "/var/lib/dr-provision/digitalrebar/users/rocketskates.json" ```

lae
2018-03-04 19:54
well 3.7.1

lae
2018-03-04 19:56
(hadn't modified that since 3.2.0 or whatever was out last august)

greg
2018-03-04 20:17
@lae I think that would only work if you already the admin user created and password set. A fresh install wouldn?t work. Until 3.7.3

stanchan740
2018-03-04 20:41
yah? I usually create a new instance everytime I test a change

rakeshrhcss
2018-03-05 12:17
Hello all I am trying to install Centos 7 on one of our servers using our drp community content.. *It starts automated install but asks to create 1MB biosboot partition -> *Your BIOS-based system needs a special partition to boot from a GPT disk label. To continue, please create a 1MiB 'biosboot' type partition.*

rakeshrhcss
2018-03-05 12:18
so looks like we will have to modify our centos7 KS template to create biosboot partition part biosboot --fstype biosboot --size=1 {{if .ParamExists "operating-system-disk"}}--ondisk={{.Param "operating-system-disk"}}{{end}}

rakeshrhcss
2018-03-05 12:18
Any suggestions please.

rakeshrhcss
2018-03-05 12:24
And after some research I found that when exactly the biosboot partition is required - did the system boot in EFI mode or BIOS mode? - EFI - use gpt and never make biosboot - BIOS - is the disk larger than the max for msdos (2TB)? - yes - use gpt and ensure there's a biosboot partition - no - use msdos

rakeshrhcss
2018-03-05 12:25
So in our preinstall script (ks) we should be looking for the EFI capabilities and the disk size.

rakeshrhcss
2018-03-05 12:26
Please let me know your suggestions on this. Thanks.

greg
2018-03-05 14:08
@rakeshrhcss - I?m guessing your system has disks that are greater than 2TB in size.

greg
2018-03-05 14:09
My guess is for the time being, you will need to create a custom install bootenv with your own partitioning layout.

greg
2018-03-05 14:10
until someone can get to looking into templatizing the Centos installer to take custom partition sections.


stanchan740
2018-03-05 17:02
@zehicle There are multiple ways to do handle inventory with awx. Ansible 2.5, which should hit rc soon, has a few changes to how inventory can be handled in awx as well. Going to have to experiment with a few implementations. The drp dynamic inventory script works fine as an inventory provider in awx, but pushing the inventory updates to awx might be a better option.

zehicle
2018-03-05 17:10
some progress to show off... Immutable Image Deploys! https://youtu.be/tDcEzirTLbo

romain.lafontaine
2018-03-05 17:55
@zehicle You made my day

stanchan740
2018-03-05 18:53
when an iso is uploaded and it shows up in the isos list, it means it?s completely uploaded? is it a blocking state?

romain.lafontaine
2018-03-05 20:06
@zehicle May I ask more info about the packer-to-drp flow ? Like which output format is used on packer side, how it's digested on DRP side ?

zehicle
2018-03-05 20:07
yes! 1) Raw Format 2) you have choices, right now, it's in the ISOs area

greg
2018-03-05 20:11
@stanchan740 - if you use the drpcli command, it returns upon completion.

stanchan740
2018-03-05 20:11
just noticed that it?s blocking call :slightly_smiling_face: thanks!

greg
2018-03-05 20:12
With regard to polling from a different thread, then it will show up as `.<filename>.part` in the iso directory while it is uploading and upon completion it gets converted to `filename`

stanchan740
2018-03-05 20:15
for some reason, it downloads it everytime? but it should skip if it already uploaded?

stanchan740
2018-03-05 20:15
``` for _, iso := range isos { if iso == env.OS.IsoFile { return nil } } ```

greg
2018-03-05 20:16
no. It assumes that you are replacing updating. We could probably make it smarter, but ?

greg
2018-03-05 20:16
Let me check to be sure though

stanchan740
2018-03-05 20:17
yah? I might just add a check for it in the isos directory

greg
2018-03-05 20:17
Actually, that would be a pretty good enhancement.

greg
2018-03-05 20:17
If the bootenv is already available, we don?t need to upload the iso for that bootenv.

greg
2018-03-05 20:17
it would then skip it.

greg
2018-03-05 20:18
yeah - I think that makes sense and would make things simpler for the installer.

stanchan740
2018-03-05 20:20
the blobstore is currently just a wrapper around the filesystem? replaceable with S3 or minio in the future I assume?

greg
2018-03-05 20:24
Yes to the filesystem piece. In general, our philosophy is more to let you reference your files where you want.

greg
2018-03-05 20:25
You can build a bootenv that references or use the repos params to define the external location of the iso repos. The same with kernels and initrds. we owe some more docs and examples on this, but the point is that through the repos params, you can reference separate off DRP locations for just about everything.

greg
2018-03-05 20:26
You could use NFS today if you wanted to maintain the DRP managed location aspect.

stanchan740
2018-03-05 20:27
cool? that makes since

clint
2018-03-05 21:58
Really exciting stuff!

rakeshrhcss
2018-03-06 05:00
thanks @greg

rackn.slack
2018-03-06 14:26
has joined #community201803

rackn.slack
2018-03-06 14:32
Hi all, I'm a complete noob with this lot! I'm wondering if anyone can point me in the direction of setting up Rebar to PXE boot Rock64s. I am trying to boot ayufan linux builds... Thanks in advance!

spector
2018-03-06 14:37
hello @rackn.slack $welcome

2018-03-06 14:37
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

rackn.slack
2018-03-06 14:40
I have a Rebar server set up and running. I am building Self Hosted Kubernetes clusters with it. It is awesome!!!! I am going to provide the modified krib template when I have it thoroughly tested. I have a 20xRock64 cluster that I'm now looking at installing it onto!

spector
2018-03-06 14:45
Sounds great - the Engineering folks are driving in to work right now and will be online shortly to provide a response

shane
2018-03-06 14:45
@rackn.slack - welcome - is this what you're referring to? http://wiki.pine64.org/index.php/ROCK64_Main_Page

rackn.slack
2018-03-06 14:46
...yes that's the fellow! I have 20 of them in a box! 4Gb memory makes it all possible!

shane
2018-03-06 14:49
Are you planning to run Digital Rebar Provision (DRP) on one of these, too - or are you just planning to install to them ?

rackn.slack
2018-03-06 14:49
...I thought of that but you need 6Gb minimum!?

shane
2018-03-06 14:49
nah - I run DRP on 768 MB VMs regularly ...

shane
2018-03-06 14:51
it's possible the old Digital Rebar ver2 had a 6GB memory footprint requirement, but the current DRP ver3 does NOT

rackn.slack
2018-03-06 14:51
...I think I tried it tentatively but stopped when I got a message saying min 6Gb. I *may* have a project that could use that though...

rackn.slack
2018-03-06 14:52
my first issue is installing ayufans linux on them over PXE. Once I get that cracked I'm in business!

shane
2018-03-06 14:53
we do build DRP (ver3) for ARM64 architecture, but it does not receive heavy testing - in general - we're a single Golang binary (only 30 MB in size), with very very few external dependencies (currently only 7zip, bsdtar, and unzip)

shane
2018-03-06 14:54
we do not have any patterns for you on PXE booting ARM64 architecture, we are certainly available to help here on the #community channel as questions or issues arise ... but right now, we haven't been PXE booting ARM64 hardware

rackn.slack
2018-03-06 14:55
Ok, thanks. I'll keep on looking.

shane
2018-03-06 14:56
we def. should be able to PXE boot ARM64, just not a lot of patterns for you to follow

florent.wagener
2018-03-06 14:58
hi there! Do you guys can explain to me why after some time on sledgehammer-wait stage I am loosing the connectivity to my servers ? They seems to lose their IP even though the DHCP lease is still valid. I tried to renew it using the dhclient ethx command, they get their IP back but still impossible to ping... The only solution I have found so far is to reboot the server...

rackn.slack
2018-03-06 14:59
...I have not been able to find any! I've got as far as the DHCP server returning 'a' filename for the Vendor Class, I just don't know what it should be, or where to get it!!!!!

shane
2018-03-06 15:00
post your details here - we have some guys that are really good with hardware - may be able to sort out what needs to be served for ARM64

shane
2018-03-06 15:02
@florent.wagener - are the machines in Sledgehammer when they lose their lease, or in a provisioned OS instance ?

florent.wagener
2018-03-06 15:02
@shane they are on sledgehammer

rackn.slack
2018-03-06 15:03
What I have so far is the rom loaded with ayufan's u-boot loader. This, in theory, allows the Rock64 to PXE boot. I have a MS$ DHCP server configured to send 'a' boot file name when the Vendor Class = 'U-Boot.armv8'.... but that's as far as I have got.

shane
2018-03-06 15:04
there are a couple of things that may be at play: * increase the Preferences setting for DHCP Lease time * `dhclient` may not be running persistently inside the sledgehammer image (I'll hav to check this) * we use Tokens to authenticate the Machine to DRP - those Tokens have a timeout as well (but theoretically, that shouldn't effect the DHCP lease)

florent.wagener
2018-03-06 15:05
I'll try to extend the active lease time/reservation lease time and see if there is a difference :slightly_smiling_face:

shane
2018-03-06 15:06
looking at my sledgehammer, it's not running `dhclient` so it's probably a single `dhclient` lease request, which would explain the issue

shane
2018-03-06 15:06
nothing is trying to renew the lease

shane
2018-03-06 15:08
extending the Lease on the DRP side will work as long as you make it really long - but this isn't necessarily the best solution - looking in to it

florent.wagener
2018-03-06 15:08
thanks !

florent.wagener
2018-03-06 15:10
on a recently rebooted machines (less than 30min) I can see the dhclient is running: ```<sledgehammer> [root@E16968917902274 ~]# ps aux | grep dhclient root 3238 0.0 0.0 113372 13140 ? Ss 15:04 0:00 dhclient eth0```

florent.wagener
2018-03-06 15:12
on a non recently rebboted machine the dhclient isn't running: ```<sledgehammer> [root@localhost ~]# ps aux | grep dhclient root 6865 0.0 0.0 112660 972 tty1 R+ 16:22 0:00 grep --color=auto dhclient```

greg
2018-03-06 15:12
Is drpcli still running?

greg
2018-03-06 15:13
Dhcliebt may exit when drpcli exits out of the sledgehammer service

florent.wagener
2018-03-06 15:13
no it isn't

florent.wagener
2018-03-06 15:14
on the other machine, it is: ```<sledgehammer> [root@E16968917902274 ~]# ps aux | grep drpcli root 3342 0.1 0.0 731576 17120 ? Sl 15:04 0:00 drpcli machines processjobs f8466cb9-acba-43a5-b0e8-1bcc5954b86e ```

florent.wagener
2018-03-06 15:14
that's weird because I have no job running on any machine at the time though...

florent.wagener
2018-03-06 15:15
`f8466cb9-acba-43a5-b0e8-1bcc5954b86e` is the UUID of the last job that was executed 10min ago

shane
2018-03-06 15:17
`drpcli` may continue to run, depending on your stage actions when the last stage runs

florent.wagener
2018-03-06 15:18
interesting when I try to run a drpcli command on the machine without IP, I got this error: ```Error creating sessions: CLIENT_ERROR: Get https://127.0.0.1:8092/api/v3/users/rocketskates/token: dial tcp 127.0.0.1:8092: getsockopt: connection refused```

florent.wagener
2018-03-06 15:18
the IP is wrong here...

shane
2018-03-06 15:18
if the machine has no IP - then it can't reach back out to the DRP endpoint

shane
2018-03-06 15:18
we provide the DRP Endpoint IP addr to the `drpcli` dynamically (if `--static-ip` is not set on the DRP side)

florent.wagener
2018-03-06 15:19
Got it.

florent.wagener
2018-03-06 15:19
makes sense :slightly_smiling_face:

shane
2018-03-06 15:19
So - the lesson is ... Sledgehammer brings up `dhclient` - and if the Runner (`drpcli`) isn't running, then the DHCP Lease times out ... there is also the Token expiry that the Runner (`drpcli`) is operating with, which can also time out

shane
2018-03-06 15:19
these things need to be managed for long living Sledgehammer runs

shane
2018-03-06 15:20
most of the use case patterns with Sledehammer are quick use - and move in to installed OS ... but in the case of "ready state" infrastructure - we need to harden Sledgehammer to be longer lived with it's management of Runner token and (subsequently) DHCP client lease renewal

florent.wagener
2018-03-06 15:20
Yep :slightly_smiling_face:

shane
2018-03-06 15:21
some of the can be effected with the Preferences settings by simply extending the times to really long lease lengths

vlowther
2018-03-06 15:21
@rackn.slack What type of firmware and network booting situation does those rock64 boards have?

florent.wagener
2018-03-06 15:21
Anyway this is not a big issue for our testing purposes, but it could be in production as we might need to have a "ready state" for our servers.

florent.wagener
2018-03-06 15:22
Im gonna try to extend the lease to see if that changes anything

vlowther
2018-03-06 15:22
@florent.wagener I think it is some legacy code left over from some DHCP shennanigans we had to pull for DRv2.

vlowther
2018-03-06 15:23
If so, unwinding that behaviour should be a simple thing to do.

florent.wagener
2018-03-06 15:25
@vlowther cool! Waiting for the fix then :slightly_smiling_face:

rackn.slack
2018-03-06 15:35
@vlowther ...they have nothing installed by default. You flash them with https://github.com/ayufan-rock64/linux-build/releases/download/0.6.25/u-boot-flash-spi-rock64.img.xz to get network booting going. This all seems to work nicely.... i.e. I see the required BOOTP messages using WireShark.

greg
2018-03-06 15:42
@rackn.slack - if you are using DRP as the dhcp server, it would be really useful for use to see a debug trace of the DHCP packet stream. This can be done by turning the DHCP Debug up to debug and catching the output of DRP.

wdennis
2018-03-06 16:01
FYI - seeing this on my Ubuntu 16.04 installs - just hangs here on this screen for a while near the end of the install...


shane
2018-03-06 16:03
Ctl-Alt-F4 to see console messages

shane
2018-03-06 16:03
it's hanging in the preseed somewhere - should be a clue on screen 4

rackn.slack
2018-03-06 16:04
@greg...I'm not using the DRP as the DHCP server in this case. I have to use MS$. I am using WireShark however. What filter would you like on that? How do I get the reslting file to you?

greg
2018-03-06 16:04
just dhcp messages. @rackn.slack

greg
2018-03-06 16:04
There is a bigger issue. We don?t have an arm-based sledgehammer yet.

rackn.slack
2018-03-06 16:05
...that will slow things down!

wdennis
2018-03-06 16:05
@shane Interesting- Ctrl/Alt/F4 not working to bring up terminal...

greg
2018-03-06 16:05
sometimes try ALT-F4

rackn.slack
2018-03-06 16:06
@greg is that something that is currently being worked on in V3?

wdennis
2018-03-06 16:06
Ok @greg that worked


wdennis
2018-03-06 16:09
(Sorry for the screen reflection...) looks like it?s hauling down net-post-install.sh & running it

greg
2018-03-06 16:11
It has been a lower priority. We don?t have one today. @rackn.slack We don?t have hardware to play with it and we haven?t had customer interest to drive it.

vlowther
2018-03-06 16:12
@rackn.slack It is sorta driven by customer demand and/or comunity involvement.

vlowther
2018-03-06 16:12
and ww mostly want to target ARM64 stuff that is running a UEFI firmware

vlowther
2018-03-06 16:13
rather than other, less well documented network boot protocols.

vlowther
2018-03-06 17:35
We are working on a new job runner that may affect existing workloads, so I am looking for feedback on the design and implementation before we replace the current implementation with it.

vlowther
2018-03-06 17:37
https://github.com/digitalrebar/provision/blob/add-auto-reconnection-and-make-an-FSM-machine-agent/api/agent.go <-- the code for the new job runner. @wdennis and other interested parties should take a look at the comments in that code and give me feedback as to how it will affect your current workflows.

wdennis
2018-03-06 17:41
@vlowther What are motivations for change? (simplification?)

vlowther
2018-03-06 17:41
Easier maintenance and simplification.

greg
2018-03-06 17:42
To not have to remember the dang STOP all the time.

greg
2018-03-06 17:42
:slightly_smiling_face:

vlowther
2018-03-06 17:43
With this runner will do what I consider to be the Right Thing by waiting by default when there is nothing to do, rebooting whenever the bootenvs change (unless it shoudl exit instead to let an OS install finish)

wdennis
2018-03-06 17:44
I do not believe it will affect anything I?m currently doing; I only have a simple workflow (not default, which in my implementation is empty, but tied to a custom profile) and the KRIB one

wdennis
2018-03-06 17:47
This is my custom profile workflow: ``` "change-stage/map": { "discover": "prep-install:Success", "prep-install": "ubuntu-16.04-install:Reboot", "ubuntu-16.04-install": "complete-nowait:Success" } ```

wdennis
2018-03-06 17:49
And my KRIB one: ``` "change-stage/map": { "docker-install": "krib-install:Success", "finish-install": "docker-install:Success", "krib-install": "complete:Success", "runner-service": "finish-install:Stop", "ssh-access": "runner-service:Success", "ubuntu-16.04-install": "ssh-access:Success" } ```

greg
2018-03-06 17:52
Thank you, @wdennis - you answered a question for me.

vlowther
2018-03-06 19:04
@florent.wagener https://github.com/digitalrebar/provision-content/pull/68 <-- should be in tip soonish.

florent.wagener
2018-03-06 19:04
@vlowther w00t !

wdennis
2018-03-06 20:58
So @greg is the ?curtin? thing something you guys invented?

greg
2018-03-06 21:06
No. It is a tool in ubuntu/maas. MaaS wraps a lot of crap around it.

greg
2018-03-06 21:06
We drive it for now. It was quick and expedient. I want to eventually replace it.

greg
2018-03-06 21:07
I?ve been looking at ignition from CoreOs ne Redhat, but it has issues too.

greg
2018-03-06 21:07
I?ve tinkered with changing ignition and have some stuff lying around to switch to it, but it has more work to do.

greg
2018-03-06 21:08
I like ignition better because it is go-based and easier to work with but it doesn?t have the pieces to handle setting up bootloaders like curtin does.

greg
2018-03-06 21:08
Anyway, it is the way it is for right now.

greg
2018-03-06 21:08
@wdennis

wdennis
2018-03-06 21:09
The idea is very cool, congrats on shipping something :+1::skin-tone-2:

wdennis
2018-03-06 21:12
In other news, all my Ubuntu installs (which are the only kind I do ;) are hanging on the net-post-install.sh phase...


greg
2018-03-06 21:12
yeah _ I saw. I tried it and it passed for me.

wdennis
2018-03-06 21:13
This wasn?t happening to me pre-3.7

greg
2018-03-06 21:13
hmm

greg
2018-03-06 21:13
can you `alt-f2`

greg
2018-03-06 21:13
ps -ef | grep drpcli

wdennis
2018-03-06 21:13
?Works on my machine? ;)

greg
2018-03-06 21:14
Trying to think about changes that could change your path.

greg
2018-03-06 21:14
From this work flow: ```"change-stage/map": { "discover": "prep-install:Success", "prep-install": "ubuntu-16.04-install:Reboot", "ubuntu-16.04-install": "complete-nowait:Success" }```


greg
2018-03-06 21:15
I prefer to do `complete-nowait:Stop`.

greg
2018-03-06 21:15
We thought what you have show have worked, but we might have been sneaky and changed something.

wdennis
2018-03-06 21:16
I?ll change it as you prefer, and see what I get

greg
2018-03-06 21:20
@wdennis - what version are you running?

greg
2018-03-06 21:27
Did your machine make it to complete-nowait while it was hung? in the UX.

greg
2018-03-06 21:29
I think I see the change that might have changed the behavior. STOP should fix it.

greg
2018-03-06 21:30
@vlowther?s new stuff will also fix the problem.

greg
2018-03-06 21:31
The new stuff will fix it in two ways.

wdennis
2018-03-06 21:40
@greg Yes, it shows as `complete-nowait` in the UX

wdennis
2018-03-06 21:41
I just reboot the nodes, and they're fine

wdennis
2018-03-06 21:42
Also, running v3.7.1 right now - I'll upgrade soon(tm)

killsudo
2018-03-07 04:28
has joined #community201803

killsudo
2018-03-07 04:31
When does the /var/lib/dr-provision/tftpboot/machines/$UUID folder get created and the kickstart file stuck in it?

shane
2018-03-07 04:31
@killsudo $welcome

2018-03-07 04:31
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

shane
2018-03-07 04:32
Kickstarts and all templated files are never rendered to disk - they're rendered on the fly by DRP

shane
2018-03-07 04:32
and served from the in-memory virtual filesystem

killsudo
2018-03-07 04:32
I've got my DR-Provision setup and working with some test vm's. I can get an ip via dhcp from rebar and see it load ipxe then sledgehammer

killsudo
2018-03-07 04:33
but after the machine is registred and I change it's bootenv to centos7 and set the ks template to centos-7.ks.tmpl and reboot the test vm centos7 never installs


shane
2018-03-07 04:34
are you familiar with the `drpcli` command ?

killsudo
2018-03-07 04:34
interesting.. this very much like VMwares AutoDeploy for vCenter for automating ESXi

killsudo
2018-03-07 04:35
yea i've been playing drpcli and getting comfortable with it

shane
2018-03-07 04:35
(except MUCH MUCH better ....) :slightly_smiling_face:

killsudo
2018-03-07 04:35
My end goal is to wrap up the drp API for StackStorm

shane
2018-03-07 04:35
can you please provide `drpcli prefs list` output

killsudo
2018-03-07 04:36

shane
2018-03-07 04:36
if you have the defaultStage setting set - then you need to manipulate what machines are going to be installed with via the use of Stages (workflow), and not BootEnvs

killsudo
2018-03-07 04:37
hmm yea I clicked that wizard button and was just trying to find better doc's on how to use that workflow tool

shane
2018-03-07 04:37
since you have defaultStage set, you need to set a Stage on the machine, which contains a BootEnv definition - do not manipulate the Machine via BootEnv settings

shane
2018-03-07 04:37
we don't have the workflow fully documented yet

killsudo
2018-03-07 04:38
ok so Im not crazy, my research ended up at http://provision.readthedocs.io/en/stable/doc/workflows.html

shane
2018-03-07 04:38
for Docs - please switch to `tip` version - they are MUCH MUCH more updated

killsudo
2018-03-07 04:38
which while nice, doesn't help that much if you are already very familiar with dhcp/pxe/autodeploy etc

shane
2018-03-07 04:39
(use the lower right floating version selector)

killsudo
2018-03-07 04:40
so I guess workflows are optional then?

killsudo
2018-03-07 04:40
my use case might be slightly different then a normal user

shane
2018-03-07 04:41
"technically" workflow (stages) can be optional, but we have found everyone adopts them once they understand them

greg
2018-03-07 04:42
`defaultBootEnv` should be sledgehammer to be consistent.

killsudo
2018-03-07 04:42
I don't want/need a TheForeman/Cobbler etc experience with lifecycle management. When I start a provisioning run I have already acquired all of my varibles from different inventories etc so I mostly need a nice API driven way to define a machine and when it boots just pull down the os with a firstboot script etc. Once I get into the OS to pre-stage some stuff I have to back out of the OS cleanly and hand off the machine to a third party

killsudo
2018-03-07 04:43
so no leaving keys etc and zero need to be able to come back to it in the future

killsudo
2018-03-07 04:43
fire and forget style setups

killsudo
2018-03-07 04:44
ideally if I can tie a 'machine' to a dhcp option 82 string, then assign a bootenv to get the SO on disk then the better

killsudo
2018-03-07 04:44
then I don't even need to futz with subnets or mac's

shane
2018-03-07 04:46
I think you'll still find that Stages and workflow is useful for you. Particularly if you're "first boot script" needs to vary between machines.

zehicle
2018-03-07 04:46
considering your request, the stages/runner would do exactly what you are asking for post provision. No keys, no ssh, no return access

shane
2018-03-07 04:46
You can build the workflow to do common boot things, select the OS, then do the post-provisioning twiddly bits. Our install method is good because our agent disolves by default after provisioning - and you don't need to leave artifacts behind

killsudo
2018-03-07 04:47
I already have my OS's 100% handled in Ansible beautifully including backing myself out gracefully. So mostly just need winrm with https on windows and ssh on linux/esxi so stackstorm can launch my ansible/powershell plays

killsudo
2018-03-07 04:47
so this is sounding good

killsudo
2018-03-07 04:50
so Default BootEnv in global properties is now sledgehammer, thanks @greg

killsudo
2018-03-07 04:50
ok so cleared out my workflows

killsudo
2018-03-07 04:50
so should I still use the wizard or start by building a simple single line flow

zehicle
2018-03-07 04:56
the wizard builds a basic one

greg
2018-03-07 04:56
@killsudo - let?s be clear on what you are trying to do.

greg
2018-03-07 04:56
You already have inventoried, IPMI managed, Raid configured, Bios adjusted system that you control from another system or systems.

killsudo
2018-03-07 04:56
+1 - http://provision.readthedocs.io/en/tip/doc/arch/provision.html tip has much better write up and starting to make some sense

killsudo
2018-03-07 04:57
@greg yup lets assume that

greg
2018-03-07 04:58
You want the to PXE boot these machines (triggered externally) and the result of some intervening process is a booted OS (of some type) with an ssh key (on winrm enablement).

greg
2018-03-07 04:58
I assume you also probably want some notification that this is done.

killsudo
2018-03-07 04:58
notification would be a nice extra

killsudo
2018-03-07 04:59
I can listen for or poll events (stackstorm uses OpenStack Mistral under the hood for workflows)

greg
2018-03-07 05:00
How do you expect to inform the black box of the OS selection and ssh key to use?

greg
2018-03-07 05:00
I assume the ssh key will be consistent and constant to your secondary provisioner.

killsudo
2018-03-07 05:01
depends on the logic once I start manipulating it

killsudo
2018-03-07 05:01
but yea the key could be static to start with and just a simple edit to the kickstart

killsudo
2018-03-07 05:02
not hard for me to have the OS call my stackstorm hook to notify it that it's alive and send along some info

killsudo
2018-03-07 05:02
I plan to put the mistral workflow into pause state while I wait around for the boot/dhcp/pxe/os load part

killsudo
2018-03-07 05:03
then I can resume when I see the incoming call

greg
2018-03-07 05:03
okay - now we are getting some where. A `task` that calls home when ready.

greg
2018-03-07 05:03
How do you want to install the OS?

killsudo
2018-03-07 05:03
you mean from where or do I want to use disk images?

greg
2018-03-07 05:04
well - kickstarts/preseeds, raw images, rootfs images?

greg
2018-03-07 05:04
immutable in-memory CoreOS/Rancher?

killsudo
2018-03-07 05:05
probably more then one, I have access to the mac addresses and know the OS before talking to drp

greg
2018-03-07 05:05
ips?

killsudo
2018-03-07 05:05
yup I can also know that

killsudo
2018-03-07 05:05
or I can just use my option-82 string that my dhcp relays are injecting

killsudo
2018-03-07 05:05
and map that to an OS

greg
2018-03-07 05:06
yep - both are options. Requiring different paths.

killsudo
2018-03-07 05:06
yea that's where I am stuck is trying to understand all of my options

greg
2018-03-07 05:07
Are you using DRP as a DHCP server?

killsudo
2018-03-07 05:08
I do have a working dhcp/pxe system now, it's just old school centos/isc-dhcp/pxe with some bash and ansible automating it

killsudo
2018-03-07 05:08
I can if that's best

killsudo
2018-03-07 05:08
but no issue using relays and next-boot / next-server options in regular dhcp

greg
2018-03-07 05:08
Do the machines have sane IPXE clients?

killsudo
2018-03-07 05:09
does that exist?

greg
2018-03-07 05:09
:slightly_smiling_face: fair enough.

greg
2018-03-07 05:09
http bzimage support

killsudo
2018-03-07 05:09
yup, all of my current systems chainload ipxe

killsudo
2018-03-07 05:10
or they can if that's what your getting at

killsudo
2018-03-07 05:10
it's all enterprise kit

greg
2018-03-07 05:10
yeah.

killsudo
2018-03-07 05:10
more concerned about uefi stuff (argh)

greg
2018-03-07 05:10
well there is uefi ipxe bootloader

killsudo
2018-03-07 05:10
if it works

killsudo
2018-03-07 05:11
I know Hyper-v Gen2 vm's will be a pain with uefi being enforced

greg
2018-03-07 05:11
VM?s uefi suck

killsudo
2018-03-07 05:11
I think my first test run with drp, those failed to even reach ipxe

killsudo
2018-03-07 05:11
my other test machines seemed happy with the defaults

greg
2018-03-07 05:11
They all seem to assume a disk

greg
2018-03-07 05:13
To reduce your initial learning space, you may want to use your existing DHCP services to chain load into sledgehammer/discovery.

killsudo
2018-03-07 05:14
well I have my test lab already successfully working with DRP dhcp service

greg
2018-03-07 05:15
Okay - well - I was more thinking for your IP management system that you already have, but either way.

greg
2018-03-07 05:15
I?d let the machines get discovered. Default to the `discover` stage.

greg
2018-03-07 05:15
Transition that to a new stage with a new task that acts as an in-line classifier.

killsudo
2018-03-07 05:18
what is an 'in-line classifier'? The template system?

greg
2018-03-07 05:19
parse your option 82 and convert to an install stage. (centos-7-install or whatever). Transition that install stage to a stage that installs your post reboot hook and reboot the machine.

greg
2018-03-07 05:20
`in-line classifier` == stage with a task represented by a template that would parse the option 82 out of the dhcp lease file and convert it to an install stage.

killsudo
2018-03-07 05:20
lets start simpler so I can build up to that. I need to get my head wrapped around the basic 1.2.3 operations. otherwise my questions are not gonna be as helpful or concise

killsudo
2018-03-07 05:21
dhcp > pxe > discovery > centos7

killsudo
2018-03-07 05:22
whats the workflow way to just discover every machine that network boots against drp and install centos7 automatically with the default ks

greg
2018-03-07 05:22
set the default stage to centos-7-install

killsudo
2018-03-07 05:23
what about if I want to hit 'discover' first then centos7?

greg
2018-03-07 05:23
that is where you will need a workflow

killsudo
2018-03-07 05:23
ok so on that page and loaded up the default global workflow with the wizard

greg
2018-03-07 05:23
Read about workflows and `ssh-access`.

greg
2018-03-07 05:25
You aren?t using virtualbox

greg
2018-03-07 05:26
Will want this:

greg
2018-03-07 05:26
discover->centos-7-install:Reboot

greg
2018-03-07 05:27
centos-7-install->finish-install:Stop

killsudo
2018-03-07 05:29
ok so seeing that

killsudo
2018-03-07 05:29
would 'discover->sledgehammer:reboot|centos7-install:reboot->finish-install:Stop' also be acceptable?

greg
2018-03-07 05:30
we don?t have a sledgehammer stage.

killsudo
2018-03-07 05:30
*sledgehammer-wait*

greg
2018-03-07 05:31
sledeghammer-wait is just a holding stage. It is meant as near-line waiting system.

killsudo
2018-03-07 05:32
ok, so that's more of a live environment I could break into and do actions on the box before proceeding with the OS install

killsudo
2018-03-07 05:32
like updating firmware or something if I was to write that task up

greg
2018-03-07 05:32
or heaven forbid automate them. :slightly_smiling_face:

shane
2018-03-07 05:32
as Stages !

greg
2018-03-07 05:33
RackN has stages that do that and more.

greg
2018-03-07 05:34
Sledgehammer is pretty rich so you could use stackstorm to trigger things there if you wanted.

killsudo
2018-03-07 05:34
yea your previous statement cleared it up, I was confused as to what the actual difference between discovery and sledgehammer was

greg
2018-03-07 05:35
Discover will put ssh keys in place if you define enough parameters.

killsudo
2018-03-07 05:35
I was thinking sledgehammer was like the autodeploy boot env that loads and send the ipxe details back to vcenter and must be booted the first time thru on an unknown machine

greg
2018-03-07 05:36
it can be used that way, but not required.

greg
2018-03-07 05:36
You could direct create machines and just jump to centos install

greg
2018-03-07 05:36
That was another path

killsudo
2018-03-07 05:37
well if guys like the workflows I'll give them a chance

greg
2018-03-07 05:39
most people want a lifecycle around their machines. Workflows really help with that.

killsudo
2018-03-07 05:44
hmm still no go, still failed to fetch ks

killsudo
2018-03-07 05:45
and $url:8091/machines/8bc9b109-0034-42d6-847b-91d716d65333/seed just 404's

shane
2018-03-07 05:45
Kickstart is rendered as `compute.ks` (not seed)

shane
2018-03-07 05:46
for centos-7-install ... for an ubuntu-16.04-install you'll get `seed`

killsudo
2018-03-07 05:46
there we go

killsudo
2018-03-07 05:46
so it did generated the ks, and this time my vm went thru discovery then auto-rebooted and pxe booted into centos7

killsudo
2018-03-07 05:46
so progress

killsudo
2018-03-07 05:47
the machine details also show it inheriting the centos7 bootenv from the workflow

killsudo
2018-03-07 05:48
weird.. 'curl: (23) Failed writing body (8520 != 16384)'

greg
2018-03-07 05:48
Depending on file and timing, it can change.

shane
2018-03-07 05:49
the kickstart/preseed is only available to be rendered during a provisioning activity

shane
2018-03-07 05:49
I actually create a "phantom" machine I can place in to various BootEnvs to render templates against to test template rendering

killsudo
2018-03-07 05:51
that curl was from dracut inside the testvm when it tried to locate ks

killsudo
2018-03-07 05:51
apparently 1gb of ram isn't enough with the fs

shane
2018-03-07 05:51
nope

greg
2018-03-07 05:51
Correct

shane
2018-03-07 05:51
not for CentOS 7 - you need 1.5 GB

shane
2018-03-07 05:51
(that's what I use on my test VMs)

greg
2018-03-07 05:51
sledgehammer is rich

shane
2018-03-07 05:51
Ubuntu is ok w/ 1 GB

killsudo
2018-03-07 05:51
yup you guys are on it, set 4gb and now centos7 is running KS

greg
2018-03-07 05:51
I usually use 2GB. It is designed for real servers.

killsudo
2018-03-07 05:52
yea not an issue, I can change those values on vm's before and after the install

killsudo
2018-03-07 05:54
I assume windows and esxi are supported well enough if your willing to put in the grunt work?

shane
2018-03-07 05:54
both are RackN commercial components ...

killsudo
2018-03-07 05:54
looked like the explode script knew how to handle those isos

killsudo
2018-03-07 05:54
what does that mean? it'll never work without some addon?

shane
2018-03-07 05:54
we support Windows via Image based deployment (eg Immutable Infrastructure)

shane
2018-03-07 05:55
there are no community based Open Source pieces to do ESXi or Windows

killsudo
2018-03-07 05:56
but I can someone implement the details themselves if they want to?

killsudo
2018-03-07 05:56
if all you want is to get pxe to launch the winpe stuff and image the disk?

shane
2018-03-07 05:57
it should be possible with the right effort

killsudo
2018-03-07 05:57
I'll look into the RackN stuff, I assume questions regarding esxi/windows isn't really for this #community channel then?

killsudo
2018-03-07 05:58
is it only windows that has a commercial piece?

shane
2018-03-07 06:01
Both Windows and ESXi - both have required a lot of engineering work to get working repeatable and at scale.

killsudo
2018-03-07 06:03
yea I bet, both gave me a headache on my current pxe system

killsudo
2018-03-07 06:08
heyo, first box is online. not to bad now that Im grasping the steps

killsudo
2018-03-07 07:30
odd one with this Hyper-V Gen2 vm.. I can see that DRP serves it up the proper efi boot file via option 67 and then ipxe loads but after ipxe tries to fetch '' it bombs out

killsudo
2018-03-07 07:31
dr-provision2018/03/07 07:24:38.728050 [3271:665]static [error]: /home/travis/gopath/src/github.com/digitalrebar/provision/midlayer/tftp.go:82

killsudo
2018-03-07 07:31
dr-provision[3877]: [3271:665]TFTP: ipxe.efi: transfer error: sending block 0: code=8, error: User aborted the transfer

greg
2018-03-07 12:50
@killsudo - check in the tftpboot directory for that file. It should be there.

vlowther
2018-03-07 14:59
@killsudo What was the error on the ipxe side?

vlowther
2018-03-07 15:00
Our pxe serving logic for UEFI has been built around "whatever recent-ish tianocore expects", and I have no idea if hyperv uses a tianocore based UEFI stack or if it rolls its own.

vlowther
2018-03-07 15:12
and that "User aborted the transfer" error you are seeing is usually just the firmware or ipxe getting the size if the thing it wants to load next.

shane
2018-03-07 16:16
- quick poll ... would you like to see Immutable Provisioning demo on the next meetup? (Immutable = image based deployments). Please vote on the meetup page: https://www.meetup.com/digitalrebar/polls/1263248/

florent.wagener
2018-03-07 18:47
following up on the DHCP issue I mentioned yesterday (even though I know that you have fixed it in the last tip version). Extending the lease seems to do the trick !

shane
2018-03-07 18:48
Yep - it's a band-aid though ... and not necessarily the right solution for some environments that want short renewal cycles on their leases ... but, glad that works for the moment :slightly_smiling_face:

shane
2018-03-07 18:48
we'll have this fix in a release out soon ... or you can update to `tip` to get it

florent.wagener
2018-03-07 19:37
Im working on something else right now but I will definetely test it :slightly_smiling_face:

killsudo
2018-03-07 22:18
@vlowther I think it might be related to this - http://git.ipxe.org/ipxe.git/commitdiff/9366578

vlowther
2018-03-07 22:21
Possible. We pull in the latest ipxe.efi whenever we cut a build, so given the patch date that fix should be in there.

vlowther
2018-03-07 22:21
Doesn't mean that patch hasn't been reverted or broken by something else in the meantime, though. :confused:

killsudo
2018-03-07 22:23
yea looks like hyper-v is gonna have to remain a virt disk copy as that comment doesn't sound reassuring until I can revisit it

killsudo
2018-03-07 22:25
I usually try and start stuff with hyper-v as an edge case cause if it works in hyper-v I can pretty much make it work on anything else

shane
2018-03-07 22:25
That's kind of like saying "I hit my self in the face with a shovel before digging a hole, because I like the feeling it gives me ... "

vlowther
2018-03-07 22:26
yeah, I go with something a little more mainstream, like qemu+kvm

vlowther
2018-03-07 22:26
:slightly_smiling_face:

killsudo
2018-03-07 22:31
heh it's what you live with when hyper-v is involved. Now I get to finally move on to testing proxmox and vmware. Both of their networking layers are virtualizied. proxmox is openvswitch with vlans dumping into a evpn/vxlan fabric. vsphere is the same with some nsx involved and dumping into a evpn/vxlan fabric. All subnet gateways are anycasted gateways(using evpn) on juniper mx routers

killsudo
2018-03-07 22:32
This topology works awesome right now with the pxe booting of physical servers but the virt side throws a few extra hops in the way that *should work*

greg
2018-03-07 22:35
@killsudo - if you already have DHCP relays working in those environments, then it should work pretty cleanly.

killsudo
2018-03-07 22:35
yea it's more making sure ovs or nsx doesn't eat the broadcast

killsudo
2018-03-08 07:28

killsudo
2018-03-08 07:28
Looks like this *should* be doable on Gen2 hyper-v

greg
2018-03-08 14:14
@killsudo - if you want to try different `ipxe.efi`, you can just place them in the tftpboot directory. If you want them to persist over drp restarts, you need to place them in the `replace` directory as well.

greg
2018-03-08 14:15
If you are in `isolated` mode, these are in the directory you started drp from.

greg
2018-03-08 14:15
If you are in `production` mode, these are in `/var/lib/dr-provision`.

greg
2018-03-08 14:15
If you find one that works, lets know where you got it from. :slightly_smiling_face:

dave.parker
2018-03-08 22:07
Hey all. I'm trying to create a bootable iPXE image that I can use to manually point a machine at a dr-provision host on another subnet. I'm doing this because I can't run a DHCP server on the subnet the host to be discovered lives on.

dave.parker
2018-03-08 22:07
Oh and for extra complications, it's also a UEFI boot only system.

dave.parker
2018-03-08 22:08
I have the image booting and am trying to chain boot bootx64.efi. But I get the error ```no config file found in forcing interactive mode due to config file error(s) ELILO boot: .............```

dave.parker
2018-03-08 22:08
Then just an endless string of dots.

dave.parker
2018-03-08 22:08
It never seems to get any farther.

dave.parker
2018-03-08 22:09
So it downloads bootx64.efi from the dr-prov server, but can't go any farther apparently.

greg
2018-03-08 22:14
@dave.parker - There is ProxyDHCP for this case if needed. It can in parallel only had out boot options.

dave.parker
2018-03-08 22:14

greg
2018-03-08 22:15
@dave.parker - Second, I think we?ve dropped support fro bootx64.efi because it was buggy. And switched to `ipxe.efi` because it worked better. @vlowther will have better info.

dave.parker
2018-03-08 22:15
Ok.

dave.parker
2018-03-08 22:16
I can't install anything in this subnet really. So I don't know that I can do the ProxyDHCP either.

greg
2018-03-08 22:16
okay - so - how do you control what it is going to boot the first time?

dave.parker
2018-03-08 22:17
With a bootable iso, with an iPXE executable that runs that script I posted.

greg
2018-03-08 22:18
okay - so you could try using ipxe.efi with a pxe script that points to `http://<ip>:8091/default.ipxe`

greg
2018-03-08 22:18
where `<ip>` is your DRP endpoint.

vlowther
2018-03-08 22:18
yeah, that. :slightly_smiling_face:

greg
2018-03-08 22:18
and `8091` is your HTTP port on the DRP endpoint.

dave.parker
2018-03-08 22:19
Got it

greg
2018-03-08 22:19
The `default.ipxe` file is rendered with enough to handle discovery and directed boot operations.

greg
2018-03-08 22:20
It is provided by the `discovery` bootenv. It looks like this:

greg
2018-03-08 22:21

greg
2018-03-08 22:21
The loopback addresses are just because it is how I asked for it.

greg
2018-03-08 22:23
It basically looks for ipxe files on the DRP endpoint for the machine specifically (`<ip>.ipxe` and `<mac>.ipxe`). Bootenvs are supposed to provide one of those files for IPXE boots.

greg
2018-03-08 22:23
If they don?t exist, then it boots sledgehammer and starts discovery.

dave.parker
2018-03-08 22:23
Ok, cool.

dave.parker
2018-03-08 22:24
I'll give that a try.

dave.parker
2018-03-08 22:31
Hey, that kind of worked.

dave.parker
2018-03-08 22:32
It grabbed sledgehammer and booted into it. But it then set the IP address to 0.0.0.0 and is just looping `Sending discover...` over and over

greg
2018-03-08 22:34
sledgehammer expects DNS domain in the DHCP options, I think.

greg
2018-03-08 22:35
You can log into sledgehammer from the console with root/rebar1.

greg
2018-03-08 22:35
To see what DHCP client is getting upset about.

greg
2018-03-08 22:35
That maybe where you are.

dave.parker
2018-03-08 22:43
I can't get in because it just keeps trying to send discover. I never get a login prompt

dave.parker
2018-03-08 23:07
So the sledgehammer image is always going to try configure the network with DHCP? So if I can't have a DHCP server on the network (or there's one there I don't control) will sledgehammer not work?

greg
2018-03-08 23:12
It needs DHCP and what you have should work. They are a couple of things going on.

greg
2018-03-08 23:13
We made a change to let dhcp and drpcli run separately in sledgehammer. This is good, but may be preventing drpcli from running. It is a new change. Probably a bug.

greg
2018-03-08 23:13
The second is that our dhcp server requires 1 options (I think). Dns Domain.

greg
2018-03-08 23:14
We can look at removing that requirement.

greg
2018-03-08 23:18
We are just a little busy with some other things at the moment.

dave.parker
2018-03-08 23:23
I don't have any subnets configured on my dr-provision server currently because I'm testing these remote installs. Do I need one configured anyway?

greg
2018-03-08 23:24
No. It is should be fine.

greg
2018-03-08 23:25
There are bugs for your use case. It would be nice to know what options your dhcp server is sending

dave.parker
2018-03-08 23:40
I don't even see anything on the server end. Nothing in the logs at all.

dave.parker
2018-03-08 23:41
Oh well I'm going to come back to this tomorrow.

zerocarbthirty
2018-03-09 14:27
has joined #community201803

spector
2018-03-09 14:29
Welcome @zerocarbthirty $

spector
2018-03-09 14:30
sorry, $Welcome

2018-03-09 14:30
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

zerocarbthirty
2018-03-09 14:35
Are there shims or example entrypoints for DRP that show things like: "This is where you put your script that SSH's into a switch and configures ip dhcp-helper on an interface so that a machine will pxeboot when it reboots" or "this is the URL where DRP can retrieve information about the network interfaces for a machine" or are defining all of the tasks left to the user?

spector
2018-03-09 14:45
Thanks for the question, I am on our short morning meeting with engineering, they will be checking this channel shortly.

vlowther
2018-03-09 15:30
@zerocarbthirty We provide several common tasks as part of the community content

vlowther
2018-03-09 15:32
For example, getting the information about network interfaces for a machine is done as part of initial system discovery, which runs the gohai task.

vlowther
2018-03-09 15:33
We don't have one that configured the DHCP helper on a switch -- our usual assumption is that everything is using DHCP all the time anyways, and that the networking guys don't want us touching their switches.

shane
2018-03-09 15:38
@zerocarbthirty - If you have the admin access to the switches in question - and if you have existing tooling to be able to implement those changes, you can build a Stage that would make switch port changes as part of the Workflow aspects of Digital Rebar Provision

vlowther
2018-03-09 15:39
More generally, we expect that each environment will need some customized content (tasks, stages, etc) and workflows to accomplish your deployment goals.

shane
2018-03-09 15:47
however, as @vlowther states, we don't have precanned components in the community tooling that does that - there is a bit of a chicken-and-egg issue, if you are not DHCPing your host against DRP, you'd need to build Reservations for the machine in advance, and add a machine in advance of it booting against us - and your first Stage would be to configure the Switch port to dhcp relay - then power on

shane
2018-03-09 15:48
you either have to have DHCP and our Discovery mechanisms to add machines in to DRP - or you have to build the information in advance to manage the Machine - then make the switch port changes and boot the machine for provisioning

mark.yeun
2018-03-09 17:36
has joined #community201803

shane
2018-03-09 17:36
@mark.yeun $welcome

2018-03-09 17:36
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

mark.yeun
2018-03-09 17:37
howdy, very cool stuff you're doing :slightly_smiling_face:

shane
2018-03-09 17:38
thanks !

mark.yeun
2018-03-09 17:38
i hope you don't mind me just popping in to ask questions...

vlowther
2018-03-09 17:40
Well, since you ask nicxely. :slightly_smiling_face:

mark.yeun
2018-03-09 17:42
:slightly_smiling_face: I've got dr-provision working nicely in a libvirt environment. I'm trying to get it working on metal. I have the server set up, dhcp relay is working. I have a serial console to my baremetal box. When I pxeboot, I see on tcpdump that dhcp works, and tftp for lpxelinux.0 works.

mark.yeun
2018-03-09 17:42
then... nothing

mark.yeun
2018-03-09 17:42
I believe the next thing that should happen is another tftp for a series of files, which should fail, then a tftp for pxelinux.cfg/default

shane
2018-03-09 17:42
fw or iptables blocking ports 8091 and 8092 on the DRP endpoint ?

mark.yeun
2018-03-09 17:43
wide open

vlowther
2018-03-09 17:43
Yep, that is what you should see.

mark.yeun
2018-03-09 17:45
so my theory is lpxelinux.0 doesn't like my hardware?

vlowther
2018-03-09 17:45
What version of dr-provision are you running, what hardware are you testing, and is it booting via UEFI or legacy BIOS?

shane
2018-03-09 17:45
...and what does your serial console on your Machine show ?

mark.yeun
2018-03-09 17:45
i _think_ it's legacy BIOS.

mark.yeun
2018-03-09 17:46
``` $ dr-provision --version dr-provision2018/03/09 17:45:45.087689 Version: v3.7.3-tip-5-eb82a0429c7c94bb1885cc32528c15e376417138 ```

vlowther
2018-03-09 17:46
Cool.

mark.yeun
2018-03-09 17:46
and I don't have access to vga -- only serial console

vlowther
2018-03-09 17:46
No worries.

shane
2018-03-09 17:46
vga is only for winders ... a real OS only needs a serial console ...

vlowther
2018-03-09 17:46
heh

mark.yeun
2018-03-09 17:46
lol nice

mark.yeun
2018-03-09 17:47
do you want to see the subnet def?

mark.yeun
2018-03-09 17:47
nothing special

vlowther
2018-03-09 17:47
What does the subnet definition looklike?

vlowther
2018-03-09 17:47
yes. :slightly_smiling_face:

mark.yeun
2018-03-09 17:47
haha ```{ "ActiveEnd": "10.10.24.171", "ActiveLeaseTime": 60, "ActiveStart": "10.10.24.151", "Available": true, "Description": "", "Enabled": true, "Errors": [], "Meta": {}, "Name": "kube1_subnet", "NextServer": "", "OnlyReservations": false, "Options": [ { "Code": 1, "Value": "255.255.254.0" }, { "Code": 3, "Value": "10.10.24.1" }, { "Code": 6, "Value": "10.40.20.101,10.40.20.102" }, { "Code": 15, "Value": "http://tower-research.com" }, { "Code": 28, "Value": "10.10.25.255" } ], "Pickers": [ "hint", "nextFree", "mostExpired" ], "Proxy": false, "ReadOnly": false, "ReservedLeaseTime": 7200, "Strategy": "MAC", "Subnet": "10.10.24.0/23", "Unmanaged": false, "Validated": true } ```

vlowther
2018-03-09 17:48
That is fine for 3.7.3

vlowther
2018-03-09 17:49
I wanted to make sure it wasn't forcing the wrong bootloader or something like that.

vlowther
2018-03-09 17:50
Can you throw a screenshot of the failed boot into the channel?

mark.yeun
2018-03-09 17:50
it's blank

vlowther
2018-03-09 17:50
...

mark.yeun
2018-03-09 17:50
serial

mark.yeun
2018-03-09 17:50
so F12, then blank

vlowther
2018-03-09 17:51
so the serial console shows the nic firmware doing its thing, loads lpxelinux.0, then goes blank?

shane
2018-03-09 17:51
^^^

mark.yeun
2018-03-09 17:51
yessir

shane
2018-03-09 17:51
you need to set serial console on the DRP side

mark.yeun
2018-03-09 17:51
actually, as usual, my bios is blanking the screen after POST

vlowther
2018-03-09 17:52
hm

mark.yeun
2018-03-09 17:52
I've set serial console, and I see it in the rendered pxelinux.cfg/default

mark.yeun
2018-03-09 17:52
but i'm not getting that far

vlowther
2018-03-09 17:52
can whatever you are serial cnsoling into the system cwith capture the output?

vlowther
2018-03-09 17:52
ah

vlowther
2018-03-09 17:52
Yeah, I suspect it is what shane mentioned then.

mark.yeun
2018-03-09 17:52
lpxelinux.0 doesn't download the config file, so it doesn't know to output on serial

vlowther
2018-03-09 17:53
We should make having a serial console enabled the default one if these fine days.

mark.yeun
2018-03-09 17:53
and even if it did, the rendered pxelinux.cfg/default has console=... for the kernel but not for itself

mark.yeun
2018-03-09 17:54
```$ tftp REDACTEDHOSTNAME -c get pxelinux.cfg/default $ cat default DEFAULT discovery PROMPT 0 TIMEOUT 10 LABEL discovery KERNEL sledgehammer/9743e672ff33179cd5218d8fe506c03cf2a31d18/vmlinuz0 INITRD sledgehammer/9743e672ff33179cd5218d8fe506c03cf2a31d18/stage1.img APPEND rootflags=loop root=live:/sledgehammer.iso rootfstype=auto ro liveimg rd_NO_LUKS rd_NO_MD rd_NO_DM provisioner.web=http://10.40.20.30:8091 -- console=ttyS1,115200n8 IPAPPEND 2 ```

greg
2018-03-09 17:54
Another thing to try would be to set option 67 in the subnet to `ipxe.pxe` and see if that works.

shane
2018-03-09 17:54
is your serial console *actually* on ttyS1 ? not ttyS0 ?

shane
2018-03-09 17:54
ttyS1 is kinda non-standard - it's what http://packet.net baremetal systems use by default

mark.yeun
2018-03-09 17:54
tried both :slightly_smiling_face:

mark.yeun
2018-03-09 17:55
i tried this, no change in behavior `drpcli subnets set kube1_subnet option 67 to '{{if (eq (index . 77) "iPXE") }}default.ipxe{{else if (eq (index . 93) "0")}}lpxelinux.0{{else}}bootx64.efi{{end}}'`

vlowther
2018-03-09 17:55
wow, that old string is still out there.

vlowther
2018-03-09 17:56
try {{if (eq (index . 77) "iPXE") }}default.ipxe{{else if (eq (index . 93) "0")}}ipxe.pxe{{else}}ipxe.efi{{end}}

shane
2018-03-09 17:56
are you using a USB to Serial converter, or an actual 9-pin serial port ?


mark.yeun
2018-03-09 17:57
actually i have an avocent

shane
2018-03-09 17:57
um ... that's my fault ...

mark.yeun
2018-03-09 17:58
ok trying that option67 value

mark.yeun
2018-03-09 17:59
``` { "Code": 67, "Value": "try {{if (eq (index . 77) \"iPXE\") }}default.ipxe{{else if (eq (index . 93) \"0\")}}ipxe.pxe{{else}}ipxe.efi{{end}}" } ``` gonna give it a go

greg
2018-03-09 17:59
too much

greg
2018-03-09 17:59
get rid of try

vlowther
2018-03-09 17:59
there is no try. :slightly_smiling_face:

mark.yeun
2018-03-09 17:59
lol

mark.yeun
2018-03-09 17:59
do or do not

zerocarbthirty
2018-03-09 17:59
What backend system does DRP use to store information gathered with gohai/etc ?

greg
2018-03-09 18:00
This command: `drpcli subnets set kube1_subnet option 67 to '{{if (eq (index . 77) "iPXE") }}default.ipxe{{else if (eq (index . 93) "0")}}ipxe.pxe{{else}}ipxe.efi{{end}}'`

greg
2018-03-09 18:00
It uses a file-based data store.

greg
2018-03-09 18:00
@zerocarbthirty

zerocarbthirty
2018-03-09 18:01
Oh, I figured it was using Collins or something like that.

greg
2018-03-09 18:01
@zerocarbthirty - we try to keep it minimal / lightweight for embedding in things.

vlowther
2018-03-09 18:01
Nope. We want complete standalone operation + a zero-dependency install.

greg
2018-03-09 18:02
We can use different backend but they need to be more blob store like than Collins.

vlowther
2018-03-09 18:02
so the dr-provison binary embeds all the things it must have to install.

zerocarbthirty
2018-03-09 18:05
So I asked the question earlier about having DRP configure switchports for DHCP when it needs to pxeboot a system rather than having DHCP always on. I've been watching alot of the videos about DRP (that use packet as infrastructure) and it seems a little odd that a network of that size would want every port forwarding DHCP traffic from every host to the DHCP server all the time. Although I guess they probably don't use DR for their own DHCP/pxe so maybe they do actually configure the ports.

shane
2018-03-09 18:06
yes, they configure switch ports dynamically for every single host - every single host by default is isolated in a /31 size L3 boundary and VXVLAN separated (updated my typo to /31 boundary)

shane
2018-03-09 18:06
Digital Rebar Provision can configure ports just like they do ... it would be part of the Stages of workflow that you'd write for content

mark.yeun
2018-03-09 18:06
lol i thought you were making a yoda joke but I get it now. removed the "try ", retrying

shane
2018-03-09 18:09
the open source Digital Rebar Provision out-of-the-box does not have switch management built in

mark.yeun
2018-03-09 18:10
boys i see many packets


zerocarbthirty
2018-03-09 18:11
Hmm I was under the impression that they didn't actually use any L2 VLAN/VXLAN stuff so that people can use their own overlays.

zerocarbthirty
2018-03-09 18:12
but it's amusing to me that they use /31s just because of NANOG's history of arguing about whether or not anyone should ever use /31s haha

mark.yeun
2018-03-09 18:12
i forgot to put serial console back on ttyS0, but the machine showed up in drpcli machines list

mark.yeun
2018-03-09 18:13
will give it another crack with proper serial

mark.yeun
2018-03-09 18:13
but you are the _man_

mark.yeun
2018-03-09 18:13
seems I'd have been stuck without the slack inquiry

mark.yeun
2018-03-09 18:13
thanks guys! much appreciated.

greg
2018-03-09 18:14
What type of hardware is it? @mark.yeun

mark.yeun
2018-03-09 18:14
supermicro X10 series with mellanox 10g

shane
2018-03-09 18:14
hmm - the mellanox drivers may not be in the open source centos/ubuntu - have you checked that ?

greg
2018-03-09 18:14
lpxelinux.0 might not like the mellanox. `ipxe.pxe` is ?newer?

shane
2018-03-09 18:14
I know they were a problem (especially Ubuntu 14.x)

shane
2018-03-09 18:15
which Mellanox cards ?

zerocarbthirty
2018-03-09 18:16
Anyway, earlier I wasn't asking about whether DRP has the ability to configure networking equipment I was asking whether there were example workflows that illustrate where one might just pop in their own code to look stuff up from IPAM/configure switches, etc

mark.yeun
2018-03-09 18:17
02:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

shane
2018-03-09 18:17
should be good - that's an older gen card - which I think is supported in the mlx4 driver

shane
2018-03-09 18:18
(it might be in the mlx5 driver - but both are in CentOS7 by default ... pretty sure ubuntu 16 too)

mark.yeun
2018-03-09 18:18
one more quick question -- I haven't looked into this yet. is IPMI support behind the pay wall?

vlowther
2018-03-09 18:18
Yep.

zerocarbthirty
2018-03-09 18:19
Hmm, I think crowbar used to manage ipmi

zerocarbthirty
2018-03-09 18:19
odd that they would decide to make that a paid for feature

greg
2018-03-09 18:20
We have to have some encouragement to move to revenue generating customers.

mark.yeun
2018-03-09 18:20
okay, if this sexy thing keeps looking sexy i'll hit up your sales team for at least ballpark pricing :slightly_smiling_face:

greg
2018-03-09 18:21
You can play with it in the content bundles of IPMI.

greg
2018-03-09 18:21
@mark.yeun - that is amazingly easy. Seeing as you?ve talked with over half the team. :slightly_smiling_face:

mark.yeun
2018-03-09 18:21
okay that triggers one more question. what sort of features do we get with the paywall bios support?

vlowther
2018-03-09 18:22
Dell firmware updates and BIOS config are there presently

2018-03-09 18:22
Time to feed the :bear:!

vlowther
2018-03-09 18:22
seeing as how that is what we have the most experience.

zerocarbthirty
2018-03-09 18:23
@mark.yeun You should be able to just boot into a thin linux environment like Alpine and then use whatever vendor's tools to configure your BIOS by just having alpine curl a script that builds the config based on either the MAC or IP of the client.

zerocarbthirty
2018-03-09 18:24
omconfig is especially easy to do that with

vlowther
2018-03-09 18:24
I have implemented Supermicro support (via sum) for DRv2 and an earlier customer, and have been waiting for demand to port that support over to dr-provision.

mark.yeun
2018-03-09 18:24
@zerocarbthirty thanks, we have such stuff for our old heavy OS builds. lots of witchcraft though with our mixed hardcware platforms

zerocarbthirty
2018-03-09 18:24
I don't know how far along redfish is but some of that may be standardized now

greg
2018-03-09 18:25
@zerocarbthirty - or you could use our tools and workflows to automate and orchestrate. Or build a task/stage that runs your own tools as part of the process. DRP is flexible.

vlowther
2018-03-09 18:25
ya, our BIOS and RAID plugins go through an in-house tool that implements a standard format and idempotent config flow on top of vendor-provided tooling.

shane
2018-03-09 18:25
@zerocarbthirty...it all sounds so _easy_ on paper .... :stuck_out_tongue_winking_eye:

vlowther
2018-03-09 18:26
yeah, we take care of blundering into and padding all the sharp bits so you don't have to.

zerocarbthirty
2018-03-09 18:27
and i'm guessing that also Windows is paywalled too?

vlowther
2018-03-09 18:27
so you can (for instance) hand us a JSON blob representing a RAID config and we take care of driving megacli/storcli/whatever to make that config happen.

vlowther
2018-03-09 18:27
Ditto for BIOS config.

zerocarbthirty
2018-03-09 18:28
You mean in the event that we have access to the rackn stuff

greg
2018-03-09 18:28
@zerocarbthirty - yes, image-based deploys are paywalled. Because they are tricky usually require consulting.

zerocarbthirty
2018-03-09 18:29
oh, we have it down to where we boot into PE and use command-line DISM to do phase 1 then the system reboots and finishes the install. We didn't find it very complicated even using pxelinux

greg
2018-03-09 18:30
Okay - then that is good for you. We?ve found that building images and deploying them is much faster and our customers like that workflow better. It fits into CICD pipelines better and gives better controls.

zerocarbthirty
2018-03-09 18:30
I assume you are referring to non .wim images?

greg
2018-03-09 18:31
Correct - though I have prototypes that deploy those as well.

zerocarbthirty
2018-03-09 18:34
So you are referring to images that you drop onto the disk when the install is past what is commonly referred to as the 'PE' phase?

mark.yeun
2018-03-09 18:34
hey I want to thank you guys again. will go have a play, and may come back for more magic.

vlowther
2018-03-09 18:35
@zerocarbthirty: yes.

vlowther
2018-03-09 18:36
@mark.yeun: No problemo.

zerocarbthirty
2018-03-09 18:41
That is interesting, I haven't really looked for different imaging formats than .wim since we used ghost+floppies to do installs but I suppose if you include the right drivers or only use hardware that has native drivers in $target_windows_version that method could be a bit faster.

shane
2018-03-09 18:42
It is extremely fast - only requires one reboot of the machine to provision to a completed OS, and we support adding in post-provisioning bootstrap config changes as well

shane
2018-03-09 18:42
along with our Agent, which you can enable longer term lifecycle management (if desired, by default our Agent "dissolves" after initial provisioning" )

shane
2018-03-09 18:43
coupled with a CI/CD pipeline to validate/test your "gold" image, you can roll out patch updates very very quickly across very large scale infrastructure

shane
2018-03-09 18:44
you also can roll forward/roll back images quickly through this mechanism in the event a new image exhibits behavior problems "in the wild"

zerocarbthirty
2018-03-09 18:45
So, pardon me if I am nosy but you would take the image when it is at the "Please wait while we are setting everything up for you" phase which I believe they still call OOBE and then have an execution configured to grab a script in powershell or (whatever) to execute the changes?

zerocarbthirty
2018-03-09 18:47
I suppose one could also just roll cloud-init into an imagine for physical hosts

zerocarbthirty
2018-03-09 18:47
err image

greg
2018-03-09 18:48
Hmmm. Maybe. :grinning:

zerocarbthirty
2018-03-09 18:50
although you would want to avoid joining the domain with such a host until after the computer name is changed for obvious reasons

greg
2018-03-09 18:50
If that is a concern. Most definitely

zerocarbthirty
2018-03-09 18:57
Discovery uses open sledgehammer? does that do cdp/lldp to gather network info?

shane
2018-03-09 18:58
lldp is in the Sledgehammer image so you can definitely build a Stage to collect switch port info

zerocarbthirty
2018-03-09 18:59
Cool, i'm having the team install 40 PowerEdge 440s to play with DRP next week so I'm just trying to think of everything I am going to wonder about ahead of time.

shane
2018-03-09 19:00
Stages can be used to integrate in to external DCIM/Asset Mgmt/Cfg Mgmt databases as well - we can either push inventory info to them, or pull info to build provisioning decisions against

shane
2018-03-09 19:00
(and IPAM as well)

zerocarbthirty
2018-03-09 19:03
The problem I find mostly is that the team will rack 500 servers and then make a bunch of mistakes describing them whether it's the actual specs, the network info, whatever. so thats really the part that is the biggest pain in the ass. I didn't realize this until today but Packet never changing their hardware after it is installed makes alot of things a whole lot easier.

shane
2018-03-09 19:03
yep, I know that pain ...

shane
2018-03-09 19:04
correlating the physical infrastructure design from what it _should_ be to what it really is (or isn't; if it's broken) ... can be a mess

shane
2018-03-09 19:04
you can audit via Sledgehammer what the reality of your network ports are and compare that to a "it should be this" design

shane
2018-03-09 19:05
we use the LLDPD implementation which supports LLDP, CDP, EDP, SONMP, and FDP protocols

zerocarbthirty
2018-03-09 19:05
Well, if you devote a portion of your DC to being totally 'fixed' i.e. you won't change the specs/wont change the network, etc it makes alot of that trivial

zerocarbthirty
2018-03-09 19:06
but if you are constantly adding/removing drives+ram+pci-e cards, etc

zerocarbthirty
2018-03-09 19:06
your inventory is going to get out of date pretty quickly

shane
2018-03-09 19:07
one could implement use of our Agent (i.e. not let it dissolve), and built in `gohai` inventory to report back to your provisioning service or other external services on a periodic basis

shane
2018-03-09 19:08
then you can support continual sweeping of inventory management - these are some of the larger lifecycle management solutions that can be built around the Agent if desired

zerocarbthirty
2018-03-09 19:08
my initial thought on that was to have a maintenance mode that would cause it to boot back into the discovery to update the inventory but then I just figured out that we could probably just make a certain percentage of stuff not changeable

shane
2018-03-09 19:09
Sledgehammer is designed to support that use, and DRP is designed to let you in-memory/live boot systems "do stuff", then boot them back to their installed OS easily enough

shane
2018-03-09 19:09
but that's a fairly disruptive pattern to have ...

zerocarbthirty
2018-03-09 19:09
I always find it humorous that if you add RAM to a dell server it takes a new inventory but there is no way to tell it to send that information anywhere

2018-03-09 19:09
Time to feed the :bear:!

shane
2018-03-09 19:10
use of the Agent reporting back to DRP can centralize and manage that information as opposed to the rebooting - our Agent is crosscompiled for Arm, Intel, 32, 64bit, Linux, Windows, and Darwin (Mac) currently - pretty easy to port to other things if desired

zerocarbthirty
2018-03-09 19:10
Oh, yeah man but it's not like you are going to upgrade RAM on a server while it's running

greg
2018-03-09 19:10
I would call those workflows. :grinning:

zerocarbthirty
2018-03-09 19:11
so there will be some amount of disruption anyway

zerocarbthirty
2018-03-09 19:11
and if it's something that can't be disrupted it should be running on block storage that is accessible from more than a single host anyway

shane
2018-03-09 19:12
that is a good modern design pattern, but sadly, not all shops are there ...

zerocarbthirty
2018-03-09 19:15
Yeah earlier I was moslty just asking if there were example workflows that are just missing the rackn pieces

zerocarbthirty
2018-03-09 19:15
like "this would be where rackn did this cool thing"

zerocarbthirty
2018-03-09 19:15
that would make it easy to figure out what pieces we have/need to build/want to buy

zerocarbthirty
2018-03-09 19:23
You guys have piqued my interest on this windows image thing, must find file format :smiley:

shane
2018-03-09 19:27
this Tuesday we will be demo'ing our Image Deployment capabilities - including the Windows Image deploy on the weekly Meetup

greg
2018-03-09 19:33
@zerocarbthirty - The default community content has stages and tasks in it. You can see the tasks / stages in the RackN content that is available by logging into the RackN SaaS.

shane
2018-03-09 19:33

zehicle
2018-03-09 22:47
When you create an account, you'll have access to some licensed content on a trial basis from the catalog.

zehicle
2018-03-09 22:48
The account turns on "registration wall" UX features.

zehicle
2018-03-10 03:03
little bonus features.... added live Task view to Bulk Edit

zehicle
2018-03-10 03:03

zehicle
2018-03-10 03:03
It will be in the latest after we test it a bit

jweber
2018-03-11 16:57
has joined #community201803

spector
2018-03-11 16:58
$Welcome

2018-03-11 16:58
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

florent.wagener
2018-03-12 14:08
hey guys, quick question, I have 3 burnin jobs that are stuck in a "running" state and I can't delete them. How can I force them into a fail state ?

florent.wagener
2018-03-12 14:08
I've tried this: `drpcli jobs destroy ce9c668a-7888-4308-84c6-e1591deedc0d --force` without success :disappointed:

florent.wagener
2018-03-12 14:10
nevermind I find it: `drpcli jobs update ce9c668a-7888-4308-84c6-e1591deedc0d '{"State": "failed"}'`

amit.handa
2018-03-13 06:41
am trying to setup kubernetes cluster on virtualbox VMs (3 VMs)

amit.handa
2018-03-13 06:41
have completed kubernetes setup via kubespray playbooks

amit.handa
2018-03-13 06:42
however, unable to access the dashboard

amit.handa
2018-03-13 06:42
it says 'unauthorized'. Any idea how to fix it ?

amit.handa
2018-03-13 06:42
Thank you !

amit.handa
2018-03-13 06:42
``` { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "Unauthorized", "reason": "Unauthorized", "code": 401 } ```

greg
2018-03-13 12:56
@amit.handa - I think you need some kubeproxy magic, but not sure. Something like this may help you start. https://github.com/kubernetes/dashboard/issues/692

amit.handa
2018-03-13 12:58
thanks greg. Let me look for cure

shane
2018-03-13 13:11
@amit.handa are you trying to access via the `kubectl proxy` command ? http://provision.readthedocs.io/en/tip/doc/integrations/krib.html#kubernetes-dashboard-via-proxy

amit.handa
2018-03-13 13:12
I am opening it via https://<kube-master-ip>:6443

amit.handa
2018-03-13 13:12
as mentioned in the drp docs

wayneeseguin
2018-03-13 14:00
has joined #community201803

zehicle
2018-03-13 15:26
@amit.handa if kubespray changed security defaults then the docs will be out of date. The integration just supplies the inventory. It's likely they tweaked the auth system

shane
2018-03-13 15:27
@wayneeseguin $welcome

2018-03-13 15:27
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

amit.handa
2018-03-13 15:58
thanks @zehicle for the info, I am new to kubernetes. ll check and update.

amit.handa
2018-03-13 17:18
thanks @greg

amit.handa
2018-03-13 17:19
I stopped the docker proxy container and ran following on master node

amit.handa
2018-03-13 17:19
```kubectl proxy --address 0.0.0.0 --accept-hosts '.*'```

amit.handa
2018-03-13 17:19
I can see the dashboard

jpresley
2018-03-13 17:41
has joined #community201803

jpresley
2018-03-13 17:57
I'm exploring use of digital rebar to provision bare metal machines in the office and in the data center. Is it a lot of effort use digital rebar when the local network already has a dhcp server? My experience is more in devops and software provisioning rather than traditional ops

zehicle
2018-03-13 18:59
@jpresley there are several ways to handle shared DHCP including setting nextboot from your DHCP and not using DRP and using DRP as a DHCP Proxy to add nextboot instructions to DHCP requests

zehicle
2018-03-13 18:59
I think some of those are in the $faq


wdennis
2018-03-13 19:13
@jpresley I am using DRP on a subnet with existing DHCP server - I just am setting ?next server? and ?default file name? params to appropriate values

greg
2018-03-13 19:30
@jpresley - proxy dhcp may work for you

lcrozzoli
2018-03-13 23:11
has joined #community201803

shane
2018-03-14 00:49
@lcrozzoli $welcome

2018-03-14 00:49
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

wayneeseguin
2018-03-14 13:36
:smile:

zehicle
2018-03-14 14:52
Good morning @wayneeseguin

lcrozzoli
2018-03-14 15:13
Hello and thanks to all. i'm very glad to be here

nkabir
2018-03-14 16:08
Hello, all. I've managed to navigate the documentation well enough to create a bundle, upload it, and successfully provision machines. It works beautifully and I appreciate everyone's hard work. I have two questions about the process: 1) I'd like to customize the ubuntu preseed "late_command" but cannot find references to it in the documentation. Nor could I find references to it in the standard collection of parameters (I used "select-kickseed" to customize other parts of the installation which worked perfectly). What is the preferred way to override "late_command"? 2) I have a "discover -> prep-install -> ubuntu-16.04-install -> finish-install -> complete-nowait" workflow to accomplish this. Is this reasonable? Apologies if I should open these questions as Github issues. I know it's always a challenge to document every possible use case. I'd like to help out any way I can so I'd be happy to post these questions as Github issues so they can be organized/triaged.

shane
2018-03-14 16:09
hi @nkabir - glad to hear you've gotten things rolling !

shane
2018-03-14 16:12
taking a look at your late_command question

shane
2018-03-14 16:17
currently - there are two paths for you: 1. with `select-kickseed` - you can define you're own post install script - either replacing the `post-install.sh` call, or adding a second line after it ... 2. you can clone the ubuntu-16.04-install BootEnv, and make the `template` call changes in that - using your cloned BootEnv

shane
2018-03-14 16:17
However, if you clone the BootEnv, when we make DRP Community Content updates, you won't get those changes for that cloned BootEnv - but the original ubuntu-16.04-install BootEnv will be updated

shane
2018-03-14 16:20
it looks like we actually call "post-install.sh" twice (erroneously) - we call it once in the BootEnv definition in the `templates` section, then again in the `net-seed.tmpl`

shane
2018-03-14 16:25
since you are already using `select-kickseed`, the `net-seed.tmpl` that you (presumably) originally cloned can be modified to point to a different `post-install.sh`; and our `post-install.sh` will still run, as it's defined by the BootEnv definition to run last - and you don't want to disturb the `reset-workflow` and `runner` template calls in that, otherwise you'll get bad behavior with the Workflow/Stages and jobs.

nkabir
2018-03-14 16:28
Thank you for the clarification, @shane. I'll give that a try!

shane
2018-03-14 16:29
Also - if you are unsure exactly what a template is going to do - you can render it to see the final product - not everything can be rendered in it's final state, since the context in which it was called matters in some cases

shane
2018-03-14 16:29
but there is a $faq on test rendering templates


shane
2018-03-14 16:31
the rendering works for other things too, any template defined in a BootEnv can be rendered against the "machines" url, example: http://drp.domain.com:8091/machines/817cbf29-30be-4807-b5a0-1234567890/post-install.sh

shane
2018-03-14 16:32
the trick is ... the machine must be in the Stage that the templates are defined in - so the `ubuntu-16.04-install` Stage - defines templates, put a "phantom" machine in to that bootenv, and then render away

shane
2018-03-14 16:32
this is true for other stages too - I use a phantom machine to render stuff all the time

nkabir
2018-03-14 16:37
I was able to get that far. I wanted to work within DRP preferred conventions and ensure I didn't customize too far afield of the tool's best-practices. Would my custom post-install.sh script need to follow the BootEnv post-install script's configuration of the chroot environment i.e. using a here-doc and executing in /target?

shane
2018-03-14 16:41
I'd suggest cloning the existing `net-post-install.sh.tmpl`, and work with in the same conventions - just modify in the HereDoc what you want to actually perform - and change the name of the written out file from `update_system2.sh` to something like `update_norms_secret_post_install.sh` - otherwise keeping everything else intact around the HereDoc

nkabir
2018-03-14 16:41
:+1: will give that a try!

shane
2018-03-14 17:35
@nkabir - additionally you could add a new Stage, with a Task `norms-post-install`, which calls the new `post-install` script - you'd then insert that in your work flow between `ubuntu-16.04-install` and `finish-install`

nkabir
2018-03-14 19:36
@shane I like your suggestion better. It's more discoverable and self-documenting. Thanks!

shane
2018-03-14 19:36
The Stage/Task/Template solution?

shane
2018-03-14 19:38
^^^ was @greg and @vlowther reminding me that is the better path to follow ... :slightly_smiling_face:

wdennis
2018-03-15 00:55
@shane Speaking of stages and tasks, I'm having a problem with a custom Stage/Task as follows...

wdennis
2018-03-15 00:56
I made a custom stage based on `prep-install` named `necla-prep-install` which is as follows:

wdennis
2018-03-15 00:56

wdennis
2018-03-15 00:57
This uses the Task I named `totally-erase-sda` which is as follows:

wdennis
2018-03-15 00:58

wdennis
2018-03-15 00:58
I wired the stage up in a stage-map in my profile as follows:

wdennis
2018-03-15 00:59

wdennis
2018-03-15 01:01
So, as I understand it, this will boot SH, then execute the template script in the task called by the `necla-prep-install` stage, which should do the `dd` command which will zero out /dev/sda. After this completes, the system will reboot, and go into the Ubuntu 16.04 install.

wdennis
2018-03-15 01:02
The systems calling the profile with the stage map are booting SH, but then nothing is executing (no `dd` is happening.) Why might this be?

wdennis
2018-03-15 01:06
I should also mention that `necla-prep-install` is a modified clone of `prep-install`, and `totally-erase-sda` is a modified clone of `erase-hard-disks-for-os-install`

wdennis
2018-03-15 01:08
The prior stage map was calling the `prep-install` Stage, which was successfully executing the template in `erase-hard-disks-for-os-install`, but the disk wipe was not sufficient, and the Ubuntu install thereafter was failing.

wdennis
2018-03-15 01:09
I have found that if I do a `dd if=/dev/zero of=/dev/sda bs=512` then all goes smoothly with the resulting install

shane
2018-03-15 02:23
not sure off the top of my head, I'm out at dinner right now - will look at this a little bit later

wdennis
2018-03-15 14:14
Mornin', DR folk!

wdennis
2018-03-15 14:15
Any ideas on my problem above? ^^^

shane
2018-03-15 14:20
sorry - had client work I was dealing with didn't get a chance to look at it ... will try and take a peek soon

wdennis
2018-03-15 14:21
@shane Thx

wdennis
2018-03-16 00:59
OK, I give up... Pretty sure it's not a DRP problem (except for the disk wiping issue not working as above) - dd'd the disks with zeros manually and the Ubuntu preseed still crapping out. Going to try MAAS and see if I have a different experience.

zehicle
2018-03-16 01:04
I believe Ubuntu writes the partition tables in a way that is hard to undo. Greg has had to fix it for other people. I don't know the details.

wdennis
2018-03-16 01:07
@zehicle Zeroing the disk should overwrite all of that, yes? I don't believe I ever ran into this problem though using Cobbler to reinstall previously-installed machines, and of course doing a manual install via USB on a previously-used disk work fine...

wdennis
2018-03-16 01:08
It's just that I have machines to reinstall, and I can't get DRP to work reliably on used disks...

zehicle
2018-03-16 01:24
There is data written in places you can't easily find. Sorry I don't have the details. It's a Ubuntu install issue.

zehicle
2018-03-16 01:33
I'm not assuming that I google better than you -> there's something about how the partition table is written.

zehicle
2018-03-16 01:34
It took Greg a while to fix it before (I remember him describing it as "blah Ubuntu blah Drive Paritions blah install turds blah" so I know it's a thing.

zehicle
2018-03-16 01:34
Greg never says turds, that's my color commentary

nkabir
2018-03-16 02:12
@wdennis I am running Ubuntu 16.04 installs (and re-installs) on a set of machines to familiarize myself with the tool. I started with "discover (start) -> prep-install (reboot) -> ubuntu-16.04-install (reboot) -> finish-install (reboot) -> complete-nowait (success)" before venturing into custom stages. To restart the process, I'm executing "# dd if=/dev/zero of=/dev/sda bs=1024 count=1", resetting the DRP Machine entry stage to "discover", and rebooting. It appears to re-install successfully. Have you managed to get a stock life-cycle working? I've noticed it's easy for me to make typos in "shell + template" files. I started out with MAAS until I discovered DRP. I prefer DRP.

wdennis
2018-03-16 13:53
@nkabir Yes, your workflow was just about exactly like mine -- but I find that `prep-install` doesn't do enough to clear the disk for re-use (I'm using LVM, and I'm pretty sure that's causing the issue.) I have done "stock" DRP-provided template installs, and they do work. But, I cannot stick with that due to my partitioning needs (as well as other preseed needed customizations.) I realize that RackN doesn't support debugging preseed problems, but I'm loosing too much time on this trying to figure it out myself... So I'm going to do a test using MAAS and see what my experience is there. Maybe I'll quickly come back to DRP :wink:

greg
2018-03-16 13:56
`prep-install` seems to work for many. It would be good to see your workflow again @wdennis. I?m out though and will get yelled at for this message.

shane
2018-03-16 13:56
@greg - go back to your vacation

wdennis
2018-03-16 13:57
Wow, that didn't take long...

greg
2018-03-16 13:58
@nkabir - I think you can remove the `prep-install(reboot)` with `prep-install(success)` - one less reboot that way. `finish-install(reboot)` should be `finish-install(stop)`, but with the tip code base it will ?fix? it for you.

mark.yeun
2018-03-16 14:45
hi guys i'm back. I had dr-provision working well with mellanox cards, but now on a machine with SolarFlare, I'm getting stuck.

mark.yeun
2018-03-16 14:46
I have the option 67 override, which I needed for mellanox: `67: "{{if (eq (index . 77) \"iPXE\") }}default.ipxe{{else if (eq (index . 93) \"0\")}}ipxe.pxe{{else}}ipxe.efi{{end}}"`

mark.yeun
2018-03-16 14:47
I have only serial console to the machine I'm trying to provision. the screen goes blank.

mark.yeun
2018-03-16 14:47
i have a tcpdump on the drp box, and it shows tftp transfer of ipxe.pxe, then nothing

mark.yeun
2018-03-16 14:53
Ah I got someone to take a pic of the screen. I have this, and it froze ```PXE->EB: !PXE at (AC0:0790, entry point at 9AC0:0248 UNDI code segment 9AC0:0838, data segment 9B44:1c?? UNDI device is PCI 06:00.0, type gPXE 619kB free base memory after PXE unload Oops! Unable to find realmode segment```

nkabir
2018-03-16 15:00
@greg Thank you. That will save me some time as I iterate through my set up! Now back to enjoying Mai Tais (or equivalent)! @wdennis I am using LVM on a single disk but I haven't ventured into customizing partitioning yet. My staging machines have a single boot disk (`/dev/sda`) and three additional disks that are configured for ZFS/RAIDZ by Ansible once the DRP provisioning sets up the single boot disk. Ansible handles the partitioning of ZFS. I just need the boot disk to host the OS.

mark.yeun
2018-03-16 15:02
hm removing option 67 worked for this box.

mark.yeun
2018-03-16 15:03
so hmm. lpxelinux.cfg freezes my mellanox box and ipxe.pxe freezes my solarflare box

mark.yeun
2018-03-16 15:03
that puts a skunk in the works

shane
2018-03-16 15:03
gotta love the PXE "_standards_" ...

mark.yeun
2018-03-16 15:03
do you have any tricks up your sleeve?

shane
2018-03-16 15:04
@greg or @vlowther might have some iPXE script options to help - unfortunately, I'm not sure how to implement that right to work for both cards

vlowther
2018-03-16 15:09
What nic is in that solarflare box?

mark.yeun
2018-03-16 15:10
solarflare is the nic

vlowther
2018-03-16 15:12
...

vlowther
2018-03-16 15:12
Interesting.

mark.yeun
2018-03-16 15:12
sorry, having trouble with copy/paste

mark.yeun
2018-03-16 15:12
Solarflare SFC9020

mark.yeun
2018-03-16 15:13
I see option 67 has template interpolation -- would we be able to switch on gohai inventory?

vlowther
2018-03-16 15:13
I assume they work fine once we get booted into Sledgehammer?

mark.yeun
2018-03-16 15:13
although that doesn't help for discovery

mark.yeun
2018-03-16 15:13
yup

vlowther
2018-03-16 15:14
Alas, the template interpolation that happens for option 67 only has access to the incoming DHCP packet.

vlowther
2018-03-16 15:16
hm.

vlowther
2018-03-16 15:16
What firmware are those solarflare nics running?

mark.yeun
2018-03-16 15:16
soo, we were pxe booting both types with our own pxelinux.0 via tftp

mark.yeun
2018-03-16 15:17
maybe I can tweak option 67 to have it try pxelinux.0 instead of lpxelinux.0 / ipxe.pxe

vlowther
2018-03-16 15:17
Trolling through the firmware release notes indicate that they have had a few firmware updatres to fix PXE related issues.

mark.yeun
2018-03-16 15:18
trying to dig up the version


vlowther
2018-03-16 15:18
release notes for their latrest utility bundle.

vlowther
2018-03-16 15:19
It looks like they have use an embedded gPXE for their PXE booting needs.

mark.yeun
2018-03-16 15:19
i have this so far, still trying

mark.yeun
2018-03-16 15:19
```Solarstorm Boot Manager (v3.2.0.6061) Solarflare Communications 2008-2010 ```

mark.yeun
2018-03-16 15:20
reading those release notes

vlowther
2018-03-16 15:30
heh, later releases allow you to embed your own custom ipxe ROM image.

vlowther
2018-03-16 15:30
COuld have all sorts of fun with that. :slightly_smiling_face:

mark.yeun
2018-03-16 15:32
ah trying to avoid that kind of fun for now :slightly_smiling_face:

mark.yeun
2018-03-16 15:32
``` Firmware version: v3.2.1 Controller type: Solarflare SFC9000 family Controller version: v3.2.0.6071 Boot ROM version: v3.2.0.6061 ```

mark.yeun
2018-03-16 15:32
quite far behind

vlowther
2018-03-16 15:33
In the mean time, though, if you have a version of pxelinux and/or ipxe that work for both systems in question you could just chuck them into /var/lib/dr-provision/tftpboot and rewrite your option 67 to use them instead.

mark.yeun
2018-03-16 15:35
ok thanks

vlowther
2018-03-16 15:35
or...

vlowther
2018-03-16 15:36
sorry, bad advice.

vlowther
2018-03-16 15:38
Make /var/lib/dr-provision/replace. and name then the same as their respective files in /var/lib/dr-provision/tftpboot

vlowther
2018-03-16 15:39
sorry, I am in the bad habit of just recompiling when I want to test different embedded assets

mark.yeun
2018-03-16 15:40
so mkdir /var/lib/dr-provision/replace; cp pxelinux.0 /var/lib/dr-provision/replace/mypxelinux.0, then change option 67 to send mypxelinux.0

vlowther
2018-03-16 15:41
no, I think it will have to be named lpxelinux.0

vlowther
2018-03-16 15:41
and you will need to include the related .c32 files for pxelinux

vlowther
2018-03-16 15:42
ipxe is easier because it will not try to load binary modules.

vlowther
2018-03-16 15:42
but I bet you will need a firmware update to get ipxe working.

mark.yeun
2018-03-16 15:44
oooh i c. so /var/lib/dr-provision/replace masks /var/lib/dr-provision/tftpboot?

mark.yeun
2018-03-16 15:44
i'm bricking, er, working on upgrading the firmware

vlowther
2018-03-16 15:45
That is the idea, yes. :slightly_smiling_face:

mark.yeun
2018-03-16 15:46
does sledgehammer on serial console have password login?

vlowther
2018-03-16 15:46
root/rebar1

mark.yeun
2018-03-16 15:46
silly me did the firmware update over ssh and of course lost connection

mark.yeun
2018-03-16 15:46
omg thank you

mark.yeun
2018-03-16 15:49
YES! firmware update seems to have done it

mark.yeun
2018-03-16 15:50
@vlowther thank you for heroic advice :slightly_smiling_face:

vlowther
2018-03-16 15:50
:slightly_smiling_face:

vlowther
2018-03-16 15:51
And it looks like the latest firmware has a slew of fixes and enhancements over your older version.

mark.yeun
2018-03-16 16:01
ok i'll come back again when I get stuck

vlowther
2018-03-16 16:02
:slightly_smiling_face:

vlowther
2018-03-16 16:04
@wdennis Is it still the --force --force --I-really-mean-it thing?

vlowther
2018-03-16 16:10
@nkabir our erase script is a little more paranoid than zeroing the first megabyte: https://github.com/digitalrebar/provision-content/blob/master/content/tasks/erase-hard-disks-for-os-install.yaml

vlowther
2018-03-16 16:10
we iterate over all the vgs, forcibly erase them, do the same for all the pvs

vlowther
2018-03-16 16:11
then wipe out the first and last meg of every partition and raw block device.

wdennis
2018-03-16 16:17
wat?

vlowther
2018-03-16 16:21
You had a similar issue a few weeks ago where vgremove was failing because one --force was not enough.

wdennis
2018-03-16 16:23
So my solution was to zeroize the entire disk via `dd if=/dev/zero of=/dev/sda bs=512`

wdennis
2018-03-16 16:23
That should do the trick, right?

vlowther
2018-03-16 16:24
Yes, the main reason I don't want to do that by default is because it can take hours if (for example) you are zeroing a multi-terabyte drive.

vlowther
2018-03-16 16:25
so I would prefer an approach that is a bit more targeted in what it erases

wdennis
2018-03-16 16:25
Except that it doesn't... The Ubuntu install I am trying to do afterwords still craps out...

vlowther
2018-03-16 16:26
possibly with fixes to the seed files for Debianoids that tell the install to really ignore anything that might already be on the disk.

wdennis
2018-03-16 16:26
Yes, I'd prefer a shorter wipe as well, but only if it works...

vlowther
2018-03-16 16:26
but you know all about the lack of documentation there. :confused:

wdennis
2018-03-16 16:27
I was using your `prep-install` stage but was experiencing faults when the installer got to the partitioning step

wdennis
2018-03-16 16:28
The interesting thing is that a standard USB install (i.e. a regular ISO installer) doesn't complain about anything when you use a used disk

wdennis
2018-03-16 16:28
It just partitions however you told it to and moves on

vlowther
2018-03-16 16:28
So it probably knows about seed options we don't

wdennis
2018-03-16 16:28
yes

wdennis
2018-03-16 16:29
I need a specific partitioning to support how we want to configure the boot/root disk

wdennis
2018-03-16 16:30
It's basically:

wdennis
2018-03-16 16:30
``` d-i partman-auto/expert_recipe string root_home_lvm : \ 500 533 1024 free \ $iflabel{ gpt } $reusemethod{ } \ method{ efi } format{ } \ . \ 1024 1024 1024 ext2 \ $primary{ } method{ format } \ format{ } use_filesystem{ } filesystem{ ext2 } \ mountpoint{ /boot } \ . \ 1 1073741824 -1 ext3 \ $defaultignore{ } method{ lvm } \ vg_name{ vg00 } \ . \ 50% 20 100% linux-swap \ $lvmok{ } \ lv_name{ lv_swap } in_vg{ vg00 } \ method{ swap } format{ } \ . \ 204800 204800 204800 ext4 \ $lvmok{ } in_vg{ vg00 } \ lv_name{ lv_root } method{ format } \ format{ } use_filesystem{ } filesystem{ ext4 } \ mountpoint{ / } \ . \ 512000 512000 512000 ext4 \ $lvmok{ } in_vg{ vg00 } \ lv_name{ lv_home } method{ format } \ format{ } use_filesystem{ } filesystem{ ext4 } \ mountpoint{ /home } \ . ```

wdennis
2018-03-16 16:31
So it makes LV's for "/", swap and "/home"

wdennis
2018-03-16 16:31
The idea is to make it partition any size of disk (we have very hetero hardware here...)

tim_epkes
2018-03-16 16:32
has joined #community201803

wdennis
2018-03-16 16:32
Basically make a reasonable-sized "/" LV, reasonable-sized swap LV, then all the rest of the VG space goes to the "/home" LV

vlowther
2018-03-16 16:34
yes

vlowther
2018-03-16 16:35
it is finding the right partman-lvm and partman-md and other related options to make it really clear everything and not ask questions

wdennis
2018-03-16 16:36
We need that because the researchers get local home dir's on the servers, and tend to pile lots of data in them... If "/home" was a part of the "/" LV, they run the OS partition out of space and crash the servers...

wdennis
2018-03-16 16:36
It's *so* easy in feekin' kickstart... Preseed is so obtuse.

wdennis
2018-03-16 16:37
But, the reseachers all want Ubuntu OS...

gary.berger
2018-03-16 16:51
has joined #community201803

spector
2018-03-16 16:51
@tim_epkes $welcome

2018-03-16 16:51
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

tim_epkes
2018-03-16 16:51
Thanks

fpisupati
2018-03-16 18:03
has joined #community201803

nkabir
2018-03-16 18:06
@vlowther Yes. It's very thorough! I'm repeatedly cycling through the whole process in my staging environment to get more comfortable with the moving parts. I run your erase stage in my workflow. However, I only erase the first megabyte when I want a machine to return to discovery via PXE. Then the workflow takes over.

nkabir
2018-03-16 18:12
@nkabir uploaded a file: https://rackn.slack.com/files/U8BTZ6HPT/F9RAUAV4J/-.sh and commented: @wdennis with your successful manual install, have you dumped the working machine's configuration and compared it with your DRP template?

zehicle
2018-03-16 18:16
@fpisupati $welcome

2018-03-16 18:16
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

fpisupati
2018-03-16 19:15
thanks

2018-03-16 19:35

romain.lafontaine
2018-03-16 20:25
Random/unrelated : Was browsing your website and found a nice pict of you @shane @zehicle @greg https://www.rackn.com/company/ ^^ I felt it was worth to share with the community

shane
2018-03-16 20:26
LOL ... yeah, well @zehicle had a "wardrobe malfunction" and didn't end up wearing his kilt that day ... that's @vlowther in the middle, @greg on the right, and /me on the left ...

romain.lafontaine
2018-03-16 20:26
^^

spector
2018-03-16 20:27
These are unapproved marketing show uniforms :grinning:

shane
2018-03-16 20:27
s/unapproved/maverick/g

greg
2018-03-16 20:28
The strange part is that the merely average guy is short.

greg
2018-03-16 20:28
And comfortable

patrick.miller
2018-03-16 21:16
has joined #community201803

zehicle
2018-03-16 21:17
@patrick.miller $welcome

2018-03-16 21:17
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

patrick.miller
2018-03-16 21:17
thanks zehicle

shane
2018-03-16 21:17
indeed, welcome, @patrick.miller :slightly_smiling_face:

patrick.miller
2018-03-16 21:18
how can I display the kickstart file for a machine?

shane
2018-03-16 21:18
the rendered one - or the template that builds the final KS ?

patrick.miller
2018-03-16 21:18
rendered


patrick.miller
2018-03-16 21:19
excellent thanks!

shane
2018-03-16 21:19
there are a lot of nuggets in the $faq


shane
2018-03-16 21:20
(oops, I need to update Slackbot responses to point to `tip` doc instead of `latest`)

shane
2018-03-16 21:21
ok - done: $quickstart

2018-03-16 21:21

shane
2018-03-16 21:31
@patrick.miller in a short bit, the `tip` docs will update with some clean ups and enhancements to that Render doc - but the info is correct as you see it right now

shane
2018-03-16 23:44
@wdennis - I got a chance to look at your Erase SDA issue ... the problem is you set `sane-exit-codes` - we expect appropriate Exit Codes to mark success. However - our "insane" (aka older Exit Code definitions) also assume Exit Code of 0 (zero) for success. The `dd` command will always exit with code ` 1 ` and message "out of disk space". This causes the Agent/Runner to mark the job as failed, and your Workflow won't advance. By explicitly setting an `exit 0` at the end of the scriptlet will fix your issue. For reference, I am attaching a content pack that contains the testing I did. NOTE - I moved your scriptlet out of the payload of the Task, and in to a separate Template, as it makes it easier to extend/edit/track/modify, etc. But the principle should be the same (I did not test using the scriptlet embedded in the task w/ a modified `exit 0`). I'm including 2 workflows - your original NECLA and a modified `rebar-prep` workflow that does the same thing, but uses the Digital Rebar/RackN `erase-hard-disks-for-os-install` Task. If this needs modification to work right - we'd appreciate the feedback on what needs to change. It's 1000x's of times faster than a pure-`dd` solution to bigger disks.

shane
2018-03-16 23:46
@shane uploaded a file: https://rackn.slack.com/files/U6QFVRJNB/F9REST4BE/NECLA_and_Rebar_wipe_prep_content_.yaml and commented: Use like regular content pack. First delete your conflicting content with the same names. Then: `drpcli contents create -< necla-and-rebar-prep.yaml` Attach the appropriate Workflow stagemap to your machines, and boot as usual.

2018-03-17 01:29

scsikid
2018-03-17 02:19
has joined #community201803

shane
2018-03-17 03:18
@scsikid $welcome !

2018-03-17 03:18
Digital Rebar welcome information is here > http://rebar.digital/community/welcome.html

patrick.miller
2018-03-17 03:39
ah ok thanks Shane.

scsikid
2018-03-17 17:16
thanks @shane