15:01:16 #startmeeting oVirt Infra 15:01:16 Meeting started Mon Mar 3 15:01:16 2014 UTC. The chair is knesenko. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:16 Useful Commands: #action #agreed #help #info #idea #link #topic. 15:01:25 eedri: ewoud here ? 15:01:29 bkp: here ? 15:01:29 * eedri here 15:01:33 #chair eedri 15:01:33 Current chairs: eedri knesenko 15:01:34 * bkp here 15:01:39 #chair bkp 15:01:39 Current chairs: bkp eedri knesenko 15:01:41 doron_afk, you want to join? 15:01:55 knesenko, obasan & dcaro are ooo 15:01:59 * doron here 15:02:06 #chair doron 15:02:06 Current chairs: bkp doron eedri knesenko 15:02:22 ewoud: hi how are you ? Joining the meeting ? 15:02:25 orc_orc: hey 15:02:32 orc_orc: want to join ? 15:02:44 anyone from kimchi ? 15:02:59 alitke: here? 15:03:22 hi 15:03:45 #topic Hosting 15:03:48 ok he all 15:03:54 lets start the meeting 15:04:03 few updated regarding the hosting .... 15:04:44 there was an outage in rackspace, so ouir servers were down for 15-20 min 15:04:54 but they are back and everythong is ok now 15:05:26 rackspace said : 15:05:48 knesenko: yes 15:06:05 there was a cable isseu 15:06:08 #chair ewoud 15:06:08 Current chairs: bkp doron eedri ewoud knesenko 15:06:38 #info there was an outage in rackspace. Issue fixed and everything works fine now 15:06:59 also we took rackspace03 and now we are using it as jenkins slave 15:07:11 #info rackspace03 was added as jenkins slave 15:07:15 eedri: anything else ? 15:07:43 knesenko: single slare or with virtualisation? 15:08:02 ewoud: regarding the outage ? 15:08:37 knesenko: I meant is rackspace03 a single slave or is it a virtulisation host with virtual slaves on top? 15:08:51 ewoud: single slave 15:08:55 knesenko: ok 15:09:00 knesenko, i think we're not using it too much now 15:09:18 knesenko, rackspace3 i mean, we tried using it for building engine, but now that we solved the build issue (open files) 15:09:25 knesenko, we can utilize it more 15:09:32 eedri: ok 15:09:33 #info rackspace03 added as a bare metal jenkins slave 15:10:27 anything else on hosting ? 15:11:35 moving to Forman and Puppet then 15:11:44 #topic Foreman and Puppet 15:11:50 q to ewoud 15:12:07 ewoud: can we easily reprovision slaves from foreman UI ? 15:13:00 all: hi, relative newbie here, gonna hang out and see what you're covering 15:13:51 ccowley, hey, welcome! 15:14:06 ccowley: welcome to the infra session! 15:14:29 ccowley: welcome 15:14:30 knesenko: we should 15:14:44 ccowley: welcome 15:14:57 ewoud: did you tried it ? 15:15:00 ccowley: any specific interest? 15:15:23 knesenko: not in ovirt infra foreman 15:15:28 knesenko: hardware or virtual? 15:15:37 ewoud: virtual .... 15:15:49 ewoud: and I assume will need both in some point 15:16:19 ewoud: but pxe solution wont work for us right, since we are talking about different networks here 15:16:34 ewoud: Many, Puppet and Foreman probably primarilty, but I am pretty broad (and deep in many subjects). 15:16:48 knesenko: we could deploy a smartproxy which manages PXE there 15:16:53 ccowley: sounds familiar :) 15:16:59 ewoud: Currently consulting on an Openstack project in the day job to give you an idea 15:17:10 ccowley, nice.. 15:17:26 ccowley: and timezone wise? 15:17:57 ewoud: GMT+1 (France, but I am English) 15:18:02 knesenko: but wasn't there a template in rackspace ovirt to deploy? 15:18:16 ewoud: I have no idea ... 15:18:21 eedri: ^^ ? 15:18:32 * eedri reading 15:18:36 ok never mind, I am just asking .... 15:18:51 ewoud, we don't have space 15:18:51 ccowley: I think we're mostly in that timezone as well, so that's convinient 15:18:54 ewoud, for templates 15:18:54 anyway we will plan for new infra and will do it properly 15:18:59 ewoud, that's the problem... 15:19:12 ewoud, vms wouldn't start... due to the limitation of 10% free space 15:19:26 ewoud, this is the reason we wanted to move to gluster storage 15:19:43 ewoud, but then hit issues with rackspace... 15:19:46 eedri: ewoud also we need to create a puppet class to clean jenkins slaves workspace , right ? 15:19:59 knesenko: don't think so 15:20:05 knesenko: you may be thinking of cleaning /tmp 15:20:08 ewoud, knesenko maybe we should revisit this issue, since it might take some time to get the new hardware 15:20:22 ewoud, knesenko surely a couple of months at the least 15:20:31 knesenko: eedri what's the concrete use case of reprovision now? 15:20:47 ewoud, what knesenko is saying is that we have rackspace vm slaves with no enough space 15:20:55 upgrade from f18 to f20 through reinstall? 15:20:58 ewoud, and you can't control how many jobs will run on it 15:21:15 ewoud, so we can add a cronjob via puppet (ugly) to clean old workpaces (3 days old?) 15:21:37 ewoud, or if there is another way of limiting a certain slave space for workspaces via jenkins 15:21:49 eedri: jenkins has no built in mechanism for this? 15:21:58 ewoud, i think till now we used isos 15:22:20 ewoud, not sure, i think it can warn or take the slave offline if it doesn't have enough space 15:22:39 ewoud, but i'm not sure it will actively run over data on workspace 15:22:48 ewoud, or delete old workspaces 15:23:52 toughts ? 15:23:57 thoughts ? 15:24:03 I'm confused, do we have 2 issues now? 15:24:05 knesenko, ewoud https://wiki.jenkins-ci.org/display/JENKINS/Workspace+Cleanup+Plugin 15:24:17 there was one of reprovision and another of filling up slaves? 15:24:23 we could use this, but it will add more time per build (i.e delte workspace after build is done) 15:24:26 or is it the same issue? 15:24:30 ewoud, different issues 15:25:16 so better to it periodically on the slave or via the master with groovy script 15:25:44 ewoud, unless you have other proposal 15:26:01 eedri: I like the plugin with post-build cleanup 15:26:15 ewoud, only downsize is it will make builds run longer 15:26:23 won't 100% failsafe, but it sounds like the easiest short term solution 15:26:28 ewoud, yea 15:26:38 ewoud, we can try it out and see how much time it adds 15:26:46 but other scripts might interfere with jenkins actually running 15:26:53 eedri: ewoud so we agreed on trying that plugin ? 15:26:56 ewoud, the long term solution is adding more slaves, or scheduling reprovisioin of slaves 15:27:05 eedri: +1 15:27:13 ewoud, not sure, if they clean only very old dirs, like a few days old 15:27:21 ewoud, but using the plugin is safer 15:29:13 eedri: ewoud so plugin then ? 15:29:24 knesenko, let's try it 15:29:30 knesenko, add it to the todo list 15:29:56 #info try to use a workspace cleanup plugin for jenkins slaves 15:30:02 #info https://wiki.jenkins-ci.org/display/JENKINS/Workspace+Cleanup+Plugin 15:30:11 anything else on puppet/foreman ? 15:30:15 knesenko, another issue is gerrit hooks on gerrit 15:30:19 knesenko, not sure it's related to puppet/foreman 15:30:32 eedri: not related ... 15:30:36 and dcaro is not here ... 15:30:48 knesenko, yea, anyway it's worth adding an open issue 15:30:49 but we can discuss it after Jenkins topic 15:30:52 knesenko, +1 15:30:56 #topic Jenkins 15:31:00 eedri: hello :) :) 15:31:12 knesenko, ok, few issues i'm aware of 15:31:31 knesenko, 1st - i changed to default behavior of gerrit trigger plugin to not fail on build failuire 15:31:47 knesenko, not sure why we didn't do it till now, it will prevent false positives on patches failing on infra issues 15:31:57 knesenko, so now jenkins will only give -1 on unstable builds 15:32:45 eedri: +1 15:32:47 knesenko, 2nd, like i said there are some open issues with new hooks installed, regarding bug-url, so dcaro should look into that once he's back 15:33:05 knesenko, i think there should also be a wiki describing on all existing hooks and thier logic 15:33:14 maybe there is one and i'm not aware of 15:33:36 #action dcaro create a wiki about gerrit hooks and their logic 15:33:55 another issue was strange git failures.. 15:34:13 knesenko, which people sent to infra, not sure if all of them were caused by loop devices on rackspace vms 15:34:19 but should also be looked into 15:34:40 fabiand, i remember that some ovirt-node jobs were leaving open loop devices right? 15:34:46 which forced us to reboot the slave 15:34:48 eedri: correct 15:34:56 eedri: I remember that too 15:35:19 knesenko, there was also an selinux issue, not sure if it's resolved yet 15:35:20 eedri, in some circumstances that can happen yes, but also the ovirt-live job has this risk 15:35:25 I see those orphan loop devices when I get build failures ... perhaps a wrapper to do clean up is in order? 15:35:29 knesenko, it was one of the minidells 15:35:44 orc_orc, can it be cleaned while host is up? 15:35:46 orc_orc, w/o reboot? 15:35:52 eedri: ues 15:36:01 yes ... sorry -- broken typing hand 15:36:02 orc_orc, i believe that the test should handle it 15:36:09 orc_orc, and post cleanup phase 15:36:13 eedri, rbarry is working on docker support, maybe that will help - in the Node case - with the loop device problem temporarily 15:36:20 orc_orc, usually each job should be indepandant 15:36:24 and not affect the slave for other jobs 15:36:28 youc check to see if anything holds it open, and if not, can remove it 15:36:36 eedri, orc_orc - host needs to be rebooted when there are oprhaned loop devices 15:36:51 orc_orc, so each resource the job creates -> it should remove at the end 15:36:55 * fabiand had this check in some ovirt-node jobs, but back then noone was interested .. 15:37:03 eedri, sometimes that is just not possible 15:37:14 fabiand, hmm 15:37:21 eedri, livecd-toiols is quite good at removing oprhans and if very often does, the problem is that in some cases it failes .. 15:37:27 but those cases are very hard to catch .. 15:37:30 fabiand, so maybe the ideal solution for ovirt-node is to resintall the vm each time it runs? 15:37:37 fabiand, but thats not possible yet 15:37:42 fabiand, with our infra 15:38:04 eedri, we can limit the nbumber of times when ovirt-node is build, that will reduce the risk to get orphans 15:38:06 fabiand, still need jenkins plugin for ovirt or foreman + connection to provision vms on the fly 15:38:15 yep, that would be great .. 15:38:39 eedri, on the longterm our build system will change, then the risk is mitigated .. 15:38:46 as we will use VMs to build node . 15:38:50 fabiand, ok 15:38:51 eedri: what blocker prevents spinning up a new VM per build from a gold master, and tearing down later, per build? 15:38:53 for now there ain't much .. 15:39:04 orc_orc, well 15:39:09 eedri, we could have a dedicated vm for node building . then the VM could reboot after each build .. 15:39:11 orc_orc_, which vm would you like to spin? 15:39:20 orc_orc, are you talking about jenkins slave? 15:39:27 orc_orc, or the job itself to add a vm? 15:39:50 fabiand, that's also a possibility 15:40:03 fabiand, we'll need to see how many vms we have, not sure current infra can support it 15:40:34 eedri, ack - a global note: the orphans are more likely to appear when a job with livecdtools is canceled (ovirt-node or ovirt-live) 15:40:52 eedri: the last listed .. a job to spin up a VM and them move into it to build, with a teardown when done 15:40:52 orc_orc, if you want the master jenkins to spin vms on demand, then you need api for the relevant cloud service 15:41:26 orc_orc, ok, that means that we need the job to run on baremetal slave 15:41:37 orc_orc, and spin a vm via ovirt/libvirt ? 15:41:55 eedri: yes 15:42:14 orc_orc, needs coding to do that, i think fabiand has something with igord 15:42:16 eedri, orc_orc_ - once we are at that point we can also do igor testing (functional testing of node) 15:42:22 :) 15:42:28 orc_orc, we can try doing that on the minidells 15:42:40 orc_orc, since they are the only baremetal hosts we have, or on the rackspace 03 15:42:47 eedri, can't we hack our hosts to support nesting - should be fine if they are AMDs 15:42:57 fabiand, we can 15:43:02 not sure we have AMD 15:43:02 I had forgotton igor altho I did a CO ... i will thy this locally 15:43:06 do we ? 15:43:17 fabiand, but i'm not sure we want to do it on our minidells while network to tlv is 10mb 15:43:18 problem w nesting is performance I thot 15:43:28 I think it's working with intel as well, but AMD seems to be a bit more mature .. 15:43:29 fabiand: I do nesting on Intel's too, it is no problem 15:43:48 orc_orc_, but IMO performance is not the ciritical point here .. 15:43:56 fabiand: ok 15:43:59 fabiand, as long as its not hogging the build 15:44:03 eedri, agreed - that's why I wanted to bring in the nested thing .. 15:44:06 fabiand, and causing quueu to build up 15:44:23 fabiand, the whole infra is in kind of a "halt" status 15:44:32 I don't think the performance is that bad. sotware meulation would be bad, but nesting should be ok .. 15:44:32 fabiand, limbo if you may call it 15:44:45 eedri, but yes - we can also change the node build schedule .. 15:44:47 :) 15:44:59 fabiand, since on the one hand we decided we might migrate out of rackspace, but we didn't get new hardware yet 15:45:05 eedri: If I set up a short term unit w 72G ram, and 6T of disk, but in low bandwidth would this be useful? 15:45:22 orc_orc, anything will be usufull for jenkins.ovirt.org :) 15:45:25 I have one spare sitting not yet in production 15:45:46 orc_orc, doesn't have to be open for ssh also, you can connect it with jnlp 15:45:51 like the minidells 15:46:21 eedri: it would need to locally mirror the git etc, as I could not take the load or repeated pulls 15:46:38 there is ssh access through two paths 15:46:57 but it is otherwise NAT isolated 15:47:24 C6 or F 19 or 20 base preferred? 15:47:48 i think c6 is better 15:47:59 me too, but I am prejudiced :) 15:48:43 I cannot get to setting it up until next MOnday but will do so then 15:48:56 orc_orc, no problem 15:49:10 if RHEL 7 drops, would you prefer the beta instead? 15:49:20 knesenko, ewoud we may need to do a meeting on infra status and what can we do in the meantime 15:49:25 until new hardware is in place 15:49:27 orc_orc, yea 15:49:36 orc_orc, rhel7 might be great, there is an open ticket on it 15:49:39 eedri: ok -- I most of that rebuild solved 15:49:50 eedri: +1 on meeting 15:50:14 ewoud, maybe we should also scheudle a monthly meeting recuurent 15:50:22 to handle long term issue or tickets 15:50:36 eedri: sounds like a good idea 15:51:00 #info orc_orc_ to provision an isolated testing unit, preferably on rhel 7 15:51:05 eedri: or we can dedicate first or last 10 minutes of this meeting to long term issues. 15:51:25 i buddies, i get a Caught event [NETWORK_UPDATE_VM_INTERFACE] from an other product when creating a 2nics vm. could it be caused by ovirt ? 15:51:52 doron, yea, but past exp showed we end up the meeting before we can review tickets for e.g 15:51:56 karimb: there is a meeting active ... please stand by 15:52:40 eedri: some meetings take longer, and that's fine. getting everyone together is not an easy task. 15:53:01 so as long as we do it here, we should be able to clarify the relevant issues. 15:53:19 doron, ok 15:54:15 ok, let's continue 15:54:26 knesenko, can you add info /action to what we agreed 15:54:54 eedri: I lost you ... was in the middle of ovirt-node update for fabiand 15:55:04 knesenko+ 15:55:10 eedri: you can do it as well 15:55:14 eedri: #action 15:55:52 #action orc_orc will try to add additional jenkins slave, possibly rhel7 beta 15:56:25 #action agreed to try and think on adding nested vms or spawning vms on baremetal slaves 15:56:50 these might be worth adding as trac ticket to follow up 15:57:18 * nod * as to trac -- I am likely to need help in getting local mirroring set up 15:57:35 I do not know how all the moving parts fir together 15:57:38 did we agreed to have a section in the meeting for infra issues long-term? 15:57:42 ... fit ... 15:57:44 or we'll schedule another meeting 15:57:49 orc_orc, sure 15:58:09 eedri: I think we can do it in this meeting as well 15:58:21 there is already a ticket for nesting: https://fedorahosted.org/ovirt/ticket/78 15:58:34 fabiand, +1 15:58:53 fabiand, so it's just a matter of deciding how to push the infra, considering our current status 15:59:12 If I understood you correctly: yes 15:59:13 :) 15:59:57 fabiand: any particular reason to run with nesting, rather than something thinner (LXC?) 16:00:38 ccowley, yep - the loop device orphans - IIUIC the orphans will not vanish when we use lxc .. 16:01:41 #action check offline jenkins slaves on rackspace and re-enable/reprovision 16:01:52 fabiand: OK, valid point - I am not fully up to speed with these things yet :-) 16:02:13 ccowley, :) 16:03:08 ok, let continue 16:03:31 * ewoud semi afk due to other work 16:03:34 ping me if needed 16:03:34 knesenko, you want to talk about build system? 16:03:42 eedri: we are out of time ... 16:03:53 eedri: and I don't think its related 16:04:13 knesenko, i would like to spend a few min revewing the trac tickets 16:04:20 knesenko, if people are willing to stay 16:04:39 I am here 16:04:47 orc_orc, doron ewoud ? 16:05:02 still here 16:05:04 eedri: I'm here for a while longer 16:05:16 still here all day 16:05:19 Hanging on 16:05:20 knesenko, ok.. so let's do a quick scan 16:05:33 #topic Review tickets 16:06:26 #link https://fedorahosted.org/ovirt/report/1 16:07:10 suggestion - maybe worth doing the meeting with bluejeans/hangout? 16:07:12 a date sort of most recent first is probably most useful, to triage from? 16:07:26 so we can review the tickets for e.g 16:07:29 orc_orc, +1 16:08:02 eedri: irc is not enough ? :) 16:08:25 how about close infra meeting ,and restart as triage meeting? 16:08:36 eedri: IRC is SOOOOO 90s, all the cool kids on Hangouts 16:08:47 ccowley, or blue jeans 16:08:48 ccowley: but leaves no useful log 16:09:25 knesenko, orc_orc looking at the list of open tickets, that might take some time 16:09:32 eedri: never heard of blue jeans (I am old ... not really). 16:09:40 eedri: only one way to eat an elephant 16:09:44 might be better to review them offline and maybe continue on the list 16:09:49 or do a follow up meeting 16:10:03 cause i see some are opened by dcaro 16:10:03 orc_orc_: true 16:10:05 and he's not around 16:10:16 unless there is a specific ticket anyone want to talk about? 16:10:18 http://bluejeans.com/trial/video-conferencing-from-blue-jeans?utm_source=google&utm_medium=cpc&utm_term=bluejeans&utm_campaign=Brand_-_BlueJeans_-_Exact&gclid=CKTbuZPg9rwCFbFaMgodxBcAfg seems to be a non free yet another vidconf system 16:10:19 and it is urgent 16:10:55 we didn't heard from aline 16:11:03 from kimchi 16:11:16 on power pc hardware/vms 16:11:17 strangely there is not a priority column in that canned trac report query 16:11:28 * eedri doesn't fancy trac too much 16:11:53 that why i suggested a seperate meeting for the tickets, seems it might take a while 16:12:10 eedri: there are worse ;) 16:12:26 i think some can be closed 16:12:27 for example 16:12:30 https://fedorahosted.org/ovirt/ticket/100 16:12:31 C moved from bugzilla to Mantis, which is REALLY bad 16:12:33 this was created 16:12:36 JIRA is nice 16:12:38 or reminde 16:13:21 eedri: Jira is great, even if you have to pay for it 16:13:25 https://fedorahosted.org/ovirt/ticket/104 - this i think can also be closed 16:13:34 but waiting for david 16:13:50 +1 close 100 16:13:52 knesenko, maybe add action to send email to list? so people can review tickets and update status 16:13:58 orc_orc, closed 16:15:26 knesenko, can we close https://fedorahosted.org/ovirt/ticket/72 16:17:14 this looks like a better form for the canned query: but I cannot seem to save it: https://fedorahosted.org/ovirt/query?status=assigned&status=new&status=reopened&col=id&col=summary&col=status&col=owner&col=type&col=priority&col=milestone&group=priority&order=priority 16:17:16 eedri: yes close 16:17:51 I am running an oVirt cluster with iscsi as my shared storage. Creating, running, migrating, deleting, and HA on guests works great. As soon as I try to create a template from a powered off guest, my iscsi datacenter goes inactive for 10 minutes before the template creation fails/quits. Has anyone else experienced behavior like this? 16:17:53 https://fedorahosted.org/ovirt/ticket/48 16:17:54 ok lets end the meeting . Seems like it will never be ended :) 16:17:57 #endmeeting