I wrote a while back about growing pains with Chef, which is the newish hyped up system management tool. I’ve been having a couple other frustrations with it in the past few months and needed a place to gripe.
The first issue started a couple of months ago where some systems were for some reason restarting Splunk every single time chef ran. It may of been going on longer than that but that’s when I first noticed it. After a couple hours of troubleshooting I tracked it down to chef seemingly randomizing the attributes for the configuration resulting in writing a new configuration (that was the same configuration, just in a different order) every time and triggering a restart. I think it was isolated primarily to the newer version(s) of chef (maybe specific to 0.10.10). My co-worker who knows more chef than I (and the more I use chef the more I really want cfengine – disclaimer I’ve only used cfengine v2 to-date), says after spending some time troubleshooting himself that the only chef solution might be to somehow set the order of the attributes in a static fashion (probably some ruby thing that lets you do that? I don’t know). In any case he hasn’t spent time on doing that and it’s over my head so these boxes just sit there restarting splunk once or twice an hour. They make up a small portion of the systems, the vast majority are not affected by this behavior.
So this morning I am alerted to a failure in some infrastructure that still lives in EC2 (oh how I hate thee), turns out the disk is going bad and I need to build a new system to replace it. So I do, and chef spits out one of it’s usual helpful error messages
 [Tue, 29 May 2012 16:35:36 +0000] ERROR: link[/var/log/myapp] (/var/cache/chef/cookbooks/web/recipes/default.rb:50:in `from_file') had an error:  link[/var/log/myapp] (web::default line 50) had an error: TypeError: can't convert nil into String  /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:106:in `stat'  /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:106:in `stat'  /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:61:in `set_owner'  /usr/lib/ruby/vendor_ruby/chef/file_access_control/unix.rb:30:in `set_all'  /usr/lib/ruby/vendor_ruby/chef/mixin/enforce_ownership_and_permissions.rb:33:in `enforce_ownership_and_permissions'  /usr/lib/ruby/vendor_ruby/chef/provider/link.rb:96:in `action_create'  /usr/lib/ruby/vendor_ruby/chef/resource.rb:454:in `send'  /usr/lib/ruby/vendor_ruby/chef/resource.rb:454:in `run_action'  /usr/lib/ruby/vendor_ruby/chef/runner.rb:49:in `run_action'  /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `converge'  /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `each'  /usr/lib/ruby/vendor_ruby/chef/runner.rb:85:in `converge'  /usr/lib/ruby/vendor_ruby/chef/resource_collection.rb:94  /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:116:in `call'  /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:116:in `call_iterator_block'  /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:85:in `step'  /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:104:in `iterate'  /usr/lib/ruby/vendor_ruby/chef/resource_collection/stepable_iterator.rb:55:in `each_with_index'  /usr/lib/ruby/vendor_ruby/chef/resource_collection.rb:92:in `execute_each_resource'  /usr/lib/ruby/vendor_ruby/chef/runner.rb:80:in `converge'  /usr/lib/ruby/vendor_ruby/chef/client.rb:330:in `converge'  /usr/lib/ruby/vendor_ruby/chef/client.rb:163:in `run'  /usr/lib/ruby/vendor_ruby/chef/application/client.rb:254:in `run_application'  /usr/lib/ruby/vendor_ruby/chef/application/client.rb:241:in `loop'  /usr/lib/ruby/vendor_ruby/chef/application/client.rb:241:in `run_application'  /usr/lib/ruby/vendor_ruby/chef/application.rb:70:in `run'  /usr/bin/chef-client:25
So I went to look at this file, on line 50, looked perfectly reasonable, there hasn’t been any changes to this file in a long time and has worked up until now. What a TypeError is I don’t know(it’s been explained to me before but I forgot what it was 30 seconds after it was explained), I’m not a developer(hey, fancy that). I have seen it tons of times before though, it was usually a syntax problem (tracking down the right syntax has been a bane for me in Chef, it’s so cryptic, just like the stack trace above).
So I went to the Chef website to verify the syntax, and yep, at least according to those docs it was right. So, WTF?
I decided to delete the user and group config values, and ran chef again, and it worked! Well until the next TypeError, rinse and repeat about four more times and I finally got chef to complete. Now for all I know my modifications to make the recipes work on this chef will break on the others. Fortunately I was able to figure this syntax error out, usually I just bang my head on my desk for two hours until it’s covered in blood and then wait for my co worker to come figure it out(he’s in a vastly different time zone from me).
So what’s next? I get an alert for the number of apache processes on this host, and that brings back another memory with regards to Chef attributes. I haven’t specifically looked into this issue again but am quite certain I know what the issue is – just no idea how to fix it. The issue the last time this came up was that Chef could not decide on what type of EC2 (ugh) instance this system is, there are different thresholds for different sizes. Naturally one would expect chef to check to see what size, it’s not as if Amazon has the ability to dynamically change sizes on you right? But for some reason again chef thinks it is size A on one run and size B on another run. Makes no sense. Thus the alerts when it gets incorrectly set to the wrong size. Again – this only seems to impact the newest version(s) of Chef.
I’m sure it’s something we’re doing wrong, or if it was VMware it would be something Chef was doing wrong before and is doing right now, what we’re doing hasn’t changed and now all of a sudden is broken. I believe another part of the issue is the legacy EC2 bootstrap process pulls in the latest chef during build, vs our new stuff(non EC2) maintains a static version, less surprises.
Annoyed to have to come back from a nice short holiday to have to immediately deal with two things I hate to deal with – Chef and EC2.
This coming trip to Amsterdam will provide the infrastructure to move the vast majority of the remaining EC2 stuff out of EC2, so am excited about that portion of the trip at least. Getting off of chef is another project I don’t feel like tackling now since I’m in the minority as to my feelings for it. I just try to minimize my work in it for my own sanity, there’s lots of other things I do instead.
On an unrelated note, for some reason during a recent apt-get upgrade my Debian system pulled in what feels like a significantly newer version of WordPress, though I think the version number only changed a little(I don’t recall what the original version number was). I did a major Debian 5.0->6.0 Upgrade a couple of months ago, but this version came in after, has a bunch of UI changes. I’m not sure if it breaks anything, I think I need to re-test how the site renders in IE9 as I manually patched a file after getting a report that it didn’t work right, and the most recent update may of overwritten that fix.
I had the very same problem. I installed chef on Ubuntu 12.04 with apt. The chef code was placed under /usr/lib/ruby/vendor_ruby/chef.
It worked at first, then it broke suddenly (still don’t know why). Anyhow, installing chef as gem (gem install chef –no-doc –no-ri) solved it for me.
Comment by Robin Wenglewski — June 26, 2012 @ 6:50 am
thanks for the post! My co-worker mentioned to me it was a new (perhaps known) bug in the version at the time(I don’t have the bug # handy). I haven’t tried to fix it (other than the workaround mentioned in the post) – we’re really close to migrating off all of those VMs to another data center anyways and I decided to keep the older more stable version of chef at that site.
I’m really surprised such a bug happened to get past whatever testers they have and into the production version, it’s got to be a trivial thing to catch especially given how simple the configuration stanza is.
Comment by Nate — June 26, 2012 @ 7:14 am