vSphere vMotion and NTP ... DONG!

Have you ever had a VMware infrastructure, and almost everytime there’s a vMotion happening, the NTP daemon of the guest just goes BOING!?

Here I am, on my white Horse, coming to help!
That’s because, the source ESXi host and the destination do not have the same time for some reason.

Ha! I can hear: “Yeah of course!!!” or “we already knew that … noob..”
Or… you would probably think: “but wait a minute, we disabled Time Synchronization between the host and the guest in vmware-tools!!!”

I’m sure you did, but you just disabled one of the Synchronization processes.
VMware does sync the guest time with the host time at other time for various reasons.

Basically, an example is your VM being suspended for an hour. When you will restore the machine, VMware will come and say: “This guest has been resumed, his time must be wrong! I’m going to update it!”
Well, you can actually bypass this and let you NTP daemon do that job!

First here are the common reasons why VMWare would sync your guest to the ESXi Host (common because I don’t know if there’s any other lol):

  • Periodically
  • After taking a snapshot
  • After reverting a snapshot
  • After migrating a guest to another host
  • After resuming a guest from suspend
  • After defrag of a vdisk
  • When the vmware-tools daemon starts up
  • After the host resumes from sleep

To make sure VMware won’t touch your guest time anymore, you need to disable all those triggers.

To do so, you need to edit the .vmx of your VM (make sure to shutdown your VM first) and add the following lines:

1
2
3
4
5
6
time.synchronize.continue = FALSE
time.synchronize.restore = FALSE
time.synchronize.resume.disk = FALSE
time.synchronize.shrink = FALSE
time.synchronize.tools.startup = FALSE
time.synchronize.resume.host = FALSE

Shaazzaaaam!!!

Dell Single bit warnings error...

Hi penguins-working-on-dell-servers,

It’s been maybe 4 years I promised myself to write a post about that …
Time flies I guess :D

Working on Dells, I would assume that you all have OpenManage installed on your servers.
Have you ever received an alert about DIMMs on Dell servers?

Looking at the logs you get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# omreport chassis
Health
Main System Chassis
SEVERITY : COMPONENT
Non-Critical : Memory
Ok : Power Management
Ok : Processors
[...]
# omreport system alertlog
Alert Log
Alert Log contains...
Severity : Non-Critical
ID : 1403
Date and Time : Thu Feb 28 20:36:14 2013
Category : Instrumentation Service
Description : Memory device status is non-critical
Memory device location: DIMM_A4
Possible memory module event cause:Single bit warning error rate exceeded
[...]

You called straight away the Dell Call centre, waited and received an answer like: “Try this command … or this command … have you tried to power it off and on again?

Well, I did a while ago! Just don’t fire at the Dell techs! It’s their job to ask you all those questions, as much as it is your job to get it resolved quickly.

Guess what, unless this becomes a real issue, you don’t need to call Dell Call centre anymore.

After all, this is a Non-Critical issue. It basically means that the memory found an error and corrected for it.

So what I would advise, is, calm down, drink some tea, and just clear the memory failures!

On linux, you can clear them with:

1
# /opt/dell/srvadmin/sbin/dcicfg command=clearmemfailures

When printing again the health of your server, everything will be OK!!

But, if this happens again, on the same DIMM, in a few weeks time, your DIMM or slot might have real problem.
In this case, open your server and swap memory in this slot with another slot.
For example, sap DIMMs between slot 4 (in this case) and slot 8.

If it happens on slot 8, then you have a bad memory. If it happens again on slot 4, then you slot is bad, so you’ll need to change your motherboard … now you can call Dell!

Hope it helps in your day to day stress :P

Live stats for dd ... everything is possible!

Good Afternoon!! … or morning … depends where you are :P

Well, in computing world, it’s always shiny, happy, warm!! So, we don’t care of the time ^^
But sometimes, we do care … especially when you ran a dd of Gigabytes of data that is running forever!!

I’ll show you how to get stats even when the command is still running ..

You, user of dd should certainly know that when this command ends, it gives you some stats output like:

1
2
3
15634841+0 records in
15634841+0 records out
8005038592 bytes (8.0 GB) copied, 583.169 s, 13.7 MB/s

But the manpage of dd states it very clearly. You can get live stats of dd sending him a signal: SIGUSR1.

What you have to do is to open another terminal, get the PID of the running command, then use:

1
~# kill -USR1 <PID of dd>

On the terminal on which your dd is running, you should get the stats output without stopping the process.

Hype!

Configure network interfaces under FreeBSD.

Hello ladies and gentlemen,

As said, about one year ago, i now open the BSD category !!! What’s up!? ^^
So, today, i’ll teach how to configure network interfaces under FreeBSD…

Yeah .. i know …some of you will think that i think you’re a bunch of noobs … but no! It’s just that i don’t think that all of you are BSD nerds …

Linux and FreeBSD are Unix systems (actually Linux is not really a Unix system … but a derivate) so we can guess that the network configuration is the same, but not really.

We are still using the same command: ifconfig, but with a slight difference. Here’s the basic syntax of the command under FreeBSD:

ifconfig <netif> <net_class> <address> [netmask <netmask>]

Here we can see difference parameters that we need to give to be able to configure our interface, but first we have to get the name of the interface. The net class would be for us inet (for IPV4).

Under Linux, it would have been ethX, raX or wlanX but under BSD, naming are quite different. For Linux, the naming is based on the link characteristics (ethernet …), but BSD will create the interface name according to the driver.

So you will have to get first the name of the interface on which you want to apply your configuration using:

1
2
3
4
5
6
7
8
# ifconfig -a
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
inet 192.168.1.3 netmask 0xffffff00 broadcast 192.168.1.255
ether 00:a0:cc:da:da:da
media: Ethernet autoselect (100baseTX <full-duplex>)
status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
inet 127.0.0.1 netmask 0xff000000

Here we can see that we have the loopback interface (lo0) and a NIC (em0).

Note that on BSD, you can change the name of the interface using the following command:

1
ifconfig <netif> name <new_netif_name>

For example:

1
2
3
4
5
6
7
8
# ifconfig em0 name eth0
eth0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
inet 192.168.1.3 netmask 0xffffff00 broadcast 192.168.1.255
ether 00:a0:cc:da:da:da
media: Ethernet autoselect (100baseTX <full-duplex>)
status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
inet 127.0.0.1 netmask 0xff000000

So, now to put an IP, we will have for example:

1
# ifconfig eth0 inet 192.168.0.24 netmask 255.255.255.0

That’s about it.

Now, we will just see how to configure network interface automatically at boot.
You just have to edit the file /etc/rc.conf and add the corresponding lines:

1
2
ifconfig_em0_name="eth0"
ifconfig_eth0="inet 192.168.0.24 netmask 255.255.255.0"

Reboot, and your interface will then have the name eth0 with the corresponding IP.

Isn’t it easier than on Ubuntu (for Ubuntu users :P)? He he
Any comments appreciated ^^

Graphite and retentions

I’m Backkkk … again!!!

So here, we’ll talk about graphite woo!
If you don’t know what is graphite, have a look at: http://graphite.wikidot.com/

Data retention can be configured in Graphite.

The way graphite does it, is by using a pair: TimePerPoint:timeToStore.

TimePerPoint refers to the interval between each stored datapoint (it’s then also a precision metric – average over the period specified).
TimeToStore refers to the number of datapoints we would keep.

The retention configuration is done through /opt/graphite/conf/storage-schemas.conf.

You can specified different retention settings for the same data by separating them with commas.
Example:

1
2
3
4
[mysql]
priority = 50
pattern = ^mysql\.
retentions = 60:43200,300:350400

This translates to: for any metrics under mysql, keep 43200 datapoints at 1-minute precision, and 350400 datapoints at 5-minutes precision which is:

  • 43200 / 60 / 24 ==> keep 1-minute precision datapoints for 30 days
  • 350400 / 12 / 24 / 365 ==> keep 5-minutes precision datapoints for 3 years
    Also, be aware that retention are set in the file when first metrics is received.
    This means that if you decide to change retention on a metric that you already have data for, you would have to either:
    Delete the metric file, and leave graphite recreating it
    Resize the file with the whisper command

For example, here’s a file that doesn’t have the retention we would want to, which would be 1-minute precision for 30 days:

1
2
3
4
5
6
7
8
9
10
11
12
~ $ whisper-info.py /opt/graphite/storage/whisper/servers/host/network/if_eth0/down.wsp
maxRetention: 33955200
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 6791068
Archive 0
retention: 33955200
secondsPerPoint: 60
points: 565920
size: 6791040
offset: 28

To change retention use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
~ $ whisper-resize.py /opt/graphite/storage/whisper/servers/host/network/if_eth0/down.wsp 60:43200
Retrieving all data from the archives
Creating new whisper database: /opt/graphite/storage/whisper/servers/host/network/if_eth0/down.wsp.tmp
Created: /opt/graphite/storage/whisper/servers/host/network/if_eth0/down.wsp.tmp (518428 bytes)
Migrating data...
Renaming old database to: /opt/graphite/storage/whisper/servers/host/network/if_eth0/down.wsp.bak
Renaming new database to: /opt/graphite/storage/whisper/servers/host/network/if_eth0/down.wsp
~ $ whisper-info.py /opt/graphite/storage/whisper/servers/host/network/if_eth0/down.wsp
maxRetention: 2592000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 518428
Archive 0
retention: 2592000
secondsPerPoint: 60
points: 43200
size: 518400
offset: 28

Also … what if you did resize but you can’t see any more disk space in return … well the resize command move the old data into a new file: .bak
You can either delete the bak file or use the command flag: –nobackup

That’s it!!
I hope this will help you manage your graphite data better!

check_snmp not working as it used to...

Hi All,

Another post to announce something big!!!

So my example is about check_snmp on a Citrix Netscaler … but the issue is valid for any OID that returns a string.

Basically, sometimes between version 1.4.13 and .1.4.15 of the nagios-plugins package (that contains check_snmp) check_snmp string parsing stopped working properly.

I was having troubles with our snmp checks when checking for “Time of last state transition” on the netscaler.
Verified other snmp checks against the netscaler and they were just fine.
So it wasn’t anything to do with the netscaler.

Ran the command on version 1.4.16 of check_snmp and got this output:

1
2
/usr/lib/nagios/plugins/check_snmp -H -o NS-ROOT-MIB::haTimeofLastStateTransition.0 -C
SNMP OK - Timeticks: (1304917500) 151 days, 0:46:15.00 |

Where if ran the exact same command on version 1.4.12 I had this output:

1
SNMP OK - Timeticks: (1304922200) 151 days, 0:47:02.00 | NS-ROOT-MIB::haTimeofLastStateTransition.0=Timeticks: (1304922200) 151 days, 0:47:02.00

That’s a first difference between the version!

But that was the least of my concerns.

What I was concerned about was that when trying to use the critical threshold flag it wouldn’t work at all!!

1
2
/usr/lib64/nagios/plugins/check_snmp -H -o NS-ROOT-MIB::haTimeofLastStateTransition.0 -C -c 86400:0
Range format incorrect

WTF!?! HAAAAAA

Well, the fix was actually easy (after banging my head against the laptop a few times)!
To trigger critical for anything under a certain value, the syntax has changed is now: :(nothing here, no zero!)
Example:

1
2
/usr/lib64/nagios/plugins/check_snmp -H -o NS-ROOT-MIB::haTimeofLastStateTransition.0 -C -c 86400:
SNMP OK - 1304963700 | NS-ROOT-MIB::haTimeofLastStateTransition.0=1304963700

And look!!! Even the output is fixed!!

Weird isn’t it?
Didn’t look too much into it, but feel free to and comment this post with what you found!

Get rid of defunct processes ... it's possible :P

Hi guys,

Have you ever been in the situation of getting on your nerves because you can’t kill a process ??
I did not :P But I was just wondering how to get rid of them.

I’ll teach you how to do it in this tip!

First of all. Do you know what is a defunct process?
Under Unix, we call those kind of processes: Zombies.

We call them zombies because they are already dead processes, so we can’t kill them.
Note that zombies are not really processes. They are just entries inside the processes’ table, that has not been cleaned up.

This happens when a parent process did not wait for his child to exit.

To get rid of them, you can just find the process parent of that process, kill it so that the Zombie gets a new parent, which happens to be init, and the latter will clean up its entry in the processes’ table.

To find out those defunct process with their parent PID, use:

1
2
3
~# ps -f ax | sed '1p;/d[e]funct/!d'
UID PID PPID C STIME TTY STAT TIME CMD
root 339 333 0 Apr05 ? Z 0:00 [sh] <defunct>

Then you can use the PPID of the process to check all its children by using:

1
2
3
~# ps --ppid 333
PID TTY TIME CMD
339 ? 00:00:00 sh <defunct>

If you don’t see any processes which you might want to tinker with, you can just kill the parent:

1
2
3
~# kill 333
~# ps -f | sed '1p;/d[e]funct/!d'
UID PID PPID C STIME TTY TIME CMD

Finished! Comments are welcome ;)

Zero padding in bash

Hi folks,

I have been asked how to do zero padding in bash … as C language does it with printf, Bash also offers a printf command that has almost the same syntax as the one in the C language…

printf is used to format the display of some text.

If you need to use a zero padding in Bash, you can use as followed:

1
2
~# printf "%03d\n" 15
015

This will watch the number (5 in this example) and calculate how many zero need to be added at the beginning so that we have 3 numeric characters displayed.

Another example would be:

1
2
~# printf "%03d\n" 1232
1232

The %d specifies that we want to display a number, then between the percent and the letter d, we can specify the format in which the number will be displayed. Here we force the display to print out 3 numeric characters minimum.

  • If the number we specify has less than 3 numeric characters, zeros will be added at the beginning of it so that is displays 3 numeric characters.
  • If the number we specify has 3 numeric characters or more, printf will just print the number.

Hope this could help ;)

Modify root environment using the right way under FreeBSD

Hello fellows!

Today, after a few months off … I’m back with a few tips that you would definitely love - or not.
Here, I’ll explain how to modify the root environment using the proper way… What an easy task, and would you know how?

For example, /bin/csh is the default shell for root, after installing your FreeBSD.
But, maybe, some of you won’t like that shell, and would rather use bash, so you installed it, and you want to modify the root settings so that when you log in, it will use bash.

Most of us will use the following command:

1
bsd # pw user mod -n root -s /usr/local/bin/bash

This will certainly work, but what if you boot your system in single mode?
If you did install your OS properly, you would have mounted / on a partition, and /usr on another one.

We all know that in single mode, the partitions are not mounted, and when the system will boot, it will error out with: “/usr/local/bin/bash” not found, or whatever … i actually don’t remember the error message :D

I could try again and boot on single mode to find out the real message but I’m wayyy too lazy loool.

So you will have to choose another available shell… soooo boring…

So here comes the magic of BSD!!! Watch carefully /etc/passwd:

1
2
3
~# cat /etc/passwd | awk -F ':' '$3 == 0 {print}'
root:*:0:0:Admin Tribute:/root:/bin/csh
toor:*:0:0:Bourne-again Superuser:/root:

toor also has the UID 0 ! And yeah, forget about root!
We always say that root is evil, then use toor to make the shell modification and log in with it!!
~# pw user mod -n toor -s /usr/local/bin/bash

Well … toor is as evil as root but shhhh!!! Don’t tell my followers :P

APC 7920 Lost Admin Password!!!

Argh!!! We lost the admin password of our APC (PDU)…
First thing that a normal and respectable system or network administrator would do is to read the manual to see if we can find a section: “Lost Password Recovery”.

Happily, you will find it … but quickly you will sink into deep abyss …

Why do I say that? It’s because the manual will say: “Connect to the APC via the Serial connector… bla bla bla … and then, press the reset button and try to login again with the default admin user / pass: apc / apc, within the next 30 seconds.”

You could be as lucky as you want, it won’t work!

So here are the real instructions:

  1. Connect via Serial to the APC (use either HyperTerminal on Windows or minicom on Linux)
  2. Press the Reset button on the back of the APC during 10 seconds
  3. Release it, and press immediately a second time for just a second
  4. Now, go back on your terminal, and try to log as apc / apc

It should just magically work, and guess what? Your APC configuration that you may have done previously has not been changed.
Just the admin password has been reset. Isn’t it great? ^^