Thursday, November 8, 2007

The previous 48 hours

Were eventful to say the least :)
Started off with an emergency puzzle-website-making spree by us on orders from our great Techkriti team. Our work was to design the website and think up puzzles for its content. In 24 hours. :P
This was for the online event TesseracT for which I was supposedly one of the three coordinators (the other two were Utkarsh and Teja). We three sat throughout the night, got questions, made the website, and then went to our normal day-routine without sleep. We thought we'd just finish up some of the stuff left and launch the website at night - pretty much everything was ready.
I asked the students' server admin Naresh to install mod_perl on the server in the afternoon since the submit script between the static xml+xsl pages used Perl to do the checking of the answers. He was somewhat busy, but was able to start the setting up. In the end, the website came up half an hour late due to a variety of problems (and some frantic debugging) on our side and the server side.

The real trouble started when the website did come up.
Now, here I must tell you that the students' server is the biggest piece of crap in the server room in the Computer Center. It has 500MB of.. swap and 256MB of RAM. It runs FreeBSD, and that was probably the reason why it ran at all in the hours of agony that followed.

Within minutes of the site being released, we found that the server had become excruciatingly slow. At first we thought it must be purely because of the sudden increase in hits, but we soon realised something was horribly wrong. The first culprit that came to mind was the Perl script that we were using. We checked the code and came to the conclusion that nothing could be wrong with it; all it did was a foreach over a hash with 10 entries :P Running it on another server showed us no abnormal behaviour, apache never went above 0.1% CPU.

In the meanwhile, the server had stopped responding to anything but pings, and was refusing to initiate ssh connections. And the bloody thing was locked inside a server room, so we didn't have physical access to it at all. The only guy who had an ssh connection open to the server was me, and that too out of pure chance - Praneeth (Flash, video, sound, and overall suckiness warning for the blog) (Praneeth was the other admin) had uploaded my ssh key onto the techkriti account so I could upload the TesseracT website (pure chance because the uploading would've normally been done by the webteam). Unfortunately, the user 'techkriti' was not in the wheel group, and hence could not `su -`, so the user was practically useless. The only thing I could run was `top` which showed us that the machine was out of memory :|


Naresh: are you logged into students
?
me: Yeah
no commands are working
Naresh: fine
me: they take forever to execute
hey
I have top running
it shows 1% CPU :|
Naresh: ?!
me: Yes.
CPU states: 4.0% user, 0.0% nice, 20.1% system, 0.0% interrupt, 75.9% idle
Mem: 134M Active, 27M Inact, 83M Wired, 608K Cache, 35M Buf, 656K Free
Swap: 487M Total, 483M Used, 3692K Free, 99% Inuse, 1704K In, 332K Out
memory
Naresh: fuck
me: ?
Does it have htop?
Naresh: no
out of memory right?
me: Yeah


Arun called me up after a while asking what had happened. After we discussed the situation, he suggested we try `ssh root@localhost`. That sounded like an excellent idea at that time, but for some reason, we did an `su praneeth` instead (Praneeth had come over to where I was in the meanwhile), and then did an `su -`. Mind you, both these commands took ~30 mins to run. While we were waiting for them to finish, we setup TesseracT on an internal server so that at least the campus people could play it.
Oh, and in hindsight, `ssh root@localhost` would not have worked since the machine was refusing ssh connections :)

When we finally had root access on the server, we issued an `apachectl stop`, which took ~ 1 1/2 hours to run. And when it had finished, something very strange happened; mod_perl stopped working, and the main Techkriti page came back up. This was quite a disaster actually, as hundreds of people who were waiting for next.pl to load suddenly saw the file load into their browsers, and thanks to my stupidity, saw the answers as well ~_~
Arun told me soon afterwards (as I had realised to an immense sense of stupidity when mod_perl stopped working), that I should have put the answers in a separate file. Oh well, lesson learnt for the main event (this was just the prelims with no official registration or prize money).

After this, I ran top with root and saw that stocksim (Crappy Java program, run by the Business Club people for online stock thingies) was one of the culprits - it was using >225MB of memory. The other culprit, as Praneeth theorised, was the mod_perl itself. He said he had read that the mod_perl on FreeBSD was buggy and prone to high memory usage.

We decided to kill stocksim and take our chances with the mod_perl again. As Naresh was working on the thing, something happened, and the server went down, and this time, it _really_ went down - I'm not sure what happened, but the end result was that now we could do nothing about the server.

Arun and I decided to give techkriti's IP address to the temporary server till the next morning. Arun did all the vhosts stuff and in 15 mins, techkriti.org was "back up" - it just had the following message:


Okay, our server died on us. This is a temp server.<br/>
While we restore the Techkriti website, Play <a href="tesserac>/">TesseracT</a> :)


TesseracT was back up for the world! ;)

Praneeth had been trying to restore/reinstall the old phpbb that was on the temporary server (which was our Navya server (internal link) actually ;), and had gotten tired, given up, and gone back to his room. After >32 hours of staying awake, I was in no mood to finish his work. All three of us lumbered back to our respective rooms from the CSE deptt at ~3am.

We did, however, have enough energy to take a 60 second exposure picture of a flowering tree behind the CC on the way back ^_^



We found this photo to be such a beauty in the midst of our sullen tiredness, that we decided to make this the last page in the puzzle - which I did right after I got back to my room :)

As things stand now, the students' server is back up, and mod_perl is off on the server. So TesseracT is down for everyone outside IITK. People inside IITK can still play it at http://navya.junta.iitk.ac.in/tesseract .

On a different note, I read this here


The decision to use Tracker by default in Ubuntu rather than the similar Beagle indexing system is somewhat controversial. Beagle can index more content and provides a more functional search tool with features like date sorting. Beagle is already included in popular Linux distributions like OpenSUSE and has been tested more extensively.


I've been wondering for the past few months if this was the result of Tracker fanboyism, Mono-hatred, or Ubuntu's experimentalism.

Update: Pics! :)

Update 2: Utkarsh's flashback on the thing; mostly stuff that I didn't write in detail about above.

PS: New Navya Server = 2 Dual-Core AMD Opteron 2.4GHz Processors, 8GB RAM, 500GB SATA hdd.
Speed = Compiling MySQL (gentoo) takes 6 mins (compared to 25mins on my Pentium M, 1GB laptop).

2 comments:

Utkarsh said...

Near great description ... I intend to write a blog entry of the times about Tesseract that existed when I and Teja were still wondering about the most innocuous excuse for not doing it. :)

bheekling said...

Looking forward to it :D