Pages

Thursday, January 26, 2012

A Chapter in Python List Comprehension Abuse

Apparently Facebook held the qualification to this years Hackercup last weekend, I just came across some of the problems and decided to write a quick solution for one of them. And since Python has this awesome thing called list comprehensions, I managed to cram the solution into a single (ugly) line.
The task is basically this: Given a string, find how many times the string "HACKERCUP" can be formed from its letters without resusing any letters. (Full description here) Seems simple enough, lets see..
First, a little bit on list comprehensions. Usually, to define a list you do the following:


To fill a list, we could to this: But, it turns out there's an easier way - Python lets you create lists "implicitly", so to say. You can iterate over range(10) inside the list and create elements dynamically:

Looks interesting, and there's a lot you can do with this, because there are very few limitations on what you can put inside a list comprehension. You can even nest them to create matrices or do other things. The following line will flatten a list of lists:
For my solution to the problem, I actually read input inside a list comprehension, and then generate a list of that input to work with.

Here's the full code:


EXAMPLE INPUT

5
WELCOME TO FACEBOOK HACKERCUP
CUP WITH LABEL HACKERCUP BELONGS TO HACKER
QUICK CUTE BROWN FOX JUMPS OVER THE LAZY DOG
MOVE FAST BE BOLD
HACK THE HACKERCUP


EXAMPLE OUTPUT

Case #1: 1
Case #2: 2
Case #3: 1
Case #4: 0
Case #5: 1
It's a bit obfuscated, but should still be understandable. The print("\n".join(......)) part prints each element in the following list into a new line, so the output is in the required format. After that, we have the actual generation of our results.
In the last part - [raw_input() for x in range(int(raw_input))] - I read the input. range() takes the number of lines that follow, and then reads that many lines. This list is worked with by the outer list comprehension, which applies a formula to it to count how many times "HACKERCUP" can be formed out of each of the input strings' letters (tC means test case, n is the index of the test case we're on).
The actual formula is this:



We count how many times each letter (the variable c) in "HACKERCUP" occurs in the test case string, and divide by how often it appears in "HACKERCUP" in the first place (rounding down). We store each result in a list, and then take the minimum of that list, giving us our final result.

So, the code is pretty messy and arguably unpythonic, but it demonstrates pretty well how powerful list comprehensions are in Python. Wrap them up in a while(true) loop and you can even create whole mini-games in one line of code!

Wednesday, January 25, 2012

ICPC @ TUM and a Quick State of the Union

Guess what's coming to TUM:
Alright, you didn't actually have to guess.
The ICPC is a yearly programming contest at universities around the world, so naturally I was very excited today to find out that TUM is also participating. Teams can have between two and three members, and I've already recruited Robert for my team, which is definitely something to be excited about aswell, given that he's been coding since he was about 12 years old.
I've been doing a lot of problems on Project Euler lately, in fact I've recently hit 50 solved problems, but competing in a live event will be a very different feeling I expect.
The competition is Saturday in a week (04/02/2012), and will go on for 5 hours as far as I know, during which everyone needs to try and solve as many out of 10 problems as they can. On first glance I was hesitant to spend all Saturday at uni, but then I found out there's gonna be pizza so that was that.

If you're studying at TUM and are looking for a team to join for ICPC, drop me a comment here or send me an email! You can find more info on ICPC here: http://cm.baylor.edu/welcome.icpc and here: http://icpc.in.tum.de/index.php/Main_Page

By the way, Robert is also the guy who's working on the WikiGraph (still unsure of that name) project with me. He's been talking about starting a Blog, so I'll be sure to post a link on here once that happens. Seriously, the guy has skills.
Speaking of Wikigraph, there are some news on what's going on with that as well:

  • We're getting good graphs for networks of up to fifteen thousand or so nodes, but beyond that our computational capacities don't allow for completing graphs in a reasonable amount of time.
  • I've talked to two of my professors, and it looks like we may be able to get time on a 30-40 core cluster to run our program, which might allow us to graph the entire German Wikipedia (1.2 million articles). This would be pretty amazing, but we're definitely going to have to work on properly scaling the algorithm so we don't get buggy as the network gets huge. 
  • I'm also working on a way to read the input from a database (specifically the Wikipedia dumps of all pages and pagelinks, which is available at  http://dumps.wikimedia.org/), but I'll post some technical stuff on that later. The gist of it is that I have to combine the pages dump (320MB) and pagelinks dump (2.6GB) to get a full description of all the connections. I'm still unsure of what the best way to do it is, and I don't want to waste time trying to do it inefficiently when dealing with these huge amounts of data. Finding disconnected subgraphs may also present itself to be a problem.
  • Different datasets are also being worked on, one thing I'm particularly interested in is mapping Darknet/Meshnet, so I'm looking for a way to actually extract the entire set of nodes and their connections. I'm not sure if it's actually possible either, I'll have to look more closely into it.
  • Finally, I used some of the source code from this project to create a simulation of planets/particles interacting through gravity - a planetary system, basically. It's almost more fun to play around with, in fact. Also more on that soon.

Thursday, January 19, 2012

9868 nodes in WikiGraph

Ha! I located one of the bugs, apparently the setting of the repulsion strength parameter in the balancing algorithm caused the flying off into infinity of nodes. I fixed it, and right away managed to graph almost 10,000 articles centered on Mathematics! I could probably go larger, but each of these Graphs is taking up to half an hour to fully balance on my computer. Let's see, maybe TUM has a nice multicore machine we can use for a bit...



A little Preview

So, there haven't been many posts on here the last few days, but certainly not because I've gotten lazy. Robert and I have been coding like crazy on the WikiGraph project this week, with a few sessions of 6+ hours at times.
We're coming along quite well, we've managed to implement multithreading on all of the force calculations, which has given us a speedup factor of 4 on my machine, but the implementaton allows for an arbitrary number of threads and automatically creates as many as the machine it runs on has cores (or virtual cores, with hyperthreading). Along with that comes the improved algorithm, which again about doubles performance.
This allows us to display some really large graphs - by really large, I mean up to around 6000 nodes at the moment, and that limit is mostly due to some unexpected behavior (nodes flying off to infinity), which we suspect is a result of the parameters we use in our algorithm. I expect that by the time we're releasing a beta version (which at current development speed would be sometime next week), we can display graphs of maybe 15000 nodes, which is around 3 levels deep of pagelinks if all links are loaded.
OpenGL support is also something we're implementing at the moment, since CPU rendering of these graphs actually takes up more resources and time than calculating a single iteration of forces, so outsourcing that to a graphics card will definitely help performance.
Apart from that, other changes will mostly be in UI and design, adding options for users to alter the graph and such.

Even though we aren't ready to put out a stable and usable release yet, we've already created some pretty neat visualizations:
Close up of an earlier version, 2 levels centered around Quicksort
3 levels deep (German Wikipedia), also centerd around Quicksort I believe. We're actually displaying over 5000 nodes here.
3 levels around Mathematics, over 6000 nodes. This one is not quite stable, the Mathematics article needs to be fixed otherwise it flies off. Looking into why it does that, all other nodes seem to behave as expected.

Sunday, January 15, 2012

An improved algorithm for N-body simulations

TL;DR: A simple algorithm that speeds up force calculations for n-body systems by a factor of nearly 2.

So the last post on here was about graph balancing using a spring-electrical model, in which each node exerts a repulsion on each other node according to Coulomb's law, and each edge attracts adjacent nodes according to the spring equation.
This is a good example of an N-body problem - the most well-known one is that of planetary/gravitational systems - and one unfortunate thing about these is that modeling them can be quite slow.
Intuitively, the first algorithm that comes to mind is simply iterating through all the bodies, and for each one iterate through each other body and calculating the force they exert on each other. For n bodies (nodes), this results in (n-1)^2 for the whole system - (n-1) because the body does not exert a force on itself. This is of course is a growth rate of O(n^2), not very efficient.
But there's an easy improvement that can improve on this quite a bit making only minimal changes to the algorithm:
  • We start with the first body, and iterate through all the other body, calculating each force and adding it to the nodes net force
  • Since the forces between each two bodies will be equal and opposite, we now subtract this force (or add it negatively) from the other bodies' net forces, and then add a reference to the original body to a list (or in Qt a QSet, which is faster) that keeps track of all the bodies that have already been calculated
  • We go to the next body, and again iterate through all other bodies, except now we skip any bodies already in our list of calculated bodies. This results in one less calculation being performed with each body we iterate through
And that's about it. With each node, we have one less calculation to perform, so our total number of operations is now:
This is significantly better, in fact:
For very large systems, this algorithm approaches to be twice as fast.
I implemented this algorithm in our Wikipedia grapher, and it's working quite nicely. Here's one of the graphs I made with it today(554 nodes, zoomed out):

There are also n log(n) algorithms for these systems, but frankly I don't understand how they work well enough to write about or implement one yet, so I'll call it a night for now and get to those another time.

Tuesday, January 10, 2012

Graphing wikipedia - update

I fixed up the crosslinking on our wiki grapher, so now each expanding node checks the list of existing nodes and connects them rather than creating a new node for every page it finds. A next step would be to use Wikipedias incoming links feature and also check node links in the other direction, however this would probably impact performance as a full web request needs to be made for every single node at creation, not just after it is clicked/expanded. Anyway, this fix already makes the resulting graph a lot more interesting:

By the way, this is centered around Mergesort and outgoing links are limited to 10 for testing purposes. I might keep it that way and then decide on a way to display the 10 (or so) most important links, although it'll be difficult to decide how to judge link importance.

Visualizing Pagelinks

In this blog's last post I talked about force graph balancing and some of the things it can be used for, more specifically visualizing networks such as links between webpages. Robert BrĂ¼nig and I are currently working on a small projects that does just that, by scraping pages (currently Wikipedia articles) for links and then representing them graphically. It's written in C++ using the Qt framework. This would've been fairly simple in Python, however for larger networks the program becomes very computationally intensive, and we wanted to get good performance. Besides, using C++/Qt allows for much more stable deployment, and Qt has a lot to offer in terms of modules (widgets, graphics, web access and such). Screenshot below.

As you can see, the graph quickly grows quite large for wikipedia pages due to the high number of pagelinks on each article. We're thinking of focusing on Youtube instead, visualzing a gives clips related videos - perhaps we might even be able to embed the videos in the nodes, so that they can be watched directly in the program.

There are also still some issues with the graph balancing algorithm, we still need to find a good combination of attraction/repulsion/weighting parameters that produces a smooth and stable graph that balances quickly.