A cluster version of Zen is running on cgos 19x19

Post by Hideki Kato
I'm now testing a cluster version of Zen (Zengg-4x4c-tst), developed
by a joint project with Yamato, on cgos 19x19. It wons, however, all
games (except first one with timeout due to a bug). Running more
strong programs are very appreciated.

Hideki, thx for your activity.

Do I have a Christmas wish for free already?
It is: Let the cluster also run on KGS - against the humans.

Ingo.

--
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

Hideki Kato

2009-11-24 19:58:26 UTC

Hideki, thx for your activity.
Do I have a Christmas wish for free already?
It is: Let the cluster also run on KGS - against the humans.

I'd like to do so but it's not allowed to connect the cluster
to the Internet, sigh.

Hideki
--
***@nue.ci.i.u-tokyo.ac.jp (Kato)

Ingo Althöfer

2009-11-24 20:06:43 UTC

Post by Ingo AlthÃ¶fer
Do I have a Christmas wish for free already?
It is: Let the cluster also run on KGS - against the humans.

I'd like to do so but it's not allowed to connect the
cluster to the Internet, sigh.

Hmm. As CGOS is also Internet, it seems that Zen-author
does not allow you to connect to KGS.

Is Zen-Author reading here?
Maybe, he can rethink about the possibility.

I want Cluster-Zen for Christmas, Cluster-Zen-for Christmas,
Cluster-Zen for Christmas, please, please, please, please...

Little child In-Go.

--
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

Hideki Kato

2009-11-24 20:42:13 UTC

Post by Ingo AlthÃ¶fer
Do I have a Christmas wish for free already?
It is: Let the cluster also run on KGS - against the humans.

I'd like to do so but it's not allowed to connect the
cluster to the Internet, sigh.

Hmm. As CGOS is also Internet, it seems that Zen-author
does not allow you to connect to KGS.

Ah, I was confusing. I wrote about T2K HPC cluster, which is the main
target of my development, not my home cluaster. My mini cluster can
freely be connected to KGS, though I have no rated bot account yet.

Post by Ingo AlthÃ¶fer
Is Zen-Author reading here?
Maybe, he can rethink about the possibility.

He is sleeping now 'cause it's 5:30 am in Japan :).

Post by Ingo AlthÃ¶fer
I want Cluster-Zen for Christmas, Cluster-Zen-for Christmas,
Cluster-Zen for Christmas, please, please, please, please...
Little child In-Go.

I'll throw it into KGS after tuning several parameters.

Hideki
--
***@nue.ci.i.u-tokyo.ac.jp (Kato)

Nick Wedd

2009-11-24 21:21:14 UTC

Post by Ingo AlthÃ¶fer
Do I have a Christmas wish for free already?
It is: Let the cluster also run on KGS - against the humans.

I'd like to do so but it's not allowed to connect the
cluster to the Internet, sigh.

Hmm. As CGOS is also Internet, it seems that Zen-author
does not allow you to connect to KGS.

Post by Ingo AlthÃ¶fer
Is Zen-Author reading here?
Maybe, he can rethink about the possibility.

He is sleeping now 'cause it's 5:30 am in Japan :).

Post by Ingo AlthÃ¶fer
I want Cluster-Zen for Christmas, Cluster-Zen-for Christmas,
Cluster-Zen for Christmas, please, please, please, please...
Little child In-Go.

I'll throw it into KGS after tuning several parameters.

The December KGS bot tournament will be 9x9. I guess that if a
cluster-Zen competes in that (I am hoping it will), it will be
unbeatable.

The existing pattern of KGS bot tournaments (see
http://www.weddslist.com/kgs/future.html) means that the January one
will also be 9x9, then February and March will both be 19x19. A cluster
Zen in a 19x19 event will be even more interesting to watch.

Nick

Post by Hideki Kato
Hideki
--
_______________________________________________
computer-go mailing list
http://www.computer-go.org/mailman/listinfo/computer-go/

--
Nick Wedd ***@maproom.co.uk

Hideki Kato

2009-11-24 21:52:42 UTC

Hi Nick,

I'll perticipate comming tournaments as much as possible but it's
still under development and needs much more work and time for full
performance.

Since my mini cluster uses usual Gigabit Ether, which is much slower
than expensive Infiniband or such high speed network devices, it
performs not so better on 9x9. So please do not expect much :).

Also, on 19x19 board, current 16-core cluster version performs almost
the same as 8-core shared memory pc such as Mac Pro, which Yamato used
for KGS.

Hideki

Post by Nick Wedd

Post by Ingo AlthÃ¶fer
Do I have a Christmas wish for free already?
It is: Let the cluster also run on KGS - against the humans.

I'd like to do so but it's not allowed to connect the
cluster to the Internet, sigh.

Hmm. As CGOS is also Internet, it seems that Zen-author
does not allow you to connect to KGS.

Post by Ingo AlthÃ¶fer
Is Zen-Author reading here?
Maybe, he can rethink about the possibility.

He is sleeping now 'cause it's 5:30 am in Japan :).

Post by Ingo AlthÃ¶fer
I want Cluster-Zen for Christmas, Cluster-Zen-for Christmas,
Cluster-Zen for Christmas, please, please, please, please...
Little child In-Go.

I'll throw it into KGS after tuning several parameters.

The December KGS bot tournament will be 9x9. I guess that if a
cluster-Zen competes in that (I am hoping it will), it will be
unbeatable.
The existing pattern of KGS bot tournaments (see
http://www.weddslist.com/kgs/future.html) means that the January one
will also be 9x9, then February and March will both be 19x19. A cluster
Zen in a 19x19 event will be even more interesting to watch.
Nick

Post by Hideki Kato
Hideki
--
_______________________________________________
computer-go mailing list
http://www.computer-go.org/mailman/listinfo/computer-go/

--
***@nue.ci.i.u-tokyo.ac.jp (Kato)

Darren Cook

2009-11-24 23:06:46 UTC

Post by Hideki Kato
Also, on 19x19 board, current 16-core cluster version performs almost
the same as 8-core shared memory pc such as Mac Pro, which Yamato used
for KGS.

Hi Hideki,
Is that difference due to a scaling limit of Zen, or is this due to the
cluster overhead? Would moving from gigabit to infiniband help, or is
the limit more to do with the lack of shared memory?

Post by Hideki Kato
T2K HPC cluster

This seems to be a cluster specification rather than an actual machine.
Can you tell us more about how many cores you are experimenting with,
and how the programs scale? (Are all your experiments with Zen, or are
you trying to run other programs on a cluster too?)

Darren

--
Darren Cook, Software Researcher/Developer
http://dcook.org/gobet/ (Shodan Go Bet - who will win?)
http://dcook.org/mlsn/ (Multilingual open source semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)

Hideki Kato

2009-11-25 05:05:00 UTC

Post by Hideki Kato
Also, on 19x19 board, current 16-core cluster version performs almost
the same as 8-core shared memory pc such as Mac Pro, which Yamato used
for KGS.

I'm right now evaluating the scaling (:-).

The performance gap is perhaps due to the algorithms. Almost all
cluster versions of current strong programs (MoGo, MFG, Fuego and Zen)
use root parallel while shared memory computers allow us to use thread
parallelism, which gives better performance. The main reason, I
guess, is that the latter increses the depth of the search tree
according to the number of processors (cores) while the former does
not.

One interesting observed thing of root parallel is that the scaling
depends on the time for a move; longer time setting shows better
scalability, when the time period to exchange root information is
fixed. In other words, each time setting has its best number of
nodes. This makes things complicated :(.

The scaling limit of Zen is still unknown, though I expected that the
playouts of Zen was not so random that it did not scale well, before
starting this joint project with Yamato.

Post by Hideki Kato
T2K HPC cluster

I'm running only Zen on the cluster, though I'd like to run my Fudo
Go as well if I have (had?) time.

Name: T2K Open Supercomputer (Todai)
#Todai is an abbreviation of University of Tokyo in Japenese.
Hardware: HITACHI HA8000-tc/RS425
Number of nodes: 952
Number of cores of each node: 16
#I can use up to 64 nodes; 1024 cores in total
Processor: AMD Opteron 8356 (quad-core) 2.3 GHz
Memory of each node: 32 GB
Interconnect: Myricom Myri-10G
Operating System: RedHat Enterprise Linux 5
#Flops numbers are omitted. :)

http://www.cc.u-tokyo.ac.jp/service/ha8000/intro.html (in Japanese)

T2K stands for Tokyo, Tsukuba and Kyoto (T, T, K). See
http://www.open-supercomputer.org/ (in English) for the idea of T2K
Open Supercomputer.

Hideki
--
***@nue.ci.i.u-tokyo.ac.jp (Kato)

Olivier Teytaud

2009-11-25 07:04:26 UTC

Post by Hideki Kato
The performance gap is perhaps due to the algorithms. Almost all
cluster versions of current strong programs (MoGo, MFG, Fuego and Zen)
use root parallel while shared memory computers allow us to use thread
parallelism, which gives better performance.

I think you should not have troubles with your networks, at least with
the number of machines you are considering.

Perhaps you should increase a little the time between two communications ?
With something like mpi_all_reduce for averaging the statistics over all the
tree at each communication, more than 3 or 4 communications per second
is useless. Averaging statistics in nodes with less than 5% of the total
number of simulations might be useless also.

Best regards,
Olivier

Hideki Kato

2009-11-25 07:49:45 UTC

Thank you Oliver,

I think you should not have troubles with your networks, at least with
the number of machines you are considering.
Perhaps you should increase a little the time between two communications ?
With something like mpi_all_reduce for averaging the statistics over all the
tree at each communication, more than 3 or 4 communications per second
is useless. Averaging statistics in nodes with less than 5% of the total
number of simulations might be useless also.

In your (or Sylvain's?) recent paper, you wrote less than one second
interval was useless. I've observed similar. I'm now evaluating the
performance with 0.2, 0.4, 1 and 4 second intervals for 5 second per
move setting on 19x19 board on 32 nodes of HA8000 cluster.

Though I have not enough games yet, current best is 1 second interval
which improves about 400 Elo in self-play. Then, why we have similar
experiments with different implementations of root parallelism, based
on different programs and on different clusters? I don't use MPI for
the cluster version of Zen. Zen's playouts are slower than MoGo's.
Etc... One second is a mysterious time :(.

Hideki
--
***@nue.ci.i.u-tokyo.ac.jp (Kato)

Olivier Teytaud

2009-11-25 08:05:59 UTC

Post by Hideki Kato
In your (or Sylvain's?) recent paper, you wrote less than one second
interval was useless. I've observed similar. I'm now evaluating the
performance with 0.2, 0.4, 1 and 4 second intervals for 5 second per
move setting on 19x19 board on 32 nodes of HA8000 cluster.

Yes, one second is fine for 5 seconds per move.
Maybe you can check if you have a linear speed-up if you artificially
simulate a
zero communication time ?
My guess is that the communication time should not be a trouble, but if you
don't use MPI, maybe there's something in your implementation of
communications ?

By the way, a cluster parallelization in MPI can be developped very quickly
and
MPI is efficient - mpi_all_reduce has a computational cost logarithmic in
the number of nodes.

Good luck,
Olivier

Hideki Kato

2009-11-25 09:09:31 UTC

Hmm, I think my communication code is not a trouble.

Post by Olivier Teytaud
By the way, a cluster parallelization in MPI can be developped very quickly
and
MPI is efficient - mpi_all_reduce has a computational cost logarithmic in
the number of nodes.

Even if the sum-up is done in a logarithmic time (with binary tree
style), the collecting time of all infomation from all nodes is
proportional to the number of nodes if the master node has few
communication ports, isn't it?

MPI is a best choice for dedicated HPC clusters, I agree. It forces,
however, several constraints such as each node cannot be unplugged or
plugged during operation. MPI cannot be installed some computers with
not-so-common operating systems or small computers with not enough
memory, such as game cosoles. I just want freer parallel and
distributed computing environment for MCTS than MPI. My code is now
running on a mini pc cluster at my home. I don't want to install MPI
to my computers :).

By the way, have you experimented not averaging but just adding sceme?
When I tested that my code had some bugs and no success.

Post by Olivier Teytaud
Good luck,

Thanks,

Hideki
--
***@nue.ci.i.u-tokyo.ac.jp (Kato)

Olivier Teytaud

2009-11-25 09:19:27 UTC

Post by Hideki Kato
Even if the sum-up is done in a logarithmic time (with binary tree
style), the collecting time of all infomation from all nodes is
proportional to the number of nodes if the master node has few
communication ports, isn't it?

No (unless I misunderstood what you mean, sorry in that case!) !
Use a tree of nodes, to agregate informations, and everything is
logarithmic. This is implicitly done in MPI.

If you have 8 nodes A, B, C, D, E, F, G, H,
then
(i) first layer
A and B send information to B
C and D send information to D
E and F send information to F
G and H send information to H
(ii) second layer
B and D send information to D
F and H send information to H
(iii) third layer
D and H send information to H

then do the same in the reverse order so that the cumulated information is
sent back to all nodes.

Post by Hideki Kato
By the way, have you experimented not averaging but just adding sceme?
When I tested that my code had some bugs and no success.

Yes, we have tested. Surprisingly, no significant difference. But I don't
know
if this would still hold today, as we have some pattern-based exploration.
For a code with a score almost only depending on percentages, it's not
surprising that averaging and summing are equivalent.

Best regards,
Olivier

Steve Kroon

2009-11-25 11:16:02 UTC

Hi.

I hope to have a student for the next month or two who can look into some
computer Go before starting his Masters degree. He is interested in using CUDA
for his Masters, so I thought it would be nice for him to investigate
applicability of CUDA for computer Go.

I know there was quite a debate about this not so long ago, and I don't mean to
re-open it. I was just wondering if anyone has a specific investigation of
limited scope they would like done, or think might be valuable with respect to CUDA.

I hope to continue further working with Fuego, so if you have suggestions
specific to Fuego, that would be particularly useful.

Thanks,

--
Steve Kroon | Computer Science Division, Stellenbosch University
(084) 458 8062 (Cell) | (086) 655 4386 (Fax) | kroonrs (Skype)
http://www.cs.sun.ac.za/~kroon | ***@sun.ac.za

Christian Nentwich

2009-11-25 11:27:11 UTC

Steve,

I was one of the people who posted in the debate - I implemented light
playouts on CUDA for testing purposes, with one thread per playout
(rather than one per intersection).

I think one of the main things that I am curious about, and I don't
think I am the only one, is whether texture memory or other such
facilities could be used for fast pattern matching. How heavy playouts
would perform on CUDA is a big unanswered question. Depending on the
answers you get, it may be possible to run heavy playouts on the CPU,
but delegate a pattern matching queue to CUDA. No idea - a good field
for research :)

Christian

Post by Steve Kroon
Hi.
I hope to have a student for the next month or two who can look into some
computer Go before starting his Masters degree. He is interested in using CUDA
for his Masters, so I thought it would be nice for him to investigate
applicability of CUDA for computer Go.
I know there was quite a debate about this not so long ago, and I don't mean to
re-open it. I was just wondering if anyone has a specific
investigation of
limited scope they would like done, or think might be valuable with respect to CUDA.
I hope to continue further working with Fuego, so if you have suggestions
specific to Fuego, that would be particularly useful.
Thanks,

Petr Baudis

2009-11-26 10:16:33 UTC

Hi!

Post by Steve Kroon
I hope to have a student for the next month or two who can look into some
computer Go before starting his Masters degree. He is interested in using CUDA
for his Masters, so I thought it would be nice for him to investigate
applicability of CUDA for computer Go.
I know there was quite a debate about this not so long ago, and I don't mean to
re-open it. I was just wondering if anyone has a specific investigation of
limited scope they would like done, or think might be valuable with respect to CUDA.

I agree with Christian that it would be great to explore the use of
texture memory. I have implemented the per-intersection parallelization
for CUDA; if your student already has some CUDA experience, maybe he could
have quick look at it to see if there aren't any obvious non-optimal
spots. ;-)

Other than that, I also agree that pattern matching is the most
interesting area to explore; CUDA could either do just the pattern
matching on many boards, or full heavy playouts. If your patterns
generate probability distribution of moves across the whole board,
I think full heavy playouts might be possible, but of course this
depends a lot on the set of features you choose for pattern matching.
(You can refer to Remi Coulom's ELO paper for nice set of features
to match.)

Generally, two basic attempts (very different in approach of the
problem) on CUDA implementation of Monte Carlo playouts have been done
by now, but I think neither of us spent too much time on optimizing
them, trying various variations and trying to use them in practice by
integrating them to some existing engine.

Post by Steve Kroon
I hope to continue further working with Fuego, so if you have suggestions
specific to Fuego, that would be particularly useful.

I think the CUDA projects are fairly universal - you will very likely
need to design CUDA-specific independent data structures anyway, and then
it's not a big deal to make quick converter from/to data structures
of reasonable program.

--
Petr "Pasky" Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth

Hideki Kato

2009-11-25 14:58:59 UTC

No (unless I misunderstood what you mean, sorry in that case!) !
Use a tree of nodes, to agregate informations, and everything is
logarithmic. This is implicitly done in MPI.
If you have 8 nodes A, B, C, D, E, F, G, H,
then
(i) first layer
A and B send information to B
C and D send information to D
E and F send information to F
G and H send information to H
(ii) second layer
B and D send information to D
F and H send information to H
(iii) third layer
D and H send information to H
then do the same in the reverse order so that the cumulated information is
sent back to all nodes.

Interesting, surely the order is almost logarithmic. But how long it
takes a packet to pass through a layer. I'm afraid the actual delay
time may increase.

Post by Hideki Kato
By the way, have you experimented not averaging but just adding sceme?
When I tested that my code had some bugs and no success.

Simple adding has an advantage that no synchronization to sum-up all
statstical numbers of all computers is required and so the time from
sending a statistics packet to receiving & adding it to the root node
will be reduced. This advantage, however, may not be effective in
MPI environments because the number of packets inceases from N to N^2
if real (ie. using UDP) broadcasting is not used. It's not so
surprising that there was no significant difference in MPI
environments. Ah, if the tree structure is used to broadcast packets,
things may vary.

Thaks a lot,
Hideki
--
***@nue.ci.i.u-tokyo.ac.jp (Kato)

Olivier Teytaud

2009-11-25 15:14:32 UTC

Post by Hideki Kato
Interesting, surely the order is almost logarithmic. But how long it
takes a packet to pass through a layer. I'm afraid the actual delay
time may increase.

With gigabit ethernet my humble opinion is that you should have no problem.
But, testing what happens if you "artificially" cancel the time of the
messages might confirm/infirm this. If you have troubles due to the
communication
time, I'm sure you can optimize it.

MPI provides plenty of well done primitives for encoding communications.
Except if you need very precise optimization, it's not worth working
directly with sockets.
Olivier

Hideki Kato

2009-12-25 01:41:46 UTC

Ingo, now Zengg19 is running in Computer Go room as a rank-free bot
with 30 minutes sd. It's running on a (mini) cluster of four Intel
quad-core handcraft computers.

Hideki

Post by Ingo AlthÃ¶fer
Do I have a Christmas wish for free already?
It is: Let the cluster also run on KGS - against the humans.

I'd like to do so but it's not allowed to connect the
cluster to the Internet, sigh.

Hmm. As CGOS is also Internet, it seems that Zen-author
does not allow you to connect to KGS.
Is Zen-Author reading here?
Maybe, he can rethink about the possibility.
I want Cluster-Zen for Christmas, Cluster-Zen-for Christmas,
Cluster-Zen for Christmas, please, please, please, please...
Little child In-Go.

--
***@nue.ci.i.u-tokyo.ac.jp (Kato)

Ingo Althöfer

2009-11-24 20:47:52 UTC

Hi Hideki,

Post by Ingo AlthÃ¶fer
Is Zen-Author reading here?
Maybe, he can rethink about the possibility.

He is sleeping now 'cause it's 5:30 am in Japan :).

Ok, let him his good sleep.

Post by Ingo AlthÃ¶fer
I want Cluster-Zen for Christmas, Cluster-Zen-for Christmas,
Cluster-Zen for Christmas, please, please, please, please...
Little child In-Go.

I'll throw it into KGS after tuning several parameters.

You are a 100-%-darling. Thanks a lot in advance.

Ingo.

--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser

Ingo Althöfer

2009-12-28 10:34:10 UTC