Friday, December 29, 2006

gonna love google adsense

I can know how google judge my taste, and know more about myself. And I find I am someone who like to click on those Ads.

Thursday, December 28, 2006

晕倒，怎么会在我的blog放这个广告阿

不孕感言：我终于怀上宝宝
上海协和医院六联基因助孕疗法 6位博导医师秉承协和百年精湛医术

电机
浙江华明电机有限公司生产销售各种单相、三相电机，产品有CCC和CE认证。

Google 提供的广告

　

Ops, I am out of BETA too~~

I have earned $0.29!

网页展示次数	点击次数	网页点击率	网页 eCPM [?]	收入
AdSense for content	169	4	2.37%	US$ 1.69	US$ 0.29
查看所有 AdSense for content 渠道 »

AdSense for search	- 无可用数据 -
推介	- 无可用数据 -
总收入					US$ 0.29

And the taiwan earth quake stops my income! People Can't open the Ad's website in US!

Wednesday, December 27, 2006

科学发展与人文发展的相互关系

-- 以电子化产品盗版为例

郑达韦华

20621221

　科学，是立世之基，只有懂得科学，掌握科学，尊重客观规律，顺乎客观规律，才能立于世上。人文，是为人之本，只有重视人文，领悟人文，溶入人类社会，尊重人类尊严，才能合乎人的基本规格。科学教育，主要启迪灵性，而人文教育，既启迪灵性，更启迪人性。只有科学教育与人文教育交融，教育才能正确地回应时代呼唤。

改革开放以来，科学与人文之间，主要的矛盾表现形式，已经从轻视科学与捍卫科学的斗争，从保守思想与改革开放的对立，向单纯的科学立场与新兴的人文立场之间的张力转变。这一判断或许并不十分准确，但无疑是富有启发性的。

根据《2002 年全国教育事业发展统计公报》1 数据统计，中国大陆在校学生有2 亿多人（其中有小学生1.2 亿，中学生7 千万，高中生3 千万，高等学校学生6 百万）。教育软件在为中国的教育发展起到了不开忽视的重要作用。因为不论是一般学生的培养还是专业人材的训练都将有赖于教育软件的应用。我们以科学计算软件为例，它为科学与工程方面的教学与应用提供了强有力的工具。目前，中国的理工科大学不少课程教学是采用了商业软件MATLAB。但是由于该软件的昂贵价格（用于单计算机的MATLAB 主体部分软件价格要1 万多元人民币，相当于两倍个人计算机价格，这还不包括各种工具箱的额外价格），中国的许多用户是非法方式使用。目前的事实是，当中国大学是知识创新基地的同时，它也同样拥有大量的盗版用户。更为严重的是，有些学校的非官方网站或网页中，已经成为盗版软件的下载中心。在此，我们不能过多地责怪学生。因为我们许多老师在教学上是借鉴了商用软件，如MATLAB。这样，同学们只能借助盗版软件方式来完成作业。另一方面，由于教育经费短缺，有些学校的领导表现出“不作为”。

近来，中国各大学开始了“大学精神与大学文化”的讨论。大学文化不仅是对知识的传授，还应包括对理智与智慧的人生态度和方法的追求，以及对学生品格与人生价值观的培养。在这

样的背景下，我们认为这种讨论不能忽视中国的大学已经是非法使用软件的重要场所这样一个现实。如果在中国的教育领域中没有切实与严格的知识产权保护制度相配合，这种有关“大学文化”的建立极有可能会落入空谈。

对于上述例举教育领域中存在的问题，并非意指中国的其它部门是一片净土。中国人的诚实信用危机，道德品质恶化是有目共睹的。我们所强调的是：教育领域（包括科研机构中的教育部门）应是中国改变知识产权保护现状的最好切入点。如果青少年从小培养采用合法应用软件的意识和行为习惯，那么在他们未来的发展中才有可能成为遵纪守法的公民。中国政府虽然在打击盗版方面采取了很多的有力措施，但是收效不能令人满意。原因何在，主要是缺乏在教育领域中下功夫。

当中国未来目标是构建成世界上最大的终身式学习型社会时，在发展与应用教育软件方面已经面临了不可回避的挑战性问题。鲁讯先生笔下的孔已己曾辩解道“窃书不能算偷”。时过境迁，在当今计算机应用不断发展的现代化社会中，我们听到另一种说法：“软件这么贵，不盗白不盗”。在盗版软件习以为常的风气影响下，中国学校里应用的计算机中，有多少安装了正版软件是可想而知的。面对这样一种习惯势力，使我们深感到在中国推广自由软件并非是技术层面上的问题，更主要的障碍是来自于我们的传统观念和法律法规的落实。

在我们的生活中，已经有好多使用盗版的例子：

日前，上海市教委一位内部人士透露，微软（中国）于4 月份曾发函上海市教委，其内容为：微软Office 的客户名单中没有大部分上海市中、小学，然而在《信息科技》教材中却有大部分微软Office 内容。微软认为上海市的这些中小学都在使用微软盗版的Office 产品，并据此要求上海市教委采取措施，购买正版的Office 产品。

微软的政策引起了上海市中、小学及上海市教委的强烈反弹：在上海市所有中、小学机器上卸载微软Office，转而采购国产软件WPS；上海市2000 多家中、小学从新学期开始将使用全新的《信息科技》教材和配套的教学软件环境，新软件不再是把持上海信息化教育的微软Office，而是金山的WPS。

百度MP3 搜索：

百度以往的遮羞布是，盗版都是别人干的，我只是提供盗版的链接。但现在百度已经连“盗链”这样的事情都干了。唱片公司没赚到钱。提供盗版的也没赚到钱。这钱惟独让高科技的百度给赚了。所谓的mp3搜索，是干什么的，谁都明白。没想到，现在变本加厉到如此地步。

迅雷下载：

说到底就是盗链别人的下载地址。自己不需要提供服务器，但是别人也得不到下载同时投放的广告收益。恶性竞争，最后没人愿意提供免费下载，是对用户的伤害。但是，在目前社会，谁能得到流量，谁就能得到财富，不管你的手段如何。

P2P，BT下载：

本来是一样好的科技，如果用来传送正版内容必定能解决多人同时下载的高负荷问题。但是目前互联网上通过P2P传播的99.9%的内容都是盗版内容。一个好的科技，使得人们的道德进一步恶化。

孔子讲，人生七十，从心所欲，不逾矩，因为此时对人世间的道理已经大彻大悟，明白规矩的存在不是限制了自由，而是从根本上保护了所有人的自由和个人长远的自由。对于开源软件，特别是自由软件，我们也要充分地认识到规则(即授权协议)背后的积极意义，认识到它们推动社会进步、文明发展的巨大作用，这样我们也可以无论是在应用中还是再开发的时候都可以做到从心所欲，不逾矩。

中国科学院院长路甬祥，在一次讲话中认为：科学技术在给人类带来福祉的同时，如果不加以控制和引导而被滥用的话，也可能带来危害。在21世纪，科学伦理的问题将越来越突出。科学技术的进步应服务于全人类，服务于世界和平、发展和进步的崇高事业，而不能危害人类自身。加强科学伦理和道德建设，需要把自然科学与人文社会科学紧密结合起来，超越科学的认知理性和技术的工具理性，而站在人文理性的高度关注科技的发展，保证科技始终沿着为人类服务的正确轨道健康发展。

我们来看看Bill Gates是怎么看待盗版的。微软的视点更为长远，看法也更为坦白，因为Bill Gates说过：要偷就来偷我的吧，等你们都偷上了瘾再来慢慢收拾你们。实际情况是，盗版的存在和泛滥在很大程度上助长了微软垄断地位的形成，微软从垄断之中获得的好处也许大大超出了盗版对它的伤害。当然，把电影与微软软件一起类比也许不太合适，微软软件的网络外部性使其在软件市场上容易形成赢家通吃的局面，但电影市场上却看不到这种情况。所以，比较而言，电影、音乐这类东西受盗版的伤害应该更重。

一方面，盗版的Windows很大的推动了PC在国内的普及。但是，大家都习惯了使用Windows，导致忽视了许多其它东西，比如Linux。导致什么东西都围绕的Windows，极大的限制了我们的创新基础。全世界范围内日益扩大的数字鸿沟，在经济全球化的背景下，信息产品被少数跨国公司垄断的日益严重的现状，引起了各国政府、科学工作者、非政府组织以及包括企业界有识之士的关切。软件盗版搅乱了真正应商业化的软件市场，还动摇了人们本应具有的最基本的一些道德理念。然而，高昂的信息产品价格壁垒和对部分产品的垄断是日益加大的数字鸿沟的重要原因。

在中国，开源运动更像是一门生意而不是一项事业。中国一直没有足够成熟的开源社区，但却有大把的开源软件公司。开源运动在成为一种值得为之奋斗的理想之前，已经先成为了一种“利益为先”的商业活动，这是开源运动的“中国特色”。

在之前的中外交流会上，国内某位开源专家发表了一篇关于“Linux操作系统如何实现与Windows操作系统的兼容”论文。这正是国内开源界目前努力的方向，但令在场的国外专家们迷惑不解。为什么他们的中国同行对开源运动学术研究方向的选择，竟然是如何与开源的对立面—一个封闭的专有系统实现兼容？
在国外专家看来，开源运动的意义就在于开放源代码、信息共享和自由使用，如果跟随Windows或者任何一种封闭式专有系统的脚步，只会让开源运动失去原有的意义，永远不能掌控自己的发展方向和技术路线。而国内专家们则认为，就目前情况看，人们的使用习惯已经固定，假如不能实现与 Windows的很好兼容，Linux等开源操作系统根本不会被大众所接受和使用。

国内开源软件企业的主要目标是政府用户。政府采购开源软件的目的不外乎软件正版化和支持民族软件产业两点，但是大多数国内正版软件在购买之后即被闲置，公务员们依旧使用原来盗版的Windows和Office进行日常办公，因为开源软件“不符合使用习惯”。于是国内开源软件厂商视“与微软兼容”为主要方向。

linus说：“学习计算机是一件很容易的事情，只要你有一台二手计算机，以及一张linux光盘，就可以开始了”。廉价的学习方法，给世界提供了大量的软件人才，这些人才参与商业运作，最终提升了整个行业的发展。

我们作为中国未来的希望，应该从我做起，少用盗版，加入到开源软件的行列。

参考文献：

[1]科学计算自由软件SCILAB与中国教育发展胡包钢 “2003 年中法科学计算自由软件SCILAB 研讨会”书面发言

[2]停止盗窃，支持开源！ Allen Chen’s Personal Site!

[3]Apache的成功应该归功于开源作者: forest

[4]科学与人文：冲突背后的深刻意义文汇报

[5]开源知识产权解决之道：“开放”下的规则盛忠良

[6]开源运动的“中国特色互联网周刊, linux999.org

[7]百度不仅盗版而且盗链刘韧Blog

[8]盗版的经济学中国人文社区

[9]开源还是不开，难道是个问题？月光软件站

[10]开源软件在中国真的根本无法推广吗？ sugarcrm的个人空间

[11]豆瓣，别把自己做成阅读器凝聚草根力量 Zeemoo的小屋

[12]开源知识产权解决之道：“开放”下的规则 Chinabyte

[13]科学与人文交融是培养高级人才的必由之路华中科技大学　杨叔子

Saturday, December 23, 2006

DNA匹配的若干算法研究及分析

摘要：

对于DNA序列的查找、标记分析其基因特性是一个热门话题。在长度数以兆计的DNA序列中通过比较查找许多特征子序列，需要高效的匹配算法。本文总结了多种DNA匹配算法，并比较其优劣。

问题描述：

在DNA长序列间的比较能够发现隐藏的基因信息，而在DNA中查找特征序列能帮助发现病变，这些都是很有用方向。比如通过检测癌变基因来发现癌症，通过发现固定结构序列来查找功能与基因的对应关系。

DNA是一个一维的化学元素序列，可以把它看成是一个巨大的线性磁带，元素是{A,C,G,T}。不同的元素组合表征了不同的化学结构。一个有机体的序列特征能够表征它的特点。如果我们比较两个序列X,Y。已知X表示某种有害特征，如果Y与X像类似，我们就能猜测Y也是表示有害信息。其中一个研究目的就是检测某人的基因中是否包含不良基因片断。[1]

匹配问题：

问题一：精确匹配

在X中查找是否包含Y字串。

最简单而且保证能确保结果的做法是穷举法。

算法描述[2]如下：

Begin initialize A,x,text,n <- length[text], m <- length[x]

s <- 0

while s <= n – m

if x[1..m] = text[s+1..s+m]

then print “pattern occurs at shift” s

s <- s+1

return

end

但是这种方法的效率很低，时间复杂度是O((n-m+1)m)。

进而我们可以通过Boyer-Moore算法提高匹配效率。

Begin initialize A,x,text,n <- length[text], m <- length[x]

F(x) <- last-occurrence function

G(x) <- good-suffix function

s <- 0

while s <= n – m

do j <- m

while j > 0 and x[j] = text[s+j]

do j <- j-1

if j = 0

then print “pattern occurs at shift” s

s <- s+G(0)

else s <- s+ max[G(j), j-F(text[s+j])]

return

end

但是在现实应用中，单单精确匹配是不够的，更多的时候我们需要的是带有容错性的匹配算法。为了更好的了解模糊匹配算法，我们先要给出以下几个定义：

定义一：编辑距离[2]

字串X,Y的编辑距离是，X经过多少步基本操作能转变为Y。

基本操作包括：

替换：X的一个字符被替换成Y中的一个字符。

插入：Y的一个字符插入到X，因此X的长度增加一个字符。

删除：X的一个字符被删除，因此X的长度减少一个字符。

定义二：任意匹配字符

如果一个字串包含字符，表示这一位上的字符与任意字符匹配。

定义三：容错匹配

允许匹配的两个字串有一定数目的字符是不匹配的。错误量可以用编辑距离来表示。

其中，如何计算编辑距离的方法如下：

通过动态优化的方法：

OPT(i,j) = min[_xiyi + OTP(i-1, j-1),  + OPT(i-1, j),  + OPT(i, j-1)]

N	8	6	5
A	6	5		5	5
E	4		2	4	4
M	2		3	4	6
-		2	4	6	8
	-	N	A	M	E

其中中表示最佳匹配。

但是这种最佳算法的时间复杂度也只能是O(mn)。

还有其他的算法来寻找包含替换和间隔的字串对齐[3]，但是这些方法需要有特殊硬件的支持。

最近的研究，提出了若干高效的方法：

其中一个是SAX( Symbolic Aggregate approXimation)。主要思想是通过把时序序列转换成字符串，然后通过统计规律，快速的找到序列中的突变片断。

比如：

对于像心电图这样规律的时序信号，在突变的情况少，通过ＳＡＸ方法能够节省许多计算量，而且非常简单。

简化运算的推理过程如下：

通过一系列近似计算，把粒度变大，简化计算量，但同时能够有很好的精确度。

下面介绍如何把时序信号转变成字符串。

方法就是选择合适的区间，把每个区间段中的信号映射到一个字母上。如下图：

然后统计字符串序列的频率，转换成位图

比如：对于ＤＮＡ序列：

CCGTGCTAGGGCCACCTACCTTGGTCCGCCGCAAGCTCATCTGCGCGAACCAGAACGCCACCACCTTGGGTTGAAATTAAGGAGGCGGTTGGCAGCTTCCAGGCGCACGTACCTGCGAATAAATAACTGTCCGCACAAGGAGCCGACGATAAAGAAGAGAGTCGACCTCTCTAGTCACGACCTACACACAGAACCTGTGCTAGACGCCATGAGATAAGCTAACA

频率分别是

C: 0.20

A: 0.26

T: 0.24

G: 0.30

制作成位图:

同样的，对于两两字母组合，甚至更多子串的组合，也能做出位图

然后，根据频率的不同，染色：得到下列位图

这时候，比较两个时序序列的异同，只要比较两个位图就可以了。因为对于相似的序列，他的位图必然是相近的。

另外一个方法是LSH（locality-sensitive hashing）。LSH是由Indyk 和 Motwani在1998年提出，然后由Gionis 等人针对高维几何问题优化。目的是把字符串的子串匹配简化到一个更容易控制的精确匹配问题。

原理如下：

有两个字符串s1,s2，他们的长度都是d,字符集是Σ。对于固定的r <>如果s1,s2最多只有r个不同子字符，我们称s1,s2相似。

为了检测两个字串是否相似，我们构造以下随机过滤器。从集合{1...d}中选择k个位置i_1,i₂,...i_k；为了简化计算，位置可以重复。我们定义函数f，Σ ^d->Σ ^k

f(s) = <>1], s[i₂], ...s[i_k]>

f被称为LSH函数。我们认为两个字串相似当且只当f(s1)= f(s2)

对于真正相似的字串，他们满足f(s1)=f(s2)的概率 > (1-r/d)^k

所以对于相似的字串，很容易被识别出来，对于不相似的字串，也有很大的概率分辨出来。

但是也可能把不相似的字串识别成相似，因为刚好满足f(s1)=f(s2)条件。

要判断两个DNA是否相似，只要设定好k,然后通过f函数，就能很快的算出结果。

当然，除了SAX,LSH还有很多其他方法。其中这两个方法是我觉得最有特色的，和平常的思路不一样。也觉得能够在许多其他领域上应用。

参考文献：

[1] Algorithm design. Jon Kleinberg, Eva Tardos Pearson education 2006

[2] Pattern Classification. Richard O. Duda, Peter E. Hart, David G. Stork. 2004

[3] Huang,x. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math., 12, 337-357

[4] E. Keogh, J. Lin and A. Fu (2005). HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In Proc. of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 226 - 233., Houston, Texas, Nov 27-30, 2005.

[5] Li Wei, Eamonn Keogh and Xiaopeng Xi (2006) SAXually Explict Images: Finding Unusual Shapes. ICDM 2006.

[6] Li Wei and Eamonn Keogh (2006) Semi-Supervised Time Series Classification. SIGKDD 2006

[7] Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei & Chotirat Ann Ratanamahatana (2006). Fast Time Series Classification Using Numerosity Reduction. ICML.

[8] Kumar, N., Lolla N., Keogh, E., Lonardi, S. , Ratanamahatana, C. A. and Wei, L. (2005). Time-series Bitmaps: A Practical Visualization Tool for working with Large Time Series Databases . In proceedings of SIAM International Conference on Data Mining (SDM '05), Newport Beach, CA, April 21-23. pp. 531-535

[9] Ratanamahatana, C. A. and Keogh. E. (2004). Everything you know about Dynamic Time Warping is Wrong. Third Workshop on Mining Temporal and Sequential Data, in conjunction with the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), August 22-25, 2004 - Seattle, WA.

[10] Mayur Datar Nicole Immorlica Piotr Indyk Vahab S. Mirrokni Locality-sensitive hashing scheme based on p-stable distributions Proceedings of the twentieth annual symposium on Computational geometry 2004

[11] Similarity search in high dimensions via hashing A Gionis, P Indyk, R Motwani - Proc 25th VLDB Conference, 1999

[12] Jeremy Buhler Ecient Large-Scale Sequence Comparison by Locality-Sensitive Hashing Bioinformatics 17(5) 419{428, 2001

[13] C Yang MACS: Music Audio Characteristic Sequence Indexing for Similarity Retrieval
- IEEE Workshop , 2001

[14] Piotr Indyk Rajeev Motwani Approximate nearest neighbors: towards removing the curse of dimensionality ACM 1998

[15] Jeremy Buhler Efficient large-scale sequence comparison by locality-sensitive hashing BIOINFORMATICS Vol. 17 no. 5 2001 Pages 419–428

Friday, December 22, 2006

瘦型计算机 -- the think I want.

瘦型计算机（Thin Computer）是指一种外形尺寸较小、没有散热风扇、没有存储硬盘的“节能”计算装置。一台瘦型计算机的“主板部分”，耗电不足20瓦，可谓“微不足道”也。大量瘦型计算机聚集的地方，运行期间可以“鸦雀无声”，听不见弥漫空间的由大量风扇散发出的嗡嗡声。瘦型计算机能有什么用？

通常，瘦型计算机外观看上去像个“小盒子”，内部有一块小集成电路板，上面集成了CPU，ROM、内存部件（64Mb以内）和网络连接卡（接口）等。瘦型计算机如何启动？如何运行？这确实是个问题。

不少计算机系毕业的学生，说不清楚计算机是怎么启动的，一般只会开机（所谓“加电”），启动操作系统（比如：Windows），就算完事，而对于计算机（内存）里面究竟发生了什么事情，出现什么“电状态”，却搞不清楚。瘦型计算机情况就更不一般，比较复杂。瘦型计算机是不能单独运行的，它必须依附网络服务器的支持。所以，瘦型计算机也叫做“网络计算机”，必须依赖网络才能运行。

大概在5～6年之前，微软搞所谓“维纳斯（女神）计划”，就是想推出这种东西。随后几年，国内也开始“跟进”，但是，所生产出的网络计算机，仍然很“笨”，一点也不瘦，而且还附带着一个散热电风扇。这种网络计算机，里面仍然集成（固化）了不少应用软件模块，比如：操作系统、浏览器什么的，单独也能“上网”冲浪。严格讲来，这种“不胖”（或者“减肥”）计算机是不能叫做“瘦型计算机”的。

瘦型计算机在哪里呢？能不能把一切“应用软件”都放在服务器上运行，而瘦型计算机只是向它发出请求（向服务器“输入”请求），然后再接受它的“服务”，即把服务器的输出结果再在瘦型计算机上显示出来？我们能不能把瘦型计算机“减肥”到最低限度？让它只剩下一个能够完成丰富计算功能而且非常“节能（节省电力）”的“计算骨架”？在我们所设想的瘦型计算机里面不能预先固化一个小型操作系统和浏览器。大体上看来，瘦型计算机里面不能固化什么系统软件。我们要让瘦型计算机一直“精瘦”（“减肥”）到底。

1999年，Jim McQuillan率先提出所谓“LTSP”，LTSP是Linux Terminal Server Project（Linux终端服务器计划）词组的字母缩写词。现今，LTSP逐渐演化成了一个基于Linux操作系统的插件包（add-on package）。采用这个插件软件包LTSP，把它安装在服务器上，就可以实现上述关于瘦型计算机的所有设想。在每台瘦型计算机上，经由DHCP（动态宿主配置协议），从服务器端运用TFTP（Tiny File Transfer Protocol，即所谓“小文件传输协议”），把最小的Linux操作系统（内核）传送到瘦型计算机上，再在瘦型计算机上，输入任何适当的请求，完成一定的计算任务。瘦型计算机可以采用“网络启动”方式，即利用驻留在瘦型计算机ROM（只读存储器）中的启动程序。因为，自1998年之后，大多数计算机主板里面都集成了NIC（网络接口控制器），使得BIOS里面存在PXE（Pre-eXecution Environment，即“预先执行环境”），办到这一点并不难。

特别值得指出，近年来，Ubuntu奠基人Mark Shuttleworth本人对LTSP贡献很大，使得LTSP成为Ubuntu（特别是教育Ubuntu，即Edubuntu）的一项核心技术。Edubuntu软件包为所谓“电脑教室”（或是举办瘦型计算机集中的场所）奠定了坚实的基础。基于LTSP瘦型计算机的电脑教室，是所谓“开源教育”的物质基础。创办Edubuntu电脑教室，举办开源教育，利国利民，何乐不为？

Trackback: http://tb.blog.csdn.net/TrackBack.aspx?PostId=1451355

Thursday, December 21, 2006

Good replacement for MatLab -- Octave


GNU Octave is a high-level language, primarily intended for numerical
computations.  It provides a convenient command line interface for
solving linear and nonlinear problems numerically, and for performing
other numerical experiments using a language that is mostly compatible
with Matlab. It may also be used as a batch-oriented language.

Trying to put Octave under cluster.

I just wanna try KVM out in my laptop's vmware's debian linux...but it fails.
And I find out that my centrino CPU doesn't support VT(virtualization technology)!
Who can give me money to buy the newest CPU for experiment?!

Communicating Sequential Processes, or CSP, is a language for describing patterns of interaction. It is supported by an elegant, mathematical theory, a set of proof tools, and an extensive literature. The book Communicating Sequential Processes was first published in 1985 by Prentice Hall International (who have kindly released the copyright); it is an excellent introduction to the language, and also to the mathematical theory.

I will start reading the book soooooooon.

计算机的共产主义社会~

发现原来plan 9在80年代就提出了我们现在想要的理想系统架构。

计算能力和储存都集中在中央，每个人只需要简单的终端。然后个人剩下来的钱，而且个人不需要升级了，钱就能提供给中央买更多的机器。当然中央是分布式的咯。而且每个人都能得到更大的计算力。

多少爽。也不需要想现在那样整天等下载，数据都在中央了，你能马上获得。

量变造成质变

原来2分钟内就能完成任务，所以设置了5分钟作为超时边界。工作一直很好。
直到现在，任务复杂了，需要6分钟，结果就出现了莫名的错误，经常以为工作不正确。

Wednesday, December 20, 2006

Installing kvm !

how to create a image?

Damn! A hacker break my username and password on server. and start a udp flood, ssh-scan.
how many hackers exist in the world!

Can I be a consultant?

I heard that IBM have a great consulting department. How can I prepare to join it?
Or accenture is another option.

Can anyone recommend me to these corporation?

Try google adsense~

Have you noticed that there are ads in my blog now?

Is it coooooool?

I can tell how do google think what kind of blog it is.

And more~~ I can earn a penny~ will them pay a penny by check?

check it out~

Friday, December 15, 2006

成功使用OpenMOSIX!

两台虚拟机连起来了，在一台机器上运行2个程序，两个程序都占100%CPU,两台机器上的CPU都被占满了！！爽

步骤也很简单，用CD启动就好了！

晴天霹雳阿！！！发现MOSIX, SSI都是进程级别的迁移，不是线程级别啊。不能让一个独立程序耗光所有机器CPU

年度盘点:现在是时候转向Linux了么?

2006.12.14 来自：IT168

link:http://news.csdn.net/n/20061214/99383.html

你的下一个要采用的操作系统可能是众多免费的Linux桌面版之一，而并非微软的Windows Vista。使用Linux，你将不用去执行有风险的软件升级，而且假若你的新计算机配置不够高的话，也不用去勉强接受一个功能有限的Vista版本。
　　Linux操作系统，其已经被证明的病毒防护能力已经被众人所熟知，正在获得来自家庭和小企业的用户的认可。
　　没有大规模的媒体广告，不同Linux版本的宣传通常是依靠用户对其的满意度，以及在互联网上的人们对其自然的相互宣传。

　　桌面系统最后一个堡垒
　　Linux已经统治了服务器市场，而且正在侵入各种嵌入系统的市场，诸如TiVo、移动电话、PDA和路由器中。桌面是其最后一个需要攻克的堡垒。
　　从一个忠实的Linux用户的角度来看，Linux这个操作系统已经具备了成为家庭和企业用户桌面操作系统的各种条件。Linux比Windows运行起来更稳定，而且成本更低。
　　业界人士分析，无论是在大型企业，还是中小单位，或者家庭和教育行业，Linux可以成为一个实力强大的操作系统候选者之一。

　　更少风险
　　对于那些刚刚开始接触计算机的新手来说，学习操作Linux和学习使用Windows或使用Mac OS没有什么大的差别。对于具有经验的用户来说，迁移到Linux并不意味着你不得不放弃你的重要的Windows应用程序。许多Linux软件程序与常见的Windows应用程序非常相似。
　　此外，你还可以运行通过虚拟化软件来运行Windows应用程序，例如VMware或Parallel；或者通过应用程序兼容工具，诸如WineheCrossOver Office等。
　　即使是有的Windows程序不能随之升级到Linux，他们创建的数据却可以。大多数开源程序（OpenOffice是其中一个例子）可以与微软的Office文档使用相同的文件格式来读写文件，其中也包括Excel电子表格和PowerPoint文件。

　　存在的劣势
　　不过，有时候程序的兼容性可能是一个让人烦恼的事情。实际上，这是一个新手转向Linux的时候碰到的最常见的问题。
　　有时候他们会遇到这样的挑战：他们需要让自己重新熟悉图形用户界面的细微差别，相似的应用程序、功能或工具可能具有用户所不熟悉的名字或相关操作，这可能需要用户花些时间来调整以前养成的习惯。
　　在转向Linux的时候，对支持的需要可能是另一个让人头疼的问题，家庭和小企业用户必须养成自我支持的能力，学会使用诸如新闻组或论坛等资源来解决遇到的问题。
　　在目前所有的Linux发行版中，由欧洲开发商Cannoincal开发的Ubuntu正在迅速缩短新用户的学习曲线。Ubuntu已经以其易用和易配置性而迅速崛起。
　　现在用户可以免费下载Ubuntu试用。用户可以从一张光盘商体验这款Linux，而不用对他们现在的Windows计算机做任何改变。
　　另外Linux对游戏的支持还要差一些，不过开发者也正在努力即决这个问题。
　　对于很多人来说，转向Linux的最大的理由就是自由。Linux不专属于任何人，它归根结底是一个社区项目。
　　随着Linux版本在安装和使用方面继续变的更友好，相信会有越来越多的人选用Linux。

推荐一个好玩的图片搜索：
image search engine: RIYA
http://www.riya.com/index?btnSearch=riya

为何难以在浙大校园内IT创业？

首先是基础设施，上网居然收费！
而且还分国内10元每月，国外50元。这都是什么年代的划分了。上网就应该免费！而且付了钱，带宽还少的可怜。不知道这点带宽害死了多少人，无法网申企业，学校。不能上国外网，少了许多接触最新潮流的机会，结果就能用baidu, tudou。要知道，这些网站模式，国外早就有了。如果早一天知道，我相信浙大的同学完全有技术能力做出来。绑定ip又极大的限制了实验，比如说用虚拟机，集群。怎么绑定ip..应该每人都能自由的上网，而且无线网络覆盖全校。而且校内资源校外难以访问，进校还要收费？这些技术上都是能够穿破的，只是花费精力，无法让大众享用。
而实验室好像好些，能够免费上国内外网，但是却限制多多，封端口，只留http。结果邮件无法收，ftp, cvs什么的上不去。下个东西千方百计，时间都耗光了。而且网络不稳定，病毒横行，都不好意思说自己是CS的实验室了。
还有就是实验室间，学校间交流不足。很多工作别人已经做过了，但是自己从头摸索。
然后就是想在校园里找个暖和，能上网，有电的地方都不行。几个人想brainstorm一把，结果草草收场。不过现在发现永谦茶吧貌似可以。

Thursday, December 14, 2006

继续努力装OpenSSI

整了半天，才在朋友帮助下，发现debian的版本不对，人家支持的是sarge(stable),我居然source-list里面用的是sid(unstable)。难怪出来千奇百怪的问题。

现在终于正常了。。。正常的遇到了别人问过的问题，在研究研究怎么装。

而且还发现了许多其它类似的东西，打算一个一个来试试看。

号外号外，杭州能花280元买到一台P3 500Mhz, 128M ram的机器。等那天我能装OpenSSI,买个8台来玩玩。爽的话继续买。

Wednesday, December 13, 2006

有没听过concurrency programming?
functional programming?
就是比如多核上，需要考虑concurrency
以前都是sequential programming
最近在研究
感觉是未来动向
functional programming 70年代就提出了，
而且美国高校很多都必修的
scheme
也是一路摸啊摸，发现的
从接触erlang开始
到functional programming , cuncurrency
然后开始看了些programming language designe
结合 google一样的计算中心， amazon提供了EC2服务
你可以google Amazon EC2看看
很有趣的服务
突然发现，其实很多程序优化的，就是围绕concurrency

改革的方向：硬件架构改变（如多核），硬件体系改变（集群），语言的改变（functional programming, lisp, erlang), 编译器的优化（帮助并行化）

the future of microprocessors

copy from pdf

26 September 2005 QUEUE rants: feedback@acmqueue.com

The Future of Microprocessors
KUNLE OLUKOTUN AND LANCE HAMMOND, STANFORD UNIVERSITY
QUEUE September 2005 27 more queue: www.acmqueue.com
he performance of microprocessors that power modern computers has continued to increase exponentially over the years for two main reasons. First, the transis-
tors that are the heart of the circuits in all processors and memory chips have simply become faster over time on a course described by Moore’s law,1 and this directly affects the performance of processors built with those transistors. Moreover, actual processor per-
formance has increased faster than Moore’s law would predict,2 because processor designers have been able to harness the increasing numbers of transistors avail-
able on modern chips to extract more parallelism from software. This is depicted in fi gure 1 for Intel’s processors.
An interesting aspect of this continual quest for more parallelism is that it has been pursued in a way that has been virtually invisible to software programmers. Since they were invented in the 1970s, microprocessors have continued to implement the conven-
tional von Neumann computational model, with very few exceptions or modifi cations. To a programmer, each computer consists of a single processor executing a stream of sequential instructions and connected to a monolithic “memory” that holds all of the program’s data. Because the economic benefi ts of backward compatibility with earlier generations of processors are so strong, hardware designers have essentially been limited to enhancements that have maintained this abstraction for decades. On the memory side, this has resulted in processors with larger cache memories, to keep frequently accessed portions of the conceptual “memory” in small, fast memories that are physi-
cally closer to the processor, and large register fi les to hold more active data values in an Chip multiprocessors’ promise of huge performance gains is now a reality.
Multiprocessors
FOCUS
28 September 2005 QUEUE rants: feedback@acmqueue.com
extremely small, fast, and compiler-managed region of “memory.” Within processors, this has resulted in a variety of modifi cations designed to achieve one of two goals: increasing the number of instructions from the proces-
sor’s instruction sequence that can be issued on every cycle, or increasing the clock frequency of the processor faster than Moore’s law would normally allow. Pipelin-
ing of individual instruction execution into a sequence of stages has allowed designers to increase clock rates as instructions have been sliced into larger numbers of increasingly small steps, which are designed to reduce the amount of logic that needs to switch during every clock cycle. Instructions that once took a few cycles to execute in the 1980s now often take 20 or more in today’s leading-edge processors, allowing a nearly proportional increase in the possible clock rate. Meanwhile, superscalar processors were developed to execute multiple instructions from a single, conventional instruction stream on each cycle. These function by dynamically examining sets of instructions from the instruction stream to fi nd ones capable of parallel execution on each cycle, and then executing them, often out of order with respect to the original program. Both techniques have fl ourished because they allow instructions to execute more quickly while maintaining the key illu-
sion for programmers that all instructions are actually being executed sequen-
tially and in order, instead of overlapped and out of order. Of course, this illusion is not absolute. Performance can often be improved if programmers or compilers adjust their instruction scheduling and data layout to map more effi ciently to the underlying pipelined or paral-
lel architecture and cache memories, but the important point is that old or untuned code will still execute cor-
rectly on the architecture, albeit at less-than-peak speeds.
Unfortunately, it is becoming increasingly diffi cult for processor designers to continue using these techniques to enhance the speed of modern processors. Typical instruction streams have only a limited amount of usable parallelism among instructions,3 so superscalar processors that can issue more than about four instructions per cycle achieve very little additional benefi t on most applica-
tions. Figure 2 shows how effective real Intel processors have been at extracting instruction parallelism over time. There is a fl at region before instruction-level parallelism was pursued intensely, then a steep rise as parallelism was utilized usefully, followed by a tapering off in recent years as the available parallelism has become fully exploited. Complicating matters further, building superscalar processor cores that can exploit more than a few instruc-
tions per cycle becomes very expensive, because the complexity of all the additional logic required to fi nd parallel instructions dynamically is approximately pro-
portional to the square of the number of instructions that can be issued simultaneously. Similarly, pipelining past about 10-20 stages is diffi cult because each pipeline stage becomes too short to perform even a minimal amount of Intel Performance Over Time
relative performance
year
0.10
1.00
10.00
100.00
1000.00
10000.00
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
FIG 1 FIG 1
The Future of Microprocessors
Multiprocessors
FOCUS
QUEUE September 2005 29 more queue: www.acmqueue.com
logic, such as adding two integers together, beyond which the design of the pipeline is signifi cantly more complex. In addition, the circuitry overhead from adding pipeline registers and bypass path multiplexers to the existing logic combines with performance losses from events that cause pipeline state to be fl ushed, primarily branches. This overwhelms any potential performance gain from deeper pipelining after about 30 stages. Further advances in both superscalar issue and pipelin-
ing are also limited by the fact that they require ever-
larger numbers of transistors to be integrated into the high-speed central logic within each processor core—so many, in fact, that few companies can afford to hire enough engineers to design and verify these processor cores in reasonable amounts of time. These trends have slowed the advance in processor performance somewhat and have forced many smaller vendors to forsake the high-end processor business, as they could no longer afford to compete effectively.
Today, however, all progress in conventional processor core development has essentially stopped because of a simple physical limit: power. As processors were pipe-
lined and made increasingly superscalar over the course of the past two decades, typical high-end microprocessor power went from less than a watt to over 100 watts. Even though each silicon process generation promised a reduc-
tion in power, as the ever-smaller transistors required less power to switch, this was true in practice only when existing designs were simply “shrunk” to use the new process technology. Processor designers, however, kept using more transistors in their cores to add pipelining and superscalar issue, and switching them at higher and higher frequencies. The overall effect was that expo-
nentially more power was required by each subsequent processor generation (as illustrated in fi gure 3). Unfortunately, cooling technology does not scale exponentially nearly as easily. As a result, processors went from needing no heat sinks in the 1980s, to moderate-size heat sinks in the 1990s, to today’s monstrous heat sinks, often with one or more dedicated fans to increase airfl ow over the processor. If these trends were to continue, the next generation of microprocessors would require very exotic cooling solutions, such as dedicated water cool-
ing, that are economically impractical in all but the most expensive systems.
The combination of limited instruction parallelism suitable for superscalar issue, practical limits to pipelin-
ing, and a “power ceiling” limited by practical cooling limitations has limited future speed increases within conventional processor cores to the basic Moore’s law improvement rate of the underlying transistors. This limitation is already causing major processor manufactur-
ers such as Intel and AMD to adjust their marketing focus away from simple core clock rate. Although larger cache memories will continue to improve performance somewhat, by speeding access to the single “memory” in the conventional model, the simple fact is that without more radical changes in pro-
cessor design, microproces-
sor performance increases will slow dramatically in the future. Processor designers must fi nd new ways to effectively utilize the increasing transis-
tor budgets in high-end silicon chips to improve performance in ways that minimize both additional power usage and design complexity. The market for microprocessors has become stratifi ed into areas with different performance requirements, so it is useful to examine the problem from the point of view of these different perfor-
mance requirements.
Intel Performance from ILP
relative performance/cycle
year
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
0
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
FIG 2
30 September 2005 QUEUE rants: feedback@acmqueue.com
THROUGHPUT PERFORMANCE IMPROVEMENT With the rise of the Internet, the need for servers capable of handling a multitude of independent requests arriving rapidly over the network has increased dramatically. Since individual network requests are typically completely independent tasks, whether those requests are for Web pages, database access, or fi le service, they are typically spread across many separate computers built using high-
performance conventional microprocessors (fi gure 4a), a technique that has been used at places like Google for years to match the overall computation throughput to the input request rate.4
As the number of requests increased over time, more servers were added to the collection. It has also been possible to replace some or all of the separate servers with multiprocessors. Most existing multiprocessors consist of two or more separate processors connected using a common bus, switch hub, or network to shared memory and I/O devices. The overall system can usually be physi-
cally smaller and use less power than an equiva-
lent set of uniprocessor systems because physically large components such as memory, hard drives, and power supplies can be shared by some or all of the processors.
Pressure has increased over time to achieve more performance per unit volume of data-center space and per watt, since data centers have fi nite room for servers and their electric bills can be stagger-
ing. In response, the server manufacturers have tried to save space by adopting denser server packaging solutions, such as blade servers and switching to mul-
tiprocessors that can share components. Some power reduction has also occurred through the sharing of more power-hungry components in these systems. These short-
term solutions are reaching their practical limits, how-
ever, as systems are reaching the maximum component density that can still be effectively air-cooled. As a result, the next stage of development for these systems involves a new step: the CMP (chip multiprocessor).5
The fi rst CMPs targeted toward the server market implement two or more conventional superscalar proces-
sors together on a single die.6,7,8,9 The primary motivation for this is reduced volume—multiple processors can now fi t in the space where formerly only one could, so overall performance per unit volume can be increased. Some savings in power also occurs because all of the proces-
sors on a single die can share a single connection to the rest of the system, reducing the amount of high-speed communication infrastructure required, in addition to the sharing possible with a conventional multiprocessor. Some CMPs, such as the fi rst ones announced from AMD and Intel, share only the system interface between proces-
sor cores (illustrated in fi gure 4b), but others share one or more levels of on-chip cache (fi gure 4c), which allows interprocessor communication between the CMP cores without off-chip accesses.
Further savings in power can be achieved by taking advantage of the fact that while server workloads require high throughput, the latency of each request is generally Intel Power Over Time
power (watts)
year
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
0
10
100
The Future of Microprocessors
FIG 3 FIG 3
Multiprocessors
FOCUS
QUEUE September 2005 31 more queue: www.acmqueue.com
not as critical.10 Most users will not be bothered if their Web pages take a fraction of a second longer to load, but they will complain if the Web site drops page requests because it does not have enough throughput capacity. A CMP-based system can be designed to take advantage of this situation. When a two-way CMP replaces a uniprocessor, it is possible to achieve essentially the same or better through-
put on server-oriented workloads with just half of the original clock speed. Each request may take up to twice as long to process because of the reduced clock rate. With many of these applications, however, the slowdown will be much less, because request processing time is more often limited by memory or disk performance than by processor performance. Since two requests can now be processed simultaneously, however, the overall through-
put will now be the same or better, unless there is serious contention for the same memory or disk resources. Overall, even though performance is the same or only a little better, this adjustment is still advantageous at the system level. The lower clock rate allows us to design the system with a signifi cantly lower power supply voltage, often a nearly linear reduction. Since power is propor-
tional to the square of the voltage, however, the power required to obtain the original performance is much lower—usually about half (half of the voltage squared = a quarter of the power, per processor, so the power required for both processors together is about half), although the potential savings could be limited by static power dis-
sipation and any minimum voltage levels required by the underlying transistors. For throughput-oriented workloads, even more power/
performance and performance/chip area can be achieved by taking the “latency is unimportant” idea to its extreme and building the CMP with many small cores instead of a few large ones. Because typical server workloads have very low amounts of instruc-
tion-level parallelism and many memory stalls, most of the hardware associated with superscalar instruc-
tion issue is essentially wasted for these applica-
tions. A typical server will have tens or hundreds of requests in fl ight at once, however, so there is enough work available to keep many processors busy simultaneously. Therefore, replacing each large, superscalar pro-
cessor in a CMP with sev-
eral small ones, as has been demonstrated successfully with the Sun Niagara,11
is a winning policy. Each small processor will process its request more slowly than a larger, superscalar processor, but this latency slowdown is more than compensated for by the fact that the same chip area can be occupied by a much larger number of processors—about four times as many, in the case CMP Implementation Options
main memory
L2 cache
CPU core 1
L1 I$ L1 D$
regs regs
regs regs
CPU core N
L1 I$ L1 D$
regs regs
regs regs
I/O
d) multithreaded, shared-cache
chip multiprocessor
main memory
L2 cache
L2 cache
CPU core 1
L1 I$ L1 D$
registers registers
CPU core N
L1 I$ L1 D$
I/O
c) shared-cache chip multiprocessor
main memory
L2 cache L2 cache
CPU core 1
L1 I$ L1 D$
registers registers
CPU core N
L1 I$ L1 D$
I/O
b) simple chip multiprocessor
main memory
CPU core
L1 I$ L1 D$
registers
I/O
a) conventional microprocessor
FIG 4 FIG 4
32 September 2005 QUEUE rants: feedback@acmqueue.com
of Niagara, which has eight single-issue SPARC processor cores in a technology that can hold only a pair of super- scalar UltraSPARC cores.
Taking this idea one step further, still more latency can be traded for higher throughput with the inclusion of multithreading logic within each of the cores.12,13,14
Because each core tends to spend a fair amount of time waiting for memory requests to be satisfi ed, it makes sense to assign each core several threads by including multiple register fi les, one per thread, within each core (fi gure 4d). While some of the threads are waiting for memory to respond, the processor may still execute instructions from the others. Larger numbers of threads can also allow each proces-
sor to send more requests off to memory in parallel, increasing the utilization of the highly pipelined memory systems on today’s processors. Overall, threads will typi-
cally have a slightly longer latency, because there are times when all are active and competing for the use of the processor core. The gain from performing computation during memory stalls and the ability to launch numerous memory accesses simultaneously more than compensates for this longer latency on systems such as Niagara, which has four threads per processor or 32 for the entire chip, and Pentium chips with Intel’s Hyperthreading, which allows two threads to share a Pentium 4 core.
LATENCY PERFORMANCE IMPROVEMENT The performance of many important applications is mea-
sured in terms of the execution latency of individual tasks instead of high overall throughput of many essentially unrelated tasks. Most desktop processor applications still fall in this category, as users are generally more concerned with their computers responding to their commands as quickly as possible than they are with their comput-
ers’ ability to handle many commands simultaneously, although this situation is changing slowly over time as more applications are written to include many “back-
ground” tasks. Users of many other computation-bound applications, such as most simulations and compilations, are typically also more interested in how long the pro-
grams take to execute than in executing many in parallel.
Multiprocessors can speed up these types of applica-
tions, but it requires effort on the part of programmers to break up each long-latency thread of execution into a large number of smaller threads that can be executed on many processors in parallel, since automatic paralleliza-
tion technology has typically functioned only on Fortran programs describing dense-matrix numerical computa-
tions. Historically, communication between processors was generally slow in relation to the speed of individual processors, so it was critical for programmers to ensure that threads running on separate processors required only minimal communication with each other. Because communication reduction is often diffi cult, only a small minority of users bothered to invest the time and effort required to parallelize their programs in a way that could achieve speedup, so these techniques were taught only in advanced, graduate-level computer science courses. Instead, in most cases programmers found that it was just easier to wait for the next generation of uni-
processors to appear and speed up their applications for “free” instead of investing the effort required to parallel-
ize their programs. As a result, multiprocessors had a hard time competing against uniprocessors except in very large systems, where the target performance simply exceeded the power of the fastest uniprocessors available.
With the exhaustion of essentially all performance gains that can be achieved for “free” with technologies such as superscalar dispatch and pipelining, we are now entering an era where programmers must switch to more parallel programming models in order to exploit multi-
processors effectively, if they desire improved single-pro-
gram performance. This is because there are only three real “dimensions” to processor performance increases beyond Moore’s law: clock frequency, superscalar instruc-
tion issue, and multiprocessing. We have pushed the fi rst two to their logical limits and must now embrace multiprocessing, even if it means that programmers will be forced to change to a parallel programming model to achieve the highest possible performance.
Conveniently, the transition from multiple-chip systems to chip multiprocessors greatly simplifi es the problems traditionally associated with parallel program-
ming. Previously it was necessary to minimize commu-
nication between independent threads to an extremely low level, because each communication could require hundreds or even thousands of processor cycles. Within any CMP with a shared on-chip cache memory, however, each communication event typically takes just a handful The Future of Microprocessors
Multiprocessors
FOCUS
QUEUE September 2005 33
more queue: www.acmqueue.comof processor cycles. With latencies like these, communication
delays have a much smaller impact on overall system performance. Programmers must still divide their work into parallel threads, but do not need to worry nearly as much about ensuring that these threads are highly independent,
since communication is relatively cheap. This is not a complete panacea, however, because programmers must still structure their inter-thread synchronization correctly, or the program may generate incorrect results or deadlock, but at least the performance impact of communication
delays is minimized.
Parallel threads can also be much smaller and still be effective—threads that are only hundreds or a few thousand
cycles long can often be used to extract parallelism with these systems, instead of the millions of cycles long threads typically necessary with conventional parallel machines. Researchers have shown that parallelization of applications can be made even easier with several schemes involving the addition of transactional hardware to a CMP.15,16,17,18,19 These systems add buffering logic that lets threads attempt to execute in parallel, and then dynamically determines whether they are actually parallel at runtime. If no inter-thread dependencies are detected at runtime, then the threads complete normally. If dependencies
exist, then the buffers of some threads are cleared and those threads are restarted, dynamically serializing the threads in the process.
Such hardware, which is only practical on tightly coupled
parallel machines such as CMPs, eliminates the need for programmers to determine whether threads are parallel
as they parallelize their programs—they need only choose potentially parallel threads. Overall, the shift from conventional processors to CMPs should be less traumatic for programmers than the shift from conventional processors
to multichip multiprocessors, because of the short CMP communication latencies and enhancements such as transactional memory, which should be commercially available within the next few years. As a result, this paradigm
shift should be within the range of what is feasible for “typical” programmers, instead of being limited to graduate-level computer science topics.
HARDWARE ADVANTAGES
In addition to the software advantages now and in the future, CMPs have major advantages over conventional uniprocessors for hardware designers. CMPs require only a fairly modest engineering effort for each generation of processors. Each member of a family of processors just requires the stamping down of additional copies of the core processor and then making some modifications to relatively slow logic connecting the processors together to accommodate the additional processors in each generation—
and not a complete redesign of the high-speed processor core logic. Moreover, the system board design typically needs only minor tweaks from generation to generation, since externally a CMP looks essentially the same from generation to generation, even as the number of processors within it increases.
The only real difference is that the board will need to deal with higher I/O bandwidth requirements as the CMPs scale. Over several silicon process generations, the savings in engineering costs can be significant, because it is relatively easy to stamp down a few more cores each time. Also, the same engineering effort can be amortized across a large family of related processors. Simply varying
the numbers and clock frequencies of processors can allow essentially the same hardware to function at many different price/performance points.
AN INEVITABLE TRANSITION
As a result of these trends, we are at a point where chip multiprocessors are making significant inroads into the marketplace. Throughput computing is the first and most pressing area where CMPs are having an impact. This is because they can improve power/performance results right out of the box, without any software changes, thanks to the large numbers of independent threads that are available in these already multithreaded applications. In the near future, CMPs should also have an impact in the more common area of latency-critical computations. Although it is necessary to parallelize most latency-critical
software into multiple parallel threads of execution to really take advantage of a chip multiprocessor, CMPs make this process easier than with conventional multiprocessors,
because of their short interprocessor communication
latencies.
Viewed another way, the transition to CMPs is inevitable
because past efforts to speed up processor architectures
with techniques that do not modify the basic von Neumann computing model, such as pipelining and superscalar issue, are encountering hard limits. As a result, the microprocessor industry is leading the way to multicore architectures; however, the full benefit of these architectures will not be harnessed until the software industry fully embraces parallel programming. The art of multiprocessor programming, currently mastered by only a small minority of programmers, is more complex than programming uniprocessor machines and requires an understanding of new computational principles, algorithms,
and programming tools. Q
34 September 2005 QUEUE rants: feedback@acmqueue.com
REFERENCES
1. Moore, G. E. 1965. Cramming more components onto integrated circuits. Electronics (April): 114–117.
2. Hennessy, J. L., and Patterson, D. A. 2003. Computer Architecture: A Quantitative Approach, 3rd Edition, San Francisco, CA: Morgan Kaufmann Publishers.
3. Wall, D. W. 1993. Limits of Instruction-Level Parallelism,
WRL Research Report 93/6, Digital Western Research Laboratory, Palo Alto, CA.
4. Barroso, L., Dean, J., and Hoezle, U. 2003. Web search for a planet: the architecture of the Google cluster. IEEE Micro 23 (2): 22–28.
5. Olukotun, K., Nayfeh, B. A., Hammond, L. Wilson, K. and Chang, K. 1996. The case for a single chip multi-
processor. Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII): 2–11.
6. Kapil, S. 2003. UltraSPARC Gemini: Dual CPU Proces-
sor. In Hot Chips 15 (August), Stanford, CA; http://
www.hotchips.org/archives/.
7. Maruyama, T. 2003. SPARC64 VI: Fujitsu’s next gen-
eration processor. In Microprocessor Forum (October), San Jose, CA.
8. McNairy, C., and Bhatia, R. 2004. Montecito: the next product in the Itanium processor family. In Hot Chips 16 (August), Stanford, CA; http://www.hotchips.
org/archives/.
9. Moore, C. 2000. POWER4 system microarchitecture. In Microprocessor Forum (October), San Jose, CA.
10. Barroso, L. A., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., and Verghese, B. 2000. Piranha: a scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th International Symposium on Computer Architec-
ture (June): 282–293.
11. Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: a 32-way multithreaded SPARC processor. IEEE Micro 25 (2): 21–29.
12. Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfi eld, A., and Smith, B. 1990. The Tera com-
puter system. In Proceedings of the 1990 International Conference on Supercomputing (June): 1–6.
13. Laudon, J., Gupta, A., and Horowitz, M. 1994. Interleaving: a multithreading technique targeting multiprocessors and workstations. Proceedings of the 6th
International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems: 308–316.
14. Tullsen, D. M., Eggers, S. J., and Levy, H. M. 1995. Simultaneous multithreading: maximizing on-chip parallelism. In Proceedings of the 22nd International Sym-
posium on Computer Architecture (June): 392–403.
15. Hammond, L., Carlstrom, B. D., Wong, V., Chen, M., Kozyrakis, C., and Olukotun, K. 2004. Transactional coherence and consistency: simplifying parallel hard-
ware and software. IEEE Micro 24 (6): 92–103.
16. Hammond, L., Hubbert, B., Siu, M., Prabhu, M., Chen, M., and Olukotun, K. 2000. The Stanford Hydra CMP. IEEE Micro 20 (2): 71–84.
17. Krishnan, V., and Torrellas, J. 1999. A chip multipro-
cessor architecture with speculative multithreading. IEEE Transactions on Computers 48 (9): 866–880.
18. Sohi, G., Breach, S., and Vijaykumar, T. 1995. Multi-
scalar processors. In Proceedings of the 22nd International Symposium on Computer Architecture (June): 414–425.
19. Steffan, J. G., and Mowry, T. 1998. The potential for using thread-level data speculation to facilitate automatic parallelization. In Proceedings of the 4th
International Symposium on High-Performance Computer Architecture (February): 2–13.
LOVE IT, HATE IT? LET US KNOW
feedback@acmqueue.com or www.acmqueue.com/forums
KUNLE OLUKOTUN is an associate professor of electrical engineering and computer science at Stanford University, where he led the Stanford Hydra single-chip multiprocessor research project, which pioneered multiple processors on a single silicon chip. He founded Afara Websystems to develop commercial server systems with chip multiprocessor technol-
ogy. Afara was acquired by Sun Microsystems, and the Afara microprocessor technology is now called Niagara. Olukotun is involved in research in computer architecture, parallel pro-
gramming environments, and scalable parallel systems.
LANCE HAMMOND is a postdoctoral fellow at Stanford Uni-
versity. As a Ph.D. student, Hammond was the lead architect and implementer of the Hydra chip multiprocessor. The goal of Hammond’s recent work on transactional coherence and consistency is to make parallel programming accessible to the average programmer.
? 2005 ACM 1542-7730/05/0900 $5.00
The Future of Microprocessors
Multiprocessors
FOCUS