Search Details｜The University of Erectro-Communications

Name

Author

Position

Affiliation

Research Areas

HIROKI HONDA

Department of Computer and Network Engineering	Professor
Cluster I (Informatics and Computer Engineering)	Professor

Researcher Information

Degree

工学修士, 早稲田大学

工学博士, 早稲田大学

Research Keyword

高性能コンピューティング，並列処理，並列化コンパイラ，GPUコンピューティング

Field Of Study

Informatics, High-performance computing

Informatics, Software

Informatics, Computer systems

Educational Background

Mar. 1991
Waseda University, Graduate School, Division of Science and Engineering, 電気工学専攻

Mar. 1986
Waseda University, Graduate School, Division of Science and Engineering, 電気工学専攻

Mar. 1984
Waseda University, Faculty of Science and Engineering, 電気工学科

Mar. 1980
早稲田大学高等学院

Member History

01 Apr. 2007 - 31 Mar. 2011
運営委員, 情報処理学会ハイパフォーマンスコンピューティング研究会

Jan. 2009 - 31 Dec. 2010
Computer Society Japan Chapter Chair, IEEE, Society

01 Apr. 2006 - 31 Mar. 2010
運営委員, 情報処理学会計算機アーキテクチャ研究会

2007 - 2009
東京支部評議委員, 電子情報通信学会, Society

Jan. 2008 - Dec. 2008
Computer Society Japan Chapter Secretary, IEEE, Society

01 Apr. 2003 - 31 Mar. 2007
幹事, 情報処理学会ハイパフォーマンスコンピューティング研究会

01 Apr. 2000 - 31 Mar. 2004
運営委員, 情報処理学会計算機アーキテクチャ研究会

1996 - 1999
論文誌編集委員会委員, 情報処理学会, Society

Research Activity Information

Award

Jun. 2003
平成15年度情報処理学会山下記念研究賞

Paper

CNFET-OCL: Open-Source Cell Libraries for Advanced CNFET Technologies.
Chenlin Shi; Shinobu Miwa; Tongxin Yang; Ryota Shioya; Hayato Yamaki; Hiroki Honda
IEEE Access, 12, 165335-165347, 2024, Peer-reviwed
Scientific journal
URL
DOI URL

Multi-Level Packet Processing Caches
K. Tanaka; H. Yamaki; S. Miwa; H. Honda
The 2019 IEEE Symposium on Low-Power and High-Speed Chips and Systems (COOL Chips 22), 1-3, 2019, Peer-reviwed
International conference proceedings, English

Data Prediction for Response Flows in Packet Processing Cache
H. Yamaki; H. Nishi; S. Miwa; H. Honda
Proc. of the 55th Annual Design Automation Conference (DAC), ACM, DAC'18, 110, 1-6, 27 Jun. 2018, Peer-reviwed
International conference proceedings, English
URL
URL 2
DOI URL
DOI 2 URL

Optimizing Memory Hierarchy within an Internet Router for High-Throughput and Energy-Efficient Packet Processing
K. Tanaka; H. Yamaki; S. Miwa; H. Honda
ACM Student Research Competition (in conjunction with the 51st Annual ACM/IEEE International Symposium on Microarchitecture) (poster presentation), poster, 2018, Peer-reviwed
International conference proceedings, English

Initial Study of Reconfigurable Neural Network Accelerators
Momoka Ohba; Satoshi Shindo; Shinobu Miwa; Tomoaki Tsumura; Hayato Yamaki; Hiroki Honda
2016 FOURTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), IEEE, 707-709, 2016, Peer-reviwed, Neural Networks or NNs are widely used for many machine learning applications such as image processing and speech recognition. Since general-purpose processors such as CPUs and GPUs are energy inefficient for computing NNs, application-specific hardware accelerators for NNs (a.k.a. Neural Network Accelerators or NNAs) have been proposed to improve the energy efficiency. However, the existing NNAs are too customized for computing specific NNs, and do not allow to change neuron models or learning algorithms. This limitation prevents machine-learning researchers from exploiting NNAs, so we are developing a general-purpose NNA including reconfigurable logic, which is called a reconfigurable NNA or RNNA. The RNNA is highly tuned for the NN computation but allows end users to customize the hardware to compute desired NNs. This paper introduces the RNNA architecture, and reports the performance analysis of the RNNA with an in-house cycle-level simulator.
International conference proceedings, English
DOI URL

粗粒度な電圧ドメインを持つメニーコアプロセッサ向け低消費電力化タスクスケジューリング
和田康孝; 近藤正章; 本多弘樹
情報処理学会論文誌, 情報処理学会, 8, 1, 34-50, 26 Mar. 2015, Peer-reviwed, Power/Energy efficiency is one of the most important issues for today's computer systems. To improve the efficiency, DVFS (Dynamic Voltage/Frequency Scaling) and PG (Power Gating) can be utilized on many current multicore/manycore processors. However, applying them to parallel applications imposes large burdens upon the programmers. In addition, it would become difficult to utilize per-core DVFS on a manycore processor because of the increasing number of cores, and DVFS can ruin its effectiveness. This paper proposes an energy-aware task scheduling algorithm for a manycore processor with coarse grain voltage domains. The proposed scheme assigns tasks in an input parallel application considering performance and power efficiency of each core and DVFS efficiency of each voltage domain. This makes it possible to improve DVFS efficiency on a manycore processor with coarse grain voltage domains. The proposed scheme gives us better energy reduction with a heterogeneous manycore in comparison with the case that a conventional scheduling method is applied on a per-core DVFS enabled manycore.
Scientific journal, Japanese
URL

Memory Hotplug for Energy Savings of HPC systems
S. Miwa; H. Honda
The International Conference for High Performance Computing, Networking, Storage and Analysis, poster, 2015, Peer-reviwed
International conference proceedings, English

OMPCUDA: OpenMP execution framework for CUDA based on omni OpenMP compiler
Ohshima, S.; Hirasawa, S.; Honda, H.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6132 LNCS, 161-173, 2010, Peer-reviwed
Scientific journal
DOI URL

キャリアグレードOSのためのディスクWrite処理方式
池邉隆; 内田直樹; 平澤将一; 本多弘樹
情報処理学会論文誌(ジャーナル), 情報処理学会, 50, 2, 691-700, Feb. 2009, Peer-reviwed
Scientific journal, Japanese
URL

Aspects of GPU for General Purpose High Performance Computing
Reiji Suda; Takayuki Aoki; Shoichi Hirasawa; Akira Nukada; Hiroki Honda; Satoshi Matsuoka
PROCEEDINGS OF THE ASP-DAC 2009: ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE 2009, IEEE, 216-+, 2009, We discuss hardware and software aspects of GPGPU, specifically focusing on NVIDIA cards and CUDA, from the viewpoints of parallel computing. The major weak points of GPU against newest supercomputers are identified to be and summarized as only four points: large SIMD vector length, small memory, absence of fast L2 cache, and high register spill penalty. As software concerns, we derive optimal scheduling algorithm for latency hiding of host-device data transfer, and discuss SPMD parallelism on GPUs.
International conference proceedings, English

Toward a Portable Programming Environment for Distributed High Performance Accelerators
Shoichi Hirasawa; Hiroki Honda
FIRST INTERNATIONAL WORKSHOP ON SOFTWARE TECHNOLOGIES FOR FUTURE DEPENDABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, IEEE COMPUTER SOC, 189-194, 2009, Accelerators with little power consumption. per computation performance are beginning to widely spread for High Performance Computing use, instead of general-purpose CPUs with much power consumption. They are GPUs, processors of Cell architecture, and FPGA accelerators. While these processors have much higher computation performance than general-purpose CPUs, they need specific programming environment respectively when. using them as distributed memory accelerators. We discuss a portable programming environment which can be used in common with distributed memory accelerators in this paper.
International conference proceedings, English

F-Omega：グリッドアプリケーションの自動サーバ切替えの枠組み
渡邊啓正; 平澤将一; 本多弘樹
情報処理学会論文誌(トランザクション ACS), 情報処理学会, 1, 3, 54-66, Dec. 2008, Peer-reviwed, サーバの性能が変動するグリッドを用いて長時間グリッドアプリケーションを実行させるには，サーバの適切な動的切替えが必要である．サーバの運用上の理由や障害などによって変動するサーバの稼働状況に対応したサーバ動的切替えを実現するには，従来，多数のサーバの運用計画を監視する手間，サーバ利用環境の管理のサーバ動的切替えへの対応，サーバ利用状況の監視の手間，代替サーバの選択とその指示の手間，サーバ可変グリッドアプリケーション開発の手間といった問題があった．本稿では，サーバの自動的な切替えのための枠組みF-Omegaを提案する．F-Omegaでは，サーバ運用計画の自動監視機能，サーバ動的切替えに対応したサーバ利用環境の管理機能，サーバ利用状況の自動監視機能，代替サーバの自動選択と自動指示機能，サーバ可変グリッドアプリケーション開発の支援機能を新たに実装し，他の既存機能とともに統合した自動サーバ切替えシステムをユーザに提供する．適用実験において，2.8日間にわたり12台のサーバ群の稼働状況が計29回変動したが，アプリケーションの利用するサーバが自動的に適切に切り替えられ計算が継続されることを確認した．Appropriate dynamic server change is needed to make grid applications run for a long time using grid that servers' performances change by time. For changing servers according to scheduled server status change and server failure, conventional method has the following problems. There are laborious works for users, which are monitor of operation schedule of many servers, support to dynamic server change in implementation of management of server use environment, monitor status of server use, select server and send order about changing server to application and development of grid application that can change server. In this paper, we propose a framework F-Omega for dynamic server change. F-Omega implements an automatic monitoring function of server operation schedule, server use environment management function that supports dynamic server change, an automatic monitoring function of server use, an automatic server selecting function and an automatic order sending function about changing server and an assistance function to develop grid applications that change servers dynamically. F-Omega provides users with an automatic server changing system that unifies the above-mentioned functions and other existing functions. In application experiment, we made 29 changes of operational status on twelve servers over 2.8 days. And we confirmed that the application automatically changed its servers appropriately while continuing its computation.
Scientific journal, Japanese
URL
URL 2

Unified programming environment for heterogeneous distributed parallel systems
Shoichi Hirasawa; Hiroki Honda
Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems, 24-31, 2008, Parallel execution environment, such as the multi-core CPU, a cluster, and a grid, has spread increasingly. The change from a homogeneous core based CPU and a shared memory to the distributed memory and the heterogeneous core based CPU is making system architecture complicate. The programming interface and programming model which are different in each parallel execution environment are used. Since this serves as a burden for users, it has barred the spread of parallel execution environment. In this paper, the execution model which treats such system architecture systematically is explored. This performs the unified programming interface for heterogeneous distributed memory system architecture. © 3770 IEEE.
International conference proceedings, English
DOI URL

ネットワークサービス提供に向けた汎用システムにおけるソフトウェア修正方式
池辺隆; 河原崎裕朗; 内田直樹; 平澤将一; 本多弘樹
電子情報通信学会論文誌 B, The Institute of Electronics, Information and Communication Engineers, J91-B, 1, 1-13, Jan. 2008, Peer-reviwed, 近年IP網をベースとし,COTS(Commercial off-the-shelf)に代表される汎用システムを用いた音声通話サービスに代表されるネットワークサービスが急速に広がりつつある.ISDN及びPSTNに代表される音声通話サービスに適用されてきた専用システムは様々な高可用技術を有しており,現状では汎用システムを用いたサービスは,専用システムによって提供されるサービスの可用性のレベルまで達していない.一方,専用システムの多くはハードウェアの老朽化や保守部品の製造困難による更改の必要性に迫られており,汎用システムによる更改が望まれている.今後汎用システムにて,専用システムを更改するためには種々の課題解決が必要である.そこで本論文では汎用システムを用いて高可用性を要求されるサービスを提供するための一手法として,システム上の主要ソフトウェアであるユーザプロセスとカーネルを稼動中のまま修正することで,ソフトウェアの再起動に伴うサービスの停止時間を削減するライフパッチ方式について提案する.また提案方式によるサービスのリアルタイム性への影響を評価し,音声通話サービスに適用可能であることを示す.
Scientific journal, English
URL

コードの性能可搬性を提供するSIMD向け共通記述方式
中西悠; 渡辺啓正; 平澤将一; 本多弘樹
情報処理学会論文誌, Information Processing Society of Japan (IPSJ), 48, SIG13(ACS19), 95-105, Aug. 2007, Peer-reviwed, In recent years, many general-purpose processors have extensions of SIMD instruction set for multimedia processing. By Descripting SIMD instructions explicitly using assembly language and intrinsic function, a programmer can use SIMD instruction effectively. But, they have a problem on code portability, because of description methods that depend on particular architecture. In our study, we proposed a common description method for using SIMD instructions explicitly and developed a tool to convert common description method into a source code including SIMD instructions effectively. Our method enables a programmer to use SIMD instructions explicitly while keeping code performance portability. We implemented some examples by our method and evaluated their performance and performance portability. As a result, we confirmed speedup and performance portability.
Scientific journal, Japanese
URL
URL 2

Common Description Method of SIMD Instructions for Providing Performance Portability
中西悠; 渡邊啓正; 平澤将一; 本多弘樹
情報処理学会シンポジウム論文集, 2007, 5, 11-18, 23 May 2007
Japanese
URL

Runtime software modification method used on COTS system for high-availability network service
Takashi Ikebe; Yasuro Kawarasaki; Naoki Uchida; Shoichi Hirasawa; Hiroki Honda
NEW TECHNOLOGIES, MOBILITY AND SECURITY, SPRINGER, 229-+, 2007, Peer-reviwed, Generally, providing high-availability services such a, a network service with COTS (commercial off-the-shelf) hardware and OS is very difficult because network services require frequent software modifications. Existing network service systems achieve high availability by using specialized systems. We present a live-patch method that enables online software modification without disrupting service on a COTS system. The live-patch method modifies user software and kernel software without rebooting by changing execution of the function to modified function. The evaluation shows the adaptability of the presented implementation as COTS systems on Linux and x86CPU SMP machines.
International conference proceedings, English

Reduction of synchronous write response time on call control server
Takashi Ikebe; Naoki Uchida; Shoichi Hirasawa; Hiroki Honda
2007 AUSTRALASIANTELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE, IEEE, 92-+, 2007, Peer-reviwed, Today, call control servers use COTS (commercial off-the-shelf) components such as Advanced Telecom Computing Architecture (ATCA) and Carrier Grade Linux. However, the adoption of COTS components causes an increase in the maximum response time of high-priority write system call during synchronous state transitions and recording whenever 10 operation is congested by normal-priority 10 requests. Call control is a real-time operation; the delay of state transition makes service quality worse and sometimes causes fail-over of service. The delays may occur due to failure of software and hardware. However, even if there are no failures, the response time of high-priority write system call sporadic delays due to insufficient consideration about the 10 prioritization by the OS. We present reduction of bottlenecks of the file system operation and 10 scheduler mechanism in the kernel to reduce the maximum response time of high-priority write system call. The evaluation demonstrates that the presented approach shortens the maximum response time of high-priority write system call with sufficient reliability.
International conference proceedings, English

F-Omega: A framework for steering GridRPC applications
Hiromasa Watanabe; Shoichi Hirasawa; Hiroki Honda
E-SCIENCE 2007: THIRD IEEE INTERNATIONAL CONFERENCE ON E-SCIENCE AND GRID COMPUTING, PROCEEDINGS, IEEE COMPUTER SOC, 475-482, 2007, Peer-reviwed, Steering grid applications is needed so that they can run several days or several weeks without restarting their computation. Existing grid middleware, such as GridRPC middleware, have room for improvement in point of steering grid applications. For instance, to manage GridRPC-related resources remains as a complicated task for programmers. And, to monitor constraint information about future server usage is a laborious task for application users. GridRPC middleware has to assist in these tasks. In this paper we propose a framework F-Omega for these tasks. F-Omega provides programmers common application modules for automatic management of GridRPC-related resources. For users, F-Omega provides an automatic visualization system for constraint information about future server usage. Experimental results show that the proposed features of F-Omega mitigate programmers' and users' burdens of the above-mentioned tasks.
International conference proceedings, English

高性能GridRPCアプリケーションの開発環境
小林孝嗣; 渡邊啓正; 本多弘樹
情報処理学会論文誌, 47, SIG12(ACS15), 218-228, Sep. 2006, Peer-reviwed
Scientific journal, Japanese

ABCLib_DRSSED: A parallel eigensolver with an auto-tuning facility
T Katagiri; K Kise; H Honda; T Yuba
PARALLEL COMPUTING, ELSEVIER SCIENCE BV, 32, 3, 231-250, Mar. 2006, Peer-reviwed, Conventional auto-tuning numerical software has the drawbacks of (1) fixed sampling points for the performance estimation, (2) inadequate adaptation to heterogeneous environments. To solve these drawbacks, we developed ABCLib(-)DRSSED, which is a parallel eigensolver with an auto-tuning facility. ABCLib(-)DRSSED has (1) functions based on the sampling points which are constructed with an end-user interface; (2) a load-balancer for the data to be distributed; (3) a new auto-tuning optimization timing called Before Execute-time Optimization (BEO).
In our performance evaluation of the BEO, we obtained speedup factors from 10% to 90%, and 340% in the case of a failed estimation. In the evaluation of the load-balancer, the performance was 220% improved. (c) 2005 Elsevier B.V. All rights reserved.
Scientific journal, English
DOI URL

Development and implementation of an interactive parallelization assistance tool for OpenMP: iPat/OMP
M Ishirara; H Honda; M Sato
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG, E89D, 2, 399-407, Feb. 2006, Peer-reviwed, iPat/OMP is an interactive parallelization assistance tool for OpenMP. In the present paper, we describe the design concept of iPat/OMP, the parallelization sequence achieved by the tool and its current implementation status. In addition, we present an evaluation of the performance of the implemented functionalities. The experimental results show that iPat/OMP can detect parallelism and create an appropriate OpenMP directive for several for-loops.
Scientific journal, English
DOI URL

Evaluation of the acknowledgment reduction in a software-DSM system
Kenji Kise; Takahiro Katagiri; Hiroki Honda; Toshitsugu Yuba
PARALLEL PROCESSING AND APPLIED MATHEMATICS, SPRINGER-VERLAG BERLIN, 3911, 17-25, 2006, Peer-reviwed, We discuss the inter-process communication in software distributed shared memory (S-DSM) systems. Some S-DSM systems, such as TreadMarks and JIAJIA, adopt the user datagram protocol (UDP) which does not provide the reliable communication between the computation nodes. To detect a communication error and recover from it, therefore, an acknowledgment is used for every message transmission in the middleware layer. In this paper, first, we show that an acknowledgment is not necessarily required for each message transmission in the middleware layer. Second, a method to reduce the acknowledgment overhead for a page request is proposed. We implemented the proposed method in our S-DSM system Mocha. The performance of the method was measured with several benchmark programs on both a PC cluster and an SMP cluster.
Scientific journal, English

Macro-dataflow using software distributed shared memory
Hiroshi Tanabe; Hiroki Honda; Toshitsugu Yuba
2005 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), IEEE, 441-+, 2006, Peer-reviwed, Macro-dataflow processing, which exploits the parallelism among coarse-grain tasks (macrotasks) such as loops and subroutines, is considered promising to break the performance limits of loop parallelism. To realize macro-dataflow processing on distributed memory systems, "data reaching conditions," a method to make the sender-receiver pair of a data transfer determined at runtime, has previously been proposed. However irregular data accesses induce extra data transfers, which lead to performance deterioration. This paper proposes an implementation method using software distributed shared memory, which enables on-demand data fetching. This paper describes the implementation using two well-accepted, page-based Software Distributed Shared Memory systems, TreadMarks and JI-AJIA. Evaluation results on a PC cluster show the software distributed memory approach is as much as 25% faster than the data reaching conditions.
International conference proceedings, English

ABCLibScript: a directive to support specification of an auto-tuning facility for numerical software
T Katagiri; K Kise; H Honda; T Yuba
PARALLEL COMPUTING, ELSEVIER SCIENCE BV, 32, 1, 92-112, Jan. 2006, Peer-reviwed, We describe the design and implementation of ABCLibScript, which is a directive that supports the addition of an auto-tuning facility. ABCLibScript limits the function of auto-tuning to numerical computations. For example, the block length adjustment for blocked algorithms, loop unrolling depth adjustment and algorithm selection are crucial functions. To establish these three particular functions, we make three kinds of instruction operators, variable, unroll, and select, respectively. As a result of performance evaluation, we showed that a non-expert user obtained a maximum speedup of 4.3 times by applying ABCLibScript to a program compared to a program without ABCLibScript. (c) 2005 Elsevier B.V. All rights reserved.
Scientific journal, English
DOI URL

A time-to-live based reservation algorithm on fully decentralized resource discovery in Grid computing
S Tangpongprasit; T Katagiri; K Kise; H Honda; T Yuba
PARALLEL COMPUTING, ELSEVIER SCIENCE BV, 31, 6, 529-543, Jun. 2005, Peer-reviwed, We present an alternative algorithm of fully decentralized resource discovery in Grid computing, which enables the sharing, selection, and aggregation of a wide variety of geographically distributed computational resources. Our algorithm is based on a simply unicast request transmission that can be easily implemented. The addition of a reservation algorithm is enable resource discovery mechanism to find more available matching resources. The deadline for resource discovery time is decided with time-to-live value. With our algorithm, the only one resource is automatically decided for any request if multiple available resources are found on forward path of resource discovery, resulting in no need to ask user to manually select the resource from a large list of available matching resources. We evaluated the performance of our algorithms by comparing with first-found-first-served algorithm. The experiment results show that the percentages of request that can be supported by both algorithms are not different. However, it can improve the performance of either resource utilization or turnaround time, depending on how to select the resource. The algorithm that finds the available matching resource whose attributes are closest to the required attribute can improve the resource utilization, whereas another one that finds the available matching resource which has the highest performance can improve the turn-around time. However, it is found that the performance of our algorithm relies on the density of resource in the network. Our algorithm seems to perform well only in the environment with enough resources, comparing with the density of requests in the network. (c) 2005 Elsevier B.V. All rights reserved.
Scientific journal, English
DOI URL

ソフトウェア分散共有メモリを用いたマクロデータフロー処理
田辺浩志; 本多弘樹; 弓場敏嗣
情報処理学会論文誌, 46, SIG4(ACS9), 56-68, Mar. 2005, Peer-reviwed
Scientific journal, Japanese

Sim/Core/Alpha Functional Simulatorの設計と実装
吉瀬謙二; 片桐孝洋; 本多弘樹; 弓場敏嗣
電子情報通信学会論文誌, The Institute of Electronics, Information and Communication Engineers, J88-D-I, 2, 143-154, Feb. 2005, Peer-reviwed, 本論文では, 機能レベルのプロセッサシミュレータであるSimCore/Alpha Functional Simulator Version 2.0 (SimCore Version 2.0)の設計と実装について述べる.SimCore Version 2.0の主な特徴は次のとおり.(1)機能レベルシミュレータとして豊富な機能を提供する.(2)C++で記述して, 2, 800行というコンパクトな実装により実現する.(3)プログラムローダの機能を分離する.(4)グローバル変数を排除して可読性と機能の向上を図る.(5)動作検証機能を提供する.(6)多くのプラットホームに対応する.(7)同様の機能を提供するSimpleScalarツールセットのsim-fastと比較して19%の高速化を達成する.
Scientific journal, Japanese
URL

The bimode++ branch predictor
Kenji Kise; Takahiro Katagiri; Hiroki Honda; Toshitsugu Yuba
Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems, 2005, 19-26, 2005, Modern wide-issue superscalar processors tend to adopt deeper pipelines in order to attain high clock rates. This trend increases the number of on-the-fly instructions in processors and a mispredicted branch can result in substantial amounts of wasted work. In order to mitigate these wasted works, an accurate branch prediction is required for the high performance processors. In order to improve the prediction accuracy, we propose the bimode++ branch predictor. It is an enhanced version of the bimode branch predictor. Throughout execution from the start to the end of a program, some branch instructions have the same result at all times. These branches are defined as extremely biased branches. The bimode++ branch predictor is unique in predicting the output of an extremely biased branch with a simple hardware structure. In addition, the bimode++ branch predictor improves the accuracy using the refined indexing and a fusion function. Our experimental results with benchmarks from SpecFP, SpecINT, multi-media and server area show that the bimode++ branch predictor can reduce the mispredict rate by 13.2% to the bimode and by 32.5% to the gshare. © 2005 IEEE.
International conference proceedings, English
DOI URL

Development and implementation of an interactive parallelization assistance tool for OpenMP: iPat/OMP
Makoto Ishihara; Hiroki Honda; Mitsuhisa Sato
Proc. of The 4th International Workshop on OpenMP: Experiences and Implementations, 1-1, Jan. 2005, Peer-reviwed
International conference proceedings, English

PCクラスタを用いたN-Queens問題の求解
吉瀬謙二; 片桐孝洋; 本多弘樹; 弓場敏嗣
電子情報通信学会論文誌, The Institute of Electronics, Information and Communication Engineers, J87-D-I, 12, 1145-1148, Dec. 2004, Peer-reviwed, N-queensは,互いに攻撃を行わないようなIV個のクイーンをN×Nのボードに配置する解の冊数を求める問題である.我々は,既存のプログラムと比較して,逐次プログラムとして11%から18%の高速化を達成するプログラムを作成した.また,MPIを用いて並列化を行い,PCクラスタを用いて計算することで,世界で初めて,24-queensの解を得た.この経験の主な知見は以下のとおり,(1)メモリ参照,制御構造等の最適化により,Semorsのプログラムと比較して11〜18%の高速化を達成する.2)並列化には,マスタ・ワーカ方式が有効となる.(3) Pentium4のもつHyper-Threadingを利用することで,利用しない場合と比較して,30%の高速化を達成する.(4)実問題の求解には,システム全体として高速に動作するように配慮する必要がある.
Scientific journal, Japanese
URL

Relis-G:計算グリッドのための遠隔ライブラリインストール機構
渡辺啓正; 本多弘樹; 弓場敏嗣; 田中良夫; 佐藤三久
情報処理学会論文誌, 45, SIG11(ACS7), 196-206, Oct. 2004, Peer-reviwed
Scientific journal, Japanese

データ再分散を行う並列Gram-Schmidt再直行化
片桐孝洋; 吉瀬謙二; 本多弘樹; 弓場敏嗣
情報処理学会論文誌, Information Processing Society of Japan (IPSJ), 45, SIG 6 (ACS 6), 75-85, May 2004, Peer-reviwed, In this paper, we improve a parallel re-orthogonalization with Gram-Schmidt (G-S) method. A data re-distribution approach is used to solve the low parallelism problem for column-wise distribution (CWD) in the method. The proposed method is implemented to the inverse it-eration method for computing eigenvectors. The inverse iteration method using the proposed method is evaluated with a PC cluster and three kinds of super-computers in Japan, which are the HITACHI SR8000/MPP, Fujitsu VPP800/63, and NEC SX-5/128M8. The results of performance evaluation indicated that the maximum speedup factor of 4 in Modified G-S method, and of 3.5 in Classical G-S method with respect to conventional methods using CWD were obtained.
Scientific journal, Japanese
URL
URL 2

A super instruction-flow architecture for high performance and low power processors
Kenji Rise; Takahiro Katagiri; Hiroki Honda; Toshitsugu Yuba
Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems, 10-19, 2004, Microprocessor performance has improved at about 55% per year for the past three decades. To maintain this performance growth rate, next generation processors must achieve higher levels of instruction level parallelism. However, it is known that a conditional branch poses serious performance problems in modern processors. In addition, as an instruction pipeline becomes deep and the issue width becomes wide, this problem becomes worse. The goal of this study is to develop a novel processor architecture which mitigates the performance degradation caused by branch instructions. In order to solve this problem, we propose a super instruction-flow architecture. The concept of the architecture is described. This architecture has a mechanism which processes multiple instruction-flows efficiently and tries to mitigate the performance degradation. Preliminary evaluation results with small benchmark programs show that the first generation super instruction-flow processor efficiently mitigates branch overhead. © 2004 IEEE.
International conference proceedings, English

Effect of auto-tuning with user's knowledge for numerical software
Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba
2004 Computing Frontiers Conference, Association for Computing Machinery, 12-25, 2004, This paper evaluates the effect of an auto-tuning facility with the user's knowledge for numerical software. We proposed a new software architecture framework, named FIBER, to generalize auto-tuning facilities and obtain highly accurate estimated parameters. The FIBER framework also provides a loop-unrolling function and an algorithm selection function to support code development by library developers needing code generation and parameter registration processes. FIBER offers three kinds of parameter optimization layers - install-time, before execute-time, and run-time. The user's knowledge is needed in the before execute-time optimization layer. In this paper, eigensolver parameters that apply the FIBER framework are described and evaluated in three kinds of parallel computers: the HITACHI SR8000/MPP, Fujitsu VPP800/63, and Pentium4 PC cluster. Our evaluation of the application of the before executetime layer indicated a maximum speed increase of 3.4 times for eigensolver parameters, and a maximum increase of 17.1 times for the algorithm selection of orthogonalization in the computation kernel of the eigensolver.
International conference proceedings, English
DOI URL

Proposal of new RSVP which considers maximum waiting time for bandwidth reservation
T Ikebe; H Honda; T Yuba
ELECTRONICS AND COMMUNICATIONS IN JAPAN PART I-COMMUNICATIONS, SCRIPTA TECHNICA-JOHN WILEY & SONS, 87, 8, 71-81, 2004, Peer-reviwed, A conventional IP network is a connectionless, best effort network, but has difficulty guaranteeing the Quality of Service (QoS). To guarantee end-to-end QoS, the Internet Engineering Task Force (IETF) has proposed bandwidth reservation protocols such as the Resource ReSerVation Protocol (RSVP), but a protocol that guarantees all bandwidth reservation requests has not been realized. To address this, with the objective of implementing bandwidth reservation capable of reliably accepting the bandwidth reservation requests regardless of the size of the requested bandwidth, the authors proposed Waiting Bandwidth Reservation Communication, which is an extension of RSVP. By using this method, we verified that excellent results compared to conventional RSVP are obtained when all bandwidth reservation communications pass over a special path and the bandwidth reservation requests exceed the bandwidth available for reservation along that path. On the other hand, in a wide-area network such as the Internet, the problems are that all of the bandwidth reservation communications are not along special paths and unnecessary waits are generated when waiting for bandwidth reservations. With the objective of solving these problems, in this paper, we propose RSVP that considers the waiting time until the start of bandwidth reservation and demonstrate through simulations how the proposed method solves these problems. (C) 2004 Wiley Periodicals, Inc.
Scientific journal, English
DOI URL

Relis-G: Remote library install system for computational grids
H Watanabe; H Honda; T Yuba; Y Tanaka; M Sato
SEVENTH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND GRID IN ASIA PACIFIC REGION, PROCEEDINGS, IEEE COMPUTER SOC, 44-53, 2004, Peer-reviwed, When a grid user installs the library which is used by the GridRPC etc. to the servers distributed over a grid, the user needs to solve complicated and laborious problems, such as user's effort to input commands for remote operation, creation of an installation package of the library for heterogeneous server systems, avoidance of redundant compilation, and observance of administration policy of each server In order to enable the user to solve these problems easily, we propose the remote library install system, Relis-G, which is highly portable and multipurpose in the grid. Besides the function for automatic remote library installation, this system offers functions, such as automatic creation of an installation package, automatic avoidance of redundant compilation, and automatic observance of administration policy of each server Moreover we confirmed that this system mitigated the above-mentioned user's burden greatly by operation tests.
International conference proceedings, English

Interactive Parallelizing Assistance Tool for OpenMP iPat/OMP
Makoto Ishihara; Hiroki Honda; Toshitsugu Yuba; Mitsuhisa Sato
Proc. of Fifth European Workshop on OpenMP, 2-3, Sep. 2003, Peer-reviwed
International conference proceedings, English

Design and Preliminary Implementation of a Particle Simulation Machine for Efficient Short-range Interaction Computations
Ryo Takata; Kenji Kise; Hiroki Honda; Toshitsugu Yuba
IPSJ Trans. on Advanced Computng Ssystems, 44, SIG 6 ( ACS1), 96-113, May 2003, Peer-reviwed
Scientific journal, English

SimAlpha version 1.0: Simple and readable alpha processor simulator
K Kise; H Honda; T Yuba
ADVANCES IN COMPUTER SYSTEMS ARCHITECTURE, SPRINGER-VERLAG BERLIN, 2823, 122-136, 2003, Peer-reviwed, We have developed a processor simulator SimAlpha Version 1.0 for research and education activities. Its design policy is to keep the source code readable (enjoyable and easy to read) and simple. SimAlpha is written in C++ and the source code consists of only 2,800 lines. This paper describes the software architecture of SimAlpha by referring to its source code. To show an example of SimAlpha in practical use, we present the ideal instruction-level parallelism of SPEC CINT95 and CINT2000 benchmarks measured with a modified version of SimAlpha.
Scientific journal, English

Grid計算環境における予定ガント図を用いたジョブスケジューリング
上田清詩; 本多弘樹; 弓場敏嗣
情報処理学会論文誌, 44, SIG1(HPS6), pp.93-102, Jan. 2003, Peer-reviwed
Scientific journal, Japanese

ホームベースソフトウェア分散共有メモリ上でMigratory Accessを効率良く処理する権限委譲プロトコル
城田裕介; 吉瀬謙二; 本多弘樹; 弓場敏嗣
情報処理学会論文誌, 44, SIG1(HPS6), pp.103-112, Jan. 2003, Peer-reviwed
Scientific journal, Japanese

分散メモリシステム上でのマクロデータフロー処理のためのデータ到達条件
本多弘樹; 上田哲平; 深川保; 弓場敏嗣
情報処理学会論文誌, 43, SIG6(HPS5), pp.45-55, Sep. 2002, Peer-reviwed
Scientific journal, Japanese

帯域予約開始までの待ち時間を考慮したRSVPの提案
池辺隆; 本多弘樹; 弓場敏嗣
電子情報通信学会論文誌, The Institute of Electronics, Information and Communication Engineers, J85-B, 8, pp.1172-1181, Aug. 2002, Peer-reviwed, 従来のIPネットワークはコネクションレス・ベストエフォート型ネットワークであり,QoSの保証が難しかった.End-to-EndのQoSを保証するために,IETFではRSVPなどの帯域予約プロトコルが提案されているが,要求されたすべての帯域予約要求を保証するプロトコルは実現されていない.これに対し筆者らは,要求する帯域の大小にかかわらず確実に帯域予約要求を受理することのできる帯域予約手法の実現を目的として,RSVPを拡張した「待時式帯域予約通信方式」を提案してきた.この方式により,すべての帯域予約通信が特定の経路を通過し,その経路で予約可能な帯域以上の量の帯域予約要求が生じるような状況下では,従来のRSVPに比較して良好な結果を得られることが実証された.一方,インターネットなどの広域ネットワークでは,すべての帯域予約通信が常に特定の経路間を通過するわけではなく,待時式帯域予約通信方式では無用な待ちが発生してしまうという問題点があった.本論文ではこの問題を解決することを目的として,「帯域予約開始までの待ち時間を考慮したRSVP」を提案し,シミュレーションにより本方式がその問題点を解決していることを示す.
Scientific journal, Japanese
URL

DEM-1: A particle simulation machine for efficient short-range interaction computations
R. Takata; K. Kise; H. Honda; T. Yuba
Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2002, Institute of Electrical and Electronics Engineers Inc., 2002, IPDPS, 610-617, 2002, Peer-reviwed, This paper describes the architecture of a high performance, particle simulation machine, DEM-1 for short-range particle interaction computations. All existing particle simulation machines have specialized pipelines to calculate long-range particle interactions effectively. However, their ability to perform particle simulations efficiently diminishes with short-range interactions. The communication cost component of particle simulations will play a significant role in performance when the computation cost becomes O(N). DEM-1's three dimensional torus high-speed network reduces the communication cost while 2048 local processors perform the time integration. The elimination of pipeline bubbles in DEM-1 is achieved by specially designed cut off judgment units. Each specialized pipeline consists of dedicated data path supported by position vector prefetch dual ported memory. The performance of DEM-1 is presented with very large-scale Embedded Atom Method (EAM) molecular dynamics simulations.
International conference proceedings, English
DOI URL

Distributed shared memory with log based consistency for operations with commutative law or associative law
Hideaki Hirayama; Hiroki Honda; Toshitsugu Yuba
Systems and Computers in Japan, 32, 8, 10-19, 2001, Peer-reviwed, With progress in high-performance networks, cluster systems with high-performance and inexpensive work stations or personal computers are attracting much attention. For the cluster systems to become popular, it is necessary that it be easy to develop programs on them. Distributed Shared Memory is the key to achieving this objective. But Distributed Shared Memory cannot achieve high performance for all application programs. In particular, programs which frequently modify the same fields by multiple nodes cannot attain high performance with Distributed Shared Memory. In this paper, we propose Distributed Shared Memory with Log Based Consistency for such application programs, such as aggregation applications in the business application area. In this scheme, consistency is maintained by transferring logs among multiple nodes for operations with the commutative law or associative law. Distributed Shared Memory with Log Based Consistency can achieve much better performance than the traditional Distributed Shared Memory. Its performance is almost the same as that of SMP parallel computers.
Scientific journal, English
DOI URL

細粒度通信機構を持つ並列計算機EM-Xにおける共有メモリプログラムの効率的実行
坂根広史; 本多弘樹; 弓場敏嗣; 児玉祐悦; 山口喜教
情報処理学会論文誌, Information Processing Society of Japan (IPSJ), 41, SIG8(HPS2), 1-14, Nov. 2000, Peer-reviwed, In this paper, we discuss efficient parallel execution of shared memory programs on a physically distributed memory multiprocessor EM-X.The EM-X provides shared memory abstraction with a global address space and a remote memory access mechanism. For this approach, multithreading efficiently hides the latency caused by fine-grain communication, while the thread switching overhead still remains. To reduce the thread switching overhead and exploit locality of shared data, we have implemented a coherent local copy mechanism by software. Performance analyses show that a highly optimized implementation for a frequently shared access program greatly improves the performance, in spite of additional software overhead. We show that the tradeoffs between these two approach provide a basis for the selection of a technique that is more appropriate for efficient executions of various applications on the EM-X.
Scientific journal, Japanese
URL

Efficient Execution Techniques of Shared Memory Programs on the EM-X Distributed Memory Multiprocessor
Hirofumi Sakane; Hiroki Honda; Toshitsugu Yuba; Yuetsu Kodama; Yoshinori Yamaguchi
695-704, Nov. 2000, Peer-reviwed
International conference proceedings, English

Distributed Shared Memory with Log Based Consistency for Operations with Commutative Law or Associative Law
Hideaki HIRAYAMA; Hiroki HONDA; Toshitsugu YUBA
電子情報通信学会論文誌, J83-D-I, 5, 449-458, May 2000, Peer-reviwed
Scientific journal, Japanese

Scalable data mining with log based consistency DSM for high performance distributed computing
H Hirayama; H Honda; T Yuba
SIXTH IEEE INTERNATIONAL CONFERENCE ON ENGINEERING OF COMPLEX COMPUTER SYSTEMS, PROCEEDINGS, IEEE COMPUTER SOC, 143-150, 2000, Peer-reviwed, Mining the large web based online distributed databases to discover new knowledge and financial gain is an important research problem. These computations require high performance distributed and parallel computing environments. Traditional data mining techniques such as classification, association, clustering can be extended to find new efficient solutions.
This paper presents the scalable data mining problem, proposes the use of software DSM (Distributed Shared Memory) with a new mechanism as an effective solution and discusses both the implementation and performance evaluation results.
It is observed that the overhead of a software DSM is very large for scalable data mining programs. A new Log Based Consistency (LBC) mechanism, especially designed for scalable data mining on the software DSM is proposed to; overcome this overhead. Traditional association rule based data mining programs frequently modify the same fields by count-up operations. In contrast, the LBC mechanism keeps up the consistency by broadcasting the count-up operation logs among the multiple nodes.
International conference proceedings, English

Distributed Shared Memory with Log Based Consistency for Scalable Data Mining
平山秀昭; 本多弘樹; 弓場敏嗣
305-308, Oct. 1999, Peer-reviwed
International conference proceedings, English

A Sandglass Type Parallelization Technique for Doacross Loop
高畠志泰; 本多弘樹; 大澤範高; 弓場敏嗣
Transactions of Information Processing Society of Japan, 40, 5, 2037-2043, May 1999, Peer-reviwed
Scientific journal, English

Performance measurements on sandglass-type parallelization of doacross loops
M Takabatake; H Honda; T Yuba
HIGH-PERFORMANCE COMPUTING AND NETWORKING, PROCEEDINGS, SPRINGER-VERLAG BERLIN, 1593, 663-672, 1999, Peer-reviwed, In this paper, we propose the sandglass-type parallelization technique for a doacross loop which has the characteristics of iteration-based parallelizing and software pipelining. We prove its effectiveness Ig comparing the sandglass-type to well-known three: parallelization techniques: iteration-based, software pipelining, and a combination of doall-type: parallel and sequential techniques. We conclude that the sandglass-type parallelization technique is the most effective among the techniques mentioned above in cases which there are less than ten processing clements and the size of tasks with loop-carried dependences is smaller than the size of tasks lacking loop-carried dependence.
Scientific journal, English

OSCAR multi-grain architecture and its evaluation
H Kasahara; W Ogata; K Kimura; G Matsui; H Matsuzaki; M Okamoto; A Yoshida; H Honda
INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS, IEEE COMPUTER SOC, 106-115, 1998, OSCAR (Optimally Scheduled Advanced Multiprocessor) was designed to efficiently realize multi-grain parallel processing using static and dynamic scheduling. It is a shared memory multiprocessor system having centralized and distributed shared memories in addition to local memory on each processor with data transfer controller for overlapping of data transfer and task processing. Also, its Fortran multi-grain compiler hierarchically exploits coarse grain parallelism among loops, subroutines and basic blocks, conventional medium grain parallelism among loop-iterations in a Doall loop and near fine grain parallelism among statements. At the coarse grain parallel processing, data localization (automatic data distribution) have been employed to minimize data transfer overhear. In the near fine grain processing of a basic block, explicit synchronization can be removed by use of a clock level accurate code scheduling technique with architectural supports. This paper describes OSCAR's architecture, its compiler and the performance for the multi-grain parallel processing. OSCAR's architecture and compilation technology will be more important in future High Performance Computers and single chip multiprocessors.
International conference proceedings, English

A Proposal and Evaluation of The RBCQ Synchvonization Mechanism and Compile Method
早川潔; 本多弘樹
Transaction of Information Processing Society of Japan, 39, 6, 1655, 1998, Peer-reviwed
Scientific journal, English

A MULTI-GRAIN PARALLELIZING COMPILATION SCHEME FOR OSCAR (OPTIMALLY SCHEDULED ADVANCED MULTIPROCESSOR)
H KASAHARA; H HONDA; A MOGI; A OGURA; K FUJIWARA; S NARITA
LECTURE NOTES IN COMPUTER SCIENCE, SPRINGER VERLAG, 589, 281-297, 1992, Peer-reviwed, This paper proposes a multi-grain parallelizing compilation scheme for Fortran programs. The scheme hierarchically exploits parallelism among coarse grain tasks, such as, loops, subroutines or basic blocks, among medium grain tasks like loop iterations and among near fine grain tasks like statements. Parallelism among the coarse grain tasks called the macrotasks is exploited by carefully analyzing control dependences and data dependences. The macrotasks are dynamically assigned to processor clusters to cope with run-time uncertainties, such as, conditional branches among the macrotasks and variation of execution time of each macrotask. The parallel processing of macrotasks is called the macro-dataflow computation. A macrotask composed of a Do-all loop, which is assigned onto a processor cluster, is processed in the medium grain in parallel by processors inside the processor cluster. A macrotask composed of a sequential loop or a basic block is processed on a processor cluster in the near fine grain by using static scheduling. A macro task composed of subroutine or a large sequential loop is processed by hierarchically applying macro-dataflow computation inside a processor cluster. Performance of the proposed scheme is evaluated on a multiprocessor system named 0 SCAR. The evaluation shows that the multi-grain parallel processing effectively exploits parallelism from Fortran programs.
Scientific journal, English

A MULTI-GRAIN PARALLELIZING COMPILATION SCHEME FOR OSCAR (OPTIMALLY SCHEDULED ADVANCED MULTIPROCESSOR)
H KASAHARA; H HONDA; A MOGI; A OGURA; K FUJIWARA; S NARITA
LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, SPRINGER-VERLAG BERLIN, 589, 281-297, 1992, Peer-reviwed
International conference proceedings, English

A FORTRAN PARALLELIZING COMPILATION SCHEME FOR OSCAR USING DEPENDENCE GRAPH ANALYSIS
H KASAHARA; H HONDA; S NARITA
IEICE TRANSACTIONS ON COMMUNICATIONS ELECTRONICS INFORMATION AND SYSTEMS, IEICE-INST ELECTRON INFO COMMUN ENG, 74, 10, 3105-3114, Oct. 1991, Peer-reviwed, This paper proposes a Fortran parallelizing compilation scheme for a multiprocessor system named OSCAR. The scheme hierarchically exploits parallelism among coarse grain tasks, such as, loops, subroutines or basic blocks. among medium grain tasks like loop iterations and among near fine grain tasks like statements. Parallelism among the coarse grain tasks called the macrotasks is detected by analyzing a macro-flow graph which explicitly represents control flow and data dependences. The detected parallelism among the macrotasks is represented by a directed acyclic graph called a macrotask graph. Macrotasks in a macrotask graph are dynamically assigned to processor clusters to cope with run-time uncertainties. A macrotask composed of a Do-all loop or a Do-across loop, which is assigned onto a processor cluster, is processed in the medium grain in parallel by processors inside the processor cluster. A macrotask composed of a basic block is processed on a processor cluster in the near fine grain by using static scheduling. A macrotask composed of subroutine or a large sequential loop is processed by hierarchically applying macro-dataflow computation inside a processor cluster. Performance of the proposed scheme is evaluated on OSCAR. The evaluation shows that the hierarchical parallel processing scheme using dynamic and static scheduling effectively exploits parallelism from Fortran programs.
Scientific journal, English

Parallel processing scheme of a basic block in a fortran program on oscar
Hiroki Honda; Hironori Kasahara; Seinosuke Narita
Systems and Computers in Japan, 22, 11, 1-13, 01 Jan. 1991, With the development of the supercomputer with multiprocessors, the parallel processing of a Fortran program on the multiprocessor system is considered interesting. This paper proposes a parallel processing scheme for the Fortran program where the assignment unit (task) to the processor is the processing of an arithmetic substitution statement. The implementation and performance evaluation of the proposed scheme on the actual system are reported. In the proposed scheme, the arithmetic substitution statement in the basic block is defined as the task, and the precedence constraints among the tasks due to data dependencies are determined. Based on the derived constraints, the allocation of the tasks to the processors as well as the execution order are determined at the compiling stage, using the multiprocessor scheduling algorithm. Then the codes for the processors are generated and the parallel processing is executed. The proposed scheme was implemented on an actual system and the performance was evaluated. The effect of the parallel processing is manifested. In the conventional loop parallel processing, the effect of the parallel processing cannot be expected for the basic block in the loop or for the basic block of the scalar operation unit outside the loop. On the other hand, it was verified that the parallel processing of those blocks can be realized by applying the proposed method. Copyright © 1991 Wiley Periodicals, Inc., A Wiley Company
Scientific journal, English
DOI URL

PARALLEL PROCESSING SCHEME FOR A FORTRAN PROGRAM ON A MULTIPROCESSOR SYSTEM OSCAR
H HONDA; A MOGI; A OGURA; H KASAHARA; S NARITA
IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING : CONFERENCE PROCEEDINGS, VOLS 1 AND 2, I E E E, 1, 9-12, 1991, Peer-reviwed
International conference proceedings, English

PARALLEL PROCESSING OF NEAR FINE-GRAIN TASKS USING STATIC SCHEDULING ON OSCAR (OPTIMALLY SCHEDULED ADVANCED MULTIPROCESSOR)
H KASAHARA; H HONDA; S NARITA
SUPERCOMPUTING 90, I E E E, COMPUTER SOC PRESS, 856-864, 1990, Peer-reviwed
International conference proceedings, English

MISC

実HPCアプリケーションを用いたマルチGPUにおける電力ばらつきの評価
郡司賢; 吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
2023, 情報処理学会研究報告(Web), 2023, HPC-188, 202302264623200051

A64FXプロセッサにおける電力・性能ばらつきの評価・分析
草場智也; 吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
2023, 情報処理学会研究報告(Web), 2023, HPC-188, 202302265328292538

並列アプリケーションのキャッシュミス数予測の評価
長谷川健人; 有馬海人; 三輪忍; 八巻隼人; 本多弘樹
2023, 情報処理学会研究報告(Web), 2023, HPC-188, 202302277125915058

Modeling Performance of Deep Learning for Image Recognition on a GPU Server
松下哲也; 三輪忍; 八巻隼人; 本多弘樹
2023, 電子情報通信学会技術研究報告(Web), 122, 451(CPSY2022 34-55), 2432-6380, 202302236656577507

Evaluation of Countermeasures Power Side-channel Attacks
下島航太; 三輪忍; 八巻隼人; 本多弘樹
2023, 電子情報通信学会技術研究報告(Web), 122, 451(CPSY2022 34-55), 2432-6380, 202302236713517810

Optimizing Hash Functions of Rabin-Karp Method for Multi-Pattern Matching with Multiple Pattern Length
鈴木想生; 八巻隼人; 三輪忍; 本多弘樹
2023, 電子情報通信学会技術研究報告(Web), 122, 451(CPSY2022 34-55), 2432-6380, 202302287281843795

TCAMを用いずにルータの最長一致検索に対応するキャッシュ-メモリ・システム
長田大樹; 八巻隼人; 三輪忍; 本多弘樹; 五島正裕
2023, 情報処理学会研究報告(Web), 2023, ARC-252, 202302240000482906

Traffic Load Balancing on aggregated links
平野愁也; 八巻隼人; 三輪忍; 本多弘樹
2023, 情報処理学会研究報告(Web), 2023, ARC-252, 202302266722204411

CUDAバージョンの違いがカーネルの実行時間と消費電力に与える影響の分析
吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
2022, 情報処理学会研究報告(Web), 2022, HPC-183, 202202242346420364

Link Congestion-Based Multipath Routing using In-Band Network Telemetry
荒巻慎太朗; 田中京介; 八巻隼人; 三輪忍; 本多弘樹
2022, 電子情報通信学会技術研究報告(Web), 122, 16(NS2022 8-22), 2432-6380, 202202251502143401

CPUおよびGPUの電力ばらつきを考慮したジョブスケジューリング手法の提案
小野賢人; 吉田幸平; 三輪忍; 坂本龍一; 八巻隼人; 本多弘樹
2022, 情報処理学会研究報告(Web), 2022, HPC-185, 202202239933175791

OpenMP/OpenACCハイブリッド並列化のためのコード変換フレームワークの提案
川崎真之; 大島聡史; 八巻隼人; 三輪忍; 本多弘樹
2022, 情報処理学会研究報告(Web), 2022, HPC-187, 202202284613935034

LULESHを対象とした関数コール回数予測
有馬海人; 長谷川健人; 三輪忍; 八巻隼人; 本多弘樹
2022, 情報処理学会研究報告(Web), 2022, HPC-187, 202202286265964027

A Fast and Secure VMI Mechanism for Malware Analysis
森, 瑞穂; 味曽野, 雅史; 八巻, 隼人; 三輪, 忍; 本多, 弘樹; 品川, 高廣
マルウェアの挙動や攻撃手法を解析する手段として，仮想マシン上のプログラムの内部状態を観察するVirtual Machine Introspection (VMI)という手法が用いられている．VMIには、主に外部のハイパーバイザから行うOut-of-the-box方式と仮想マシン内部から行うIn-the-box方式の2つがあるが，両者は解析時の動作速度の高速性と解析システムを保護・隠蔽する安全性の面でトレードオフの関係にある．そこで我々は，高速かつ安全なVMI機構としてFastVMIXを提案する．FastVMIXでは，マルウェアを解析する解析エージェントを仮想マシン内部に挿入することによってハイパーバイザへのコンテキストスイッチを減らしつつ，Intel CPUがサポートするVMFUNCのEPTP SwitchingとHuge Pageを用いた高速な動的メモリ保護変更機構により、マルウェアから解析エージェントのメモリ領域を保護・隠蔽する．また，準パススルー型ハイパーバイザを用いることで、仮想化のオーバーヘッド削減及び隠蔽度の向上を図る．本論文では，BitVisorをベースにFastVMIXを実装した結果を報告する．
As a means of quickly analyzing malware behavior and attack methods, a technique called Virtual Machine Introspection (VMI) is used to observe the internal state of programs on a virtual machine. A typical VMI system mainly takes either an out-of-the-box (i.e., with hypervisor) or in-the-box (i.e., within the virtual machine) approach; however, these two approaches involve a trade-off between the analysis speed and the security of protectiong and hiding the analysis system. In this paper, we propose FastVMIX that realizes fast and secure VMI. FastVMIX reduces the number of context switches to the hypervisor during malware analysis by inserting an analysis agent in the target virtual machine, while protecting and hiding the agent's memory area by switching memory protection with EPTP switching of VMFUNC and huge pages supported by Intel CPUs. In addition, we used a para-pass-through hypervisor to reduce the overhead of virtualization and improve the degree of hiding. This paper reports several experimental results of FastVMIX built on BitVisor., 25 Nov. 2021, コンピュータシステム・シンポジウム論文集, 2021, 48-56, Japanese, 170000185943
URL

MPIアプリケーションの関数コール回数予測
有馬海人; 長谷川健人; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-178, 202102222474724022

MPIアプリケーションのキャッシュプロファイル予測
長谷川健人; 有馬海人; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-178, 202102245178693852

Table-Separate Packet Processing Caches for Routing/ARP/ACL/QoS
長田大樹; 田中京介; 八巻隼人; 三輪忍; 本多弘樹; 五島正裕
2021, 情報処理学会研究報告(Web), 2021, ARC-244, 202102212200475516

TensorFlowアプリケーション用GPUサーバにおけるNVDIMMの利用可能性の検討
松下哲也; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, ARC-244, 202102288818448896

Mesh TensorFlowを用いたモデル並列学習におけるCPU-GPU間のデータ転送最適化
横手宥則; 三輪忍; 八巻隼人; 本多弘樹
2021, 電子情報通信学会技術研究報告(Web), 120, 435(CPSY2020 50-69), 2432-6380, 202102235623871899

Wisteria/BDEC-01におけるNVIDIA A100 GPUの電力性能ばらつきの評価
提山春日; 吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-182, 202102219931012939

SDNコントローラにおける優先度付きキューを用いた高優先度パケットの処理高速化
高倉玲央; 八巻隼人; 三輪忍; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, ARC-246, 202102229681366351

深層学習における実行時ファイルステージング
樋口遼太郎; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-182, 202102244061258213

MPIにおける小規模実行時の通信トレース解析による大規模実行時の通信タイミング予測の評価
岡田悠希; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-182, 202102265589208738

ネットワーク機器における高速なGZIP復号のためのキャッシュ利用効率向上手法
黒川雄亮; 八巻隼人; 三輪忍; 本多弘樹
2020, 電子情報通信学会技術研究報告, 119, 429(DC2019 98-121)(Web), 0913-5685, 202002211528472983

動画トラフィック検査除外手法のSnortにおける実装
祐野雅範; 八巻隼人; 三輪忍; 本多弘樹
2020, 電子情報通信学会技術研究報告, 119, 429(DC2019 98-121)(Web), 0913-5685, 202002260646532043

パケット処理キャッシュにおけるパイプライン化とマルチポート化の評価
田中京介; 八巻隼人; 三輪忍; 本多弘樹
2019, 情報処理学会研究報告(Web), 2019, ARC-237, 201902240481337109

多頻度・順不同で到着するシーケンスデータの主キーごとの処理順序制約を満たすリアルタイム並列処理手法
山添高弘; 山添高弘; 三輪忍; 本多弘樹
2019, 情報処理学会研究報告(Web), 2019, DBS-169, 201902235140320625

TSUBAME3.0における製造ばらつきを考慮したGPUの電力モデリングの高速化
大八木哲哉; 浅田風太; 三輪忍; 八巻隼人; 本多弘樹
2019, 情報処理学会研究報告(Web), 2019, HPC-172, 202002234496157254

テーブル検索回数の削減によるインターネットルータの高スループット化および省電力化
山下壮樹; 八巻隼人; 三輪忍; 本多弘樹
2019, 電子情報通信学会技術研究報告, 119, 343(IA2019 48-58)(Web), 0913-5685, 202002264724684120

OpenFlowを用いた動画フローの非ミラーリングによるNIDS処理負荷の削減
高倉玲央; 八巻隼人; 三輪忍; 本多弘樹
2019, 電子情報通信学会技術研究報告, 119, 343(IA2019 48-58)(Web), 0913-5685, 202002285135458100

ネットワーク機器上における高速なGZIP復号のためのキャッシュ利用効率向上手法の提案
黒川雄亮; 八巻隼人; 三輪忍; 本多弘樹
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902210265635577

学習済み重みを利用した畳み込みニューラルネットワークの学習法の初期検討
横手宥則; 三輪忍; 井内悠太; 津邑公暁; 八巻隼人; 本多弘樹
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902220924867022

キャッシュを利用したOpenFlow通信の高速化
祐野雅範; 三輪忍; 八巻隼人; 本多弘樹
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902227184856037

GPUの電力ばらつきモデリング
浅田風太; 三輪忍; 本多弘樹; 八巻隼人
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902281110920126

ネットワークベースの攻撃に対応可能な高対話型ハニーポット
森瑞穂; 八巻隼人; 三輪忍; 本多弘樹
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902217092133421

ON/OFFリンクにおける通信開始遅延を低減するためのプリウェイクアップ手法の検討
松山, 朋樹; 三輪, 忍; 八巻, 隼人; 本多, 弘樹
近年のスーパーコンピュータの消費電力は、供給可能な電力に達しつつあり、システム内の各ハードウェアの消費電力をさらに削減する必要がある。スーパーコンピュータのインターコネクション・ネットワークにおける省電力化技術として、通信していないリンクを低電力モードにすることが可能なON/OFFリンクが注目されている。しかし、低電力モード時に通信要求があった場合まず通常モードにする必要があり、そのモード遷移にかかる時間分、通信の開始が遅延してしまう。そこで、本研究では、低電力モードのリンクを通信要求に先立って通常モードにし(プリウェイクアップ)、データ到着後直ちに通信を開始できるようにする方法を検討する。, 13 Mar. 2018, 第80回全国大会講演論文集, 2018, 1, 123-124, Japanese, 170000176601, AN00349328
URL

CNN計算の省メモリ化のためのカーネル・クラスタリング手法の検討—A Study of Kernel Clustering for Reducing Memory Footprint of CNN—コンピュータシステム ; 組込み技術とネットワークに関するワークショップETNET2018
松井優樹; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 479, 185-190, Japanese, 0913-5685, 40021521631, AA1123312X
URL

NVDIMMを用いたメモリスナップショットの解析システム—A System for Analyzing Memory Snapshot with NVDIMM—ディペンダブルコンピューティング ; 組込み技術とネットワークに関するワークショップETNET2018
三須雅仁; 三輪忍; 八巻隼人; 本多弘樹
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 480, 107-112, Japanese, 0913-5685, 40021521593, AA1123312X
URL

ゲートウェイにおける攻撃パケットに着目したテーブル検索負荷削減手法の提案—ディペンダブルコンピューティング ; 組込み技術とネットワークに関するワークショップETNET2018
愛甲達也; 八巻隼人; 三輪忍; 本多弘樹
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 480, 89-94, Japanese, 0913-5685, 40021521583, AA1123312X
URL

HSPICEを用いたシリコン回路とカーボンナノチューブ回路の比較評価—ディペンダブルコンピューティング ; 組込み技術とネットワークに関するワークショップETNET2018
松尾駿; 三輪忍; 八巻隼人; 本多弘樹
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 480, 119-124, Japanese, 0913-5685, 40021522224, AA1123312X
URL

A System for Analyzing Memory Snapshot with NVDIMM
三須雅仁; 三輪忍; 八巻隼人; 本多弘樹
2018, 電子情報通信学会技術研究報告, 117, 480(DC2017 89-106), 0913-5685, 201802277611652387

A Study of Kernel Clustering for Reducing Memory Footprint of CNN
松井優樹; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
2018, 電子情報通信学会技術研究報告, 117, 480(DC2017 89-106), 0913-5685, 201802279006502398

プリウェイクアップ手法によるON/OFFリンクの消費エネルギー削減
松山朋樹; 三輪忍; 八巻隼人; 本多弘樹
2018, 情報処理学会研究報告(Web), 2018, HPC-165, 201802210749233668

1Tbps実現に向けたルータのメモリ階層の最適化
田中京介; 八巻隼人; 三輪忍; 本多弘樹
2018, 情報処理学会研究報告(Web), 2018, ARC-233, 201902221895431753

動画トラフィックに着目したNIDSにおける文字列探索処理負荷削減手法の提案
高徳真晴; 八巻隼人; 三輪忍; 本多弘樹
電子情報通信学会, Jul. 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 153, 177-183, Japanese, 0913-5685, 40021286172, AA1123312X
URL

パケット処理キャッシュにおける送信元IPアドレスに着目したミス削減手法に関する初期検討
八巻隼人; 愛甲達也; 三輪忍; 本多弘樹
電子情報通信学会, May 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 44, 55-62, Japanese, 0913-5685, 40021215515, AA1123312X
URL

マルチコアニューラルネットワークアクセラレータにおけるデータ転送のブロードキャスト化—ディペンダブルコンピューティング ; 組込み技術とネットワークに関するワークショップETNET2017
大場百香; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
電子情報通信学会, Mar. 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 116, 511, 165-170, Japanese, 0913-5685, 40021158854, AA1123312X
URL

ジョブ実行中の計算ノードにおけるDIMM待機電力削減手法の実装と評価
石原雅也; 石原雅也; 三輪忍; 三輪忍; 八巻隼人; 八巻隼人; 本多弘樹; 本多弘樹
2017, 情報処理学会研究報告(Web), 2017, HPC-158, 201702274862027744

再構成可能なニューラルネットワークアクセラレータの提案と性能分析
大場百香; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
電子情報通信学会, Aug. 2016, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 116, 177, 235-242, Japanese, 0913-5685, 40020932410, AA1123312X
URL

リンクオフスレッショルドを有するON/OFFリンクの電力見積手法の初期検討
西郷雄斗; 三輪忍; 八巻隼人; 本多弘樹
2016, 情報処理学会研究報告(Web), 2016, HPC-155, 201602251172366586

メモリホットプラグを用いたメインメモリの省電力化に関する初期検討
石原雅也; 石原雅也; 三輪忍; 三輪忍; 八巻隼人; 八巻隼人; 本多弘樹; 本多弘樹
2016, 情報処理学会研究報告(Web), 2016, HPC-155, 201602256412247012

演算器におけるオペランド値を考慮したパワーゲーティングに関する初期検討
石川雄介; 小柴篤史; 坂本龍一; 和田康孝; 三輪忍; 近藤正章; 並木美太郎; 本多弘樹
2015, 電子情報通信学会技術研究報告, 115, 243(CPSY2015 45-60), 0913-5685, 201502202570022719

高性能計算環境向け電力配分自動最適化のためのコンパイラ環境の構築
和田康孝; 稲富雄一; 井上弘士; 三吉郁夫; 近藤正章; 本多弘樹
将来の HPC システムでは，消費電力がシステムの設計や実効性能を制約する最大の要因の一つになると考えられており，利用可能な電力バジェットを最重要資源とする電力志向型 HPC システムに関する研究開発が進められている．この電力志向型システムにおいてアプリケーションの実効性能を向上させるためには，与えられた消費電力の上限（電力バジェット）の制約下で，CPU や DRAM，アクセラレータ等の電力性能ノブを調節し，各要素に適切に消費電力を配分する必要がある．しかしながら，このためには，各電力性能ノブと対象アプリケーションの実行性能の関係を予め知る必要があり，そのための作業に大きな手間を要する．また，従来提案されている電力配分最適化手法は多岐にわたり，適用する手法毎に異なる情報を取得する必要がある．本稿では，これらの問題を解決し，並列アプリケーションの性能/電力解析および電力配分を自動化するコンパイラフレームワークについて述べる．, 一般社団法人情報処理学会, 21 Jul. 2014, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2014, 11, 1-8, Japanese, 110009808106, AN10463942
URL

HPCシステムにおける電力・性能可視化ツールに関する比較検討
和田康孝; カオタン; 近藤正章; 本多弘樹
将来の HPC システムでは，消費電力がシステムの設計や実効性能を制約する最大の要因の一つになると考えられている．しかしながら，システムのピーク消費電力が電力制約を超えないことを保証する従来の設計思想では，アプリケーションを今後の大規模システムに対してスケールさせることは難しいと考えられる．我々はこの認識のもと，システムのピーク消費電力が制約を超過することを積極的に許容し，適切に電力性能ノブを調整することで，限られた電力資源を有効に使用して高い実効性能を得る電力制約適応型システムと，その実現に必要となる電力マネージメントフレームワークの研究開発を行っている．このような電力制約適応型システムにおいて，アプリケーションの最適化を効率よく行うためには，従来のプロファイルや実行トレース，ハードウェアカウンタ等の情報に加え，消費電力に関する情報も併せて取得し，可視化してプログラマにわかりやすく提示する仕組みが求められる．本稿では，既存のパフォーマンス解析ツールの消費電力情報取得機能について，主に観測オーバヘッドの削減および可視化や各種ツール間の可搬性の観点から比較検討を行った．その結果，既存ツールと比較して，可搬性の高い Open Trace Format Version2（OTF2）形式にて低オーバヘッドで消費電力情報を記録することが可能となった．, 09 Dec. 2013, 研究報告計算機アーキテクチャ（ARC）, 2013, 31, 1-7, Japanese, 170000079478, AN10096105
URL

GPUにおける走行時パワーゲーティング向けスレッドブロック割り当ておよびワープ発行制御手法
松本洋平; 近藤正章; 和田康孝; 本多弘樹
近年，GPU の消費電力の増加が問題となっている．本稿では GPU のリーク電力を削減するために，GPU コアとコア内部の SIMD 演算器にパワーゲーティングを適用することを考え，その際のリーク電力削減効果を向上させるためのスレッドブロック割り当て手法と，ワープ発行制御手法を提案する．提案するパワーゲーティング手法では GPU コア，SIMD ユニット，SIMDユニット中の演算器単位と複数の粒度で電源供給を制御する．具体的には，オンチップネットワークのストールが多く発生する状況下では，一部コアに対してスレッドブロックの割り当てを停止し，その GPU コアへの電源供給を遮断する．また，メモリアクセス待ちによるストール発生時には長期間演算器がアイドルになることがあるため，演算器の使用率に応じて SIMD ユニット単位，さらには演算器単位で細粒度に電源供給を制御する．シミュレーションによる評価の結果，スレッドブロック割り当て制御とワープ発行制御により，性能低下を抑えつつ，パワーゲーティングによるリーク消費エネルギー削減効果を高められることがわかった．, 09 Dec. 2013, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2013, 15, 1-8, Japanese, 170000079426, AN10463942
URL

GPUにおける走行時パワーゲーティング向けスレッドブロック割り当ておよびワープ発行制御手法
松本洋平; 近藤正章; 和田康孝; 本多弘樹
近年，GPU の消費電力の増加が問題となっている．本稿では GPU のリーク電力を削減するために，GPU コアとコア内部の SIMD 演算器にパワーゲーティングを適用することを考え，その際のリーク電力削減効果を向上させるためのスレッドブロック割り当て手法と，ワープ発行制御手法を提案する．提案するパワーゲーティング手法では GPU コア，SIMD ユニット，SIMDユニット中の演算器単位と複数の粒度で電源供給を制御する．具体的には，オンチップネットワークのストールが多く発生する状況下では，一部コアに対してスレッドブロックの割り当てを停止し，その GPU コアへの電源供給を遮断する．また，メモリアクセス待ちによるストール発生時には長期間演算器がアイドルになることがあるため，演算器の使用率に応じて SIMD ユニット単位，さらには演算器単位で細粒度に電源供給を制御する．シミュレーションによる評価の結果，スレッドブロック割り当て制御とワープ発行制御により，性能低下を抑えつつ，パワーゲーティングによるリーク消費エネルギー削減効果を高められることがわかった．, 09 Dec. 2013, 研究報告計算機アーキテクチャ（ARC）, 2013, 15, 1-8, Japanese, 170000079462, AN10096105
URL

HPCシステムにおける電力・性能可視化ツールに関する比較検討
和田康孝; カオタン; 近藤正章; 本多弘樹
将来の HPC システムでは，消費電力がシステムの設計や実効性能を制約する最大の要因の一つになると考えられている．しかしながら，システムのピーク消費電力が電力制約を超えないことを保証する従来の設計思想では，アプリケーションを今後の大規模システムに対してスケールさせることは難しいと考えられる．我々はこの認識のもと，システムのピーク消費電力が制約を超過することを積極的に許容し，適切に電力性能ノブを調整することで，限られた電力資源を有効に使用して高い実効性能を得る電力制約適応型システムと，その実現に必要となる電力マネージメントフレームワークの研究開発を行っている．このような電力制約適応型システムにおいて，アプリケーションの最適化を効率よく行うためには，従来のプロファイルや実行トレース，ハードウェアカウンタ等の情報に加え，消費電力に関する情報も併せて取得し，可視化してプログラマにわかりやすく提示する仕組みが求められる．本稿では，既存のパフォーマンス解析ツールの消費電力情報取得機能について，主に観測オーバヘッドの削減および可視化や各種ツール間の可搬性の観点から比較検討を行った．その結果，既存ツールと比較して，可搬性の高い Open Trace Format Version2（OTF2）形式にて低オーバヘッドで消費電力情報を記録することが可能となった．, 09 Dec. 2013, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2013, 31, 1-7, Japanese, 170000079442, AN10463942
URL

RAPLインタフェースを用いたHPCシステムの消費電力モデリングと電力評価
カオタン; 和田康孝; 近藤正章; 本多弘樹
将来の HPC システムでは，消費電力がシステム設計や実効性能を制約する最大の要因の一つになると考えられている．運用時のピーク消費電力が電力制約を超えないことを保証する従来の設計思想では，アプリケーションを今後の大規模システムに対してスケールさせることは難しいとの認識のもと，我々は，ピーク消費電力が制約を超過することを積極的に許容し，適切に電力性能ノブを調整しつつ限られた電力資源を有効に使用して高い実効性能を得る電力制約適応型システムと，その実現に必要となる電力マネージメントフレームワークの研究開発を実施している．このような電力制約適応型システムにおいては，アプリケーション実行時の電力消費状況を観測し，また柔軟に電力制御を行える環境が必須となる．近年の Intel 社のプロセッサには RAPL (Running Average Power Limit) と呼ばれるプロセッサと DRAM の消費電力を観測・制御するインタフェースが備えられている．本稿ではこの RAPL を用い，アプリケーションを実行させた際の消費電力計測と制御を行い，HPC システムに用いられる計算機の電力計測特性について調査する．また，ノード全体の電力の柔軟な計測を可能とするべく，RAPL の計測値を用いてノード全体の電力のモデリングを行う．実験の結果，RAPL により高い精度でプロセッサや DRAM，またノードの消費電力を観測できることがわかった．, 一般社団法人情報処理学会, 23 Sep. 2013, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2013, 20, 1-8, Japanese, 110009606441, AN10463942
URL

DVFSドメインを考慮した低消費電力化タスクスケジューリング手法に関する検討
和田康孝; 近藤正章; 本多弘樹
コンピュータシステムの消費電力を低減するために現在では様々な技術が用いられており,特にプロセッサの DVFS(Dynamic Voltage-Frequency Scaling) 機能の活用は,パワーゲーティングの適用とともに,プログラム実行時の消費電力および消費エネルギーを削減するために有効な手段である.従来のDVFSスケジューリング手法では,各コア個別にDVFSを適用可能であるという条件のもと,プログラムの実行時間の増加を抑えつつ消費エネルギーを削減可能であることを示しているが,将来的にはチップ上のコア数が増大し,各コア毎にDVFSを適用することは難しくなるものと考えられる.本稿では,メニーコアプロセッサにおいて,複数コア単位でDVFSが適用可能である場合を考慮した低消費電力化タスクスケジューリング手法について検討・評価を行った結果について述べる., 一般社団法人情報処理学会, 24 Jul. 2013, 情報処理学会研究報告. 計算機アーキテクチャ研究会報告, 2013, 22, 1-7, Japanese, 110009587864, AN10096105
URL

電力モード協調によるプロセッサと主記憶の省電力化の検討
宮部創一; 近藤正章; 和田康孝; 本多弘樹
15 May 2013, 先進的計算基盤システムシンポジウム論文集, 2013, 147-147, Japanese, 170000076938
URL

GPUにおける細粒度パワーゲーティング向けスレッド発行制御手法の検討
松本洋平; 近藤正章; 和田康孝; 本多弘樹
近年では GPU の消費電力の増加が問題となっている．本稿では GPU に搭載された SIMD 演算器のリーク電力の削減手法として細粒度パワーゲーティングを適応することを考え，その際のリーク電力削減効果を向上させるためのスレッド発行制御手法を提案する．前提とする細粒度パワーゲーティングでは GPU の SIMD の各演算器単位で電源の供給を制御する．提案手法では，電源の ON/OFF によるスリープモードとアクティブモードの移行時に生じる電力的なオーバーヘッド抑えるために，各ワープ内のスレッドの発行制御を行うものである．スレッド発行制御手法としては一部の演算器に集約してスレッド実行を行うスレッドコンパクション，1warp を 2 つの warp に分割する warp 分割を検討する．シミュレーションによる初期評価の結果，スレッドコンパクション，および warp 分割を適用した場合のリークエネルギー削減率はそれぞれ 46%，71% になった．また両者を組み合わせた場合のリーク電力削減率は 74% になることがわかった．, 19 Mar. 2013, 研究報告計算機アーキテクチャ（ARC）, 2013, 2, 1-7, Japanese, 110009552435, AN10096105
URL

マルチコア・プロセッサ向けのヘルパースレッドによるキャッシュ制御支援手法の検討
橋本崇浩; 井上功一; 近藤正章; 平澤将一; 本多弘樹
近年 1 チップ上に複数のコアを搭載するマルチコア・プロセッサ構成を用いることが主流となっている．今後もコア数は増加すると予想されるが，現在では多くのコアを活用できるような並列プログラムは限られており，増加するコアを有効利用することは重要な課題である．また，それらマルチコア・プロセッサでは，キャッシュメモリの有効利用という観点から共有キャッシュメモリを実装することが多い．しかし，他のスレッドとのアクセスパターンやアクセス間隔などの違いから，再利用性の高いデータがキャッシュから追い出さてしまうキャッシュ競合が問題となることがある．そこで本研究では，共有キャッシュの置換制御の補助を行う専用スレッドをヘルパースレッドとして遊休コア上で動作させ，キャッシュの競合を緩和させることで性能向上を図る手法を検討する．ヘルパースレッドは，他コアで動作するスレッドのキャッシュミスの情報を取得してデータの再利用性を予測しつつ，再利用性の低いデータを次に当該セットでキャッシュミスが生じた際にキャッシュから追い出され易くなるよう制御することで競合の緩和を狙う．本手法の評価を行った結果，共有キャッシュにおける競合頻度が高い場合，提案手法によって性能を向上させることが可能であることを確認した．一方で，現状ではソフトウェアによる処理がキャッシュミスイベントの発生頻度に追いつかず，性能向上率は高くないことがわかった．, 20 Mar. 2012, 研究報告計算機アーキテクチャ（ARC）, 2012, 13, 1-8, Japanese, 110008803085, AN10096105
URL

複数GPU向けのCUDAコードを生成するOpenMP処理系の提案
長塚郁; 大島聡史; 平澤将一; 近藤正章; 本多弘樹
著者らは OpenMP プログラムから CUDA プログラムへ変換する処理系，"OMPCUDA" の開発を行っている．本稿では，OMPCUDA における複数 GPU 向けの CUDA プログラムを生成するための機能の実装を述べ，生成された CUDA コードの評価結果について考察する．, 19 Mar. 2012, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2012, 12, 1-8, Japanese, 110008803047, AN10463942
URL

A Task Scheduling Method for Low-Energy Consumption on Heterogenius Cluster Systems
山下良; 近藤正章; 平澤将一; 本多弘樹
近年，データセンタの省エネルギー化への要求が高まっている．データセンタでは機器の更新が頻繁に行われるため，様々な計算機で構成されているヘテロジニアス構成であることが多い．そのため，同一タスクを処理するのに要する消費エネルギーはサーバ毎に異なり，スケジューリングによって，タスクセットの処理に必要な消費エネルギーも異なる．本稿では，先行制約を持つタスクセットを対象に，ヘテロジニアスなサーバ計算機環境を考慮した低消費エネルギー化タスクスケジューリング手法を提案する．提案手法は，従来の Heterogeneous Earliest Finish Time (HEFT) 法のスケジューリング結果を基に，プロセッサのアイドル時，またはスタンバイモード時の消費電力を考慮しつつ，サーバへのタスク再割り当てを行うことで，タスク処理のエネルギーを削減するものである．本提案手法を評価したところ，HEFT 法に比べ，タスクセットのスケジュール長を変えずに消費エネルギーを削減できることがわかった．Reducing energy consumption of data-centers is one of the important requirement for data-center operations. Since the hardware of server systems is replaced frequently, there is a heterogeneity in data-centers. Therefore, the energy consumption for processing a task depends on the server that the task is allocated. In this paper, we propose a task scheduling method to reduce energy consumption for processing a task set in which each task has dependency to other tasks. Our method is based on the Heterogeneous Earliest Finish Time (HEFT) scheduling algorithm. After HEFT scheduling, we re-allocate tasks to low-power servers without increasing the critical path length of the task set. We evaluate the proposed method and the evaluation results reveal that the proposed method successfully reduces energy consumption in most of the evaluated cases., 情報処理学会, 03 Mar. 2011, 研究報告計算機アーキテクチャ（ARC）, 2011, 3, 1-8, Japanese, 2186-2583, 110008583163, AN10096105
URL

A Task Allocation Strategy for Heterogeneous Server Systems with A Machine Learning Approach
穂園智哉; 近藤正章; 平澤将一; 本多弘樹
近年，ウェブサービスの提供やデータベース処理の需要の増加，さらにはクラウドコンピューティングの普及などにより，サーバ計算機を多数運用するデータセンタの重要性が高まっている．データセンタは，増大するサービス要求を処理するために，稼働するハードウェアの更新が欠かせない．しかし，一度に全ハードウェアを更新することは稀で，通常は徐々に入れ替えが行なわれるため，ヘテロジニアスなサーバ計算機環境になることが多い．そのため，タスクの MIPS 値は実行するサーバによって異なると考えられ，タスクの特徴によってタスク処理の際の MIPS 値を最大にする実行サーバが変わり得る．本稿では，タスク処理の特徴と各サーバの相性を機械学習により予測し，それに基づいてサーバへタスク配置を行うことで，データセンタのスループットを向上させる手法を提案する．また，機械学習による相性予測の精度と，予測結果を用いたタスク配置によるデータセンタのスループット向上の有効性を検証する．The total performance of server computers is one of the important concerns in data-center operations. To meet the increasing service demands, the hardware of server systems in data-centers is replaced frequently. Since a replacement is done for a part of the server systems, there is a heterogeneity in data-centers. Therefore, the performance of a task depends on which server is used for processing the task. The server that brings the best performance may vary task by task. In this paper, we propose a task allocation technique based on predicting the best performing server to improve the total data-center throughput. The prediction is realized by a machine learning approach. We study the accuracy of the prediction method and the effectiveness of the task allocation technique., 情報処理学会, 03 Mar. 2011, 研究報告計算機アーキテクチャ（ARC）, 2011, 12, 1-8, Japanese, 2186-2583, 110008583172, AN10096105
URL

A Software Cache Implementation for GPU
HIRASAWA SHOICHI; SHIMODA KAZUAKI; OHSHIMA SATOSHI; HONDA HIROKI
高性能コンピューティングにおいて GPU が注目されている．NVIDIA 製 GPU は CUDA において高性能なシェアードメモリを有効に用いるプログラミング技術により各種アプリケーションで非常に高いピーク性能が得られている一方，プログラミングの容易さ，汎用性に問題を残している．本研究においては CUDA においてユーザが明示的に使用するシェアードメモリの一部をデバイスメモリのキャッシュとするソフトウェアキャッシュ機構を提案する．本機構によりデバイスメモリからシェアードメモリへ暗黙的にデータ転送が行われ汎用計算の高速化が達成される．In HPC, GPU attracts attention. Although programming difficulty still remains, very high peak performance can be achieved using NVIDIA GPUs. In this research, we propose a software cache mechanism which caches the device memory of CUDA with the shared memory. User data can be transfered implicitly with the software cache and performance improvement of general-purpose computation benchmark programs can be achieved., 情報処理学会, 23 Nov. 2009, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2009, 9, 1-10, Japanese, 0919-6072, 110007995455, AN10463942
URL

Parallelism abstraction of distributed heterogeneous computing eivironment
HIRASAWA SHOICHI; HONDA HIROKI
Parallel execution environment, such as the multi-core CPU, a cluster, and a grid, shows a spread increasingly. A spread from a shared memory to the distributed memory and the homogenious multi-core CPU to the heterogenious multi-core CPU has made system architecture complicate. The respectively different programming interface and the programming model are used in each parallel execution environment, and since it becomes a burden to users and prevents these parallel execution environment spread. In this paper, we consider distributed memory and hetero-genious architecture as a general programming model and consider how to realize a unified programming interface for the distributed heterogenious system architecture., Information Processing Society of Japan (IPSJ), 22 Nov. 2007, IPSJ SIG Notes, 2007, 115, 57-60, Japanese, 0919-6072, 110006533750, AN10096105
URL

POSIX Thread API for Cell Processor
MACHIDA SATOSHI; NAKANISHI YU; HIRASAWA SHOICHI; HONDA HIROKI
Cell Broadband Engine (CBE) with high efficiency computing power attracts attention. However, to draw the performance of Cell processor, a program must be described with API prepared for for Cell processor. In addition, it burdens programmers because the API is a thing peculiar to a Cell processor. In this paper, we developed a tool to convert the source code that was described in a POSIX thread into for a Cell processor and evaluated it. Experimental results show that the proposed tool enables programmers to create PPE/SPE source codes for Cell processor easily without discriptions to control Cell processor., Information Processing Society of Japan (IPSJ), 22 Nov. 2007, IPSJ SIG Notes, 2007, 115, 71-76, Japanese, 0919-6072, 110006533753, AN10096105
URL

F-Omega: A Framework for Grid RPC Application with Adaptive Server Use
WATANABE HIROMASA; HIRASAWA SHOICHI; HONDA HIROKI
Grid applications need flexibility so that they can execute persistently even they have to change computation nodes according to user-specified schedule about resource usage. In this paper, we propose a framework F-Omega to easily develop and execute flexible grid applications. F-Omega has the following features; easy event-driven programming style for grid applications while concealing remote library management, automatic visualization of resource usage constraints and user interface for dynamic control of resource usage of grid applications. Experimental results for F-Omega applications show that F-Omega enables programmers to create grid applications easily while keeping simple and unmixed outlook without implemeriting remote library management. F-Omega also enables users to control resource usage efficiently by mitigating user's burden to monitor server constraints., Information Processing Society of Japan (IPSJ), 02 Mar. 2007, IPSJ SIG Notes, 2007, 17, 181-186, Japanese, 110006248439, AN10463942
URL

F-Omega: A Framework for Grid RPC Application with Adaptive Server Use
WATANABE HIROMASA; HIRASAWA SHOICHI; HONDA HIROKI
Grid applications need flexibility so that they can execute persistently even they have to change computation nodes according to user-specified schedule about resource usage. In this paper, we propose a framework F-Omega to easily develop and execute flexible grid applications. F-Omega has the following features; easy event-driven programming style for grid applications while concealing remote library management, automatic visualization of resource usage constraints and user interface for dynamic control of resource usage of grid applications. Experimental results for F-Omega applications show that F-Omega enables programmers to create grid applications easily while keeping simple and unmixed outlook without implemeriting remote library management. F-Omega also enables users to control resource usage efficiently by mitigating user's burden to monitor server constraints., Information Processing Society of Japan (IPSJ), 02 Mar. 2007, IPSJ SIG Notes, 2007, 17, 181-186, Japanese, 0919-6072, 110006249890, AN10096105
URL

Assistance Tool for Trial-and-Error in Source-Code-Level Optimization on iPat/OMP
NAGANO YUSUKE; ISHIHARA MAKOTO; HIRASAWA SHOICHI; HONDA HIROKI
Source-code-level performance optimization is important in parallel programming. However, previous development environments have problems, such as decrease of readability of codes and large amount of work for the trial-and-error in source-code-level optimization. In this paper, we propose a new method for trial-and-error in source-code-level optimization. The optimization is performed by directive-based interactive code transformation and undo in our method. We apply the proposed method to program restructuring function of iPat/OMP, which is an interactive parallelization assistance tool for OpenMP. The results of application experiment show our method solves the above-mentioned problems in source-code-level optimization., Information Processing Society of Japan (IPSJ), 01 Mar. 2007, IPSJ SIG Notes, 2007, 17, 67-72, Japanese, 0919-6072, 110006248420, AN10463942
URL

Assistance Tool for Trial-and-Error in Source-Code-Level Optimization on iPat/OMP
NAGANO YUSUKE; ISHIHARA MAKOTO; HIRASAWA SHOICHI; HONDA HIROKI
Source-code-level performance optimization is important in parallel programming. However, previous development environments have problems, such as decrease of readability of codes and large amount of work for the trial-and-error in source-code-level optimization. In this paper, we propose a new method for trial-and-error in source-code-level optimization. The optimization is performed by directive-based interactive code transformation and undo in our method. We apply the proposed method to program restructuring function of iPat/OMP, which is an interactive parallelization assistance tool for OpenMP. The results of application experiment show our method solves the above-mentioned problems in source-code-level optimization., Information Processing Society of Japan (IPSJ), 01 Mar. 2007, IPSJ SIG Notes, 2007, 17, 67-72, Japanese, 0919-6072, 110006249871, AN10096105
URL

HTTP Connection Based RPC for World Wide Interactive Systems
HIRAYAMA HIDEAKI; HONDA HIROKI; YUBA TOSHITSUGU
Since WWW(World Wide Web)has been developped, users of the Internet are extremely increasing. Now WWW is mostly used for retrieving information. But more complicated style of distributed processing is increasing according to the needs of EC(Electronic Commerce). In this paper, we propose a new scheme named "HTTP connection based RPC" for world wide interactive application model such as groupware. The RPC has the features such as direction of server applications by URL, connection from server applications to client applications and synchronous/asynchronous RPC. It provides world wide and flexible distributed processing environment for interactive application model., Information Processing Society of Japan (IPSJ), 10 Jul. 2000, 情報処理学会研究報告インターネットと運用技術（IOT）, 2000, 62, 1-6, Japanese, 0919-6072, 110002929198, AA12326962
URL

道しるべ:自動並列化コンパイラ
本多弘樹
1998, 情報処理学会,学会誌, 39, 4, 358-361, Japanese, Peer-reviwed, Introduction other

並列処理のためのシステムソフトウェア：特集「並列処理のためのシステムソフトウェア」の編集にあたって
岩澤京子; 本多弘樹
15 Sep. 1993, 情報処理, 34, 9, Japanese, 170000001199, AN00116625
URL

実行開始条件による並列性検出手法 : ループへの拡張
本多弘樹; 笠原博徳
本稿では、Fortranプログラムのマクロタスク(粗粒度タスク)レベルでの並列処理を自動的に行うシステムで必要となる、プログラム全域にわたるマクロタスク間並列性の自動検出手法について議論する。, 01 Mar. 1993, 全国大会講演論文集, 46, 69-70, Japanese, 110002882800, AN00349328
URL

A Data-Localization Scheme for Macro-Dataflow Computation
Yoshida Akimasa; Maeda Seiji; Ogata Wataru; Okamoto Masami; Honda Hiroki; Kasahara Hironori
This paper proposes a data-localization scheme for macro- dataflow computation which automatically exploits a parallelism among coarse-grain tasks such as loops,subroutines,and basic blocks in a Fortran program.Data-localization means that data are transfered through local memory among tasks assigned to the same processor at run time.This method can reduce data transfer overhead due to common memory accesses.The proposed scheme employs an aligned loop decomposition method to allow a compiler to localize data among loops.The scheme has.been implemented on an acutual multi-processor system named OSCAR and the results of performance evaluation are also discussed., The Institute of Electronics, Information and Communication Engineers, 1993, IEICE technical report. Computer systems, CPSY93-23, Japanese, 110003180008, AN10013141

Near Fine Grain Parallel Processing on a Multiprocessor System Without Synchronization
尾形航; 岡本雅巳; 本多弘樹; 笠原博徳; 成田誠之助
マルチプロセッサシステム上でFortranプログラム中の基本ブロックを並列処理する手法として、従来よりコンパイル時のスタティックスケジューリングを用いた細粒度並列処理手法が提案されている。しかし、従来の方式ではタスク間のデータ依存に基づく先行制約を保証するため並列プログラム中に同期コードを埋めこまねばならず、その実行によるオーバーヘッドが比較的大きいという問題があった。本論文ではスケジューリングの精度を引き上げマシンクロックレベルでの命令実行の最適化を可能とすることにより、すべての同期コードを除去する事でオーバーヘッドを低減する手法について提案する。又、本手法を実マルチプロセッサシステムOSCAR上でインプリメントし、無同期実行の効果を検証した結果についても報告する。The near fine grain parallel processing scheme using static scheduling algorithms has been proposed to process a Fortran basic block in parallel on a multiprocessor system. However, the scheme suffers from relatively large synchronization overhead since synchronization codes must be inserted into a parallel machine code to satisfy precedence constraints caused by data dependences among tasks. To cope with this problem, this paper proposes a parallel code generation scheme which removes all synchronizations by optimizing, or scheduling, execution timing of every instrucrion in a machine clock level, Furthermore, it reports performance of the parallel processing without synchronization evaluated on an actual multiprocessor system OSCAR., 22 Oct. 1992, 情報処理学会研究報告計算機アーキテクチャ（ARC）, 1992, 82, 149-156, Japanese, 170000021187, AN10096105
URL

Macro-dataflow Computation of FORTRAN Program on Multi-processor Super Computer
合田憲人; 岡本雅巳; 尾形航; 本多弘樹; 笠原博徳; 成田誠之助
高性能プロセッサを比較的少数結合した主記憶共有型マルチプロセッサ(マルチプロセッサスーパーコンピュータ)上での従来のFORTRANプログラムの並列処理では,マクロタスキング(サブルーチン並列処理)とマイクロタスキング(ループ並列処理)のみが行われていた.また,プログラム中からの粗粒度の並列性の抽出は,多くの場合ユーザーにゆだねられていた.本稿では,マルチプロセッサスーパーコンピュータ上でのFORTRANプログラムのマクロデータフロー処理手法FUJITSU LABORATORIES Ltd.を提案する.本手法では,コンパイラがプログラムの粗粒度タスク(マクロタスク)への分割,マクロタスク間の並列性抽出,各Fortranプログラム専用のダイナミックスケジューリングコードの生成等を自動的に行うため,低オーバーヘッドで効率の良い並列処理を行うことができる., 24 Feb. 1992, 全国大会講演論文集, 44, 25-26, Japanese, 110002888936, AN00349328
URL

Parallel Processing of Fortran Subroutines on OSCAR
茂木章善; 本多弘樹; 笠原博徳
筆者らは従来より複数プロセッサクラスタを持つマルチプロセッサシステム上でのFotranプログラムの粗粒度レベルでの並列処理(マクロデータフロー)手法について提案している。本稿では、このマクロデータフロー処理におけるサブルーチン並列処理について述べる。Fortranにはサブルーチンと関数のニ種類の副プログラムが存在するが、その処理手法に対しては同様の手法が適用できるので本稿ではサブルーチンについてのみ述べる。, 25 Feb. 1991, 全国大会講演論文集, 42, 74-75, Japanese, 110002887089, AN00349328
URL

Fortranプログラム粗粒度タスク間の並列性検出手法
本多弘樹; 岩田雅彦; 笠原博徳
電子情報通信学会, Dec. 1990, 電子情報通信学会論文誌 D-1 情報・システム, 73, 12, p951-960, Japanese, 0915-1915, 40004780397, AN10071319
URL

PARALLEL PROCESSING OF NEAR FINE GRAIN TASKS ON OSCAR (Optimally Scheduled Advanced Multiprocessor)
笠原博徳; 本多弘樹; Premchaiswadi Wichian; 小椋章央; 茂木章善; 成田誠之助
本論文ではマルチプロセッサシステムOSCAR （__?ptimally ___?heduled A__?dvanced Multiprocessor__?．）上での、細粒度タスクの並列処理手法について述べる。ここでOSCAR上での細粒度タスクとは各々が単一あるいは複数浮動小数点命令命令からなるタスクを意味する。本手法ではデータ転送を考慮したスタティックスケジューリングを用いることにより、同期及びデータ転送の最小化及び、各プロセッサのレジスタの最適使用が可能となる。本手法を用いたコンパイラはすでにOSCAR上にインプリメントされており、本論文では、OSCAR上での性能評価についても述べる。This paper proposes a compilation scheme for parallel processing of near fine grain tasks, each of which consists of several operations or a statment. on a multiprocessor system called OSCAR(O__-ptimally ___-heduled A__-dvanced Multiprocessor__-). The scheme generates optimized parallel machine codes which minimize synchronization overhead and data transfer overhead and optimally use registers of each processor by using static multiprocessor scheduling algorithms considering data transfer among processors. This scheme can effectively be combined with compilation scheme for macro-dataflow computation which uses parallelism among coarse grain tasks like loops, basic blocks and subroutines and for the traditional loop concurrentization which use palallelism among medium grain tasks like iterations. A compiler using the proposed scheme has been implemented on OSCAR which has been designed to take full advantage of the static scheduling. In this paper the performance evaluation of the scheme on OSCAR is also described., 18 Jul. 1990, 情報処理学会研究報告計算機アーキテクチャ（ARC）, 1990, 60, 97-102, Japanese, 170000021395, AN10096105
URL

An Implementation of Fortran Parallelizing Compiler on OSCAR
広田雅一; 本多弘樹; 笠原博徳
筆者等は、従来より複数クラスタを持つマルチプロセッサ・システム上でのFORTRANプログラムの並列処理手法について提案しており、今回は、OSCARシングルプロセッサ・クラスタ上で、その一部をインプリメントしたのでそれについて報告する。, 15 Mar. 1989, 全国大会講演論文集, 38, 1447-1448, Japanese, 110002895259, AN00349328
URL

階層型マルチプロセッサシステムOSCAR上でのFortran並列処理手法
本多弘樹
1989, 並列処理シンポジウムJSPP'89論文集, 2, 251-258, 10006747349

Parallel Processing of the Solution of Ordinary Differential Equations Using Static Multiprocessor Scheduling Algorithms
KASAHARA HIRONORI; FUJII TOSHIHISA; HONDA HIROKI; NARITA SEINOSUKE
本論文ではエクスプリシットな常微分方程式求解のための効率良い並列処理手法を提案する.数値積分法を用いた常微分方程式の求解で要求される計算は互いに複雑なデータ依存性を持つ多くの算術代入文(スカラアサイメント文)から構成されており従来効率良い並列処理が難しかった種類の計算である.本並列処理手法はこのような計算を筆者らが開発したスタティックなマルチプロセッサ・スケジューリング・アルゴリズムを用いることにより任意数のプロセッサを用いてほほ最小の処理時間で処理することを可能とする.この手法はタスク生成タスクのプロセッサ上への最適スケジューリングスケジューリング結果を用いた実行効率の良いマシンコード生成などの部分から成り立っており種々のタスクグラニュラリティに対応できる.また本手法の有効性および実用性は 7ペアの8086と8087をパス結合した実験用マルチプロセッサ上で検証される.さらに本論文では従来アノレゴリズム開発の難しさ等から実並列処理システムヘの適用が諦められていた最適スケジューリングが実マルチプロセッサ・システム上で実際に並列処理を可能とする実用的なものであることを初めて示す., Information Processing Society of Japan (IPSJ), 15 Oct. 1987, IPSJ Journal, 28, 10, 1060-1070, Japanese, 1882-7764, 110002724341, AN00116647
URL

Books and other publications

bit別冊:並列プログラミング(第4章)
湯淺太一; 中田登志之; 関口智嗣; 妹尾義樹; 小西弘一; 安部広多; 石川裕; 本多弘樹; 白川友紀; 伊達博; 石塚辰美; 弓場敏嗣
Japanese, 共立出版, May 1998

はじめての並列プログラミング
湯淺太一; 中田登志之; 関口智嗣; 妹尾義樹; 小西弘一; 安部広多; 石川裕; 本多弘樹; 白川友紀; 伊達博; 石塚辰美; 弓場敏嗣
Japanese, Joint work, 1998

新しいOS (第2章)
石田晴久; 土居範久; 徳田英幸; 本多弘樹; 宇田川誠; 櫛木好明; 坂村健; 竹内郁雄; 上田尚純; 内田俊一; 清水謙多郎
Japanese, Joint work, 共立出版, Aug. 1989

Lectures, oral presentations, etc.

HPCアプリケーションの消費電力最適化に向けた性能・消費電力情報の統合手法
大坂隼平; 和田康孝; 近藤正章; 三吉郁夫; 本多弘樹
Oral presentation, Japanese, ハイパフォーマンスコンピューティングと計算科学シンポジウム HPCS2015, Domestic conference
20 May 2015

ヘテロジニアス構成でのGPUコンピューティングのためのワークサイズ自動調整手法の提案
竹本拓未; 和田康孝; 近藤正章; 本多弘樹
Oral presentation, Japanese, 情報処理学会ハイパフォーマンスコンピューティング研究会, International conference
02 Mar. 2015

誘導結合型三次元積層マルチコアプロセッサにおけるキャッシュ間通信手法の検討
松村正隆; 近藤正章; 松谷宏紀; 和田康孝; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会コンピューターシステム研究会, Domestic conference
30 Jan. 2015

次世代3次元実装メモリのメモリネットワーク構成に関する初期検討
佐々木沢; 和田康孝; 近藤正章; 本多弘樹
Oral presentation, Japanese, 情報処理学会第147回ハイパフォーマンスコンピューティング研究会, 情報処理学会, 北海道, Domestic conference
02 Dec. 2014

HPCシステムにおける性能プロファイリングツールの電力測定精度の評価
大坂隼平; 和田康孝; 近藤正章; 本多弘樹
Oral presentation, Japanese, 情報処理学会第147回ハイパフォーマンスコンピューティング研究会, 情報処理学会, Domestic conference
02 Dec. 2014

HPCシステムにおける電力・性能可視化ツールに関する比較検討
和田康孝; カオタン; 近藤正章; 本多弘樹
Public symposium, Japanese, 第21回ハイパフォーマンスコンピューティングとアーキテクチャの評価に関する北海道ワークショップ, 情報処理学会, 北海道大学, Domestic conference
17 Dec. 2013

GPUにおける走行時パワーゲーティング向けスレッドブロック割り当ておよびワープ発行制御手法
松本洋平; 近藤正章; 和田康孝; 本多弘樹
Public symposium, Japanese, 第21回ハイパフォーマンスコンピューティングとアーキテクチャの評価に関する北海道ワークショップ, 情報処理学会, 北海道大学, Domestic conference
16 Dec. 2013

使用コア数最適化とDVFSを用いたＧＰＵの省電力化手法の検討
藤原祐太; 松本洋平; 和田康孝; 近藤正章; 本多弘樹
Public symposium, Japanese, 第21回ハイパフォーマンスコンピューティングとアーキテクチャの評価に関する北海道ワークショップ, 情報処理学会, 北海道大学, Domestic conference
16 Dec. 2013

RAPLインタフェースを用いたHPCシステムの消費電力モデリングと電力評価
カオタン; 和田康孝; 近藤正章; 本多弘樹
Oral presentation, Japanese, 情報処理学会計算機アーキテクチャ研究会
Sep. 2013

複数GPU向けのCUDAコードを生成するOpenMP処理系の提案
長塚郁; 大島聡史; 平澤将一; 近藤正章; 本多弘樹
Oral presentation, Japanese, 情報処理学会研究報告,第133回ハイパフォーマンスコンピューティング研究発表会
Mar. 2012

GridRPC における計算ノードの動的な追加・切替を可能とする枠組
松本優人; 渡邊啓正; 平澤将一; 近藤正章; 本多弘樹
Oral presentation, Japanese, 情報処理学会研究報告,情報処理学会ハイパフォーマンスコンピューティング研究会
Feb. 2010

OMPCUDA:GPU向けOpenMPの実装
大島聡史; 平澤将一; 本多弘樹
Public symposium, Japanese, ハイパフォーマンスコンピューティングと計算科学シンポジウム HPCS2009, 情報処理学会, 東京
Jan. 2009

メッセージ通信型GPGPUプログラミング
大島聡史; 平澤将一; 本多弘樹
Oral presentation, Japanese, 情報処理学会,ハイパフォーマンスコンピューティング研究会
Mar. 2008

コードの性能可搬性を提供するSIMD向け共通記述方式
中西悠; 渡辺啓正; 平澤将一; 本多弘樹
Public symposium, Japanese, SACSIS2007, 情報処理学会, 東京
May 2007

F-Omega：サーバ稼動状況に適応するGridRPCアプリケーションの開発・実行フレームワーク
渡辺啓正; 平澤将一; 本多弘樹
Oral presentation, Japanese, 情報処理学会,ハイパフォーマンスコンピューティング研究会
Mar. 2007

iPat/OMPでのソースコードレベル最適化における試行錯誤支援ツール
永野悠介; 石原誠; 平澤将一; 本多弘樹
Oral presentation, Japanese, 情報処理学会,ハイパフォーマンスコンピューティング研究会
Mar. 2007

アプリケーションの配置を考慮したグリッドポータル構築ツール：GnsPortlets
佐々木耕; 渡辺啓正; 本多弘樹
Public symposium, Japanese, ハイパフォーマンスコンピューティングと計算科学シンポジウムHPCS2007, 情報処理学会, つくば
Jan. 2007

コードの性能可搬性を提供するSIMD向け共通記述方式
中西悠; 渡辺啓正; 本多弘樹
Oral presentation, Japanese, 情報処理学会,計算機アーキテクチャ研究会
Aug. 2006

高性能GridRPCアプリケーションの開発環境
小林孝嗣; 渡邊啓正; 本多弘樹
Public symposium, Japanese, 先進的計算基盤システムシンポジウムSACSIS2006, 情報処理学会, 大阪
May 2006

SMPクラスタ上でのタスク粒度を考慮した階層型粗粒度並列処理
角田昌芳; 田邊浩志; 本多弘樹
Oral presentation, Japanese, 情報処理学会,計算機アーキテクチャ研究会
Jan. 2006

コンパイラ研究の明日--アーキテクチャの進歩とともに
本多弘樹
Oral presentation, Japanese, 情報処理学会,計算機アーキテクチャ研究会
Jan. 2006

S-DSMシステムにおけるページ要求時の受信通知を削減する方式
吉瀬謙二; 田辺浩志; 多忠行; 片桐孝洋; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 先進的計算基盤システムシンポジウム SACSIS2005, 情報処理学会, つくば
May 2005

極端な偏りを利用するBimode++分岐予測器の提案
吉瀬謙二; 片桐孝洋; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会,計算機アーキテクチャ研究会
Jan. 2005

S-DSMシステムの受信通知オーバーヘッドを削減する方式
吉瀬謙二; 田辺浩志; 多忠行; 片桐孝洋; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 電子情報通信学会,コンピュータシステム研究会
Dec. 2004

OpenMPプログラム作成を支援する対話型ツールiPat/OMP
石原誠; 本多弘樹; 佐藤三久
Public symposium, Japanese, コンピュータシステム・シンポジウム, 情報処理学会, 東京
Nov. 2004

ソフトウェア分散共有メモリを用いたマクロデータフロー処理
田辺浩志; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 先進的計算基盤システムシンポジウムSACSIS2004, 情報処理学会, 札幌
May 2004

祖粒度並列化コンパイラCoCoの開発
池田倫久; Ngo Tau Van; 田中雅俊; 福岡岳穂; 片桐孝洋; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会,ハイパフォーマンスコンピューティング研究会
Apr. 2004

PCクラスタ上でのマクロデータフロー処理の評価
田辺浩志; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会,ハイパフォーマンスコンピューティング研究会
Mar. 2004

FIBER:汎用的な自動チューニング機能の負荷を支援するソフトウェア構成方式
片桐孝洋; 吉瀬謙二; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会,ハイパフォーマンスコンピューティング研究会
Jun. 2003

マルチクラスタ向けソフトウェア分散共有メモリの提案
吉川克哉; 城田祐介; 吉瀬謙二; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会，研究報告
Mar. 2003

キャッシュラインの時間情報を利用するTime Based Load Filterの提案
檜田敏克; 吉瀬謙二; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会，研究報告
Mar. 2003

ソフトウェア分散共有メモリを用いたマクロデータフロー処理
田邊浩志; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会，研究報告
Mar. 2003

分散メモリシステム上でのOpenMPによるマクロデータフロー処理
深川保; 吉瀬謙二; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会，研究報告
Aug. 2002

分散メモリシステム上でのマクロデータフロー処理のためのデータ到達条件
本多弘樹; 上田哲平; 深川保; 弓場敏嗣
Public symposium, Japanese, 並列処理シンポジウムJSPP2002論文集
May 2002

16) Grid計算環境における予定ガント図を用いたジョブスケジューリング
上田清詩; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 並列処理シンポジウムJSPP2002論文集
May 2002

17) Migratory Access を対象とするホームベース分散共有メモリ
城田祐介; 吉瀬謙二; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 並列処理シンポジウムJSPP2002論文集
May 2002

分散メモリシステム上でのマクロデータフロー処理の実現
上田哲平; 本多弘樹; 吉瀬謙二; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告
Mar. 2002

"プログラマの意図により複数のキャッシュコヒーレンスプロトコル利用を可能とするソフトウェア分散共有メモリ"
城田祐介; 吉瀬謙二; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告 ARC-144-2
Jul. 2001

"大容量FPGAを用いたキャッシュ評価用ハードウェアエミュレータの検討"
吉瀬謙二; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 並列処理シンポジウム論文集 JSPP'01, 情報処理学会/電子情報通信学会
Jun. 2001

"カットオフの短い相互作用の計算の高速化を目指した粒子シミュレーション用並列計算機DEM-1の提案"
高田亮; 吉瀬謙二; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 並列処理シンポジウム論文集 JSPP'01, 情報処理学会/電子情報通信学会
Jun. 2001

"IPネットワークにおける待時式帯域予約通信方式の評価"
池辺隆; 本多弘樹; 弓場敏嗣; 三木哲也
Oral presentation, Japanese, 電子情報通信学会技術研究報告 IN2001-1
May 2001

エージェントシステムにおけるスケジューリング機構の実装
伊藤元晴; 本多弘樹
Oral presentation, Japanese, 情報処理学会研究報告 2000-DPS-102
Mar. 2001

帯域確保要求を自動的にスケジューリングするルータでの帯域制御方式とその実装
本圖英承; 本多弘樹
Oral presentation, Japanese, 情報処理学会研究報告 2000-DPS-102
Mar. 2001

セキュアなRPCのためのコールバック型コネクション確立方式
平山秀昭; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, インターネットコンファレンス2000（IC2000）
Nov. 2000

グローバルコンピューティングシステムにおけるネットワークGantt図を用いたジョブスケジューリング手法の提案
上田清詩; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告2000-HPC-83-9
Oct. 2000

EM-XとMD Oneを統合化した粒子シミュレーション用並列計算機プロトタイプの構築
高田亮; 清水昭皓; 児玉祐悦; 坂根広史; 佐谷野健二; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告 2000-ARC-139-15
Aug. 2000

OpenMPによる粗粒度タスク並列実行方式
福岡岳穂; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告 2000-HPC-82-12
Aug. 2000

ワールドワイドなインターラクティブシステムのためのHTTPコネクション型RPCの検討
平山秀昭; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告 2000-DSM-18-1
Jul. 2000

可換／結合法則が成立する操作を対象としたログベース更新型分散共有メモリ
平山秀昭; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 電子情報通信学会論文誌
May 2000

ワークステーションクラスタ環境における分散共有記憶方式の高信頼化／高効率化の研究
弓場敏嗣; 本多弘樹; 平山秀昭; 大澤範高
Others, Japanese, 並列・分散処理研究推進機構・成果概要
Mar. 2000

ワークステーションクラスタ環境でのログベース更新型分散共有記憶方式
弓場敏嗣; 本多弘樹; 平山秀昭; 大澤範高
Others, Japanese, 並列分散処理研究推進機構・成果報告（CD-ROM）
Mar. 2000

並列計算機EM-Xの細粒度通信機構を用いた共有メモリベンチマークの実行
坂根広史; 本多弘樹; 弓場敏嗣; 児玉祐悦; 山口喜教
Oral presentation, Japanese, 情報処理学会研究報告，2000-ARC-137
Mar. 2000

OSレベルでのTCP のフロー制御によるQoSの実現
塩川泰広; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 分散システム／インターネット運用技術シンポジウム，情報処理学会
Feb. 2000

並列計算機用要素プロセッサの細粒度同期機構におけるキャッシュ方式の検討
坂根広史; 本多弘樹; 弓場敏嗣; 児玉祐悦; 山口喜教
Oral presentation, Japanese, 情報処理学会研究報告，99-ARC-134-2
Aug. 1999

階層並列構造と演算チェインニング機構を持つ粒子シミュレーション用並列計算機の提案
高田亮; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告，99-ARC-134-22
Aug. 1999

異機種分散システムにおけるLU分解プログラムの最適なデータ分割
高畠志泰; 持田善行; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 並列処理シンポジウム（JSPP'99），情報処理学会シンポジウムシリーズ
Jun. 1999

細粒度並列処理向け要素プロセッサにおけるキャッシュメモリアーキテクチャ
坂根広史; 本多弘樹; 弓場敏嗣; 児玉祐悦; 山口喜教
Public symposium, Japanese, 並列処理シンポジウム（JSPP'99）,情報処理学会シンポジウムシリーズ
Jun. 1999

マルチスレッドアーキテクチャ用データキャッシュ-動的スレッドアソシアティブ方式-の評価
山崎真矢; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告99-ARC-132/99-OS-80/99-HPC-75
Mar. 1999

GPS時刻同期機構を用いた通信遅延測定ツールの開発
原三英; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 分散システム／インターネット運用技術シンポジウム '99
1999

分散インプリサイス計算における負荷の状態近似に基づく適応的なタスク移送方式
Amien Rusdiutomo; 佐藤直人; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 信学技報，CPSY97-177
1999

Doacrossループの並列化手法とその評価
高畠志泰; 本多弘樹; 大澤範高; 弓場敏嗣
Public symposium, Japanese, 並列処理シンポジウムJSPP '98論文集
1998

分散インプリサイス計算のための双手動スケジューリング方式の予備評価
Amien Rusdiutomo; 佐藤直人; 本多弘樹; 弓場敏嗣
Public symposium, Japanese, 並列処理シンポジウムJSPP '98論文集
1998

細粒度並列計算機EM-Xにおけるキャッシュメモリアーキテクチャ
坂根広史; 児玉祐悦; 小池汎平; 山名早人; 山口喜教; 本多弘樹; 弓場敏嗣
Oral presentation, Japanese, 情報処理学会研究報告98-ARC-130
1998

Courses

MICS実験第二
The University of Electro-Communications

基礎プログラミング及び演習
The University of Electro-Communications

基礎プログラミング及び演習
電気通信大学

並列処理論第一
The University of Electro-Communications

基礎プログラミングおよび演習
The University of Electro-Communications

高性能コンピューティング論Ⅰ
The University of Electro-Communications

高性能コンピューティング論Ⅰ
電気通信大学

高性能コンピューティングⅠ
The University of Electro-Communications

高性能コンピューティングⅠ
電気通信大学

Affiliated academic society

IEEE

電子情報通信学会

情報処理学会

Research Themes

Software distributed shared memory systems with performance scalability for supercluster systems
YUBA Toshitsugu; HONDA Hiroki; KISE Kenji; KATAGIRI Takahiro; TSUKAMOTO Michiharu
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, The University of Electro-Communications, Grant-in-Aid for Scientific Research (B), A high-performance software distributed shared memory (DSM) system with performance scalability, named Mocha, was implemented as an experimental prototype aiming at the usefulness of supercluster systems. The current version of Mocha works on a 32 node cluster on the Linux environment. The source program has been opened in public domain via web. A scheme for suppressing communication overhead, which is mainly caused by maintaining data consistency among shared data located at distributed memories, is proposed and slotted in Mocha. The effectiveness is evaluated by using some benchmark programs. A tool for performance debugging of application programs and constructing an efficient software DSM, named S-Cat, was implemented, which has such functions as showing visually communication locus among processors during execution process of an application program. S-Cat was used in efficient implementation of Mocha and performance tuning of application programs on Mocha. The main research results are as follows: 1.Implementation of a low communication overhead software DSM, named Mocha 2.Comparison study of different software DSMs on the same cluster 3.Proposal of coarse-grain parallel processing on software DSM 4.Construction of a visualization tool for software DSM, named S-Cat 5.Proposal of parallel processing on graphic processor and off-the-shelf processor Some technical issues toward realization of high-performance software DSMs for supercluster systems are made clear. "Hierarchical paradigm" for controlling multiple clusters was proposed, but its usefulness cannot be verified during this research period., 16300004
2004 - 2006

Study on advanced programming environment using OpenMP for a next generation high performance cluster system
SATO Mitsuhisa; ISHIKAWA Yutaka; MATSUOKA Satoshi; HONDA Hiroki; BOKU Taisuke; TAKAHASHI Daisuke
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, University of Tsukuba, Grant-in-Aid for Scientific Research (A), We have studied the OpenMP programming environment for the next generation 64-bit high-performance clusters, by using software distributed shared memory (SDSM) system to enable OpenMP program to run on the cluster. We have also developed a programming support system for OpenMP, and numerical libraries using OpenMP. 1.We ported the SCore cluster system software to 64-bit processor architectures. We conducted the performance evaluation of SCASH DSM system which runs on SCore. 2.We have designed and implemented a very portable SDSM system, SCASH-MPI which uses MPI as its communication layer. MPI is the most portable communication library supported for many kinds of high-speed communication network, so that this approach provide highly portability It allows the users to make use of wide address space in 64-bit processor. We found that the overhead of this implementation is just 6% comparing to the original SCASH. 3.We have designed a new SDSM system, FDSM, by using the access pattern analysis of applications. The access pattern is detected by a hardware mechanism provided by IA64, and is used for efficient communication. It achieves more performance than SCASH. 4.We have studied the optimization of OpenMP program running a DSM system of heterogeneous clusters. We found that the performance can be improved by the combination of the loop re-partitioning and the page migration. 5.We have designed and implemented the interactive tool, OMP/iPat, to support the programmer for OpenMP program developments. It allows the programmer to develop his OpenMP program interactively using the information from parallelism analysis by the compiler. 6.We have conducted the performance evaluation by using the OpenMP benchmark, SPEC-OMP. We have designed and implemented an algorithm of parallel recursive FFT by using OpenMP for IA-64 shared memory multi-processors., 14208026
2002 - 2004

Study on parallelizing compiters with a granularity tuning mechanism
YUBA Toshitsugu; YAMAGUCHI Yoshinori; KISE Kenji; HONDA Hiroki
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, The University of Electro-Communications, Grant-in-Aid for Scientific Research (B), We aim to establish the fundamental technology of parallelizing compilers with a granularity tuning mechanism for efficient parallel processing. The parallelizing compiler generates a parallel object program, which will be executed in the shortest time on a parallel computer, by matching its hardware characteristics with parallel properties of a given application program. The main results are as follows : 1. A new static parallelizing scheme is proposed, in which apart of a target sequential program is divided into parallel tasks with optimal granularity by using the LogP model as an abstract parallel machine. Do loops, do across loops and recursive function calls are applied for granularity tuning, and evaluation studies are carried out on some different parallel computers. An efficient execution mechanism for coarse-grain parallel processing is proposed for distributed memory parallel computers. The mechanism can be realized to transform a given sequential program into a coarse-grain task graph with execution start conditions as well as data reaching conditions as a function of parallelizing computers. 3. A parallelizing compiler with a coarse-grain parallel processing function is experimentally constructed. The conventional Open MP compiler is utilized to translate the coarse-grain task graph with Open MP primitives into an executable parallel C program. 4. A novel software distributed shared memory (SDSM) is proposed, which reduces memory consistency overhead by reflecting characteristics of application programs at a middleware level. The SDSM scheme is implemented in such parallel machine environment as a heterogeneous workstation cluster and an SMP-type personal computer cluster., 10480057
1998 - 2001

階層分割された粗粒度タスク並列処理のための自動並列化コンパイラに関する研究
本多弘樹
日本学術振興会, 科学研究費助成事業, 電気通信大学, 奨励研究(A), 本研究は、本申請者がこれまでの研究で開発したFortranプログラム粗粒度タスクの並列処理方式を土台として、階層的並列処理実現に関する研究を行なったもので、以下の研究実績のとおりその目的を達成した。 1. 粗粒度並列処理の階層化の検討: 階層的並列処理は粗粒度タスクをプロセッサクラスタへ割り当て、プロセッサクラスタ内のプロセッサ群を利用して細粒度並列処理を行なうものである。細粒度並列処理においてはスタティックスケジューリングによりコンパイル時に静的にプロセッサに割り当てており、コンパイル時に割り当てが行なわれることを利用して、同期コードの挿入およびその最適化を行なっている。一方粗粒度並列処理においてはマルチプロセッサスケジューリングアルゴリズムを応用したダイナミックスケジューリング手法を用いることにより、実行すべきタスクが実行時に決定されるという問題に対処している。このため細粒度並列処理において必要となる同期情報をハードウェア同期機構に挿入することをプログラム実行時に動的に行なう必要がある。本研究ではone-PE同期機構、SBM同期機構、RBCQ同期機構を用いる細粒度並列処理と粗粒度並列処理を階層的に実現する手法を考案した。 2. 階層的並列処理を実現する自動並列化コンパイラの開発: 上記方式を実現する自動並列化コンパイラを作製した。本コンパイラでは逐次中間コードを生成の後、並列性解析を行ない、並列中間コードを生成する。また細粒度並列処理のための同期コードおよび粗粒度並列処理のためのダイナミックスケジューリングコードもコンパイラが生成する。これにより実行時ののオーバーヘッドを軽減することが可能となる。なお、本コンパイラは大規模なソフトウェアとなるため、PCを利用した複数人数による開発となるが、本補助金で購入した設備により開発した。 3. 実マルチプロセッサシステムの開発と性能評価: 本自動並列化コンパイラによる階層的並列処理を効率良く実現するハードウェア同期機構を実装した実マルチプロセッサシステムを開発しその上で本方式の有用性を検証した。 4. 研究成果のまとめと学会発表: 以上の一部をまとめ学会論文として発表した。, 09750405
1997 - 1998

Study on Granularity Tuning Mechanism in Fine-Grain Parallel Processing
YUBA Toshitsugu; YAMAGUCHI Yoshinori; SATO Naohito; OSAWA Noritaka; HONDA Hiroki
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, The University of Electro-Communications, Grant-in-Aid for Scientific Research (B), This research addresses to give granularity tuning mechanism for obtaining the maximum speed in parallel program execution on a distributed memory-type parallel computer. The mechanism requires to exploit parallelism in a program and divide the computation into subprograms (threads) of suitable granularity. Performance improvement can be achieved to allocate the suitable-size threads statically to each processing element of a parallel computer by efficiently utilizing its hardware's potential ability. (1) Parallelizing compiler. We proposed a novel granularity tuning mechanism based on a LogP model, which is an abstract parallel computer model for analyzing execution time of a parallel program. The mechanism was experimentally installed in a SISAL compiler for a dataflow computer EM-X,and some evaluation study was carried out by executing benchmark programs with do-all loops and do-across loops. (2) Performance debugging for parallel programs : A performance debugging system for a parallel program was developed, which shows an execution process of a parallel program as a form of a Gantt chart. A programr can find performance bottleneck by carefully checking the chart, and interactively change its execution sequence, as a result the parallel granularity, in order to obtain better performance. We proposed 3D animation technique based on a dynamical model, which is adopted to visualization of a process of parallel discrete event. A 3D visualization tool was developed and the utilization for parallel debugging was investigated., 07458055
1995 - 1997

粗粒度タスク階層的並列処理のための自動並列化コンパイラに関する研究
本多弘樹
日本学術振興会, 科学研究費助成事業, 山梨大学, 奨励研究(A), 本研究は、本申請者がこれまでの研究で開発したFortranプログラム粗粒度タスクの並列処理方式を土台として、階層的並列処理実現に関する研究を行なったもので、研究実績は次の通りである。 (1)細粒度並列処理と粗粒度並列処理の階層化の検討階層的並列処理は粗粒度タスクをプロセッサクラスタへ割り当て、プロセッサクラスタ内のプロセッサ群を利用して細粒度並列処理を行なうものである。細粒度並列処理においてはコンパイル時に割り当てが行なわれることを利用して、同期コードの挿入を行なっている。一方粗粒度並列処理においてはダイナミックスケジューリング手法を用いることにより、実行すべきタスクが実行時に決定されるという問題に対処している。このため同期情報をハードウェア同期機構に挿入することをプログラム実行時に低オーバーヘッドに行なう必要がある。そこで本研究では同期情報の少ないRBCQ同期機構を考案した。 (2)階層的並列処理を実現する自動並列化コンパイラの開発上記方式を実現する自動並列化コンパイラを作製した。対象プログラミング言語はC及びHigh Performance Fortranである。本コンパイラでは逐次中間コードを生成の後、並列性解析を行ない、並列中間コードを生成する。また細粒度並列処理のための同期コードおよび粗粒度並列処理のためのダイナミックスケジューリングコードもコンパイラが生成する。これにより実行時ののオーバーヘッドを軽減することが可能となる。なお、本コンパイラは大規模なソフトウェアとなるため、ワークステーションを利用した複数人数による開発となるが、本補助金で購入した設備により開発した。 (3)実マルチプロセッサシステム上での有効性評価申請者が既に開発している実マルチプロセッサシステム上で本方式により各種アプリケーションプログラムを並列処理し、本自動並列化コンパイラによる階層的並列処理の有用性を検証した。 (4)研究成果のまとめと学会発表以上の一部をまとめ学会に投稿し採録が決定した。早川潔,本多弘樹「RBCQ同期機構およびその同期方式の提案と性能評価」並列処理シンポジウムJSPP '97論文集, 08750425
1996 - 1996

スーパーコピュータ用自動並列化コンパイラに関する研究
成田誠之助; 合田憲人; 本多弘樹; 笠原博徳
日本学術振興会, 科学研究費助成事業, 早稲田大学, 一般研究(B), 主記憶共有マルチプロセッサシステム上でのFortranプログラムの並列処理では、従来よりマルチタスキングやマイクロカスキングなどの手法が用いられてきた.しかし,マルチタスキングでは,ユーザによる並列性指定が困難である,osコールなどによるスケジューリングオーバーヘッドが大きい等といった問題がある.マイクロタスキングは,最も広く用いられてきたループ並列化手法であるが,イタレーション間にまがる複雑なデータ依存やループ外への条件分岐によって並列化できないループが以前存在する. これらに対して当研究者当は、マクロデータフロー処理手法を提案した.マクロデータフロー処理手法では,コンパイラがプログラムを粗粒度タスクへ分割し,粗粒度タスクの最早実行可能条件を解析することにより粗粒度のアスク間の並列性を自動抽出する.コンパイラが各ソースプログラム専用に生成したスケジューリングルーチンを用いることで,スケジューリングオーバーヘッドを抑えることができる.また,マクロデータフロー処理を行なう場合,各データをデータ転送を最小化するよう考慮し,各プロセッサ上のローカルメモリに配置(データローカライズ)することによって,より効率の良い並列処理が可能となる. プロトタイプマルチプロセッサイステムOSCAR上での性能評価では,マクロデータフロー処理による粗粒度タスクの有効な並列処理を確認できた.また富士通VPP-500、Alliant FX/4、KSR1、NEC Cnju-3等,商用マルチプロセッサシステム上での性能評価でも,従来手法であるマルチタスキングおよびマイクロスタキングに比べ,マクロデータフロー処理の方が高い並列性の抽出が可能であることが分かった.さらに,それらの評価から従来手法に比べて低オーバーヘッドな処理を行なうことが可能で,プログラムの実行速度が向上することも確認された., 05452354
1993 - 1995

階層的並列処理手法を実現するための自動並列化コンパイラに関する研究
本田弘樹
日本学術振興会, 科学研究費助成事業, 山梨大学, 奨励研究(A), 本研究は、本申請者がこれまでの研究で開発しFORTRANプログラム粗粒度タスクの並列処理方式を土台として、プログラム全域にわたる階層的並列処理実現に関する研究を行なったもので、研究実績は次の通りである。 (1)細粒度並列処理と粗粒度並列処理の階層化の検討階層的並列処理は粗粒度タスクをプロセッサクラスタへ割り当て、プロセッサクラスタ内のプロッセッサ群を利用して細粒度並列処理を行なうものである。細粒度並列処理においてはスティックスケジューリングによりタスクをコンパイル時に静的にプロセッサに割り当てており、コンパイル時に割り当てが行なわれることを利用して、同期コードの挿入およびその最適化を行なっている。一方粗粒度並列処理においてはマルチプロセッサスケジューリングアルゴリズムを応用したダイナミックスケジューリング手法を用いることにより、実行すべきタスクが実行時に決定されるという問題に対処している。このため細粒度並列処理において必要となる同期情報をハードウェア同期機構に挿入することをプログラム実行時に動的に行なう必要がある。本研究ではone-PE同期機構ならびにSBM同期機構を用いる細粒度並列処理と粗粒度並列処理を階層的に実現する手法を考案した。 (2)階層的並列処理を実現する自動並列コンパイラの開発上記方式を実現する自動並列化コンパイラを作製した。対象プログラミング言語はC及びHigh Performance Fortranである。本コンパイラでは逐次中間コードを生成の後、並列性解析を行ない、並列中間コードを生成する。また細粒度並列処理のための同期コードおよび粗粒度並列処理のためのダイナミックスケジューリングコードもコンパイラが生成する。これにより実行時ののオーバーヘッドを軽減することが可能となる。なお、本コンパイラは大規模なソフトウェアとなるため、ワークステーションを利用した複数人数による開発となるが、本補助金で購入した設備により開発した。 (3)実マルチプロセッサシステム上での有効性評価申請者が既に開発している実マルチプロセッサシステム上で本方式により各種アプリケーションプログラムを並列処理し、本自動並列化コンパイラによる階層的並列処理の有効性を検証した。 (4)研究成果のまとめと学会発表以上の成果をまとめて学会において発表を行なった。, 06750378
1994 - 1994

マルチプロセッサシステムでの階層的自動並列処理方式に関する研究
本多弘樹
日本学術振興会, 科学研究費助成事業, 山梨大学, 奨励研究(A), 本研究は、これまでの基礎研究で開発した細粒度と粗粒度タスクの並列処理方式を階層的に組み合わせた、プログラム全域にわたる階層的並列処理を実現する自動並列かコンパイラに関する研究を行ったもので、研究実績は以下のとおりである。 (1)細粒度並列処理と粗粒度並列処理の階層化方式の考案階層的並列処理は粗粒度タスクをプロセッサクラスタへ割り当て、プロセッサクラスタ内のプロセッサ群を利用して細粒度並列処理を行なうものである。細粒度並列処理においてはスタティックスケジューリングによりタスクをコンパイル時に静的にプロセッサに割り当てており、コンパイル時に割り当てが行なわれることを利用して、同期コードの挿入およびその最適化を行なっている。一方粗粒度並列処理においてはマルチプロセッサスケジューリングアルゴリズムを応用したダイナミックスケジューリング手法を用いることにより、実行すべきタスクが実行時に決定されるという問題に対処している。このため細粒度並列処理において必要となる同期情報をハードウェア同期機構に挿入することをプログラム実行時に動的に行なう必要がある。本研究ではSBM同期機構を用いる細粒度並列処理のための同期方式を考案した。 (2)階層的並列処理を実現する自動並列化コンパイラの開発上記方式を実現する自動並列化コンパイラを作製いた。対象プログラミング言語はC及びHigh Performance Fortranである。本コンパイラでは逐次中間コードを生成の後、並列性解析を行ない、並列中間コードを生成する。また細粒度並列処理のための同期コードおよび粗粒度並列処理のためのダイナミックスケジューリングコードもコンパイラが生成する。これより実行時ののオーバーヘッドを軽減することが可能となる。なお、本コンパイラは大規模なソフトウェアとなるため、ワークステーションを利用した複数人数による開発となるが、本補助金で購入した設備により開発した。 (3)実マルチプロセッサシステム上での有効性評価申請者が既に開発している実マルチプロセッサシステム上で本方式により各種アプリケーションプログラムを並列処理し、本自動並列化コンパイラによる並列処理の有用性を検証した。 (4)研究成果のまとめと学会発表以上の成果をまとめ学会において発表を行なった。このための出張費を本補助金により支出した。, 05750336
1993 - 1993

マルチプロセッサシステム上での自動並列処理方式に関する研究
本多弘樹
日本学術振興会, 科学研究費助成事業, 山梨大学, 奨励研究(A), 04750316
1992 - 1992