Search Details｜The University of Erectro-Communications

Name

Author

Position

Affiliation

Research Areas

Shinobu MIWA

Department of Computer and Network Engineering	Associate Professor
Cluster I (Informatics and Computer Engineering)	Associate Professor

Researcher Information

Degree

博士（情報学）, 京都大学

Doctor of Informatics, Kyoto University

Research Keyword

high performance computing

parallel processing

computer architecture

Field Of Study

Informatics, High-performance computing

Informatics, Computer systems

Career

Aug. 2022 - Present
RIKEN, R-CCS, Visiting Researcher

Apr. 2016 - Present
The University of Electro-Communications, Graduate School of Informatics and Engineering, Associate Professor

Apr. 2023 - Mar. 2024
University of Maryland, Visiting Associate Professor, United States

Mar. 2023 - Mar. 2024
Georgetown University, Visiting Scientist, United States

Apr. 2017 - Mar. 2018
Lawrence Livermore National Laboratory, Visiting Scientists and Professionals

Feb. 2017 - Sep. 2017
The University of Tokyo, Graduate School of Information Science and Technology, Visiting Researcher

01 Mar. 2015 - 31 Mar. 2016
The University of Electro-Communications, Graduate School of Information Systems, Associate Professor

01 Apr. 2011 - 28 Feb. 2015
The University of Tokyo, Graduate School of Information Science and Technology, Assistant Professor

01 Jan. 2008 - 31 Mar. 2011
Tokyo University of Agriculture and Technology, Graduate School of Engineering, Project Assistant Professor

01 Apr. 2005 - 31 Dec. 2007
Kyoto University, Graduate School of Law, Research Associate

Educational Background

01 Apr. 2002 - 31 Mar. 2005
Kyoto University, Graduate School of Informatics, Department of Communication and Computer Engineering

01 Apr. 2000 - 25 Mar. 2002
Kyoto University, Graduate School of Informatics, Department of Communication and Computer Engineering

Apr. 1996 - Mar. 2000
Kyoto University, Faculty of Engineering, Informatics and Mathmatical Science

01 Mar. 1996
Iwata Minami High School in Shizuoka

Member History

Oct. 2020 - Mar. 2023
審査委員, 科学研究費助成事業若手研究（高性能計算分野）, Government

2020 - Mar. 2021
副編集委員長, 情報処理学会ACS論文誌編集委員会, Society

2015 - 2019
Editorial committee member, IEICE Transactions on Information and Systems Editorial committee member, Society

2014 - 2017
Organization committee member, IPSJ SIG-ARC, Society

Apr. 2016
NEDO評価委員, Government

2015 - 2016
Representative member, IPSJ, Society

2014 - 2015
Chief organizer, Editorial committee of IPSJ magazine (system working group), Society

2011 - 2015
Editorial committee member, IPSJ Transactions on Advanced Computing Systems Editorial committee member, Society

2013 - 2014
Organizer, Editorial committee of IPSJ magazine (system working group), Society

2010 - 2013
Editorial committee member, Editorial committee of IPSJ magazine (hardware working group), Society

2009 - 2013
Editorial committee member, Editorial committee of IPSJ journal, Society

2011 - 2011
Writing committee member, Writing committee of "Knowledge-base" of IEICE, Society

2010 - 2010
Selection member, Best paper award selection working group of IPSJ, Society

2009 - 2009
Selection member, Best paper award selection working group of IPSJ, Society

Research Activity Information

Award

2019
Functionally-Predefined Kernel: a Way to Reduce CNN Computation
Best paper award for computers track in the 2019 IEEE PacRim, Y. Inouchi, H. Yamaki, S. Miwa, and T. TsumuraNakajo
International society

2010
Dalvik Accelerator: Mechanisms for High Performance Computing of Java Applications on Android Devices
Best Paper Award in Embedded System Symposium 2010, Atushi Ohta, Shinobu Miwa, Hironori Nakajyo
Japan society

2010
Parallelilzing Hilbert-Huang Transform and its Implementation on GPU
2nd Place in Free Programming Track of GPU Challenge 2010, Pulung Waskito, Shinobu Miwa, Yasue Mitsukura, Hironori Nakajo
Japan society

2008
Implementation of Multi SMT Processor on FPGA
Best Poster Award in SACSIS 2008, Yoshiyasu Ogasawara, Ippei Tate, Shinobu Miwa, Hironori Nakajo
Japan society

2003
リカレントニューラルネットにおける移動ロボットのナビゲーション課題の学習
Student Encouragement Award of Kansai-Section Convention of IPSJ 2003, Shinobu Miwa
Japan society

Paper

CACTI-CNFET: an Analytical Tool for Timing, Power, and Area of SRAMs with Carbon Nanotube Field Effect Transistors.
Shinobu Miwa; Eiichiro Sekikawa; Tongxin Yang; Ryota Shioya; Hayato Yamaki; Hiroki Honda
ASP-DAC, 1350-1356, 2025
International conference proceedings
URL
URL 2
DOI URL

Evaluating MPI Performance on SGX and Gramine
K. Shimojima, S. Miwa, H. Yamaki, and H. Honda
2024 IEEE International Conference on Cluster Computing (CLUSTER) (poster presentation), Sep. 2024

Post-Route Power Estimation: a Case Study of RIKEN-CGRA
C. Shi; B. Adhi; S. Miwa; K. Sano
2024 IEEE International Conference on Cluster Computing (CLUSTER) (poster presentation), Sep. 2024, Peer-reviwed

Power-Efficiency Variation on A64FX Supercomputers and its Application to System Operation
T. Kusaba; Y. Awaki; K. Yoshida; S. Miwa; H. Yamaki; T. Hanawa; H. Honda
2024 IEEE International Conference on Cluster Computing Workshop (CLUSTER Workshop), Sep. 2024, Peer-reviwed

Analyzing the impact of CUDA versions on GPU applications
Kohei Yoshida; Shinobu Miwa; Hayato Yamaki; Hiroki Honda
Parallel Computing, Elsevier BV, 120, 103081-103081, Jun. 2024, Peer-reviwed
Scientific journal
DOI URL

CNFET-OCL: Open-Source Cell Libraries for Advanced CNFET Technologies.
Chenlin Shi; Shinobu Miwa; Tongxin Yang; Ryota Shioya; Hayato Yamaki; Hiroki Honda
IEEE Access, 12, 165335-165347, 2024
Scientific journal
URL
DOI URL

Analyzing the Performance Impact of HPC Workloads with Gramine+SGX on 3rd Generation Xeon Scalable Processors
Shinobu Miwa; Shin'Ichiro Matsuo
Lead, Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, ACM, 1850-1858, 12 Nov. 2023, Peer-reviwed
International conference proceedings
URL
DOI URL

CNFET7: An Open Source Cell Library for 7-nm CNFET Technology
C. Shi; S. Miwa; T. Yang; R. Shioya; H. Yamaki; H. Honda
The 28th Asia and South Pacific Design Automation Conference (ASP-DAC), 763-768, Jan. 2023, Peer-reviwed
International conference proceedings, English

Analyzing Performance and Power-Efficiency Variations among NVIDIA GPUs
K. Yoshida; R. Sageyama; S. Miwa; H. Yamaki; H. Honda
The 51st International Conference on Parallel Processing (ICPP), 65, 1-12, Nov. 2022, Peer-reviwed
International conference proceedings, English

PredCom: A Predictive Approach to Collecting Approximated Communication Traces
Shinobu Miwa; Ignacio Laguna; Martin Schulz
IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers ({IEEE}), 32, 1, 45-58, 01 Jan. 2021, Peer-reviwed
Scientific journal, English
DOI URL

Footprint-Based DIMM Hotplug
Shinobu Miwa; Masaya Ishihara; Hayato Yamaki; Hiroki Honda; Martin Schulz
IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers ({IEEE}), 69, 2, 172-184, 01 Feb. 2020, Peer-reviwed
Scientific journal, English
DOI URL

RPC: An Approach for Reducing Compulsory Misses in Packet Processing Cache.
Hayato Yamaki; Hiroaki Nishi; Shinobu Miwa; Hiroki Honda
IEICE Trans. Inf. Syst., 103-D, 12, 2590-2599, 2020
Scientific journal
URL
URL 2
DOI URL

Evaluating Architecture-Level Optimization in Packet Processing Caches
K. Tanaka; H. Yamaki; S. Miwa; H. Honda
Computer Networks, Elsevier, 181, 107550, 1-10, 2020, Peer-reviwed
Scientific journal, English

Multi-Level Packet Processing Caches
K. Tanaka; H. Yamaki; S. Miwa; H. Honda
The 2019 IEEE Symposium on Low-Power and High-Speed Chips and Systems (COOL Chips 22), 1-3, 2019, Peer-reviwed
International conference proceedings, English

Functionally-Predefined Kernel: a Way to Reduce CNN Computation
Y. Inouchi; H. Yamaki; S. Miwa; T. Tsumura
The 2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim 2019), 1-6, 2019, Peer-reviwed
International conference proceedings, English

Evaluating the Impact of Energy Efficient Networks on HPC Workloads
G. Georgakoudis; N. Jain; T. Ono; K. Inoue; S. Miwa; A. Bhatele
26th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, (to appear), 1-10, 2019, Peer-reviwed
International conference proceedings, English
URL
DOI URL

Power management framework for post-petascale supercomputers
Masaaki Kondo; Ikuo Miyoshi; Koji Inoue; Shinobu Miwa
Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project, Springer Singapore, 249-269, 06 Dec. 2018, Power consumption is a first class design constraint for developing future exascale computing systems. To achieve exascale system performance with realistic power provisioning of 20-30MW, we need to improve power-performance efficiency significantly compared to today's supercomputer systems. In order to maximize effective performance within a power constraint, investigating how to optimize power resource allocation to each hardware component or each job submitted to the system is necessary. We have been conducting research and development on a software framework for code optimization and system power management for the power-constraint adaptive systems. We briefly introduce the research efforts for maximizing application performance under a given power constraint, power-aware resource manager, and power-performance simulation and analysis framework for future supercomputer systems.
In book, English
DOI URL

Run-Time DFS/DCT Optimization for Power-Constrained HPC Systems
I. Miyoshi; S. Miwa; K. Inoue; M. Kondo
The International Conference on High Performance Computing in Asia-Pacific Region, poster, 2018, Peer-reviwed
International conference proceedings, English

Data Prediction for Response Flows in Packet Processing Cache
H. Yamaki; H. Nishi; S. Miwa; H. Honda
2018 55th ACM/EDAC/IEEE Design Automation Conference, ACM, 110, 110-6, 2018, Peer-reviwed
International conference proceedings, English
URL
URL 2
DOI URL
DOI 2 URL

Optimizing Memory Hierarchy within an Internet Router for High-Throughput and Energy-Efficient Packet Processing
K. Tanaka; H. Yamaki; S. Miwa; H. Honda
ACM Student Research Competition (in conjunction with the 51st Annual ACM/IEEE International Symposium on Microarchitecture) (poster presentation), poster, 2018, Peer-reviwed
International conference proceedings, English

A Runtime Optimization Selection Framework to Realize Energy Efficient Networks-on-Chip.
Yuan He; Masaaki Kondo; Takashi Nakada; Hiroshi Sasaki; Shinobu Miwa; Hiroshi Nakamura
IEICE Transactions, The Institute of Electronics, Information and Communication Engineers, 99-D, 12, 2881-2890, 2016, Peer-reviwed,
Networks-on-Chip (or NoCs, for short) play important roles in modern and future multi-core processors as they are highly related to both performance and power consumption of the entire chip. Up to date, many optimization techniques have been developed to improve NoC's bandwidth, latency and power consumption. But a clear answer to how energy efficiency is affected with these optimization techniques is yet to be found since each of these optimization techniques comes with its own benefits and overheads while there are also too many of them. Thus, here comes the problem of when and how such optimization techniques should be applied. In order to solve this problem, we build a runtime framework to throttle these optimization techniques based on concise performance and energy models. With the help of this framework, we can successfully establish adaptive selections over multiple optimization techniques to further improve performance or energy efficiency of the network at runtime.

English
URL
URL 2
DOI URL

Initial Study of Reconfigurable Neural Network Accelerators
Momoka Ohba; Satoshi Shindo; Shinobu Miwa; Tomoaki Tsumura; Hayato Yamaki; Hiroki Honda
2016 FOURTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), IEEE, poster, 707-709, 2016, Peer-reviwed, Neural Networks or NNs are widely used for many machine learning applications such as image processing and speech recognition. Since general-purpose processors such as CPUs and GPUs are energy inefficient for computing NNs, application-specific hardware accelerators for NNs (a.k.a. Neural Network Accelerators or NNAs) have been proposed to improve the energy efficiency. However, the existing NNAs are too customized for computing specific NNs, and do not allow to change neuron models or learning algorithms. This limitation prevents machine-learning researchers from exploiting NNAs, so we are developing a general-purpose NNA including reconfigurable logic, which is called a reconfigurable NNA or RNNA. The RNNA is highly tuned for the NN computation but allows end users to customize the hardware to compute desired NNs. This paper introduces the RNNA architecture, and reports the performance analysis of the RNNA with an in-house cycle-level simulator.
International conference proceedings, English
DOI URL

Evaluation of Task Mapping on Multicore Neural Network Accelerators
Satoshi Shindo; Momoka Ohba; Tomoaki Tsumura; Shinobu Miwa
2016 FOURTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), IEEE, 415-421, 2016, Peer-reviwed, Deep neural networks are widely used for many applications such as image classification, speech recognition and natural language processing because of their high recognition rate. Since general-purpose processors such as CPUs and GPUs are not energy efficient for such neural networks, application-specific hardware accelerators for neural networks (a.k.a. neural network accelerators or NNAs) have been proposed to improve the energy efficiency. There are many studies to increase the energy efficiency of NNAs, but few studies focus on task allocation on the accelerators. This paper provides the first exploration of task mapping to cores within NNAs for the increased performance. Intuitively, a well-tuned task mapping has less amount of communication between cores. To confirm this assumption, we tested two types of task mappings that generate different amount of communication between cores on an NNA. Our experimental results show that the number of communication between cores strongly affects the execution cycle of the NNA and the most effective task mapping differs depending on the size of neural networks.
International conference proceedings, English
DOI URL

Subarray Level Power-Gating in STT-MRAM Caches to Mitigate Energy Impact of Peripheral Circuits
E. Arima; S. Miwa; T. Nakada; S. Takeda; H. Noguchi; S. Fujita; H. Nakamura
2015 52nd ACM/EDAC/IEEE Design Automation Conference, poster, 2015, Peer-reviwed
International conference proceedings, English

Runtime Multi-Optimizations for Energy Efficient On-chip Interconnections
Yuan He; Masaaki Kondo; Takashi Nakada; Hiroshi Sasaki; Shinobu Miwa; Hiroshi Nakamura
2015 33RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), IEEE, 455-458, 2015, Peer-reviwed, On-chip interconnection (or NoC) is a major performance and power contributor to modern and future multicore processors. So far, many optimization techniques have been developed to improve its bandwidth, latency and power consumption. But it is not clear how energy efficiency is affected since an optimization technique normally comes with overheads. This paper thus attempts to address when and how such optimization techniques should be applied and tuned to help achieve better energy efficiency. We firstly model the performance and energy impacts of representative NoC optimization techniques. These models help us more easily understand the consequences when applying these optimization techniques and their combinations under different circumstances. Moreover, based on such modeling, we propose and implement an adaptive control over these NoC optimization techniques to improve both performance and energy efficiency of the network. Our results show that, this proposal can achieve an average improvement of 26% and 57% on network performance and energy delay product, respectively.
International conference proceedings, English

Immediate Sleep: Reducing Energy Impact of Peripheral Circuits in STT-MRAM Caches
Eishi Arima; Hiroki Noguchi; Takashi Nakada; Shinobu Miwa; Susumu Takeda; Shinobu Fujita; Hiroshi Nakamura
2015 33RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), IEEE, 149-156, 2015, Peer-reviwed, Implementing last level caches (LLCs) with STT-MRAM is a promising approach for designing energy efficient microprocessors due to high density and low leakage power of its memory cells. However, peripheral circuits of an STT-MRAM cache still suffer from leakage power because large and leaky transistors are required to drive large write current to STT-MRAM element. To overcome this problem, we propose a new power management scheme called Immediate Sleep (IS). IS immediately turns off a subarray of an STT-MRAM cache if the next access is predicted to be not critical in performance. Thus, IS can effectively reduce leakage energy with little impact on performance. Our experimental results show that our technique can save the leakage energy of an STT-MRAM LLC by 32% compared to an STT-MRAM LLC with the conventional scheme at the same performance.
International conference proceedings, English

Memory Hotplug for Energy Savings of HPC systems
S. Miwa; H. Honda
The International Conference for High Performance Computing, Networking, Storage and Analysis, poster, 2015, Peer-reviwed
International conference proceedings, English

Profile-Based Power Shifting in Interconnection Networks with On/Off Links
Shinobu Miwa; Hiroshi Nakamura
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, ASSOC COMPUTING MACHINERY, 37:1-37:11, 2015, Peer-reviwed, Overprovisioning hardware devices and coordinating their power budgets are proposed to improve the application performance of future power-constrained HPC systems. This coordination process is called power shifting. Meanwhile, recent studies have revealed that on/off links can save network power in HPC systems. Future HPC systems will thus adopt on/off links in addition to power shifting. This paper explores power shifting in interconnection networks with on/off links. Given that on/off links keep network power low at application runtime, we can transfer appreciable quantities of power budgets on networks to other devices before an application runs. We thus propose a profile-based power shifting technique that allows HPC users to transfer the power budget remaining on networks to other devices at the time of job dispatch. Experimental results show that the proposed technique appreciably improves application performance under various power constraints.
International conference proceedings, English
DOI URL

Low-power cache memory with state-of-the-art STT-MRAM for high-performance processors
Susumu Takeda; Hiroki Noguchi; Kumiko Nomura; Shinobu Fujita; Shinobu Miwa; Eishi Arima; Takashi Nakada; Hiroshi Nakamura
2015 INTERNATIONAL SOC DESIGN CONFERENCE (ISOCC), IEEE, 153-154, 2015, Peer-reviwed, This paper describes state-of-the-art STT-MRAM, which can drastically save energy consumption dissipated in cache memory system compared with conventional SRAM-based ones. This paper also presents how to build cache memory hierarchy with both the state-of-art STT-MRAM and SRAM to reduce cache energy consumption. The key point is "break-even-time aware memory design" based on normally-off operation. For further power reduction, an intelligent power management technique for the STT-MRAM-based cache is also discussed.
International conference proceedings, English

Area-Efficient Microarchitecture for Reinforcement of Turbo Mode
Shinobu Miwa; Takara Inoue; Hiroshi Nakamura
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG, E97D, 5, 1196-1210, May 2014, Peer-reviwed, Turbo mode, which accelerates many applications without major change of existing systems, is widely used in commercial processors. Since time duration or powerfulness of turbo mode depends on peak temperature of a processor chip, reducing the peak temperature can reinforce turbo mode. This paper presents that adding small amount of hardware allows microprocessors to reduce the peak temperature drastically and then to reinforce turbo mode successfully. Our approach is to find out a few small units that become heat sources in a processor and to appropriately duplicate them for reduction of their power density. By duplicating the limited units and using the copies evenly, the processor can show significant performance improvement while achieving area-efficiency. The experimental result shows that the proposed method achieves up to 14.5% of performance improvement in exchange for 2.8% of area increase.
Scientific journal, English
DOI URL

Performance estimation of high performance computing systems with Energy Efficient Ethernet technology.
Shinobu Miwa; Sho Aita; Hiroshi Nakamura
Computer Science - R&D, 29, 3-4, 161-169, 2014, Peer-reviwed
URL
DOI URL

Power/performance Evaluation of EEE on real HPC environment
Shinobu Miwa; Sho Aita; Yuichiro Ajima; Toshiyuki Shimizu; Akira Asato; Hiroshi Nakamura
IPSJ Transactions on Advanced Computing Systems, 情報処理学会, 7, 4, 67-83, 2014, Peer-reviwed
Japanese

Evaluation of Core Hopping on POWER7
Shinobu Miwa; Charles R. Lefurgy
ACM SIGMETRICS Performance Evaluation Review, ACM, special issue, greenmetrics 2014, 11-16, 2014, Peer-reviwed
Scientific journal, English

Analysis of Performance Requirement of STT-MRAM Last Level Caches Considering Low CPU Load
Eishi Arima; Toshiya Komoda; Takashi Nakada; Shinobu Miwa; Hiroki Noguchi; Kumiko Nomura; Keiko Abe; Shinobu Fujita; Hiroshi Nakamura
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, J97-A, 10, 629-647, 2014, Peer-reviwed
Japanese

Design Aid of Multi-core Embedded Systems with Energy Model
T. Nakada; K. Okamoto; T. Komoda; S. Miwa; Y. Sato; H. Ueki; M. Hayashikoshi; T. Shimizu; H. Nakamura
IPSJ Transactions on Advanced Computing Systems, 情報処理学会, 7, 3, 37-46, 2014, Peer-reviwed
Scientific journal, Japanese

Data-aware Power Management for Periodic Real-time Systems with Non-Volatile Memory
Takashi Nakada; Takuya Shigematsu; Toshiya Komoda; Shinobu Miwa; Hiroshi Nakamura; Yohei Sato; Hiroshi Ueki; Masanori Hayashikoshi; Toru Shimizu
2014 IEEE NON-VOLATILE MEMORY SYSTEMS AND APPLICATIONS SYMPOSIUM (NVMSA), IEEE, 1-6, 2014, Peer-reviwed, In real-time systems, power gating is widely adopted by processing cores but not working memory because of data loss. Meanwhile, new non-volatile memory technology, which is comparable to volatile memory, quickly emerges. Thus, in this paper, we propose data-aware power management for real-time systems with both volatile and non-volatile memories. Considering the trade-off between data migration energy overhead and energy reduction through power gating, we minimize energy consumption when the system is idle by appropriately selecting sleep modes and making decisions on data migration. Experimental results show that this approach can reduce energy consumption by up to 20%.
International conference proceedings, English

Fine-Grain Power-Gating on STT-MRAM Peripheral Circuits with Locality-aware Access Control
E. Arima; T. Nakada; S. Miwa; S. Takeda; H. Noguchi; S. Fujita; H. Nakamura
The Memory Forum, 1-5, 2014, Peer-reviwed
International conference proceedings, English

Normally-Off Computing Project : Challenges and Opportunities
Hiroshi Nakamura; Takashi Nakada; Shinobu Miwa
2014 19TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC), IEEE, special session 1S, 1, 1-5, 2014, Peer-reviwed, Normally-Off is a way of computing which aggressively powers off components of computer systems when they need not to operate. Simple power gating cannot fully take the chances of power reduction because volatile memories lose data when power is turned off. Recently, new non-volatile memories (NVMs) have appeared. High attention has been paid to normally-off computing using these NVMs. In this paper, its expectation and challenges are addressed with a brief introduction of our project started in 2011.
International conference proceedings, English

Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping.
Toshiya Komoda; Shingo Hayashi; Takashi Nakada; Shinobu Miwa; Hiroshi Nakamura
2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA, October 6-9, 2013, IEEE Computer Society, 349-356, 2013, Peer-reviwed
URL
DOI URL

Lost Data Prefetch for Reduction of Performance Penalty Caused by Cache Power-off
Eishi Arima; Toshiya Komoda; Takashi Nakada; Shinobu Miwa; Hiroshi Nakamura
IPSJ Transactions on Advanced Computing Systems, 情報処理学会, 6, 3, 118-130, 2013, Peer-reviwed
Scientific journal, Japanese

Integrating Multi-GPU Execution in an OpenACC Compiler
Toshiya Komoda; Shinobu Miwa; Hiroshi Nakamura; Naoya Maruyama
2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), IEEE, 260-269, 2013, Peer-reviwed, GPUs have become promising computing devices in current and future computer systems due to its high performance, high energy efficiency, and low price. However, lack of high level GPU programming models hinders the wide spread of GPU applications. To resolve this issue, OpenACC is developed as the first industry standard of a directive-based GPU programming model and several implementations are now available. Although early evaluations of the OpenACC systems showed significant performance improvement with modest programming efforts, they also revealed the limitations of the systems. One of the biggest limitations is that the current OpenACC compilers do not automate the utilization of multiple GPUs.
In this paper, we present an OpenACC compiler with the capability to execute single GPU OpenACC programs on multiple GPUs. By orchestrating the compiler and the runtime system, the proposed system can efficiently manage the necessary data movements among multiple GPUs memories. To enable advanced communication optimizations in the proposed system, we propose a small set of directives as extensions of OpenACC API. The directives allow programmers to express the patterns of memory accesses in the parallel loops to be offloaded. Inserting a few directives into an OpenACC program can reduce a large amount of unnecessary data movements and thus helps the proposed system drawing great performance from multi-GPU systems. We implemented and evaluated the prototype system on top of CUDA with three data parallel applications. The proposed system achieves up to 6.75x of the performance compared to OpenMP in the 1CPU with 2GPU machine, and up to 2.95x of the performance compared to OpenMP in the 2CPU with 3GPU machine. In addition, in two of the three applications, the multi-GPU OpenACC compiler outperforms the single GPU system where hand-written CUDA programs run.
International conference proceedings, English
DOI URL

Performance Modeling for Designing NoC-based Multiprocessors
Takashi Nakada; Shinobu Miwa; Keisuke Yano; Hiroshi Nakamura
RAPID SYSTEM PROTOTYPING: SHORTENING THE PATH FROM SPECIFICATION TO PROTOTYPE (RSP 2013), IEEE, 30-36, 2013, Peer-reviwed, Network-on-Chip (NoC) based multiprocessors have become popular as a scalable alternative to classical bus architectures. The performance evaluation of NoC-based multiprocessors is largely based on simulation. However, precise simulation is extremely slow. Additionally, there are many design parameters that affect the total performance. Therefore, it is practically impossible to use the precise simulation for the design space exploration purposes. To alleviate this problem, prototyping NoC systems and estimating their performances are critically important. In this paper, we present a generalized novel performance model that combined with the simulations for designing NoC-based multiprocessors. We revealed that the performance impact of cache and network latencies are dominant. Moreover, network congestion rarely happens under near appropriate configuration. Thus, the performance model is mainly constructed using the hardware parameters and the statistics that obtained from a simple cache simulation that is separated from the network behavior. The proposed performance model is used not only to obtain fast and accurate performance, but also to guide the NoC-based multiprocessor design space exploration. The accuracy of our approach and its practical use are illustrated through simulation. The results showed that proposed model can estimate performance with only 3.4% error on average and 21% at worst. We also confirmed that our evaluation framework can estimate 360 times faster than the brute force full system simulation.
International conference proceedings, English

McRouter: Multicast within a Router for High Performance Network-on-Chips
Yuan He; Hiroshi Sasaki; Shinobu Miwa; Hiroshi Nakamura
2013 22ND INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), IEEE, 319-329, 2013, Peer-reviwed, The inevitable advent of the multi-core era has driven an increasing demand for low latency on-chip interconnection networks (or NoCs). Being a critical part of the memory hierarchy for modern chip multi-processors (CMPs), these networks face stringent design constraints to provide fast communication with tight power budget. Modern NoC's first-order concern is clearly its latency, while we also find that internal bandwidth of its routers is relatively plentiful; thus, we present a low latency router design utilizing a technique we call "multicast within a router" or McRouter, which allows productive utilization of remaining bandwidth inside a NoC router. McRouter allows a single cycle transfer of flits which shortens the communication latency when there is enough remaining bandwidth within the router. The key idea is to transmit a header flit to all possible output ports (multicast) so that it is always transmitted to the correct output port without relying on route computation. In addition, we find it is affordable with marginal power overhead while still being a stand-alone design by maintaining portability and modularity (unlike look-ahead routing based designs). Our evaluation with application traffic shows that McRouter helps achieving system speed-ups of 1.28, 1.17 and 1.05 over the conventional router (CR), the VSA router (VSAR) and the prediction router (PR), respectively.
International conference proceedings, English

D-MRAM Cache: Enhancing Energy Efficiency with 3T-1MTJ DRAM / MRAM Hybrid Memory
Hiroki Noguchi; Kumiko Nomura; Keiko Abe; Shinobu Fujita; Eishi Arima; Kyundong Kim; Takashi Nakada; Shinobu Miwa; Hiroshi Nakamura
DESIGN, AUTOMATION & TEST IN EUROPE, ASSOC COMPUTING MACHINERY, 1813-1818, 2013, Peer-reviwed, This paper describes a proposal of non-volatile cache architecture utilizing novel DRAM / MRAM cell-level hybrid structured memory (D-MRAM) that enables effective power reduction for high performance mobile SoCs without area overhead. Here, the key point to reduce active power is intermittent refresh process for the DRAM-mode. D-MRAM has advantage to reduce static power consumptions compared to the conventional SRAM, because there are no static leakage paths in the D-MRAM cell and it is not needed to supply voltage to its cells when used as the MRAM-mode. Besides, with advanced perpendicular magnetic tunnel junctions (p-MTJ), which decreases the write energy and latency without shortening its retention time, D-MRAM is capable of power reduction by replacing the traditional SRAM caches. Considering the 65-nm CMOS technology, the access latencies of 1MB memory macro are 2.2 ns / 1.5 ns for read / write in DRAM mode, and 2.2 ns / 4.5 ns in MRAM mode, while those of SRAM are 1.17 ns. The SPEC CPU2006 benchmarks have revealed that the energy per instruction (EPI) of the total cache memory can be dramatically reduced by 71 % on average, and the instruction per cycle (IPC) performance of the D-MRAM cache architecture degraded only by approximately 4 % on average in spite of its latency overhead.
International conference proceedings, English

Predict-more router: A low latency NoC router with more route predictions
Yuan He; Hiroshi Sasaki; Shinobu Miwa; Hiroshi Nakamura
Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013, IEEE Computer Society, 842-850, 2013, Peer-reviwed, Network-on-Chip (NoC) is a critical part of the memory hierarchy of emerging multicores. Lowering its communication latency while preserving its bandwidth is key to achieving high system performance. By now, one of the most effective methods helps achieving this goal is prediction router (PR). PR works by predicting the route an incoming packet may be transferred to and it speculatively allocates resources (virtual channels and the switch crossbar) to the packet and traverses the packet's flits using this predicted route in a single cycle without waiting for route computation
however, if prediction misses, the packet will then be processed in the conventional pipeline (in our work, four cycles) and the speculatively allocated router resources will be wasted. Obviously, prediction accuracy contributes to the amount of successful predictions, latency reduction and bandwidth consumption. We find that predictions hit around 65% for most applications even under the best algorithm so in such cases PR can at most accelerate about 65% of the packets while the left 35% will consume extra router resources and bandwidth. In order to increase the prediction accuracy, we propose a technique, which makes use of multiple prediction algorithms at the same time for one incoming packet. Such a prediction is more accurate. With this proposal, we design and implement predict-more router (PmR). While effectively increasing the prediction accuracy, PmR also helps utilizing remaining bandwidth within the router more productively. When both PmR and PR are evaluated under their best algorithm(s), we find that PmR is over 15% higher in prediction accuracy than PR, which helps PmR outperform PR by 3.5% on average in speeding-up the system. We also find that although PmR creates more contentions in prediction, these contentions can be well resolved and are kept within the router so both router internal bandwidth and link bandwidth are not exacerbated with it. © 2013 IEEE.
International conference proceedings, English
DOI URL

Evaluation of a New Power-Gating Scheme Utilizing Data Retentiveness on Caches
Kyundong Kim; Seidai Takeda; Shinobu Miwa; Hiroshi Nakamura
IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG, E95A, 12, 2301-2308, Dec. 2012, Peer-reviwed, Caches are one of the most leakage consuming components in modern processor because of massive amount of transistors. To reduce leakage power of caches, several techniques using power-gating (PG) were proposed. Despite of its high leakage saving, a side effect of PG for caches is the loss of data during a sleep. If useful data is lost in sleep mode, it should be fetched again from a lower level memory. This consumes a considerable amount of energy, which very unfortunately mitigates the leakage saving. This paper proposes a new PG scheme considering data retentiveness of SRAM. After entering the sleep mode, data of an SRAM cell is not lost immediately and is usable by checking the validity of the data. Therefore, we utilize data retentiveness of SRAM to avoid energy overhead for data recovery, which results in further chance of leakage saving. To check availability, we introduce a simple hardware whose overhead is ignorable. Our experimental result shows that utilizing data retentiveness saves up to 32.42% of more leakage than conventional PG.
Scientific journal, English
DOI URL

Communication Library to Overlap Computation and Communication for OpenCL Application.
Toshiya Komoda; Shinobu Miwa; Hiroshi Nakamura
26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPS 2012, Shanghai, China, May 21-25, 2012, IEEE Computer Society, 567-573, 2012, Peer-reviwed
URL
URL 2
DOI URL

Stepwise sleep depth control for run-time leakage power saving
Seidai Takeda; Shinobu Miwa; Kimiyoshi Usami; Hiroshi Nakamura
Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI, 233-238, 2012, Peer-reviwed, Recently, run-time sleep control scheme using multiple sleep modes have been studied. In those studies, each sleep mode has its own sleep depth. Deeper sleep mode provides higher leakage saving but incurs larger overhead energy.Use of multiple modes is helpful for further leakage saving if an appropriate mode is selected, but the best mode depends on the idle period whose length cannot be told in advance.Although the implementations how to realize different sleep depths have been well studied, few attention has been paid to the method of how to select the best sleep depth dynamically during execution. This paper proposes a simple but novel sleep control scheme, called stepwise sleep depth control, which aims to select the best depth among provided multiple sleep depths.Our scheme automatically applies deeper depth in a step-by-step manner after an idle state starts. It successfully reduces leakage energy while only a small modification is required for circuit implementation. This paper also proposes a methodology for optimizing control parameters of our sleep control scheme according to program behavior and temperature. Experimental result shows that stepwise sleep depth control applied to body biasing circuit improves net leakage saving of up to 43% for FPAlu at 1.0GHz, 75°C compared to conventional reverse body biasing. Copyright 2012 ACM.
International conference proceedings, English
DOI URL

Efficient Leakage Power Saving by Sleep Depth Controlling for Multi-mode Power Gating
Seidai Takeda; Shinobu Miwa; Kimiyoshi Usami; Hiroshi Nakamura
2012 13TH INTERNATIONAL SYMPOSIUM ON QUALITY ELECTRONIC DESIGN (ISQED), IEEE, 625-632, 2012, Peer-reviwed, Power Gating (PG) and Body Biasing (BB) are effective schemes to save leakage power in standby-time. However, in run-time, their large overhead energy and latency for sleep control prevent the circuit from saving power in short idle times. To reduce those overheads, advanced PG and BB using shallow sleep mode are studied. Those circuits achieve leakage saving even in short idle time. The depth of sleep mode has trade-offs between the overheads and the amount of saved leakage power; hence, making decision of depth of a shallow sleep is an important issue to maximize total leakage saving. However, the depth which achieves best leakage saving depends heavily on run-time factors, such as application behavior and temperature. Thus, the conventional circuit has multiple shallow sleep modes and chooses an adequate mode in run-time. However, it causes large overhead power because of additional voltage generators for shallow sleep modes. In this paper, we propose a sleep control scheme named Opt-static for run-time leakage saving. Our scheme uses only one shallow sleep mode, but its depth is reconfigurable. It successfully achieves leakage saving by adopting its depth with run-time factors. In addition, our scheme needs only one active voltage generator; hence overhead power associated with voltage generators is smaller than the conventional circuit which has multiple shallow sleep modes. Experimental results show that our schemes applied to Multi-mode PG achieves higher leakage saving than the conventional Multi-mode PG which has two shallow sleep modes, although it does not take into account for overhead power of voltage generators.
International conference proceedings, English

A novel power-gating scheme utilizing data retentiveness on caches
Kyundong Kim; Seidai Takeda; Shinobu Miwa; Hiroshi Nakamura
Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI, 91-94, 2012, Peer-reviwed, Caches are one of the most leakage consuming components in modern processor because of massive amount of transistors. To re- duce leakage power of caches, several techniques using power- gating(PG) were proposed. Despite of its high leakage saving, a side effect of PG for caches is the loss of data during a sleep. If useful data is lost in sleep mode, it should be fetched again from a lower level memory. This consumes a considerable amount of energy, which very unfortunately mitigates the leakage saving. This paper proposes a new PG scheme considering data retentiveness of SRAM. After entering the sleep mode, data of an SRAM cell is not lost immediately and is usable by checking the validity of the data. Therefore, we utilize data retentiveness of SRAM to avoid energy overhead for data recovery, which results in further chance of leak- age saving. To check availability, we introduce a simple hardware whose overhead is ignorable. We also examined leakage saving potential of our approach. For both L1 data and instruction caches, our scheme results in more than 2 times of smaller leakage energy compared to conventional PG scheme. Copyright 2012 ACM.
International conference proceedings, English
DOI URL

Evaluation of GPU-Based Empirical Mode Decomposition for Off-Line Analysis
Pulung Waskito; Shinobu Miwa; Yasue Mitsukura; Hironori Nakajo
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG, E94D, 12, 2328-2337, Dec. 2011, Peer-reviwed, In off-line analysis, the demand for high precision signal processing has introduced a new method called Empirical Mode Decomposition (EMD), which is used for analyzing a complex set of data. Unfortunately, EMD is highly compute-intensive. In this paper, we show parallel implementation of Empirical Mode Decomposition on a GPU. We propose the use of "partial+total" switching method to increase performance while keeping the precision. We also focused on reducing the computation complexity in the above method from O(N) on a single CPU to O(N/P log (N)) on a GPU. Evaluation results show our single CPU implementation using Tesla C2050 (Fermi architecture) achieves a 29.9x speedup partially, and a 11.8x speedup totally when compared to a single Intel dual core CPU.
Scientific journal, English
DOI URL

A Fine-Grained Runtime Power/Performance Optimization Method for Processors with Adaptive Pipeline Depth
Jun Yao; Shinobu Miwa; Hajime Shimada; Shinji Tomita
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, SCIENCE PRESS, 26, 2, 292-301, Mar. 2011, Peer-reviwed, Recently, a method known as pipeline stage unification (PSU) has been proposed to alleviate the increasing energy consumption problem in modern microprocessors. PSU achieves a high energy efficiency by employing a changeable pipeline depth and its working scheme is eligible for a fine control method. In this paper, we propose a dynamic method to study fine-grained program interval behaviors based on some easy-to-get runtime processor metrics. Using this method to determine the proper PSU configurations during the program execution, we are able to achieve an averaged 13.5% energy-delay-product (EDP) reduction for SPEC CPU2000 integer benchmarks, compared to the baseline processor. This value is only 0.14% larger than the theoretically idealized controlling. Our hardware synthesis result indicates that the proposed method can largely decrease the hardware overhead in both area and delay costs, as compared to a previous program study method which is based on working set signatures.
Scientific journal, English
DOI URL

Hardware Acceleration for Java in Android Devices
Atsushi Ohta; Shinobu Miwa; Hironori Nakajo
IPSJ Transactions on Advanced Computing Systems, 情報処理学会, 42, 3, 115-132, 2011, Peer-reviwed
Scientific journal, Japanese

Dalvik アクセラレータ：Android 端末における Java アプリケーションの高速実行機構
太田淳; 三輪忍; 中條拓伯
組込みシステムシンポジウム (ESS2010), 2010, 10, 13-22, Oct. 2010, Peer-reviwed
Japanese

Accelerating Hilbert-Huang Transform using GPU
Pulung Waskito; Shinobu Miwa; Yasue Mitsukura; Hironori Nakajo
情報処理学会ハイパフォーマンスコンピューテング研究会報告, 2010-HPC-126, No.3, 1-8, Aug. 2010
Scientific journal, English

選択的キャッシュ・アロケーション：マルチスレッド環境におけるキャッシュ利用効率の向上手法
堀部悠平; 三輪忍; 塩谷亮太; 五島正裕; 中條拓伯
情報処理学会計算機アーキテクチャ研究会報告, 2010-ARC-190, No.1, 1-8, Aug. 2010
Research institution, Japanese

Hilbert-Huang変換の並列化およびGPUによる高速化
Pulung Waskito; 三輪忍; 満倉靖恵; 中條拓伯
先進的計算基盤システムシンポジウム (SACSIS2010) ポスター・セッション, Vol.2010, No.5, 139-140, May 2010, Peer-reviwed
Japanese

選択的キャッシュ・ライン・アロケーションによるキャッシュの容量効率向上
堀部悠平; 三輪忍; 塩谷亮太; 五島正裕; 中條拓伯
先進的計算基盤システムシンポジウム(SACSIS2010) ポスター・セッション, Vol.2010, No.5, 121-122, May 2010, Peer-reviwed
Scientific journal, Japanese

DalvikアクセラレータのためのMIPSシミュレータを用いた評価環境
太田淳; 茂手木貴彦; 三輪忍; 中條拓伯
先進的計算基盤システムシンポジウム (SACSIS2010) ポスター・セッション, Vol.2010, No.5, 113-114, May 2010, Peer-reviwed
Japanese

小容量 CAM を用いたレジスタ・マップ表の回路面積削減
三輪忍; 張鵬; 横山弘基; 堀部悠平; 中條拓伯
先進的計算基盤システムシンポジウム (SACSIS2010)論文集, Vol.2010, No.5, 329-338, May 2010, Peer-reviwed
Japanese

Area-efficient Register Map Table Using a Cache
Shinobu Miwa; Peng Zhang; Hiroki Yokoyama; Yuhei Horibe; Hironori Nakajo
IPSJ Transactions on Advanced Computing Systems, IPSJ, 3, 3, 44-55, 2010, Peer-reviwed, SMTの普及により，近年，レジスタ・マップ表は肥大化する傾向にある．マップ表は，通常，マルチポートRAMで構成される．同じくマルチポートRAMであるレジスタ・ファイルに対しては，小容量のキャッシュを用いて回路面積を削減する手法が提案されているが，この手法をマップ表に適用した例はまだない．また，この手法は，マルチポートRAMの回路面積を削減する一般的な手法，たとえばマルチバンク化などとの比較がまったく行われていなかった．そこで今回，小容量のキャッシュを用いる手法をマップ表に適用し，マルチバンク化した場合との比較を行った．本稿ではその結果を報告する．Area of register map tables is growing up in recent processors following the spread of SMT technologies. Register map tables are usually implemented with multi-port RAMs as well as register files. In order to reduce area of register files, a technique which uses a small cache has been proposed, but it has never been applied to register map tables. Moreover, the technique has never been compared with other techniques which aim to reduce area of multi-port RAM. This paper shows the result when both techniques are applied to register map tables.
Scientific journal, Japanese
URL
URL 2

An Effective Replacement Policy Focusing on Lifetime of a Cache Line
H. Yokoyama; Y. Horibe; P. Zhang; S. Miwa; H. Nakajo
International Conference on Computer Design, 146-152, 2010, Peer-reviwed
International conference proceedings, English

Parallelizing Hilbert-Huang transform on a GPU
Pulung Waskito; Shinobu Miwa; Yasue Mitsukura; Hironori Nakajo
Proceedings - 2010 1st International Conference on Networking and Computing, ICNC 2010, 184-190, 2010, Peer-reviwed, In this paper, we show parallel implementation of Hilbert-Huang Transform on GPU. This implementation focused on the reducing the computation complexity from O(N) on a single CPU to O(N/P log (N)) on GPU, as well as the use of 'shared-global' switching method to increase performance. Evaluation results show our single GPU implementation using Tesla C1060 achieves 29.0x speedup in best case, and a total of 7.1x speedup for all results when compared to a single Intel dual core CPU. © 2010 IEEE.
International conference proceedings, English
DOI URL

Extraction of horns in a noisy environment by EMD
M. Nakanishi; Y. Mitsukura; T. Tanaka; S. Miwa; H. Nakajo
International Workshop on Nonlinear Circuits and Signal Processing, IIC-10, 71.73-78, 333-336, 2010, Peer-reviwed
English

Dynamic Switch of L1/L2 Cache Accesses on SMT Processors
Y. Ogasawara; S. Miwa; H. Nakajo
IPSJ Transactions on Advanced Computing Systems, 情報処理学会, 2, 3, 12-25, 2009, Peer-reviwed, SMTプロセッサは，複数のスレッドで演算器やキャッシュメモリを共有し，性能向上を目指している．ところが，キャッシュメモリの共有が原因で，キャッシュラインにおけるスレッド間競合が発生し，性能が低下するという問題がある．そこで本論文では，キャッシュアクセスとしてL2-ダイレクトアクセスを可能にし，それを適切な条件で適用することでL1-キャッシュメモリを使用するスレッド数を調節し，スレッド間競合を抑える．L1/L2キャッシュアクセスの動的切替え方式として，ヒット率を切替えパラメータとする方式とセットごとにキャッシュアクセスを切り替える方式を提案し，設計した．評価の結果，提案方式は通常のキャッシュアクセスと比較し，最大1.106倍，平均1.022倍の性能向上をもたらした．また，各提案方式を実装した結果，どちらの方式も，プロセッサとキャッシュメモリを含んだチップ全体で3%未満とわずかなハードウェア増加量で実現できることを示した．An SMT processor aims to gain higher performance by sharing resources such as ALUs and cache memory among several threads. However, sharing cache memory causes thread conflict miss which degrades its performance. This paper proposes two dynamic switching strategies of accessing L1/L2 cache in order to improve performance. One uses the number of cache miss as switching, and the other switches accessing algorithm in each set. Dynamic switching strategies adjust number of thread in L1 Cache memory in order to reduce thread conflict miss. As a result, dynamic switching strategies show 1.022 times as high performance in average and 1.106 times in max as a conventional cache access. Furthermore, both dynamic switching strategies can be implemented with small additional hardware cost in less than 3%.
Scientific journal, Japanese
URL
URL 2

An Instruction Scheduler for Dynamic ALU Cascading Adoption
J. Yao; K. Ogata; H. Shimada; S. Miwa; H. Nakashima; S. Tomita
IPSJ Transactions on Advanced Computing Systems, 情報処理学会, 2, 2, 30-47, 2009, Peer-reviwed, To reduce the processor energy consumption under low workload and low clock frequency executions, a possible solution is to use ALU cascading while keeping the supply voltage unchanged. This cascading scheme uses a single cycle to execute multiple ALU instructions which have a data dependence relationship between them and thus saves clock cycles for the whole execution. Since the processor energy consumption is the product result of both power and execution time, ALU cascading is expected to help energy optimization for microprocessors operating under low frequency status. To implement ALU cascading in a current superscalar processor, a specific instruction scheduler is required to wakeup a pair of cascadable instructions simultaneously despite there being a data dependence relationship between them. Furthermore, ALU cascading is only applied under low clock frequency execution mode so that the instruction scheduler must support standard scheduling for the normal clock frequency execution. In this paper, we propose an instruction scheduling method that enables the additional wakeup features for the utilization of ALU cascading without large hardware extensions. With this scheduler, the average IPC improvement becomes 3.7% in SPECint2000 and 6.4% in Mediabench, as compared to the baseline execution. The delay of additional hardware required for the ALU cascading purpose is also evaluated to study the complexity of ALU cascading.
Scientific journal, Japanese
URL
DOI URL

Dynamic Switching Techniques of Accessing L1/L2 Cache on an SMT Processor
Y. Ogasawara; P. Waskito; S. Miwa; H. Nakajo
International Conference on Computer Design, 171-177, 2009, Peer-reviwed
International conference proceedings, English

Improving Effectiveness of Pipeline Stage Unification via ALU Cascading
J. Yao; H. Shimada; K. Ogata; S. Miwa; S. Tomita
12th IEEE Symposium on Low-Power and High-Speed Chips, 423-436, 2009, Peer-reviwed
International conference proceedings, English

分岐予測精度改善のための決定的な分岐フィルタ機構
三輪忍; 中條拓伯
情報処理学会計算機アーキテクチャ研究会報告(SWoPP 2008), 2008-ARC, 179, 61-66, Aug. 2008
Japanese

A dynamic control mechanism for pipeline stage unification by identifying program phases
Jun Yao; Shinobu Miwa; Ilajime Shimada; Shinji Tomita
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG, E91D, 4, 1010-1022, Apr. 2008, Peer-reviwed, Recently, a method called pipeline stage unification (PSU) has been proposed to reduce energy consumption for mobile processors via inactivating and bypassing some of the pipeline registers and thus adopt shallow pipelines. It is designed to be an energy efficient method especially for the processors under future process technologies. In this paper, we present a mechanism for the PSU controller which can dynamically predict a suitable configuration based on the program phase detection. Our results show that the designed predictor can achieve a PSU degree prediction accuracy of 84.0%. averaged from the SPEC CPU2000 integer benchmarks. With this dynamic control mechanism, we can obtain 11.4% Energy-Delay-Product (EDP) reduction in the processor that adopts a PSU pipeline, compared to the baseline processor, even after the application of complex clock gating.
Scientific journal, English
DOI URL

Low-Complexity Bypass Network Using Small RAM
Shinobu Miwa; Hironori Ichibayashi; Hidetsugu Irie; Masahiro Goshima; Hironori Nakajo; Shinji Tomita
International Conference on Computer Design, 48, SIG13(ACS19), 153-159, 2008, Peer-reviwed
English

Three quads: An interconnection network for interactive simulations
Tomoyuki Yoshimura; Keita Saito; Hajime Shimada; Shinobu Miwa; Yasuhiko Nakashima; Shin-ichiro Mori; Shinji Tomita
SYSTEMS MODELING AND SIMULATION: THEORY AND APPLICATIONS, ASIA SIMULATION CONFERENCE 2006, SPRINGER-VERLAG TOKYO, 362-+, 2007, Peer-reviwed, In this paper, we have proposed an interconnection network for Medium Scale Commodity Cluster. This network has originally designed for the Visualization Subsystem of the Sensable Simulation System (Scube) which the authors have been developing. Scube is a 64-nodes PC-based cluster system in which a commodity GPU as the visualization accelerator is configured with each node. There is no dedicated special purpose networks for the numerical simulation and visualization, however, the high cost-performance inter-connection network which we call Three Quads is originally designed for Scube. All the hardware components for this network is essentially the small-scale and commodity hardware designed for Giga-bit Ethernet. The network configuration and its characteristics are discussed in this paper.
International conference proceedings, English

Low Complexity of Operand Bypasses Using Small RAM
Shinobu Miwa; Hironori Ichibayashi; Hidetsugu Irie; Masahiro Goshima; Shinji Tomita
IPSJ Transactions on Advanced Computing Systems, IPSJ, 48, SIG13, 58-69, 2007, Peer-reviwed, For the wire delay problem, the units with the long wires become critical such as register files and a bypass network. To prevent the units to be critical, the pipelining is an effective technique. However, the pipelining of register files complicates a bypass network. It is unacceptable that a bypass network is complicated because it is already critical. A register cache is proposed to resolve this problem. The register cache is a small buffer to cache register files. It is accessible in 1 cycle. If the instruction hits the register cache, the processor with the register cache behaves same as the processor with the non-pipelined register files. Therefore, the bypass network of the former processor is same as that of the latter processor. However, the processor with the register cache doesn't outperform because of the much register cache miss penalty. Then, we propose a bypass buffer. There is no miss penalty on the processor with it because it is not a cache. In this paper, we show that the processor with the bypass buffer achieves high performance rather than the processor with the ideal register cache.
Scientific journal, Japanese
URL
URL 2

Optimal Pipeline Depth with Pipeline Stage Unification Adoption
J. Yao; S. Miwa; H. Shimada; S. Tomita
ACM SIGARCH Computer Architecture News, ACM, 35, 5, 3-6, 2007, Peer-reviwed
Scientific journal, English

Branch Filtering Mechanism with Path Information
Shinobu Miwa; Tomohisa Fukuyama; Hajime Shimada; Masahiro Goshima; Yasuhiko Nakashima; Shinichiro Mori; Shinji Tomita
IPSJ Transactions on Advanced Computing Systems, IPSJ, 47, SIG12, 108-118, 2006, Peer-reviwed
Japanese

An FPGA-based Visualization Accelerator : VisA Pro
Symposium on VLSI (GLSVLSI; oster presentation; Ma; S. Mori; D. Okamura; H. Shimada; S. Miwa; Y. Nakashima; S. Tomita
International Symposium on Advanced Reconfigurable Systems, poster, 2005, Peer-reviwed
International conference proceedings, English

記憶構造観測のための神経網シミュレーション
津邑公暁; 三輪忍; 五島正裕; 富田眞治
計測自動制御学会システム工学部会研究会「人工生命の新しい潮流」計測自動制御学会, 111-114, Feb. 2000, Peer-reviwed
Research society, Japanese

MISC

GPT-4oのMPI並列コード生成能力の分析
田中凛; 八巻隼人; 三輪忍; 本多弘樹
Mar. 2025, 情報処理学会研究報告 2025-HPC-198, 53, 1-7

NUMA構成とGramineバージョンがIntel SGXの性能に与える影響の評価
佐野拳紳; 下島航太; 八巻隼人; 本多弘樹; 三輪忍
Last, Mar. 2025, 情報処理学会研究報告 2025-HPC-198, 51, 1-8

Intel SGXを用いた並列計算環境のための暗号化MPI通信ライブラリ
下島航太; 八巻隼人; 本多弘樹; 松尾真一郎; 三輪忍
Last, Mar. 2025, 情報処理学会研究報告 2025-HPC-198, 48, 1-8

GPUにおける神経回路シミュレータの高速化
鈴木嘉竜; 八巻隼人; 本多弘樹; 三輪忍
Last, Mar. 2025, 情報処理学会研究報告 2025-HPC-198, 43, 1-7

CNFET製4T SRAMの安定性の評価
苗田友之助; 八巻隼人; 本多弘樹; 三輪忍
Last, Mar. 2025, 情報処理学会研究報告 2024-SLDM-208, 45, 1-8

大局的および局所的な経路制御を併用したマルチパス負荷分散手法
佐藤翔; 三輪忍; 本多弘樹; 八巻隼人
Mar. 2025, 情報処理学会研究報告 2024-ARC-260, 42, 1-6

ルーティングテーブル検索におけるマルチプレフィクス長キャッシュの構成
都地佑月; 五島正裕; 三輪忍; 本多弘樹; 八巻隼人
Mar. 2025, 情報処理学会研究報告 2024-ARC-260, 39, 1-8

並列計算システムにおける情報漏洩を防止するTEEの利用方法の研究
下島航太; 八巻隼人; 本多弘樹; 松尾真一郎; 三輪忍
Last, Sep. 2024, コンピュータセキュリティシンポジウム2024（ポスター）

Linux上に実装した通信遅延計測手法の精度評価
山口圭亮; 八巻隼人; 三輪忍; 本多弘樹
Jun. 2024, 情報処理学会研究報告 2024-ARC-257, 21, 1-6

Open vSwitchにおけるフローテーブルミスバッファの提案
丸山颯斗; 八巻隼人; 三輪忍; 本多弘樹
Jun. 2024, 情報処理学会研究報告 2024-ARC-257, 9, 1-7

LSTMによるジョブの実行時間予測および予測実行時間と要求実行時間を併用するジョブスケジューリング
久保優也; 吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
Mar. 2024, 情報処理学会研究報告, 2023, HPC-193

TCAMを用いずにルータの最長一致検索に対応するキャッシュ-メモリ・システム
長田大樹; 八巻隼人; 三輪忍; 本多弘樹; 五島正裕
Aug. 2023, 情報処理学会研究報告(Web), 2023, ARC-252, 202302240000482906

高帯域幅メモリを有するプロセッサにおけるデータプリフェッチャの性能分析
トウリン; 三輪忍; 塩谷亮太; 八巻隼人; 本多弘樹
Aug. 2023, 情報処理学会研究報告(Web), 2023, ARC-254, 202302264598073999

IP網におけるIn-networkコンテンツキャッシュ
大河原幸哉; 八巻隼人; 三輪忍; 本多弘樹
Jul. 2023, 情報処理学会研究報告(Web), 2023, IOT-62, 202302266587998625

Bandwidth-Requirement-Based Dynamic Traffic Balancing Using INT for Multipath Routing
佐藤翔; 荒巻慎太朗; 八巻隼人; 三輪忍; 本多弘樹
Jul. 2023, 情報処理学会研究報告(Web), 2023, IOT-62, 202302267351118270

検査対象の種類ごとに特化したSnortを複数用いたソフトウェア侵入検知システムの並列化
小倉快将; 八巻隼人; 本多弘樹; 三輪忍
Jun. 2023, 情報処理学会研究報告(Web), 2023, ARC-253, 202302225450390001

Load Balancer for Parallel NIDS Using Multiple Devices with Diferrent Processing Performance
八巻隼人; 三輪忍; 本多弘樹
Jun. 2023, 電子情報通信学会技術研究報告(Web), 123, 62(CPSY2023 1-7), 2432-6380, 202302236336502031

実HPCアプリケーションを用いたマルチGPUにおける電力ばらつきの評価
郡司賢; 吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
Mar. 2023, 情報処理学会研究報告(Web), 2023, HPC-188, 202302264623200051

A64FXプロセッサにおける電力・性能ばらつきの評価・分析
草場智也; 吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
Mar. 2023, 情報処理学会研究報告(Web), 2023, HPC-188, 202302265328292538

並列アプリケーションのキャッシュミス数予測の評価
長谷川健人; 有馬海人; 三輪忍; 八巻隼人; 本多弘樹
Mar. 2023, 情報処理学会研究報告(Web), 2023, HPC-188, 202302277125915058

Modeling Performance of Deep Learning for Image Recognition on a GPU Server
松下哲也; 三輪忍; 八巻隼人; 本多弘樹
Mar. 2023, 電子情報通信学会技術研究報告(Web), 122, 451(CPSY2022 34-55), 2432-6380, 202302236656577507

Evaluation of Countermeasures Power Side-channel Attacks
下島航太; 三輪忍; 八巻隼人; 本多弘樹
Mar. 2023, 電子情報通信学会技術研究報告(Web), 122, 451(CPSY2022 34-55), 2432-6380, 202302236713517810

Optimizing Hash Functions of Rabin-Karp Method for Multi-Pattern Matching with Multiple Pattern Length
鈴木想生; 八巻隼人; 三輪忍; 本多弘樹
Mar. 2023, 電子情報通信学会技術研究報告(Web), 122, 451(CPSY2022 34-55), 2432-6380, 202302287281843795

Traffic Load Balancing on aggregated links
平野愁也; 八巻隼人; 三輪忍; 本多弘樹
Mar. 2023, 情報処理学会研究報告(Web), 2023, ARC-252, 202302266722204411

SRAMの電力/遅延シミュレータCACTIのCNFETへの対応
関川栄一郎; 三輪忍; ヨウドウキン; 塩谷亮太; 八巻隼人; 本多弘樹
Jul. 2022, 情報処理学会研究報告(Web), 2022, ARC-249, 202202244716289108

CUDAバージョンの違いがカーネルの実行時間と消費電力に与える影響の分析
吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
2022, 情報処理学会研究報告(Web), 2022, HPC-183, 202202242346420364

Link Congestion-Based Multipath Routing using In-Band Network Telemetry
荒巻慎太朗; 田中京介; 八巻隼人; 三輪忍; 本多弘樹
2022, 電子情報通信学会技術研究報告(Web), 122, 16(NS2022 8-22), 2432-6380, 202202251502143401

CPUおよびGPUの電力ばらつきを考慮したジョブスケジューリング手法の提案
小野賢人; 吉田幸平; 三輪忍; 坂本龍一; 八巻隼人; 本多弘樹
2022, 情報処理学会研究報告(Web), 2022, HPC-185, 202202239933175791

OpenMP/OpenACCハイブリッド並列化のためのコード変換フレームワークの提案
川崎真之; 大島聡史; 八巻隼人; 三輪忍; 本多弘樹
2022, 情報処理学会研究報告(Web), 2022, HPC-187, 202202284613935034

LULESHを対象とした関数コール回数予測
有馬海人; 長谷川健人; 三輪忍; 八巻隼人; 本多弘樹
2022, 情報処理学会研究報告(Web), 2022, HPC-187, 202202286265964027

A Fast and Secure VMI Mechanism for Malware Analysis
森, 瑞穂; 味曽野, 雅史; 八巻, 隼人; 三輪, 忍; 本多, 弘樹; 品川, 高廣
マルウェアの挙動や攻撃手法を解析する手段として，仮想マシン上のプログラムの内部状態を観察するVirtual Machine Introspection (VMI)という手法が用いられている．VMIには、主に外部のハイパーバイザから行うOut-of-the-box方式と仮想マシン内部から行うIn-the-box方式の2つがあるが，両者は解析時の動作速度の高速性と解析システムを保護・隠蔽する安全性の面でトレードオフの関係にある．そこで我々は，高速かつ安全なVMI機構としてFastVMIXを提案する．FastVMIXでは，マルウェアを解析する解析エージェントを仮想マシン内部に挿入することによってハイパーバイザへのコンテキストスイッチを減らしつつ，Intel CPUがサポートするVMFUNCのEPTP SwitchingとHuge Pageを用いた高速な動的メモリ保護変更機構により、マルウェアから解析エージェントのメモリ領域を保護・隠蔽する．また，準パススルー型ハイパーバイザを用いることで、仮想化のオーバーヘッド削減及び隠蔽度の向上を図る．本論文では，BitVisorをベースにFastVMIXを実装した結果を報告する．
As a means of quickly analyzing malware behavior and attack methods, a technique called Virtual Machine Introspection (VMI) is used to observe the internal state of programs on a virtual machine. A typical VMI system mainly takes either an out-of-the-box (i.e., with hypervisor) or in-the-box (i.e., within the virtual machine) approach; however, these two approaches involve a trade-off between the analysis speed and the security of protectiong and hiding the analysis system. In this paper, we propose FastVMIX that realizes fast and secure VMI. FastVMIX reduces the number of context switches to the hypervisor during malware analysis by inserting an analysis agent in the target virtual machine, while protecting and hiding the agent's memory area by switching memory protection with EPTP switching of VMFUNC and huge pages supported by Intel CPUs. In addition, we used a para-pass-through hypervisor to reduce the overhead of virtualization and improve the degree of hiding. This paper reports several experimental results of FastVMIX built on BitVisor., 25 Nov. 2021, コンピュータシステム・シンポジウム論文集, 2021, 48-56, Japanese, 170000185943
URL

MPIアプリケーションの関数コール回数予測
有馬海人; 長谷川健人; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-178, 202102222474724022

MPIアプリケーションのキャッシュプロファイル予測
長谷川健人; 有馬海人; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-178, 202102245178693852

Table-Separate Packet Processing Caches for Routing/ARP/ACL/QoS
長田大樹; 田中京介; 八巻隼人; 三輪忍; 本多弘樹; 五島正裕
2021, 情報処理学会研究報告(Web), 2021, ARC-244, 202102212200475516

TensorFlowアプリケーション用GPUサーバにおけるNVDIMMの利用可能性の検討
松下哲也; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, ARC-244, 202102288818448896

Mesh TensorFlowを用いたモデル並列学習におけるCPU-GPU間のデータ転送最適化
横手宥則; 三輪忍; 八巻隼人; 本多弘樹
2021, 電子情報通信学会技術研究報告(Web), 120, 435(CPSY2020 50-69), 2432-6380, 202102235623871899

Wisteria/BDEC-01におけるNVIDIA A100 GPUの電力性能ばらつきの評価
提山春日; 吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-182, 202102219931012939

SDNコントローラにおける優先度付きキューを用いた高優先度パケットの処理高速化
高倉玲央; 八巻隼人; 三輪忍; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, ARC-246, 202102229681366351

深層学習における実行時ファイルステージング
樋口遼太郎; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-182, 202102244061258213

MPIにおける小規模実行時の通信トレース解析による大規模実行時の通信タイミング予測の評価
岡田悠希; 三輪忍; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, HPC-182, 202102265589208738

カーボンナノチューブトランジスタを用いて論理合成したプロセッサの電力/面積/回路遅延評価
佐々木魁; 三輪忍; ヨウドウキン; 塩谷亮太; 八巻隼人; 本多弘樹
2021, 情報処理学会研究報告(Web), 2021, ARC-245, 202102251997160736

ネットワーク機器における高速なGZIP復号のためのキャッシュ利用効率向上手法
黒川雄亮; 八巻隼人; 三輪忍; 本多弘樹
2020, 電子情報通信学会技術研究報告, 119, 429(DC2019 98-121)(Web), 0913-5685, 202002211528472983

動画トラフィック検査除外手法のSnortにおける実装
祐野雅範; 八巻隼人; 三輪忍; 本多弘樹
2020, 電子情報通信学会技術研究報告, 119, 429(DC2019 98-121)(Web), 0913-5685, 202002260646532043

パケット処理キャッシュにおけるパイプライン化とマルチポート化の評価
田中京介; 八巻隼人; 三輪忍; 本多弘樹
2019, 情報処理学会研究報告(Web), 2019, ARC-237, 201902240481337109

多頻度・順不同で到着するシーケンスデータの主キーごとの処理順序制約を満たすリアルタイム並列処理手法
山添高弘; 山添高弘; 三輪忍; 本多弘樹
2019, 情報処理学会研究報告(Web), 2019, DBS-169, 201902235140320625

TSUBAME3.0における製造ばらつきを考慮したGPUの電力モデリングの高速化
大八木哲哉; 浅田風太; 三輪忍; 八巻隼人; 本多弘樹
2019, 情報処理学会研究報告(Web), 2019, HPC-172, 202002234496157254

テーブル検索回数の削減によるインターネットルータの高スループット化および省電力化
山下壮樹; 八巻隼人; 三輪忍; 本多弘樹
2019, 電子情報通信学会技術研究報告, 119, 343(IA2019 48-58)(Web), 0913-5685, 202002264724684120

OpenFlowを用いた動画フローの非ミラーリングによるNIDS処理負荷の削減
高倉玲央; 八巻隼人; 三輪忍; 本多弘樹
2019, 電子情報通信学会技術研究報告, 119, 343(IA2019 48-58)(Web), 0913-5685, 202002285135458100

ネットワーク機器上における高速なGZIP復号のためのキャッシュ利用効率向上手法の提案
黒川雄亮; 八巻隼人; 三輪忍; 本多弘樹
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902210265635577

学習済み重みを利用した畳み込みニューラルネットワークの学習法の初期検討
横手宥則; 三輪忍; 井内悠太; 津邑公暁; 八巻隼人; 本多弘樹
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902220924867022

キャッシュを利用したOpenFlow通信の高速化
祐野雅範; 三輪忍; 八巻隼人; 本多弘樹
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902227184856037

GPUの電力ばらつきモデリング
浅田風太; 三輪忍; 本多弘樹; 八巻隼人
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902281110920126

ネットワークベースの攻撃に対応可能な高対話型ハニーポット
森瑞穂; 八巻隼人; 三輪忍; 本多弘樹
2019, 電子情報通信学会大会講演論文集(CD-ROM), 2019, 1349-144X, 201902217092133421

ON/OFFリンクにおける通信開始遅延を低減するためのプリウェイクアップ手法の検討
松山, 朋樹; 三輪, 忍; 八巻, 隼人; 本多, 弘樹
近年のスーパーコンピュータの消費電力は、供給可能な電力に達しつつあり、システム内の各ハードウェアの消費電力をさらに削減する必要がある。スーパーコンピュータのインターコネクション・ネットワークにおける省電力化技術として、通信していないリンクを低電力モードにすることが可能なON/OFFリンクが注目されている。しかし、低電力モード時に通信要求があった場合まず通常モードにする必要があり、そのモード遷移にかかる時間分、通信の開始が遅延してしまう。そこで、本研究では、低電力モードのリンクを通信要求に先立って通常モードにし(プリウェイクアップ)、データ到着後直ちに通信を開始できるようにする方法を検討する。, 13 Mar. 2018, 第80回全国大会講演論文集, 2018, 1, 123-124, Japanese, 201802252881916162, 170000176601, AN00349328
URL

CNN計算の省メモリ化のためのカーネル・クラスタリング手法の検討—A Study of Kernel Clustering for Reducing Memory Footprint of CNN—コンピュータシステム ; 組込み技術とネットワークに関するワークショップETNET2018
松井優樹; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 479, 185-190, Japanese, 0913-5685, 40021521631, AA1123312X
URL

NVDIMMを用いたメモリスナップショットの解析システム—A System for Analyzing Memory Snapshot with NVDIMM—ディペンダブルコンピューティング ; 組込み技術とネットワークに関するワークショップETNET2018
三須雅仁; 三輪忍; 八巻隼人; 本多弘樹
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 480, 107-112, Japanese, 0913-5685, 40021521593, AA1123312X
URL

カーネルの類似性に基づく近似計算を行うCNNアクセラレータの検討—コンピュータシステム ; 組込み技術とネットワークに関するワークショップETNET2018
進藤智司; 松井優樹; 八巻隼人; 津邑公暁; 三輪忍
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 479, 179-184, Japanese, 0913-5685, 201802232250185293, 40021521629, AA1123312X
URL

ゲートウェイにおける攻撃パケットに着目したテーブル検索負荷削減手法の提案—ディペンダブルコンピューティング ; 組込み技術とネットワークに関するワークショップETNET2018
愛甲達也; 八巻隼人; 三輪忍; 本多弘樹
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 480, 89-94, Japanese, 0913-5685, 201802212922943301, 40021521583, AA1123312X
URL

HSPICEを用いたシリコン回路とカーボンナノチューブ回路の比較評価—ディペンダブルコンピューティング ; 組込み技術とネットワークに関するワークショップETNET2018
松尾駿; 三輪忍; 八巻隼人; 本多弘樹
電子情報通信学会, Mar. 2018, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 480, 119-124, Japanese, 0913-5685, 201802283716598713, 40021522224, AA1123312X
URL

Run-Time Performance-per-Watt Optimization by DFS/DCT Techniques
三吉郁夫; 三輪忍; 井上弘士; 近藤正章
2018, 情報処理学会研究報告(Web), 2018, HPC-163, 201802270411095874

A System for Analyzing Memory Snapshot with NVDIMM
三須雅仁; 三輪忍; 八巻隼人; 本多弘樹
2018, 電子情報通信学会技術研究報告, 117, 480(DC2017 89-106), 0913-5685, 201802277611652387

A Study of Kernel Clustering for Reducing Memory Footprint of CNN
松井優樹; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
2018, 電子情報通信学会技術研究報告, 117, 480(DC2017 89-106), 0913-5685, 201802279006502398

プリウェイクアップ手法によるON/OFFリンクの消費エネルギー削減
松山朋樹; 三輪忍; 八巻隼人; 本多弘樹
2018, 情報処理学会研究報告(Web), 2018, HPC-165, 201802210749233668

1Tbps実現に向けたルータのメモリ階層の最適化
田中京介; 八巻隼人; 三輪忍; 本多弘樹
2018, 情報処理学会研究報告(Web), 2018, ARC-233, 201902221895431753

高電力効率なCNNアクセラレータ実現に向けたカーネルクラスタリングの応用の検討 (コンピュータシステム)
進藤智司; 松井優樹; 八巻隼人; 津邑公暁; 三輪忍
電子情報通信学会, 26 Jul. 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 153, 65-73, Japanese, 0913-5685, 201702211184429190, 40021284637
URL

高電力効率なCNNアクセラレータ実現に向けたカーネルクラスタリングの応用の検討 (ディペンダブルコンピューティング)
進藤智司; 松井優樹; 八巻隼人; 津邑公暁; 三輪忍
電子情報通信学会, 26 Jul. 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 154, 41-49, Japanese, 0913-5685, 40021286119
URL

動画トラフィックに着目したNIDSにおける文字列探索処理負荷削減手法の提案
高徳真晴; 八巻隼人; 三輪忍; 本多弘樹
電子情報通信学会, Jul. 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 153, 177-183, Japanese, 0913-5685, 201702201099810955, 40021286172, AA1123312X
URL

パケット処理キャッシュにおける送信元IPアドレスに着目したミス削減手法に関する初期検討
八巻隼人; 愛甲達也; 三輪忍; 本多弘樹
電子情報通信学会, May 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 117, 44, 55-62, Japanese, 0913-5685, 201702227542455634, 40021215515, AA1123312X
URL

マルチコアニューラルネットワークアクセラレータにおけるデータ転送のブロードキャスト化—ディペンダブルコンピューティング ; 組込み技術とネットワークに関するワークショップETNET2017
大場百香; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
電子情報通信学会, Mar. 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 116, 511, 165-170, Japanese, 0913-5685, 201702248211492751, 40021158854, AA1123312X
URL

ジョブ実行中の計算ノードにおけるDIMM待機電力削減手法の実装と評価
石原雅也; 石原雅也; 三輪忍; 三輪忍; 八巻隼人; 八巻隼人; 本多弘樹; 本多弘樹
2017, 情報処理学会研究報告(Web), 2017, HPC-158, 201702274862027744

電力性能推定を目的としたインターコネクト・シミュレータTraceRPの開発
小野貴継; 垣深悠太; 三輪忍; 井上弘士
2017, 情報処理学会研究報告(Web), 2017, HPC-161, 201702243126684906

再構成可能なニューラルネットワークアクセラレータの提案と性能分析 (コンピュータシステム)
大場百香; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
電子情報通信学会, 08 Aug. 2016, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 116, 177, 235-242, Japanese, 0913-5685, 201602282821771867, 40020932410
URL

ニューラルネットワークアクセラレータにおけるコア間通信量最小化のためのタスク配置手法 (コンピュータシステム)
進藤智司; 大場百香; 津邑公暁; 三輪忍
電子情報通信学会, 08 Aug. 2016, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 116, 177, 243-250, Japanese, 0913-5685, 201602248836614037, 40020932412
URL

リンクオフスレッショルドを有するON/OFFリンクの電力見積手法の初期検討
西郷雄斗; 三輪忍; 八巻隼人; 本多弘樹
2016, 情報処理学会研究報告(Web), 2016, HPC-155, 201602251172366586

メモリホットプラグを用いたメインメモリの省電力化に関する初期検討
石原雅也; 石原雅也; 三輪忍; 三輪忍; 八巻隼人; 八巻隼人; 本多弘樹; 本多弘樹
2016, 情報処理学会研究報告(Web), 2016, HPC-155, 201602256412247012

ヘテロジニアス・プロセッサの設計探索手法の初期検討
澁谷俊憲; 三輪忍; 塩谷亮太; 佐々木広; 八巻隼人; 本多弘樹
2016, 電子情報通信学会技術研究報告, 116, 177(CPSY2016 10-40), 0913-5685, 201602240060831500

TLBミスペナルティ削減のための大容量LLCの利用法に関する初期検討
有間英志; 三輪忍; 中田尚; 中村宏
近年，不揮発性メモリや 3 次元積層技術等デバイス技術の進歩によって，これまで以上に大容量のメモリをオンチップに実装することが可能となりつつある．また，この様な大容量メモリをラスト・レベル・キャッシュ (LLC) として用いる利用法が提案され，大幅な性能向上が可能であることが示されてきた．しかし，これまでの大容量 LLC に関する先行研究では，TLB ミスペナルティの影響については，十分な考慮がなされてこなかった．LLC の大容量化に伴い，LLC 上に格納されたデータの内，当該ページアドレスが TLB 上に存在しないものの割合は増大する．その様なデータがアクセスされると TLB ミスが発生し，キャッシュもしくはメインメモリ上に存在する当該ページテーブルエントリへのアクセスが発生する．この TLB ミスペナルティの影響を削減することは，今後 LLC の大容量化がさらに進むにつれて極めて重要となる．そこで本研究では，大容量 LLC 上において，ページテーブルエントリを保持するラインの存在割合を最適化し，ページテーブルへのアクセスの殆どを LLC 上でヒットさせることによって，TLB ミスペナルティの削減を目指す．本稿では，これを行うためのキャッシュリプレイスメントアルゴリズムを検討し評価を行った．, 一般社団法人情報処理学会, 22 Jan. 2015, 研究報告計算機アーキテクチャ（ARC）, 2015, 7, 1-6, Japanese, 110009866537, AN10096105
URL

TLBミスペナルティ削減のための大容量LLCの利用法に関する初期検討(集積回路とアーキテクチャの協創「ロボット,ヒューマノイド,AI技術及び一般」)
有間英志; 三輪忍; 中田尚; 中村宏
近年,不揮発性メモリや3次元積層技術等デバイス技術の進歩によって,これまで以上に大容量のメモリをオンチップに実装することが可能となりつつある.また,この様な大容量メモリをラスト・レベル・キャッシュ(LLC)として用いる利用法が提案され,大幅な性能向上が可能であることが示されてきた.しかし,これまでの大容量LLCに関する先行研究では,TLBミスペナルティの影響については,十分な考慮がなされてこなかった.LLCの大容量化に伴い,LLC上に格納されたデータの内,当該ページアドレスがTLB上に存在しないものの割合は増大する.その様なデータがアクセスされるとTLBミスが発生し,キャッシュもしくはメインメモリ上に存在する当該ページテーブルエントリへのアクセスが発生する.このTLBミスペナルティの影響を削減することは,今後LLCの大容量化がさらに進むにつれて極めて重要となる.そこで本研究では,大容量LLC上において,ページテーブルエントリを保持するラインの存在割合を最適化し,ページテーブルへのアクセスの殆どをLLC上でヒットさせることによって,TLBミスペナルティの削減を目指す.本稿では,これを行うためのキャッシュリプレイスメントアルゴリズムを検討し評価を行った., 一般社団法人電子情報通信学会, 22 Jan. 2015, 電子情報通信学会技術研究報告. ICD, 集積回路, 114, 436, 37-42, Japanese, 0913-5685, 201502261898390010, 110010010030, AN10013276

演算器におけるオペランド値を考慮したパワーゲーティングに関する初期検討
石川雄介; 小柴篤史; 坂本龍一; 和田康孝; 三輪忍; 近藤正章; 並木美太郎; 本多弘樹
2015, 電子情報通信学会技術研究報告, 115, 243(CPSY2015 45-60), 0913-5685, 201502202570022719

Organization Report for SWoPP Beppu 2015
H. Yamada; T. Ohkawa; Y. Katsu; S. Miwa; T. Endo; H. Tadano; Y. Takamiya; M. Kubota; M. Koibuchi; M. Goshima
2015, IPSJ Magazine, 56, 12, 1220-1223, Japanese, Introduction scientific journal

アクセスの局所性に着目したSTT-MRAMキャッシュの周辺回路の電源制御手法
有間英志; 野口紘希; 中田尚; 三輪忍; 武田進; 藤田忍; 中村宏
プロセッサの消費するリーク電力は，半導体の微細化が進むにつれて増大してきた．特にキャッシュのリーク電力は，回路面積が大きいために，プロセッサの消費電力の大部分を占めている．この問題に対処するため，STT-MRAM といった不揮発性メモリをキャッシュに適用する試みが近年なされている．しかし，STT-MRAM で構成されたキャッシュでは，メモリセルのリーク電力は無視できる程小さいが，周辺回路のリーク電力が大きくなるという問題があった．そのため，性能低下を抑えつつこれを削減する様な技術が必要となる．そこで本研究では，STT-MRAM キャッシュの周辺回路に対して，細粒度に電源制御を行う技術を提案する．具体的には，サブアレイ単位の電源制御を行い，各サブアレイに対して一定時間アクセスがなければ，そのサブアレイに対する電源供給を遮断する．また，電力削減効果をさらに増大させるため，各サブアレイに対するアクセスの時空間的な局所性を向上させる技術も提案する．評価の結果，最新の STT-MRAM を適用したラスト・レベル・キャッシュにおいて 80%程度のリーク電力の削減が可能であることが分かった．, 一般社団法人情報処理学会, 21 Jul. 2014, 研究報告計算機アーキテクチャ（ARC）, 2014, 11, 1-6, Japanese, 201502259588427288, 110009808091

Implementation and Evaluation of Dalvik Accelerator Using FPGA
Yuki Oigo; Daisuke Yoshizane; Atsushi Ohta; Shinobu Miwa; Hironori Nakajo
On an Android device, a Java application is compiled to the intermediate code called as Dalvik bytecode and then run with Dalvik Virtual Machine. Such a execution model causes performance degradation and memory wastage. In order to solve this issue, we have proposed a Dalvik accelerator which directly executes an intermediate code with specialized hardware. Previous studies by our group have shown architecture and preliminary evaluation about generated code. Therefore, in this study, we have implemented the accelerator as well as a pipelined MIPS processor on FPGA to evaluate our proposed mechanism. The result shows that the proposed mechanism significantly improves performance for some programs. Here we report detailed implementation and evaluation results., Information Processing Society of Japan (IPSJ), 12 May 2014, 情報処理学会研究報告(Web), 2014, 3, 1-8, Japanese, 0919-6072, 201502297254336906, 110009767029, AA12149313
URL

電力制約下における蓄電池を用いたHPCシステムの性能向上
酒井崇至; 薦田登志矢; 三輪忍; 中村宏
HPC システムの大規模化に伴い，電力供給系，冷却系の建設コストの増大，運用コストの増大が深刻になっている．こうしたコストを削減する目的で，電力供給系をオーバープロビジョニングすることで電力供給系の使用効率を高める設計手法が提案されている．このような設計手法を用いるためには，運用時に計算機資源に電力制約を課す電力管理手法が必要となる．しかし，従来の電力管理手法を適用した場合，性能低下が大きく問題があった．そこで本論文では，蓄電池を用いて時間方向に電力を融通し，電力制約下における性能向上を実現するパワーシフティング手法を提案する．提案手法では，停電時のために設置されている UPS （無停電電源装置）内の蓄電池と周波数制御を併用し，電力を投入しても性能が上がりにくいフェーズから，電力を投入することで性能が大きく上がるフェーズへ電力を融通することで高い性能を達成する．評価の結果，提案手法を用いることで，従来の周波数制御を用いた電力制約制御手法に対して，CPU アプリケーションで平均 4.5%，GPU アプリケーションで平均 17.1%の性能向上が実現できることを示した．, 一般社団法人情報処理学会, 24 Feb. 2014, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2014, 25, 1-6, Japanese, 201502260905452860, 110009675739, AN10463942
URL

ロードバランスを考慮した電力制約下におけるCPUのDVFS制御
會田翔; 三輪忍; 中村宏
現在の大規模計算環境においては，実質的に利用可能な電力の限界が見えてきたことで電力制約が性能向上の妨げとなっている．近年そういった環境下において必要となる性能対電力のトレードオフの調整を行う技術が発達してきた．本稿では，電力効率改善のため，CPU 間のロードバランス改善のためのランタイム制御手法を提案し，最大 21.3%の性能向上を達成した．, 一般社団法人情報処理学会, 24 Feb. 2014, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2014, 23, 1-8, Japanese, 201502243125090230, 110009675737, AN10463942
URL

物理メモリの増減による電力制約下でのHPCシステムの性能向上
米澤亮太; 會田翔; 三輪忍; 中村宏
近年の HPC システムの規模はその消費電力によって制限されている．今後，HPC システムの処理能力をさらに向上させ，エクサフロップスを実現するためには，消費電力を増やすことなくシステムの処理能力のみを向上させる技術が必要不可欠である．HPC システムには大量の物理メモリが搭載されているが，使用メモリ量はアプリケーションによって異なっており，全ての物理メモリを必要とするケースは少ないと考えられる．そこで本研究では，アプリケーションに応じて使用する物理メモリ量を変更することにより，メモリの消費電力を節約する．そうして削減した分の電力を CPU 周波数の向上に費すことにより，電力制約下でのシステム性能を改善する．評価の結果，最大 27.8%の性能向上が得られることを確認した．, 一般社団法人情報処理学会, 24 Feb. 2014, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2014, 24, 1-8, Japanese, 201502206149132135, 110009675738, AN10463942
URL

ターボ・モード強化のための面積効率に優れたマイクロプロセッサとその設計手法
三輪忍; 井上聖等; 中村宏
多くのプログラムを加速できるターボ・モードは商用プロセッサにおいて広く採用されている．ターボ・モード中に利用できる最大動作周波数は，CPU 負荷の大きいプログラムを用いた温度試験を通じて，温度制約違反を起こさないようにある程度の温度マージンを設けた上で定められている．したがって，プログラム実行中のプロセッサの温度上昇を抑えることができれば，同一温度マージンのもとでより高い動作周波数を利用することができ，その結果，ターボ・モードによる性能向上効果を高めることができる．本論では，少量のハードウェアを追加することでプロセッサの温度上昇を抑制し，ターボ・モードの能力を増強する方法について述べる．また，そのようなプロセッサの設計手法も併せて提案する．評価の結果，提案手法により 2.8%の面積増加で最大 14.5%の性能向上を達成した．, 一般社団法人情報処理学会, 16 Jan. 2014, 研究報告計算機アーキテクチャ（ARC）, 2014, 12, 1-10, Japanese, 201502290583202479, 110009658692, AN10096105
URL

ダーク・シリコン時代のプロセッサ・アーキテクチャに関する初期検討
三輪忍; 塩谷亮太; 佐々木広
2014, 情報処理学会研究報告(Web), 2014, ARC-211, 201502292808743566

回路資源の投入により電力効率を改善するプロセッサ・アーキテクチャ
三輪忍; 塩谷亮太; 佐々木広
2014, 情報処理学会研究報告(Web), 2014, ARC-212, 201502236873120625

Design Aid of Multi-core Embedded Systems with Energy Model
Nakada Takashi; Okamoto Kazuya; Komoda Toshiya; Miwa Shinobu; Sato Yohei; Ueki Hiroshi; Hayashikoshi Masanori; Shimizu Toru; Nakamura Hiroshi
Shifting to multi-core designs is so pervasive a trend to overcome the power wall and it is a necessary move for embedded systems in our rapidly evolving information society. Meanwhile, the need to increase the battery life and reduce maintenance costs for such embedded systems is very critical. Therefore, a wide variety of power reduction techniques have been proposed and realized, including Clock Gating, DVFS and Power Gating. To maximize the effectiveness of these techniques, task scheduling is a key but for multi-core systems it is very complicated due to the huge exploration space. This problem is a major obstacle for further power reduction. To cope with it, we propose a design method for embedded systems to minimize their energy consumption under performance constraints. This method is based on the clarification of properties of the above mentioned low power techniques and their interactions. In more details, we firstly establish energy models for these low power techniques and our target systems. We then explore for the best configuration by constructing an optimization problem especially for applications which have a longer deadline than the execution interval. Finally, we propose an approximate solution using dynamic programming with a lower computation complexity and compare it to a brute force explicit solution. We confirm with our evaluations that the proposed method successfully found a better configuration which reduces the total energy consumption by 32% if compared to the manually optimized configuration, which utilizes only one core., Information and Media Technologies Editorial Board, 2014, IMT, 9, 4, 419-428, 1881-0896, 130004705277

Design Aid of Multi-core Embedded Systems with Energy Model
Nakada Takashi; Okamoto Kazuya; Komoda Toshiya; Miwa Shinobu; Sato Yohei; Ueki Hiroshi; Hayashikoshi Masanori; Shimizu Toru; Nakamura Hiroshi
Shifting to multi-core designs is so pervasive a trend to overcome the power wall and it is a necessary move for embedded systems in our rapidly evolving information society. Meanwhile, the need to increase the battery life and reduce maintenance costs for such embedded systems is very critical. Therefore, a wide variety of power reduction techniques have been proposed and realized, including Clock Gating, DVFS and Power Gating. To maximize the effectiveness of these techniques, task scheduling is a key but for multi-core systems it is very complicated due to the huge exploration space. This problem is a major obstacle for further power reduction. To cope with it, we propose a design method for embedded systems to minimize their energy consumption under performance constraints. This method is based on the clarification of properties of the above mentioned low power techniques and their interactions. In more details, we firstly establish energy models for these low power techniques and our target systems. We then explore for the best configuration by constructing an optimization problem especially for applications which have a longer deadline than the execution interval. Finally, we propose an approximate solution using dynamic programming with a lower computation complexity and compare it to a brute force explicit solution. We confirm with our evaluations that the proposed method successfully found a better configuration which reduces the total energy consumption by 32% if compared to the manually optimized configuration, which utilizes only one core., Information Processing Society of Japan, 2014, IPSJ Online Transactions, 7, 0, 122-131, 1882-6660, 130004679148
DOI URL

Organization Report for SWoPP Niigata 2014
K. Nakajima; Y. Katsu; S. Miwa; R. Takano; T. Iwashita; T. Yoshikawa; H. Tadano; H. Matsutani
2014, IPSJ Magazine, 55, 12, 1415-1418, Japanese, Introduction scientific journal

Lost Data Prefetching to Reduce Performance Degradation Caused by Powering off Caches
Eishi Arima; Toshiya Komoda; Takashi Nakada; Shinobu Miwa; Hiroshi Nakamura
In current computer systems, to reduce power consumption of a processor in idle state, cores and caches are powered off when OS detects the idle state. However, powering off caches causes performance degradation, because it invokes data loss and additional cache misses. For this reason, caches, especially last level cache, are infrequently powered off in modern systems. To cope with this problem we propose a novel prefetch scheme: restoring such lost data before they are re-referenced. The experimental results show lots of lost data can be restored before re-references. Hence, it is cleared that this method suppresses performance degradation and increases opportunity of powering off caches., 一般社団法人情報処理学会, 25 Sep. 2013, 情報処理学会論文誌コンピューティングシステム（ACS）, 6, 3, 118-130, Japanese, 1882-7829, 201502266912937258, 110009606665

周期実行システムにおける中間データに着目した電力制御手法
重松拓也; 薦田登志矢; 中田尚; 三輪忍; 佐藤洋平; 植木浩; 林越正紀; 清水徹; 中村宏
現在，様々な分野でスマートセンサシステムが使われている．しかし各ノードに十分な電源を確保することは難しく，電池で動作しなければならない状況で利用されることも多い．このような状況下では電池交換の回数を最小化したいといった要求があり，ノード自体の省電力化が強く求められている．特に，ノード内で高度な処理を行うセンサシステムではマイコンが消費する電力が問題となっている．従来より，マイコンの消費電力を抑えるためにパワーゲーティングと呼ばれる手法が用いられるが，近年ではワーキングメモリをもパワーゲーティングするより深いスリープモードを持つマイコンも登場している．深いスリープを行えば大幅な電力削減効果が得られるが，ワーキングメモリの内容が失われるため，復帰後も必要な中間データは不揮発メモリに退避させる必要がある．また，中間データの退避にも追加エネルギが必要であるため，深いスリープによって得られる消費エネルギの削減分より，この退避エネルギが大きくなる場合には深いスリープを行うべきではない．そこで本稿では，中間データのサイズと保持期間に着目し，マイコンが消費するエネルギを最小化する最適な電源制御とデータ退避方法を導出するアルゴリズムを提案する．, 一般社団法人情報処理学会, 10 Sep. 2013, 研究報告組込みシステム（EMB）, 2013, 5, 1-8, Japanese, 0919-6072, 201502272328977191, 110009605215

電力制約下におけるCPUとネットワークの電力制御協調手法
會田翔; 三輪忍; 中村宏
現在の大規模計算環境においては，実質的に利用可能な電力の限界が見えてきたことで性能向上のために省電力化が必要となっている．本研究は近年の高性能化により消費電力が無視できなくなってきたネットワークの省電力化に着目している．Energy Efficient Ethernet(EEE) はネットワークの省電力化技術として標準化されている．これはデータを送受信していない時にリンクを低電力モードに遷移させることで PHY の消費電力を減少させる技術である．本稿では，電力制約のある大規模計算環境において EEE が導入されることを想定し，ネットワークにおいて削減した電力を効率良く計算ノードの CPU に分配することで，アプリケーションの性能を向上させる手法を提案する．, 一般社団法人情報処理学会, 24 Jul. 2013, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2013, 1, 1-8, Japanese, 201502224001728886, 110009588121, AN10463942
URL

ALUローテーションによるスーパスカラプロセッサの性能向上
井上聖等; 三輪忍; 中田尚; 中村宏
プロセッサの定格の電圧と動作周波数は，半導体素子が正常に動作することを保証するため，ジャンクション温度がしきい値を超えないように定められている．ジャンクション温度はホット・スポットのモジュールのアクティビティに依存するため，IPC を保ったままアクティビティを下げることができればより高い周波数を用いることができ，プロセッサ性能は向上する．ホット・スポットの 1 つである ALU は，通常の実装では ALU 間のアクティビティに大きな偏りが存在し，その結果，特定の ALU が高温になりやすい構造となっている．本稿では，多数の ALU を用意し，それらをラウンド・ロビンに利用することで，IPC を悪化させることなく個々の ALU のアクティビティを抑える手法を提案する．評価の結果，提案手法によって 10.4% 性能が向上することがわかった．, 19 Mar. 2013, 研究報告計算機アーキテクチャ（ARC）, 2013, 10, 1-8, Japanese, 110009552443

バイパス専用ALUを用いる事による小面積高スループットプロセッサ
齋藤和明; 三輪忍; 中條拓伯
2013, 情報処理学会研究報告(CD-ROM), 2012, 5, 2186-2583, 201302246094880398

ALUローテーションによるスーパスカラプロセッサの性能向上
井上聖等; 三輪忍; 中田尚; 中村宏
2013, 情報処理学会研究報告(CD-ROM), 2012, 6, 2186-2583, 201302270889884373

NoC型メニーコア設計のための高速キャッシュシミュレーション
中田尚; 三輪忍; 中村宏
2013, 情報処理学会研究報告(CD-ROM), 2012, 5, 2186-2583, 201302294688642815

From the Editor
Shinobu Miwa
2013, IPSJ Magazine, 54, 7, 652-653, Japanese, Introduction scientific journal

Normally-off Computing - its opportunities and challenges -
Hiroshi Nakamura; Takashi Nakada; Shinobu Miwa
2013, IPSJ Magazine, 54, 7, 654-660, Japanese, Introduction scientific journal, 0447-8053, 201302264272913384

FX10におけるインタコネクト・コントローラの省電力化手法の初期検討
三輪忍; 會田翔; 安島雄一郎; 清水俊幸; 安里彰; 中村宏
近年のスーパコンピュータでは，供給可能な電力がシステムの規模を決定づける主要因となってきている．供給可能な電力は現状の 2～3 倍程度が現実的な上限と考えられている．そのため，エクサフロップス級のシステム実現のためには，コンピュータのあらゆるモジュールにおいて電力効率のさらなる改善が必須である．本稿では，スーパコンピュータ FX10 を対象に，これまであまり研究の対象とされてこなかったインタコネクト部分の省電力化手法を検討する．, 06 Dec. 2012, 研究報告計算機アーキテクチャ（ARC）, 2012, 5, 1-10, Japanese, 2186-2583, 201302204333308109, 110009490616, AN10096105
URL

FX10におけるインタコネクト・コントローラの省電力化手法の初期検討
三輪忍; 會田翔; 安島雄一郎; 清水俊幸; 安里彰; 中村宏
近年のスーパコンピュータでは，供給可能な電力がシステムの規模を決定づける主要因となってきている．供給可能な電力は現状の 2～3 倍程度が現実的な上限と考えられている．そのため，エクサフロップス級のシステム実現のためには，コンピュータのあらゆるモジュールにおいて電力効率のさらなる改善が必須である．本稿では，スーパコンピュータ FX10 を対象に，これまであまり研究の対象とされてこなかったインタコネクト部分の省電力化手法を検討する．, 06 Dec. 2012, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2012, 5, 1-10, Japanese, 110009490653, AN10463942
URL

バイパス専用ALUを用いる事による小面積高スループットプロセッサ
齋藤和明; 三輪忍; 中條拓伯
命令レベルの並列性 (ILP) を有効に利用するため，発行幅を大きくすることは有効である．しかし，発行幅を大きくするにあたって，非演算回路の回路面積や消費電力の増加といった問題が引き起こされる．我々は特にレジスタ・ファイルの回路面積増加を抑え，その上でスループットを向上させるために，直近の演算結果を提供するバイパス機構に着目し，レジスタ・ファイル・アクセスの制限された ALU であるバイパス専用 ALU を提案する．提案手法であるバイパス専用 ALU を適用した場合， ALU 2 個の構成に対し， SPEC CINT2006 ベンチマークセットにおいて，平均 8.0% のスループット向上， SPEC CFP2006 ベンチマークセットにおいて平均 5.7% のスループット向上であった．, 06 Dec. 2012, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2012, 12, 1-6, Japanese, 110009490660, AN10463942
URL

NoC型メニーコア設計のための高速キャッシュシミュレーション
中田尚; 三輪忍; 中村宏
比較的小規模なコアを多数接続したメニーコアシステムは電力あたり性能に優れたシステムとして注目されている．特にコア間をオンチップネットワーク（Network on Chip: NoC）で接続した NoC 型メニーコアはコア数増大に適応可能なスケーラブルなメニーコアシステムであるといえる．しかし， NoC 型メニーコア設計においては設計パラメータが多岐にわたり，各パラメータを変えつつ，詳細なシミュレーションを行うためには膨大な時間が必要となり，効率的な設計の妨げになっている．そこで本研究では， NoC 型メニーコアの性能に与える影響が大きな共有キャッシュに注目し，そのシミュレーションをコア本体やネットワークのシミュレーションと分離する．これにより，共有キャッシュの挙動を高速にシミュレートし，性能予測を大幅に高速化することで，設計空間の大まかな絞り込みを実現する．, 06 Dec. 2012, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2012, 15, 1-6, Japanese, 110009490663

Hardware Implementation of a Dalvik Accelerator
吉實大輔; 太田淳; 三輪忍; 中條拓伯
10 Oct. 2012, 組込みシステムシンポジウム2012論文集, 2012, 225-226, Japanese, 170000072445
URL

C-006 ユーザーの快適さを考慮した情報機器の動的電源制御(コンピュータシステム技術,C分野:ハードウェア・アーキテクチャ)
岩澤直弘; 薦田登志矢; 三輪忍; 中田尚; 中村宏
FIT(電子情報通信学会・情報処理学会)運営委員会, 04 Sep. 2012, 情報科学技術フォーラム講演論文集, 11, 1, 277-278, Japanese, 201202284513055620, 110009622702

周期実行システムにおける省電力スケジューリングの初期検討
岡本和也; 薦田登志矢; 中田尚; 三輪忍; 佐藤洋平; 植木浩; 林越正紀; 清水徹; 中村宏
マイクロプロセッサを備えたセンサであるスマートセンサは，周囲の状況を周期的にサンプリングし，センシングした結果に簡単な処理を施し，その結果をメインのシステムへ送信する，周期的リアルタイムシステムの一種である．ただし，一般的なリアルタイムシステムとは異なり，入力データのサンプリング周期とデータ送信（デッドライン）の周期が必ずしも一致するわけではなく，一般には，後者の周期が前者の周期よりもはるかに大きい．そのため，データの入力間隔に合わせてシステムがデータを処理するのではなく，データを一旦バッファに格納しておき，いくつかのデータがバッファに溜まったらシステムを起動して処理を行い，処理が完了したらシステムをシャットダウンする，という制御が可能である．このような制御を行えば， DVFS や動的電源制御などの従来の制御を行う場合よりも，省電力なシステムを実現できると考えられる．本稿では，上述の制御を行うシステムのモデルを提案し，既存の制御手法と比較する．評価の結果，既存手法と比べて消費エネルギを 79.6% 削減できることがわかった．, 03 Sep. 2012, 研究報告組込みシステム（EMB）, 2012, 4, 1-8, Japanese, 170000071971, AA12149313

CPU/GPU間データ通信向け先読み機構の検討
薦田登志矢; 三輪忍; 中村宏
HPC を中心として，GPU コンピューティングの重要性が高まっている．一般的な構成の GPU コンピューティングシステムでは，汎用 CPU と GPU が物理的に異なるメモリを持ちこれらがシステムバスを介して接続される．これまでシステムバスにおけるデータ転送オーバーヘッドは，プログラマがアプリケーションの特性を考慮しつつデータ転送処理を最適化することで対処されてきた．しかし，手動によるデータ通信の管理・最適化はアプリケーション開発の生産性を大きく低下させることから，このようなデータ転送処理の自動化・自動最適化が望まれている．そこで本研究では，システムメモリとグラフィクスメモリの間で生じるデータ転送を対象とし，自動で計算と転送の並列実行を実現する先読み機構を提案する．提案システムは，アプリケーションのデータ通信パターンを実行時に解析し，次に転送対象となるデータを予測する．予測対象データは，非同期転送を用いて計算処理の裏で GPU 上のメモリへと先読みされる．本稿ではこのような先読み機構の設計と実装を示し，初期評価実験の結果を通じて性能向上の可能性を検討する．, 25 Jul. 2012, 研究報告計算機アーキテクチャ（ARC）, 2012, 25, 1-8, Japanese, 110009425037, AN10096105
URL

アイドル時キャッシュ電源遮断における性能ペナルティ削減手法の実装
有間英志; 薦田登志矢; 三輪忍; 中村宏
プロセッサがアイドル時に消費するリーク電力が全消費電力に占める割合は，トランジスタの微細化が進むにつれて年々上昇を続け問題となっている．このようなリーク電力を削減する目的で，アイドル時に OS の判断によりコアへの電源供給を遮断するパワーゲーティング技術が広く用いられている．しかし，キャッシュの電源を遮断した場合には格納されていたデータが揮発するため，電源復帰後に失われたデータを参照した場合，それによるキャッシュミスが性能低下を引き起こす．そのため，本研究ではキャッシュの電源を遮断する場合においても，タグアレイには通電させておき，電源復帰後にタグを用いてデータを復帰させる技術を提案する．また，無駄なデータ復帰を防ぐため，再利用されるデータの識別手法についての検討を行い，予備的な評価を行う．, 25 Jul. 2012, 研究報告計算機アーキテクチャ（ARC）, 2012, 15, 1-7, Japanese, 110009425027, AN10096105
URL

レジスタ・ファイルと実行ユニットにおけるアクティビティ・マイグレーション
井上聖等; 三輪忍; 中田尚; 中村宏
近年のチップは，ホット・スポットから発せられる熱によって，安全に動作できる周波数が制限されてしまっている．よって，ホット・スポットの温度を下げることができれば，チップの動作周波数を向上させることができ，プロセッサ性能は向上する．アクティビティ・マイグレーションは，あるモジュールで行われる処理をそれと同等の機能を有する別のモジュールへと移すことで，性能を維持しつつモジュールの温度上昇を抑える技術である．本稿では，ホットなモジュールの 1 つであるレジスタ・ファイルと実行ユニットに関して，アクティビティ・マイグレーションの適用を検討する．, 25 Jul. 2012, 研究報告計算機アーキテクチャ（ARC）, 2012, 11, 1-9, Japanese, 110009425023

キャッシュの利用効率の向上に関する研究
浅見公輔; 倉田成己; 塩谷亮太; 三輪忍; 五島正裕; 坂井修一
マルチスレッド実行環境では，キャッシュが複数スレッドによって共有されるため，多くの競合が発生してプロセッサの性能が大きく低下する．共有キャッシュにおける競合を緩和し，キャッシュの利用効率を向上させる手法を研究した．, 06 Mar. 2012, 第74回全国大会講演論文集, 2012, 1, 61-62, Japanese, 201202216132290059, 170000089518, AN00349328
URL

情報機器の動的電源制御における起動時間隠蔽のためのリクエスト間隔予測手法
渡辺千洋; 三輪忍; 中村宏
個人用情報機器においては、消費エネルギーを削減するため個々のモジュールのアイドル時間を検出して電源を遮断する、動的電源制御が行われている。従来の動的電源制御では、電源遮断のタイミングのみに着目して研究が行われており、電源復帰時の起動時間については考慮されていない。従来手法では、電源が遮断されている状態でリクエストが到来したら起動を開始する、という単純な制御を行っている。モジュールによっては起動に数秒を要するものもあり、このような電源制御はユーザに対してストレスを与えていると考えられる。本稿では、統計的な手法を用いたリクエスト間のインターバル予測に基づく電源制御により、機器の再起動に要する遅延を隠蔽する手法を提案する。, 一般社団法人情報処理学会, 06 Mar. 2012, 全国大会講演論文集, 2012, 1, 67-69, Japanese, 201202299830951582, 110009784863, AN00349328
URL

アイドル時のキャッシュ電源遮断による性能ペナルティとその削減手法 (集積回路・集積回路とアーキテクチャの協創 : ノーマリオフコンピューティングによる低消費電力化への挑戦)
有間英志; 薦田登志矢; 三輪忍; 中村宏
プロセッサがアイドル時に消費するリーク電力が全消費電力に占める割合は,トランジスタの微細化が進むにつれて年々上昇を続け問題となっている.このようなリーク電力を削減する目的で,プロセッサアイドル時にコアへの電源供給を遮断するパワーゲーティング技術がモバイル向け・デスクトップ向けのプロセッサで広く用いられている.現行のシステムでは,CPUのアイドル時間が一定の閾値を越えた場合にコアへの電源遮断を制御しているが,このとき問題になるのがスリープモード中にキャッシュに存在するデータが揮発することによって生じるスリープ復帰後のキャッシュミスの増大である.本研究ではコアがスリープモードに入った場合に生じるキャッシュフラッシュの影響によるキャッシュミス増大を回避するための,キャッシュプリフェッチ手法を提案しその効果について予備的な評価を行う., 一般社団法人電子情報通信学会, 19 Jan. 2012, 電子情報通信学会技術研究報告 : 信学技報, 111, 388, 9-14, Japanese, 0913-5685, 201202294075288205, 110009481165, AN10013276
URL

データ保持性を利用したキャッシュのパワーゲーティング手法 (集積回路・集積回路とアーキテクチャの協創 : ノーマリオフコンピューティングによる低消費電力化への挑戦)
金均東; 武田清大; 三輪忍; 中村宏
Caches consume large amount of leakage power because of their large area and massive transistors. To handle leakage power of caches, several works using power-gating(PG) was proposed. Even though PG is capable of high leakage saving, energy overhead by dismissing data is a big shortcoming of PG. In this paper, we focus on the data retentiveness of PG. This nature was not focused on previous works. Voltage of SRAM cell does not decrease to zero immediately after PG and this phenomenon is valuable to relive energy overhead for data recovery. We also propose a circuit to utilize data retentiveness. With the oracle knowledge control, we examined leakage saving potential of our proposal for L1 instruction and data cache. Results show that utilizing retentiveness of PG have big potential of leakage saving., 一般社団法人電子情報通信学会, 19 Jan. 2012, 電子情報通信学会技術研究報告 : 信学技報, 111, 388, 1-7, English, 0913-5685, 110009481164, AN10013276
URL

データ保持性を利用したキャッシュのパワーゲーティング手法
金均東; 武田清大; 三輪忍; 中村宏
Caches consume large amount of leakage power because of their large area and massive transistors. To handle leakage power of caches, several works using power-gating(PG) was proposed . Even though PG is capable of high leakage saving, energy overhead by dismissing data is a big shortcoming of PG. In this paper, we focus on the data retentiveness of PG. This nature was not focused on previous works. Voltage of SRAM cell does not decrease to zero immediately after PG and this phenomenon is valuable to relive energy overhead for data recovery. We also propose a circuit to utilize data retentiveness. With the oracle knowledge control, we examined leakage saving potential of our proposal for L1 instruction and data cache. Results show that utilizing retentiveness of PG have big potential of leakage saving.Caches consume large amount of leakage power because of their large area and massive transistors. To handle leakage power of caches, several works using power-gating(PG) was proposed . Even though PG is capable of high leakage saving, energy overhead by dismissing data is a big shortcoming of PG. In this paper, we focus on the data retentiveness of PG. This nature was not focused on previous works. Voltage of SRAM cell does not decrease to zero immediately after PG and this phenomenon is valuable to relive energy overhead for data recovery. We also propose a circuit to utilize data retentiveness. With the oracle knowledge control, we examined leakage saving potential of our proposal for L1 instruction and data cache. Results show that utilizing retentiveness of PG have big potential of leakage saving., 12 Jan. 2012, 研究報告計算機アーキテクチャ（ARC）, 2012, 1, 1-7, English, 170000068914, AN10096105
URL

アイドル時のキャッシュ電源遮断による性能ペナルティとその削減手法
有間英志; 薦田登志矢; 三輪忍; 中村宏
プロセッサがアイドル時に消費するリーク電力が全消費電力に占める割合は，トランジスタの微細化が進むにつれて年々上昇を続け問題となっている．このようなリーク電力を削減する目的で，プロセッサアイドル時にコアへの電源供給を遮断するパワーゲーティング技術がモバイル向け・デスクトップ向けのプロセッサで広く用いられている．現行のシステムでは，CPU のアイドル時間が一定の閾値を越えた場合にコアへの電源遮断を制御しているが，このとき問題になるのがスリープモード中にキャッシュに存在するデータが揮発することによって生じるスリープ復帰後のキャッシュミスの増大である．本研究ではコアがスリープモードに入った場合に生じるキャッシュフラッシュの影響によるキャッシュミス増大を回避するための，キャッシュプリフェッチ手法を提案しその効果について予備的な評価を行う．, 12 Jan. 2012, 研究報告計算機アーキテクチャ（ARC）, 2012, 2, 1-6, Japanese, 170000068915, AN10096105
URL

CPU/GPU間データ通信向け先読み機構の検討
薦田登志矢; 三輪忍; 中村宏
2012, 情報処理学会研究報告(CD-ROM), 2012, 3, 2186-2583, 201202222965513124

アイドル時キャッシュ電源遮断における性能ペナルティ削減手法の実装
有間英志; 薦田登志矢; 三輪忍; 中村宏
2012, 情報処理学会研究報告(CD-ROM), 2012, 3, 2186-2583, 201202239811755060

レジスタ・ファイルと実行ユニットにおけるアクティビティ・マイグレーション
井上聖等; 三輪忍; 中田尚; 中村宏
2012, 情報処理学会研究報告(CD-ROM), 2012, 3, 2186-2583, 201202268784259536

CMPにおけるキャッシュ・データを考慮したスレッド・スケジューリング手法の初期検討
三輪忍; 角崎宏一; 角崎宏一; 佐々木広; 中村宏
2012, 情報処理学会研究報告(CD-ROM), 2012, 1, 2186-2583, 201202239918739068

OSの電力管理下におけるラスト・レベル・キャッシュのリーク削減手法の比較
有間英志; 薦田登志矢; 三輪忍
2012, 回路とシステムワークショップ論文集(CD-ROM), 25, 202102243559229090

命令グループのワーキング・セットに着目したキャッシュ・マネジメント
浅見公輔; 倉田成己; 塩谷亮太; 三輪忍; 五島正裕; 坂井修一
2012, 情報処理学会研究報告(CD-ROM), 2012, 1, 2186-2583, 201202203624989886

命令グループごとのキャッシュ・パーティショニングの予備評価
浅見公輔; 倉田成己; 塩谷亮太; 三輪忍; 五島正裕; 坂井修一
2012, 情報処理学会研究報告(CD-ROM), 2012, 3, 2186-2583, 201202247876844837

周期実行システムにおける省電力スケジューリングの初期検討
岡本和也; 薦田登志矢; 中田尚; 三輪忍; 佐藤洋平; 植木浩; 林越正紀; 清水徹; 中村宏
2012, 情報処理学会研究報告(CD-ROM), 2012, 3, 2186-2583, 201202224576054281

Sleep Depth Controlling for Run-Time Leakage Power Saving
TAKEDA Seidai; MIWA Shinobu; NAKAMURA Hiroshi
The Institute of Electronics, Information and Communication Engineers, 08 Dec. 2011, Technical report of IEICE. ICD, 111, 352, 69-69, Japanese, 110009466843, AN10013276

メニーコアプロセッサにおける競合とスケーラビリティを考慮したスレッドスケジューリング
谷本輝夫; 佐々木広; 三輪忍; 中村宏
本研究はメニーコアシステム上で並列アプリケーションを複数同時に実行した際のシステム全体の性能向上を目指す。メニーコアシステムではキャッシュやメモリなどの資源を複数のコアで共有しているため，アプリケーション間の競合を防ぐことが重要である．また，アプリケーション毎にスケーラビリティが異なるため、複数のアプリケーションを同時に実行する際にはスケーラビリティに応じて適切に資源を分配することが望ましい。本研究では同時に実行されるアプリケーションに割り当てるコア数を動的に制御することで効率的に実行する。本稿ではアプリケーションの振る舞いが一定であるという仮定のもとで実際にスケジューラを実装し，実行時情報からスケーラビリティを検出し，適切な割り当てコア数を決定できることを示す．, 21 Nov. 2011, 研究報告計算機アーキテクチャ（ARC）, 2011, 31, 1-7, Japanese, 2186-2583, 201202275944812179, 110008713501, AN10096105
URL

メニーコアプロセッサにおける競合とスケーラビリティを考慮したスレッドスケジューリング
谷本輝夫; 佐々木広; 三輪忍; 中村宏
本研究はメニーコアシステム上で並列アプリケーションを複数同時に実行した際のシステム全体の性能向上を目指す。メニーコアシステムではキャッシュやメモリなどの資源を複数のコアで共有しているため，アプリケーション間の競合を防ぐことが重要である．また，アプリケーション毎にスケーラビリティが異なるため、複数のアプリケーションを同時に実行する際にはスケーラビリティに応じて適切に資源を分配することが望ましい。本研究では同時に実行されるアプリケーションに割り当てるコア数を動的に制御することで効率的に実行する。本稿ではアプリケーションの振る舞いが一定であるという仮定のもとで実際にスケジューラを実装し，実行時情報からスケーラビリティを検出し，適切な割り当てコア数を決定できることを示す．, 21 Nov. 2011, 研究報告ハイパフォーマンスコンピューティング（HPC）, 2011, 31, 1-7, Japanese, 110008713537, AN10463942
URL

OpenCLを用いたパイプライン並列プログラミングAPIの初期検討
薦田登志矢; 三輪忍; 中村宏
シングルスレッド性能向上の限界，電力制約の問題から特定アプリケーションに特化したアクセラレータを利用することの重要性が高まっている．これまでのアクセラレータを利用する事例はデータ並列性を利用するアプリケーションを主たるターゲットとしてきた．しかし，特に組み込みシステムにおいてパイプライン並列性を利用することがアプリケーションの性能向上を，与えられた電力制約のもと達成するために重要となる．本稿では組み込みシステムにおいてアクセラレータを利用する場面を想定し，アクセラレータを含むシステム上でパイプライン並列性を利用するアプリケーションを容易にかつ柔軟に実現するためのライブラリを提案する．提案ライブラリではアクセラレータプログラミングの標準として策定された OpenCL を用い，ソフトウェアパイプライニング技術を応用することで，アクセラレータ上におけるパイプライン並列処理を実現すると同時に，パイプライン並列アプリケーションを開発するための簡潔なユーザーインタフェースを提供する．プロトタイプシステムの評価により，パイプライン並列処理におけるタスクスケジューリングや通信バッファの管理といったシステムの複雑さをプログラマから隠蔽しつつ，アクセラレータデバイス上においてパイプライン化による性能向上を達成できることが分かった．, 21 Nov. 2011, 研究報告計算機アーキテクチャ（ARC）, 2011, 10, 1-7, Japanese, 110008713480, AN10096105
URL

Data Compression on Last Level Cache for Reducing Hardware Amount
横山弘基; 堀部悠平; 三輪忍; 中條拓伯
2011, 情報処理学会研究報告(CD-ROM), 2010, 6, 2186-2583, 201102267115554260

Proposal of a Hardware Scheme for Java Acceleration on Android Devices
太田淳; 三輪忍; 中條拓伯
2011, 情報処理学会論文誌トランザクション(CD-ROM), 2011, 1, 1882-7772, 201102227683431006

OpenCLを用いたパイプライン並列プログラミングAPIの初期検討
薦田登志矢; 三輪忍; 中村宏
2011, 情報処理学会研究報告(CD-ROM), 2011, 4, 2186-2583, 201202257832830168

A Thread Migration Method on CMP with Data Migration Methods
角崎宏一; 佐々木広; 三輪忍; 中村宏
2011, 電子情報通信学会技術研究報告, 111, 255(CPSY2011 25-41), 0913-5685, 201102217639090492

Selective Cache Line Allocation with Load/Store Instruction Address.
堀部悠平; 三輪忍; 塩谷亮太; 五島正裕; 中條拓伯
2011, 先進的計算基盤システムシンポジウム SACSIS 2011, 2011, 316-323-323, Japanese, Peer-reviwed, Summary national conference, 170000065684
URL

Area-efficient Register Map Table Using a Cache
三輪忍; ZHANG Peng; 横山弘基; 堀部悠平; 中條拓伯
2010, 情報処理学会論文誌トランザクション(CD-ROM), 2010, 1, 1882-7772, 201102283679932405

Dynamic Switch Strategies of Accessing L1/L2 Cache for an SMT Processor
小笠原嘉泰; 小笠原嘉泰; 三輪忍; 中條拓伯
2009, 情報処理学会論文誌トランザクション(CD-ROM), 2009, 1, 1882-7772, 200902227901550197

Feasibility of an Embedded Virtual Machine under Parallel or Distributed Processing Environment
矢野裕章; 中西正樹; 三輪忍; 中條拓伯
2009, 情報処理学会研究報告, ARC-181, 1(ARC-181 EMB-11), 75-80, 0919-6072, 200902202864117507

Fast Instruction Supply Method Using Scheduled Instruction Cache
三輪忍; 中條拓伯
2009, 情報処理学会研究報告(CD-ROM), 2009, 4, -, 2186-2583, 201002287475800281

Hardware prefetching to achive high accuracy according to memory access patterns
堀部悠平; TYOU Hou; 小笠原嘉泰; 三輪忍; 中條拓伯
2009, 情報処理学会研究報告, 2009, 14(ARC-182 HPC-119), 91-96, 0919-6072, 200902292430255005

スケーラブルFPGAシステムにおけるハードウェア拡張プロトコル
中條拓伯; 坂本龍一; 三輪忍
2009, 信学技報(リコンフィギャラブルシステム研究会（RECONF）, 12/2-4, -

A Dynamic Instruction Scheduler for ALU Cascading
尾形幸亮; YAO Jun; 嶋田創; 三輪忍; 富田眞治
04 Jun. 2008, 情報処理学会シンポジウム論文集, 2008, 5, 105-114, Japanese, Peer-reviwed, 1344-0640, 200902244074006750

Branch Target Predictor Utilizing Context Base Value Predictor
平嶋哲朗; 嶋田創; 三輪忍; 富田眞治
2008, 情報処理学会研究報告, 2008, 39(ARC-178), 0919-6072, 200902269397485162

Development of Parallel Volume Rendering Accelerator VisA and its Preliminary Implementation
川原崇宏; 三輪忍; 嶋田創; 森眞一郎; 富田眞治
2008, 情報処理学会研究報告, 2008, 2(SLDM-133), 0919-6072, 200902255895102350

Branch Prediction Method with Compressed Path Information
三輪忍; 中條拓伯
2008, 情報処理学会シンポジウム論文集, 2008, 5, 1344-0640, 200902226911654601

FPGAにおけるマルチSMTプロセッサの実装
小笠原嘉泰; 館一平; 三輪忍; 中條拓伯
2008, 先進的計算基盤システムシンポジウムSACSIS (Symposium on Advanced Computing Systems and Infrastructures) 2008 論文集, Vol.2008, No.6, 1344-0640, 200902273969894830

Low-Complexity Operand Bypass Using Small RAM
三輪忍; 一林宏憲; 入江英嗣; 五島正裕; 富田眞治
23 May 2007, 情報処理学会シンポジウム論文集, 2007, 5, 265-274, Japanese, Peer-reviwed, 1344-0640, 200902213975140356

Mounting of remote-controlled framework in interactive simulation
橋本健介; 嶋田創; 三輪忍; 幡生安紀; 森眞一郎; 富田眞治
2007, 情報処理学会研究報告, 2007, 80(HPC-111), 0919-6072, 200902204555660347

The Dynamic Instruction Scheduler for ALU Cascading
尾形幸亮; YAO Jun; 三輪忍; 嶋田創; 富田眞治
2007, 情報処理学会研究報告, 2007, 55(ARC-173), 0919-6072, 200902227880794228

Examination of Force Sense Presentation model for Interactive Fluid Simulation
山口明徳; 三輪忍; 嶋田創; 森眞一郎; 富田眞治
2007, 日本バーチャルリアリティ学会大会論文集(CD-ROM), 12th, 1349-5062, 200902260859137265

A Digital Appliance Architecture Which Increases User Side Robustness
嶋田創; 三輪忍; 富田眞治
2007, 情報処理学会研究報告, 2007, 115(ARC-175), 0919-6072, 200902284704258100

High-Speed Calculation on a Surgical Simulator Considered the Sequence of Operation
依藤逸; 野田裕介; 吉田智一; 三輪忍; 粂直人; 嶋田創; 中尾恵; 森眞一郎; 富田眞治
2007, 情報処理学会研究報告, 2007, 80(HPC-111), 0919-6072, 200902287963359704

Selective Instruction Re-Issue Mechanism using Bit Vector
嶋田創; 三輪忍; 富田眞治
2007, 情報処理学会研究報告, 2007, 79(ARC-174), 0919-6072, 200902291575945585

Acceleration Method Considering the Sequence of Operation on a Surgical Simulator with PC Cluster
野田裕介; 依藤逸; 三輪忍; 粂直人; 嶋田創; 森眞一郎; 富田眞治
2007, 日本バーチャルリアリティ学会大会論文集(CD-ROM), 12th, 1349-5062, 200902248719470711

Instruction Steering for Clustered Superscalar Processor with Slack Prediction
福山智久; 三輪忍; 嶋田創; 五島正裕; 中島康彦; 森眞一郎; 富田眞治
We proposed an instruction criticality prediction technique based on prediction of instruction slacks. When the execution time of a program doesn't become longer even if an instruction of the program is delayed by s cycles, the maximum of s is referred to as the slack of the instruction. The slack value is stored to the prediction table to be a predicted value for the next time. This paper describes instruction steering of clustered processor with slack prediction. The cluster that a instruction will use at the next time is decided by the slack value given after the execution of the instruction. Evaluation result shows IPC is reduced 10% in comparison with non-clustered processor., 一般社団法人情報処理学会, 31 Jul. 2006, 情報処理学会研究報告, 2006, 88(ARC-169), 55-60, Japanese, 0919-6072, 200902293690458127, 110004824128, AN10096105
URL

Branch Filtering Mechanism with Path Trace
三輪忍; 福山智久; 嶋田創; 五島正裕; 中島康彦; 森眞一郎; 富田眞治
22 May 2006, 情報処理学会シンポジウム論文集, 2006, 5, 315-323, Japanese, 1344-0640, 200902213821786547
URL

Three Quads: A Versatile Interconnection Network for Medium Scale Commodity Cluster
YOSHIMURA TOMOYUKI; MIWA SHINOBU; SHIMADA HAJIME; NAKASHIMA YASUHIKO; MORI SHIN-ICHIRO; TOMITA SHINJI
In this paper, we have propose an interconnection network for Medium Scale Commodity Cluster. This network has originally designed the Visualization Subsystem for the Sensable Simulation System (Scube) which the authors have been developing. Scube is a 64-nodes PC-based cluster system in which a commodity GPU as the visualization accelerator is con figured with each node. There is no dedicated special purpose network for the numerical simulation and visualization, however, the high cost-performance inter-connection network is originally designed for Scube. All the hardware components for th..., 社団法人情報処理学会, 27 Feb. 2006, IPSJ SIG Notes, 2006, 20, 79-84, Japanese, 0919-6072, 200902209877159815, 110004668755, AN10096105
URL

DVIによる超高速単方向リンクを用いた並列ボリュームレンダリング(FPGAとその応用及び一般)
岡村大; 野田祐介; 三輪忍; 嶋田創; 中島康彦; 森眞一郎; 富田眞治
近年の計算機処理能力の急速な向上の中で注目を浴びる可視化方法の一つとして, ボリュームレンダリングが挙げられる.本稿では, 従来提案してきたVisAのボリュームデータ三重化の欠点を克服した大規模データの並列ボリュームレンダリングを行うシステムについて紹介する.ハードウェア実装向けのレイ・キャスティング法の工夫とDVIインタフェースを用いた超高速通信路により, 高レスポンス, 高フレームレートを実現する., 社団法人情報処理学会, 17 Jan. 2006, 情報処理学会研究報告. SLDM, [システムLSI設計技術], 2006, 4, 97-100, Japanese, 0919-6072, 110004085803

The parallel volume rendering using the superspeed unilateral link by the DVI.
岡村大; 野田祐介; 三輪忍; 嶋田創; 中島康彦; 中島康彦; 森真一郎; 富田真治
近年の計算機処理能力の急速な向上の中で注目を浴びる可視化方法の一つとして, ボリュームレンダリングが挙げられる.本稿では, 従来提案してきたVisAのボリュームデータ三重化の欠点を克服した大規模データの並列ボリュームレンダリングを行うシステムについて紹介する.ハードウェア実装向けのレイ・キャスティング法の工夫とDVIインタフェースを用いた超高速通信路により, 高レスポンス, 高フレームレートを実現する., 一般社団法人電子情報通信学会, 11 Jan. 2006, 電子情報通信学会技術研究報告, 105, 516, 97-100, Japanese, 0913-5685, 200902214626444360, 110004079428, AN10013141
URL

DVIによる超高速単方向リンクを用いた並列ボリュームレンダリング(FPGAとその応用及び一般)
岡村大; 野田祐介; 三輪忍; 嶋田創; 中島康彦; 森眞一郎; 富田眞治
近年の計算機処理能力の急速な向上の中で注目を浴びる可視化方法の一つとして, ボリュームレンダリングが挙げられる.本稿では, 従来提案してきたVisAのボリュームデータ三重化の欠点を克服した大規模データの並列ボリュームレンダリングを行うシステムについて紹介する.ハードウェア実装向けのレイ・キャスティング法の工夫とDVIインタフェースを用いた超高速通信路により, 高レスポンス, 高フレームレートを実現する., 社団法人電子情報通信学会, 11 Jan. 2006, 電子情報通信学会技術研究報告. RECONF, リコンフィギャラブルシステム, 105, 518, 43-46, Japanese, 0913-5685, 10017974718

汎用GPUを用いた流体シミュレーションのプロトタイプ実装
橋本健介; 小松原誠; 嶋田創; 三輪忍; 幡生安紀; 森眞一郎; 富田眞治
2006, 電気関係学会北陸支部連合大会講演論文集(CD-ROM), 2006, 200902245307902383

Consideration for Speculative Rendering in PVR
SHINOMOTO YUKI; MIWA SHINOBU; SHIMADA HAJIME; MORI SHINICHIRO; NAKASHIMA YASUHIKO; TOMITA SHINJI
In this paper, we point out a problem in a parallel volume rendering with commodity GPUs. Since rendering with GPU and image composition with CPU are processed independently, we can overlap these jobs. However, we cannot overlap them when a viewpoint is set intermittently. To utilize both GPU and CPU, we consider a spaculative rendering., 社団法人情報処理学会, 03 Aug. 2005, IPSJ SIG Notes, 2005, 80, 145-150, Japanese, 0919-6072, 200902292094461858, 110002775587, AN10096105

Instruction Scheduling for Low-Power Architecture with Slack Prediction
福山智久; 福田匡則; 三輪忍; 小西将人; 五島正裕; 中島康彦; 森真一郎; 富田真治
18 May 2005, 情報処理学会シンポジウム論文集, 2005, 5, 123-132, Japanese, 1344-0640, 200902220147223775
URL

Implementation of Asynchronous Graded State Machines through Learning
TSUDA Akihisa; NAGANO Takanobu; MIWA Shinobu; TSUMURA Tomoaki; GOSHIMA Masahiro; TOMITA Shinji
Recurrent Neural Networks can handle time series information, and are classified as powerful machine: GSM. But it is difficult to lead RNNs to have continuous states and continuous transition through learning. We proposed the way to build GSM on RNN through learning based on BPTT. We implemented a modulo counter on RNN, which recognize continuous input signal. We add an output-input edge with transmission delay to RNN. And we apply EBPTT to time continuous RNN by declining to define teacher signals for particular terms., The Institute of Electronics, Information and Communication Engineers, 16 Aug. 2002, IEICE technical report. Computer systems, 102, 276, 59-64, Japanese, 0913-5685, 200902163414086169, 110003494006, AN10013141
URL

Implementation of a Sequential Circuit by Conductance-Based Neuron Model
TSUDA Akihisa; MIWA Shinobu; TSUMURA Tomoaki; GOSHIMA Masahiro; TOMITA Shinji
Graded State Machine (GSM) is embodied by Recurrent Neural Networks (RNN). RNN is generally constructed by neuron cells based on formal neuron model. But the formal neuron model cannot handle time series information. Therefore RNN is not strictly time continuous. First, for finding a neuron model which really can handle time series information, we made a comparative study of past neuron models and showed that a membrane potential-based neuron model had problems. Next, we proposed a new conductance-based neuron model. We prepared conductance-based neurons which behave like logic gates, and implemented a sequential circuit with them. The circuit works time-continuously according to countinuous input transitions., The Institute of Electronics, Information and Communication Engineers, 18 Jul. 2001, IEICE technical report. Computer systems, 101, 216, 31-38, Japanese, 0913-5685, 200902156871845346, 110003180648, AN10013141
URL

Implementation of GSM with Conductance-Based Neural Networks
MIWA Shinobu; TSUDA Akihisa; TSUMURA Tomoaki; GOSHIMA Masahiro; TOMITA Shinji
RNN is a neural network with feedback loops. RNN can handle time series information. It is thought that such a machine class, called GSM, is able to handle high level information such as natural language. Generally, RNN is constructed by neuron cells based on formal neuron model. By the way, formal neuron model cannot treat time series information in oneself. This lack of ability would limit the capability of whole RNN about time series information processing. We proposed GSM with continuous states and continuous transition. It can be implemented by an RNN constructed with conductance-based neurons. Conductance-based neuron model represents actual neuron behavior more faithfully than formal neuron model. This model handle pulse series as inputs and output., The Institute of Electronics, Information and Communication Engineers, 18 Jul. 2001, IEICE technical report. Computer systems, 101, 216, 39-46, Japanese, 0913-5685, 200902153184407208, 110003180649, AN10013141
URL

Books and other publications

ロス・キニー論理回路
佐藤証; 三輪忍; 吉永努
Textbook, Japanese, Joint translation, 東京化学同人, 2021

Advanced Software Technologies for Post-Peta Scale Computing～The Japanese Post-Peta CREST Research Project～
M. Kondo; I. Miyoshi; K. Inoue; S. Miwa
Scholarly book, English, Contributor, Power Management Framework for Post-Petascale Supercomputers, Springer, 2018

Computer Systems - a Programmers' Perspective
Textbook, Japanese, Contributor, Translation for Sections 6.5-6.7 and 9.1-9.7, Maruzen Publishing, 2018

IT研究者のひらめき本棚～ビブリオ・トーク：私のオススメ
General book, Japanese, Contributor, 近代科学社, 2017

Lectures, oral presentations, etc.

Design Challenges of CNFET Processors
S. Miwa
Invited oral presentation, ARCHIDE: Workshop on Architecture Design Methodologies and Ecosystems for HPC and Scientific Edge Computing, Invited
Aug. 2024

高帯域幅メモリを有するプロセッサにおけるデータプリフェッチャの性能分析
滕林; 三輪忍; 塩谷亮太; 八巻隼人; 本多弘樹
Oral presentation, 情報処理学会ARC研究会
Aug. 2023

マルチパスルーティングにおけるINTを応用した帯域要求量ベースの動的トラフィック分散
佐藤翔; 荒巻慎太郎; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, 情報処理学会IOT研究会
Jul. 2023

IP網におけるIn-networkコンテンツキャッシュ
大河原幸哉; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, 情報処理学会IOT研究会
Jul. 2023

検査対象の種類ごとに特化したSnortを複数用いたソフトウェア侵入検知システムの並列化
小倉快将; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, 情報処理学会ARC研究会
May 2023

処理性能の異なる機器を複数台用いた並列NIDSに対するロードバランサ
八巻隼人; 三輪忍; 本多弘樹
Oral presentation, 電子情報通信学会CPSY研究会
May 2023

CNFET7: An Open Source Cell Library for 7-nm CNFET Technology
C. Shi, S. Miwa, T. Yang, R. Shioya, H. Yamaki, H. Honda
Oral presentation, Japanese, 電子情報通信学会VLD研究会
02 Mar. 2023

ソフトウェアベース電力サイドチャネル攻撃の対抗策の評価
下島航太,三輪忍,八巻隼人,本多弘樹
Oral presentation, Japanese, 電子情報通信学会CPSY研究会
Mar. 2023
2023

複数パターン長を有するマルチパターンマッチングにおけるラビン-カープ法のハッシュ関数最適化
鈴木想生,八巻隼人,三輪忍,本多弘樹
Oral presentation, Japanese, 電子情報通信学会CPSY研究会
Mar. 2023
2023

GPUサーバにおける画像認識を行う深層学習の性能モデリング
松下哲也,三輪忍,八巻隼人,本多弘樹
Oral presentation, Japanese, 電子情報通信学会CPSY研究会
Mar. 2023
2023

リンク集約におけるトラフィック負荷分散方式の検討
平野愁也,八巻隼人,三輪忍,本多弘樹
Oral presentation, Japanese, 第244回ARC研究会
Mar. 2023
2023

並列アプリケーションのキャッシュミス数予測の評価
長谷川健人,有馬海人,三輪忍,八巻隼人,本多弘樹
Oral presentation, Japanese, 第188回HPC研究会
Mar. 2023
2023

A64FXプロセッサにおける電力・性能ばらつきの評価・分析
草場智也,吉田幸平,三輪忍,八巻隼人,本多弘樹
Oral presentation, Japanese, 第188回HPC研究会
Mar. 2023
2023

実HPCアプリケーションを用いたマルチGPUにおける電力ばらつきの評価
郡司賢,吉田幸平,三輪忍,八巻隼人,本多弘樹
Oral presentation, Japanese, 第188回HPC研究会
Mar. 2023
2023

最長一致検索に対応する非TCAMキャッシュによるルータ宛先検索の高速化・省電力化
長田大樹,八巻隼人,三輪忍,本多弘樹,五島正裕
Oral presentation, Japanese, 第244回ARC研究会
Mar. 2023
2023

SRAM の電力/遅延シミュレータCACTIのCNFETへの対応
関川栄一郎; 三輪忍; ヨウドウキン; 塩谷亮太; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第241回ARC研究会, Domestic conference
Jul. 2022

CPUおよびGPUの電力ばらつきを考慮したジョブスケジューリング手法の提案
小野賢人; 吉田幸平; 三輪忍; 坂本龍一; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第185回HPC研究会, Domestic conference
Jul. 2022

In-band Network Telemetryによるリンク混雑度に応じたマルチパス経路制御
荒巻慎太朗; 田中京介; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会NS研究会, Domestic conference
May 2022

Evaluation of Microprocessors Placed-and-Routed with CNFET
C. Shi; K. Sasaki; S. Miwa; T. Yang; R. Shioya; H. Yamaki; H. Honda
Oral presentation, Japanese, 第240回ARC研究会, Domestic conference
Mar. 2022

CUDAバージョンの違いがカーネルの実行時間と消費電力に与える影響の分析
吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第183回HPC研究会, Domestic conference
Mar. 2022

マルウェア解析のための高速かつ安全なVMI機構
森瑞穂; 味曽野雅史; 八巻隼人; 三輪忍; 本多弘樹; 品川高廣
Oral presentation, Japanese, コンピュータシステム・シンポジウム (ComSys'21),, Domestic conference
2021

Wisteria/BDEC-01におけるNVIDIA A100 GPUの電力性能ばらつきの評価
提山春日; 吉田幸平; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第182回HPC研究会, Domestic conference
2021

深層学習における実行時ファイルステージング
樋口遼太郎; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第182回HPC研究会, Domestic conference
2021

MPIにおける小規模実行時の通信トレース解析による大規模実行時の通信タイミング予測の評価
岡田悠希; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第182回HPC研究会, Domestic conference
2021

テーブル分離パケット処理キャッシュを用いたルータテーブル検索の高効率化
長田大樹; 田中京介; 八巻隼人; 三輪忍; 本多弘樹; 五島正裕
Oral presentation, Japanese, The 5th cross-disciplinary Workshop on Computing Systems, Infrastructures, and Programming (xSIG2021), Domestic conference
2021

カーボンナノチューブトランジスタを用いて論理合成したプロセッサの電力／面積／回路遅延評価
佐々木魁; 三輪忍; ヨウドウキン; 塩谷亮太; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第237回ARC研究会, Domestic conference
2021

MPIアプリケーションの関数コール回数予測
有馬海人; 長谷川健人; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第178回HPC研究会, Domestic conference
2021

MPIアプリケーションのキャッシュプロファイル予測
長谷川健人; 有馬海人; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第178回HPC研究会, Domestic conference
2021

TensorFlowアプリケーション用GPUサーバにおけるNVDIMMの利用可能性の検討
松下哲也; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第236回ARC研究会, Domestic conference
2021

Mesh TensorFlowを用いたモデル並列学習におけるCPU-GPU間のデータ転送最適化
横手宥則; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会CPSY研究会, Domestic conference
2021

Routing/ARP/ACL/QoSごとのテーブル分離パケット処理キャッシュ
長田大樹; 田中京介; 八巻隼人; 三輪忍; 本多弘樹; 五島正裕
Oral presentation, Japanese, 第236回ARC研究会, Domestic conference
2021

ネットワーク機器における高速なGZIP復号のためのキャッシュ利用効率向上手法
黒川雄亮; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会CPSY研究会, Domestic conference
2020

動画トラフィック検査除外手法のSnortにおける実装
祐野雅範; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会CPSY研究会, Domestic conference
2020

多頻度・順不同で到着するシーケンスデータの主キーごとの処理順序制約を満たすリアルタイム並列処理手法
山添高弘; 三輪忍; 本多弘樹
Oral presentation, Japanese, 第169回DBS研究会, Domestic conference
2019

TSUBAME3.0における製造ばらつきを考慮したGPUの電力モデリングの高速化
大八木哲哉; 浅田風太; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第172回HPC研究会, Domestic conference
2019

OpenFlowを用いた動画フローの非ミラーリングによるNIDS処理負荷の削減
高倉玲央; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会IA研究会, Domestic conference
2019

テーブル検索回数の削減によるインターネットルータの高スループット化および省電力化
山下壮樹; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会IA研究会, Domestic conference
2019

キャッシュを利用したOpenFlow通信の高速化
祐野雅範; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 2019年電子情報通信学会総合大会, Domestic conference
2019

学習済み重みを利用した畳み込みニューラルネットワークの学習法の初期検討
横手宥則; 三輪忍; 井内悠太; 津邑公暁; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 2019年電子情報通信学会総合大会, Domestic conference
2019

ネットワークベースの攻撃に対応可能な高対話型ハニーポット
森瑞穂; 本多弘樹; 八巻隼人; 三輪忍
Oral presentation, Japanese, 2019年電子情報通信学会総合大会, Domestic conference
2019

GPUの電力ばらつきモデリング
浅田風太; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 2019年電子情報通信学会総合大会, Domestic conference
2019

ネットワーク機器上における高速なGZIP復号のためのキャッシュ利用効率向上手法の提案
黒川雄亮; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 2019年電子情報通信学会総合大会, Domestic conference
2019

パケット処理キャッシュにおけるパイプライン化とマルチポート化の評価
田中京介; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 第229回ARC研究会, Domestic conference
2019

DFS/DCT 制御による電力あたり性能の実行時最適化
三吉郁夫; 三輪忍; 井上弘士; 近藤正章
Oral presentation, Japanese, 第163回HPC研究会, Domestic conference
2018

ON/OFFリンクにおける通信開始遅延を低減するためのプリウェイクアップ手法の提案
松山朋樹; 三輪忍; 八巻隼人; 本多弘樹
Invited oral presentation, Japanese, 情報処理学会第80回全国大会, Domestic conference
2018

ゲートウェイにおける攻撃パケットに着目したテーブル検索負荷削減手法の提案
愛甲達也; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 第222回ARC研究会, Domestic conference
2018

HSPICEを用いたシリコン回路とカーボンナノチューブ回路の比較評価
松尾駿; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第222回ARC研究会, Domestic conference
2018

高電力効率なCNNアクセラレータ実現に向けたカーネルクラスタリングの応用の検討
進藤智司; 松井優樹; 八巻隼人; 津邑公暁; 三輪忍
Oral presentation, Japanese, 第222回ARC研究会, Domestic conference
2018

CNN計算の省メモリ化のためのカーネル・クラスタリング手法の検討
松井優樹; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会コンピュータシステム研究会, Domestic conference
2018

NVDIMMを用いたメモリスナップショットの解析システム
三須雅仁; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 電子情報通信学会コンピュータシステム研究会, Domestic conference
2018

プリウェイクアップ手法によるON/OFFリンクの消費エネルギー削減
松山朋樹; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第165回HPC研究会, Domestic conference
2018

1Tbps実現に向けたルータのメモリ階層の最適化
田中京介; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, 第225回ARC研究会, Domestic conference
2018

ジョブ実行中の計算ノードにおけるDIMM待機電力削減手法の実装と評価
石原雅也; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, 第158回HPC研究会, Domestic conference
2017

マルチコアニューラルネットワークアクセラレータにおけるデータ転送のブロードキャスト化
大場百香; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
Oral presentation, Japanese, ETNET, Domestic conference
2017

パケット処理キャッシュにおける送信元IPアドレスに着目したミス削減手法に関する初期検討
八巻隼人; 愛甲達也; 三輪忍; 本多弘樹
Oral presentation, Japanese, HotSpa, Domestic conference
2017

高電力効率なCNNアクセラレータ実現に向けたカーネルクラスタリングの応用の検討
進藤智司; 松井優樹; 八巻隼人; 津邑公暁; 三輪忍
Oral presentation, Japanese, SWoPP, Domestic conference
2017

動画トラフィックに着目したNIDSにおける文字列探索処理負荷削減手法の提案
高徳真晴; 八巻隼人; 三輪忍; 本多弘樹
Oral presentation, Japanese, SWoPP, Domestic conference
2017

電力性能推定を目的としたインターコネクト・シミュレータTraceRPの開発
小野貴継; 垣深悠太; 三輪忍; 井上弘士
Oral presentation, Japanese, 第161回HPC研究会, Domestic conference
2017

Energy-efficient computers by increasing hardware
S. Miwa
Invited oral presentation, Japanese, IPSJ-ONE, Domestic conference
12 Mar. 2016

ニューラルネットワークアクセラレータにおけるコア間通信量最小化のためのタスク配置手法
進藤智司; 大場百香; 津邑公暁; 三輪忍
Oral presentation, Japanese, SWoPP, Domestic conference
2016

再構成可能なニューラルネットワークアクセラレータの提案と性能分析
大場百香; 三輪忍; 進藤智司; 津邑公暁; 八巻隼人; 本多弘樹
Oral presentation, Japanese, SWoPP, Domestic conference
2016

ヘテロジニアス・プロセッサの設計探索手法の初期検討
澁谷俊憲; 三輪忍; 塩谷亮太; 佐々木広; 八巻隼人; 本多弘樹
Oral presentation, Japanese, SWoPP, Domestic conference
2016

メモリホットプラグを用いたメインメモリの省電力化に関する初期検討
石原雅也; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, SWoPP, Domestic conference
2016

リンクオフスレッショルドを有するON/OFFリンクの電力見積手法の初期検討
西郷雄斗; 三輪忍; 八巻隼人; 本多弘樹
Oral presentation, Japanese, SWoPP, Domestic conference
2016

Initial Study of Usage of Large LLC for Reduction of TLB Miss Penalty
E. Arima; S. Miwa; T. Nakada; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2015

Initial Study of Power Gating Considering Operand Values on Function Units
Y. Ishikawa; A. Koshiba; R. Sakamoto; Y. Wada; S. Miwa; M. Kondo; M. Namiki; H. Honda
Poster presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2015

Area-efficient microarchitecture for Reinforcement of Turbo Mode and its Design
Shinobu Miwa; Takara Inoue; Hiroshi Nakamura
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2014

Initial Study of Processor Architecture in the Dark Silicon Era
Shinobu Miwa; Ryota Shioya; Hiroshi Sasaki
Oral presentation, Japanese, SWoPP, Domestic conference
2014

Processor Architecture for Improved Energy Efficiency by Hardware Overprovisioning
Shinobu Miwa; Ryota Shioya; Hiroshi Sasaki
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2014

Power Shifting between Networks and CPUs in HPC Sytems
Shinobu Miwa; Hiroshi Nakamura
Invited oral presentation, English, JST/CREST International Symposium on Post Petascale System Software, Invited, International conference
2014

Techniques of Power Saving for Processors in the Dark Silicon Era
Shinobu Miwa
Invited oral presentation, Japanese, Embedded System Symposium 2014, Invited, Domestic conference
2014

Power Management of Peripheral Circuits of STT-MRAM Caches Considering Locality of Cache Accesses
E. Arima; H. Noguchi; T. Nakada; S. Miwa; S. Takeda; S. Fujita; H. Nakamura
Oral presentation, Japanese, SWoPP, Domestic conference
2014

Implementation and Evaluation of Dalvik Accelerator with FPGA
Y. Oigo; D. Yoshizane; A. Ohta; S. Miwa; H. Nakajo
Oral presentation, Japanese, IPSJ SIG-EMB, Domestic conference
2014

Improving Performance of Power-constrained HPC Systems with Energy Storage Devices
T. Sakai; T. Komoda; S. Miwa; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-HPC, Domestic conference
2014

Improving Performance of Power-constrained HPC Systems by Increase/Decrease of Physical Memories
R. Yonezawa; S. Aita; S. Miwa; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-HPC, Domestic conference
2014

Load-balancing-aware CPU DVFS Control under Power Constraints
S. Aita; S. Miwa; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-HPC, Domestic conference
2014

Power Management Framework for Post-Petascale Supercomputers
M. Kondo; T. CAO; Y. He; Y. Wada; H. Honda; I. Miyoshi; Y. Inadomi; K. Fukazawa; K. Inoue; S. Miwa; H. Nakamura
Poster presentation, English, JST/CREST International Symposium on Post Petascale System Software, Invited, International conference
2014

Power Management of Periodic Execution Systems Focusing on Temporal Data
T. Shigematsu; T. Komoda; T. Nakada; S. Miwa; Y. Sato; H. Ueki; M. Hayashikoshi; T. Shimizu; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-EMB, Domestic conference
2013

Coordinating Power of CPUs and Networks in Power-constrained Systems
S. Aita; S. Miwa; H. Nakamura
Oral presentation, Japanese, SWoPP, Domestic conference
2013

Performance Improvement of Superscalar Processors by ALU Rotation
T. Inoue; S. Miwa; T. Nakada; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2013

Initial Study of Cache-aware Thread Scheduling on CMPs
Shinobu Miwa; Koichi Sumizaki; Hiroshi Sasaki; Hiroshi Nakamura
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2012

Initial Study of Power Saving Methods for Interconnection Controllers in FX10
Shinobu Miwa; Sho Aita; Yuichiro Ajima; Toshiyuki Shimizu; Akira Asato; Hiroshi Nakamura
Oral presentation, Japanese, HOKKE, Domestic conference
2012

Comparison of Leakage Reduction Techniques for Last Level Caches under OS Power Management
H. Arima; T. Komoda; S. Miwa; H. Noguchi; K. Nomura; K. Abe; S. Fujita; H. Nakamura
Oral presentation, Japanese, The 25th Workshop on Circuits and Systems, Domestic conference
2012

A Small Area and High Throughput Processor with Bypass-dedicated ALU
K. Saito; S. Miwa; H. Nakajo
Oral presentation, Japanese, HOKKE, Domestic conference
2012

Fast Cache Simulation for Design of Many-core Processors with NoC
T. Nakada; S. Miwa; H. Nakamura
Oral presentation, Japanese, HOKKE, Domestic conference
2012

Dynamic Power Management of IT Devices Considering User Comfort
N. Iwasawa; T. Komoda; S. Miwa; T. Nakada; H. Nakamura
Oral presentation, Japanese, FIT, Domestic conference
2012

Initial Study of Power-aware Task Scheduling on Periodic Execution Systems
K. Okamoto; T. Komoda; T. Nakada; S. Miwa; Y. Sato; H. Ueki; M. Hayashikoshi; T. Shimizu; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-EMB, Domestic conference
2012

Activity Migration on Register Files and Execution Units
T. Inoue; S. Miwa; T. Nakada; H. Nakamura
Oral presentation, Japanese, SWoPP, Domestic conference
2012

Preliminary Evaluation of Instruction-group-aware Cache Partitioning
K. Asami; N. Kurata; R. Shioya; S. Miwa; M. Goshima; S. Sakai
Oral presentation, Japanese, SWoPP, Domestic conference
2012

Implementation of Reduction of Performance Penalty Caused by Cache Power-off in Idle States
E. Arima; T. Komoda; S. Miwa; H. Nakamura
Oral presentation, Japanese, SWoPP, Domestic conference
2012

Study of Data Prefetching Method for Communications between CPU and GPU
T. Komoda; S. Miwa; H. Nakamura
Oral presentation, Japanese, SWoPP, Domestic conference
2012

Cache Management Focusing on Working Sets of Instruction Groups
K. Asami; N. Kurata; R. Shioya; S. Miwa; M. Goshima; S. Sakai
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2012

Study of Improvement of Cache Efficiency
K. Asami; N. Kurata; R. Shioya; S. Miwa; M. Goshima; S. Sakai
Oral presentation, Japanese, The 74th National Convention of IPSJ, Domestic conference
2012

Performance Penalty of Cache Power-off in Idle States and Its Reduction
E. Arima; T. Komoda; S. Miwa; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2012

Cache Power Gating Using Data Retentiveness
K. Kim; S. Takeda; S. Miwa; H. Nakamura
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2012

Selective Cache Line Allocation with Load/Store Instruction Addresses
Y. Horibe; S. Miwa; R. Shioya; M. Goshima; H. Nakajo
Oral presentation, Japanese, The 2011 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2011

Sleep Depth Controlling for Run-Time Leakage Power Saving
S. Takeda; S. Miwa; H. Nakamura
Oral presentation, Japanese, IEICE-ICD, Domestic conference
2011

Thread Scheduling on Many-core Processors Considering Contention and Scalability
T. Tanimoto; H. Sasaki; S. Miwa; H. Nakamura
Oral presentation, Japanese, HOKKE, Domestic conference
2011

Initial Study of API for Pipeline Parallel Programming with OpenCL
T. Komoda; S. Miwa; H. Nakamura
Oral presentation, Japanese, HOKKE, Domestic conference
2011

A Thread Migration Method on CMP with Data Migration Methods
K. Sumizaki; H. Sasaki; S. Miwa; H. Nakamura
Oral presentation, Japanese, IEICE-CPSY, Domestic conference
2011

Data Compression on Last Level Cache for Reducing Hardware Amount
H. Yokoyama; Y. Horibe; S. Miwa; H. Nakajo
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2011

Area-efficient Register Map Table Using Small Caches
Shinobu Miwa; Peng Zhang; Hiroki Yokoyama; Yuhei Horibe; Hironori Nakajo
Oral presentation, Japanese, The 2010 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2010

Dalvik Accelerator: a Mechanism for Fast Java Execution on Android Devices
A. Ohta; S. Miwa; H. Nakajo
Oral presentation, Japanese, Embedded System Symposium 2010, Domestic conference
2010

Parallelizing Hilbert-Huang Transform and its Implementation on GPU
Pulung Waskito; Shinobu Miwa; Yasue Mitsukura; Hironori Nakajo
Oral presentation, Japanese, The 2010 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2010

Improvement of Cache Efficiency by Selective Cache Line Allocation
Y. Horibe; S. Miwa; R. Shioya; M. Goshima; H. Nakajo
Oral presentation, Japanese, The 2010 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2010

Evaluation Environment with a MIPS Simulator for Dalvi Accelerator
A. Ohta; T. Motegi; S. Miwa; H. Nakajo
Oral presentation, Japanese, The 2010 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2010

Selective Cache Allocation: Efficient Cache Management in a Multi-threaded Environment
Y. Horibe; S. Miwa; R. Shioya; M. Goshima; H. Nakajo
Oral presentation, Japanese, SWoPP, Domestic conference
2010

Accelerating Hilbert-Huang Transform using GPU
P. Waskito; S. Miwa; Y. Mitsukura; H. Nakajo
Oral presentation, English, SWoPP, Domestic conference
2010

Horn Extraction with Empirical Mode Decomposition in Noisy Environment
M. Nakanishi; Y. Mitsukura; T. Tanaka; S. Miwa; H. Nakajo
Oral presentation, English, IEEJ, Domestic conference
2010

Fast Instruction Supply Method Using Scheduled Instruction Cache
Shinobu Miwa; Hironori Nakajo
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2009

Dynamic Switching of Accessing L1/L2 Caches on an SMT Processor
Y. Ogasawara; S. Miwa; H. Nakajo
Oral presentation, Japanese, The 2009 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2009

Hardware prefetching to achive high accuracy according to memory access patterns
Y. Horibe; P. Zheng; Y. Ogasawara; S. Miwa; H. Nakajo
Oral presentation, Japanese, HOKKE, Domestic conference
2009

Feasibility of an embedded virtual machine under parallel or distributed processing environment
H. Yano; M. Nakanishi; S. Miwa; H. Nakajo
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2009

Deterministic Branch Filter Mechanism for Improving Branch Prediction Accuracy
Shinobu Miwa; Hironori Nakajo
Oral presentation, Japanese, SWoPP, Domestic conference
2008

Branch Prediction with Compressed Path Information
Shinobu Miwa; Hironori Nakajo
Oral presentation, Japanese, The 2008 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2008

Implementation of a Multi SMT Processor on FPGA
Y. Ogasawara; I. Tate; S. Miwa; H. Nakajo
Oral presentation, Japanese, The 2008 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2008

An Instruction Scheduler for ALU Cascading
K. Ogata; J. Yao; H. Shimada; S. Miwa; S. Tomita
Oral presentation, Japanese, The 2008 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2008

Branch Target Predictor Utilizing Context Base Value Predictor
T. Hirashima; H. Shimada; S. Miwa; S. Tomita
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2008

Low-complexity Bypass Networks Using Small RAM
Shinobu Miwa; Hironori Ichibayashi; Hidetsugu Irie; Masahiro Goshima; Shinji Tomita
Oral presentation, Japanese, The 2007 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2007

Development of Parallel Volume Rendering Accelerator VisA and its Preliminary Implementation
T. Kawahara; S. Miwa; H. Shimada; S. Mori; S. Tomita
Oral presentation, Japanese, IEICE-RECONF, Domestic conference
2007

Study of Haptic Display Model for Interactive Fluid Simulators
A. Yamaguchi; S. Miwa; H. Shimada; S. Mori; S. Tomita
Oral presentation, Japanese, VRSJ, Domestic conference
2007

A Digital Appliance Architecture Which Increases User Side Robustness
H. Shimada; S. Miwa; S. Tomita
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2007

A High-performance Surgery Simulator with a PC Cluster Considering Continuity of Surgery Operations
Y. Noda; S. Yorifuji; S. Miwa; N. Kume; H. Shimada; S. Mori; S. Tomita
Oral presentation, Japanese, VRSJ, Domestic conference
2007

Selective Instruction Re-Issue Mechanism using Bit Vector
H. Shimada; S. Miwa; S. Tomita
Oral presentation, Japanese, SWoPP, Domestic conference
2007

Mounting of remote-controlled framework in interactive simulation
K. Hashimoto; H. Shimada; S. Miwa; Y. Hatabu; S. Mori; S. Tomita
Oral presentation, Japanese, SWoPP, Domestic conference
2007

High-Speed Calculation on a Surgical Simulator Considered the Sequence of Operation
S. Yorifuji; Y. Noda; T. Yoshida; N. Kume; S. Miwa; H. Shimada; S. Mori; S. Tomita
Oral presentation, Japanese, SWoPP, Domestic conference
2007

The Dynamic Instruction Scheduler for ALU Cascading
K. Ogata; J. Yao; S. Miwa; H. Shimada; S. Tomita
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2007

Branch Filter Mechanism with Path Information
Shinobu Miwa; Tomohisa Fukuyama; Hajime Shimada; Masahiro Goshima; Yasuhiko Nakashima; Shinichiro Mori; Shinji Tomita
Oral presentation, Japanese, The 2006 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2006

Instruction Steering for Clustered Superscalar Processor with Slack Prediction
T. Fukuyama; S. Miwa; H. Shimada; M. Goshima; Y. Nakashima; S. Mori; S. Tomita
Oral presentation, Japanese, SWoPP, Domestic conference
2006

Implementation of High-performance Low-latency Data Communications with DVI-D and its Application to Parallel Image Composition
T. Kawahara; Y. Noda; S. Miwa; H. Shimada; Y. Nakashima; S. Mori; S. Tomita
Oral presentation, Japanese, The 2006 Kansai-section Convention of IPSJ, Domestic conference
2006

Preliminary Evaluation of a High-performance Surgery Simulator with Conjugate Gradient Method
Y. Noda; T. Yoshida; S. Miwa; H. Shimada; Y. Nakashima; S. Mori; S. Tomita
Oral presentation, Japanese, The 2006 Kansai-section Convention of IPSJ, Domestic conference
2006

Three Quads: A Versatile Interconnection Network for Medium Scale Commodity Cluster
T. Yoshimura; S. Miwa; H. Shimada; Y. Nakashima; S. Mori; S. Tomita
Oral presentation, Japanese, IPSJ SIG-ARC, Domestic conference
2006

Parallel Volume Rendering with Ultra High Performance Simplex Communication Links Based on DVI
D. Okamura; Y. Noda; S. Miwa; H. Shimada; Y. Nakashima; S. Mori; S. Tomita
Oral presentation, Japanese, IPSJ SIG-SLDM, Domestic conference
2006

Instruction Scheduling with Slack Prediction for Low Power Architecture
T. Fukuyama; M. Fukuda; S. Miwa; M. Konishi; M. Goshima; Y. Nakashima; S. Mori; S. Tomita
Oral presentation, Japanese, The 2005 Symposium on Advanced Computing Systems and Infrastructures, Domestic conference
2005

Improvement of Texture Accesses for Volume Rendering with General-purpose GPU
Y. Shinomoto; S. Miwa; H. Shimada; Y. Nakashima; S. Mori; S. Tomita
Oral presentation, Japanese, The 2005 Kansai-section Convention of IPSJ, Domestic conference
2005

Consideration for Speculative Rendering in PVR
Y. Shinomoto; S. Miwa; H. Shimada; Y. Nakashima; S. Mori; S. Tomita
Oral presentation, Japanese, SWoPP, Domestic conference
2005

Learning of Robot Navigation Tasks with Recurrent Neural Networks
Shinob Miwa; Takanobu Nagano; Masahiro Goshima; Yasuhiko Nakashima; Shinji Tomita
Oral presentation, Japanese, The 2003 Kansai-section Convention of IPSJ, Domestic conference
2003

Implementation of Asynchronous Graded State Machines through Learning
Akihisa Tsuda; Shinobu Miwa; Tomoaki Tsumura; Masahiro Goshima; Shinji Tomita
Oral presentation, Japanese, SWoPP, Domestic conference
2002

Implementation of GSM with Conductance-Based Neural Networks
Shinobu Miwa; Akihisa Tsuda; Tomoaki Tsumura; Masahiro Goshima; Shinji Tomita
Oral presentation, Japanese, SWoPP, Domestic conference
2001

Implementation of a Sequential Circuit by Conductance-Based Neuron Model
Akihisa Tsuda; Shinobu Miwa; Tomoaki Tsumura; Masahiro Goshima; Shinji Tomita
Oral presentation, Japanese, SWoPP, Domestic conference
2001

Neural Network Simulation for Monitoring Memory Architecture
T. Tsumura; S. Miwa; M. Goshima; S. Tomita
Oral presentation, Japanese, SICE System, Domestic conference
2000

Courses

Mathematical Information and Computer Science Laboratory ⅡA/B
Apr. 2024 - Present
The University of Electro-Communications

Logic Circuit Design
2018 - Present
The University of Electro-Communications

Parallel Processing II
2016 - Present
The University of Electro-Communications

Mathematical Information and Computer Science Laboratory ⅡA/B
2018 - Mar. 2023
The University of Electro-Communications

Graduate Technical English
2019 - 2021
The University of Electro-Communications

High Performance Computing
2015 - 2017
The University of Electro-Communications

Mathematical Information and Computer Science Laboratory I
2016 - 2016
The University of Electro-Communications

Elements of Information Systems Fundamentals
2015 - 2016
The University of Electro-Communications

Affiliated academic society

IPSJ

IEICE

IEEE

Research Themes

A Study of Wire-Aware Processor Architecture and its Automatic Generation for Beyond CMOS
三輪忍; 塩谷亮太
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, The University of Electro-Communications, Grant-in-Aid for Scientific Research (B), 24K02913
01 Apr. 2024 - 31 Mar. 2028

Optimizing I/O Performance for Foundation Model Training using Hierarchical Storage
佐藤賢斗; 三輪忍
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, Institute of Physical and Chemical Research, Grant-in-Aid for Scientific Research (C), 24K14974
01 Apr. 2024 - 31 Mar. 2027

Production of Memory-Bandwidth-Centric Computing
三輪忍; 塩谷亮太
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, The University of Electro-Communications, Grant-in-Aid for Challenging Research (Exploratory), 23K18461
Jun. 2023 - Mar. 2026

Development of Innovative Frameworks for Application Analysis in Post-Peta Scale Systems
三輪忍
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (B), The University of Electro-Communications, Grant-in-Aid for Scientific Research (B), 2021年度はプロファイル，トレースそれぞれの予測技術の開発を行った．それぞれの開発状況を以下にまとめる．
プロファイル予測に関しては，プロファイルに含まれる実行時情報の中で関数コール回数とキャッシュミス数に着目し，それぞれの実行時情報を予測する手法を開発した．具体的には，少ないコア数，小さな問題サイズで取得したプロファイルを用いて当該実行時情報を予測するモデルをフィッティングし，フィッティングにより得られたモデルを用いて多いコア数，大きな問題サイズで当該プログラムを実行した際の当該実行時情報を予測する．予測に使用するモデルとして，線形関数，指数関数など複数の関数を組み合わせたモデルを新たに開発した．NPBを用いて評価を行ったところ，小さな問題サイズかつ少ない並列数の実行結果から大きな問題サイズかつ多くの並列数の実行結果を予測する場合において，高い精度で予測できることを確認した．また，提案手法により，プロファイル取得に要するコストを大幅に削減できることを確認した．なお，実験にはTSUBAME3.0を使用した．
トレース予測に関しては，タイムスタンプ予測技術の開発を行った．具体的には，先行研究で提案されているタイムスタンプ予測手法を分析し，通信関数の呼び出し回数が並列数だけでなく問題サイズにも依存するアプリケーションに対して予測精度が悪化することを明らかにした．また，上記の問題を解決するために，並列数だけでなく問題サイズも用いて通信関数の呼び出し回数を予測するモデルを新たに考案し，このモデルを用いてタイムスタンプ予測を行ったところ，通信関数の呼び出し回数が問題サイズにも依存するアプリケーションに対して先行研究の手法よりも高い予測精度を示すことを確認した．, 20H04193
01 Apr. 2020 - 31 Mar. 2024

Extension of an Innovative Application Analysis Infrastructure for Post-Peta Scale
Shinobu Miwa
日本学術振興会, 科学研究費助成事業国際共同研究加速基金(国際共同研究強化(A)), 電気通信大学, 国際共同研究加速基金(国際共同研究強化(A)), Principal investigator, 22KK0182
2022 - 2024

Resource Managers in Next-Generation Massively Parallel Environments
KDDI Foundation, Research Grant
Apr. 2020 - Mar. 2023

A Study of Ultrascaled Nanocarbon Processor Architecture
Miwa Shinobu
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research Grant-in-Aid for Challenging Research (Exploratory), The University of Electro-Communications, Grant-in-Aid for Challenging Research (Exploratory), The main contributions of this study are twofold. First, I was able to successfully reproduce an environment for development of CNFET processors, which is similar to the substantially close environment used in the previous work. This enables us to develop and evaluate architecture of CNFET processors. Second, with the above environment, I uncovered the impact of CNFET on performance, power consumption and area of processors. This analysis provides a valuable guideline to optimize processor architecture for CNFET., 18K19778
29 Jun. 2018 - 31 Mar. 2022

A Framework to Support Use of Trusted Execution Environment in High Performance Computing
JST, Principal investigator, Domestic joint research
2022

A Study of Performance and Memory Models of AI Applications on High Performance Computing Systems
Shinobu Miwa
KIOXIA, Principal investigator, Domestic joint research
2019 - 2021

A Study of Profile Prediction for MPI Parallel Applications Executed on Massively Parallel Processing Environments
Shinobu Miwa
Kayamori Foundation of Informational Science Advancement, Research Grant, Principal investigator
Nov. 2018 - Oct. 2020

Development of Frameworks of Power Management for Post Petascale Systems
Masaaki Kondo
JST, Coinvestigator, Domestic joint research
2012 - 2018

Development of Fundamental Technologies of Normally-off Computing
Shinobu Miwa
TOSHIBA, Principal investigator, Domestic joint research
2015 - 2016

Computation Model of Spatiotemporal Data Control
NAKAMURA HIROSHI; MIWA Shinobu
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research Grant-in-Aid for Challenging Exploratory Research, The University of Tokyo, Grant-in-Aid for Challenging Exploratory Research, Coinvestigator not use grants, Bottlenecks of both performance and power exist not in the arithmetic calculation or computation but in the data transfer between memory and arithmetic unit or in memory access. To overcome this problem, new computation model of Spatiotemporal Data Control is proposed. This model can explicitly specify both the timing of data movement and computation and the physical location of data. This research also investigated how to optimize execution by using this new model. The proposed method is applied to wide variety of computing systems, including three dimensional integrated VLSIs, high performance server systems, and sensor network systems. The preliminary experimental results reveal that the proposed method can successfully improve performance or reduce power consumption., 25540018
Apr. 2013 - Mar. 2015

A Study of Heat-Spreading-Aware Processor
MIWA Shinobu
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research Grant-in-Aid for Young Scientists (B), The University of Tokyo, Grant-in-Aid for Young Scientists (B), Principal investigator, Performance of modern microprocessors is often constrained by the chip temperature. Therefore, we studied a method to reduce chip temperature for performance improvement of microprocessors. We focused on Turbo Mode, which is widely used in modern microprocessors. Our technique is to use Activity Migration for upgrading the maximum CPU frequency during Turbo Mode. We developed a method to apply Activity Migration to a microprocessor with spatial fine granularity and its design methodology. We evaluated our technique with the combination of some simulators. The result shows that our technique achieves 14.5% of performance improvement in exchange for 2.8% of area increase., 24700044
01 Apr. 2012 - 31 Mar. 2014

Industrial Property Rights

Router
Patent right, Y. He, S. Miwa, H. Nakamura, 111244, Date applied: 2013

Translator and Translation Method
Patent right, S. Miwa, A. Ohta, H. Nakajo, 234673, Date applied: 2010

Academic Contribution Activities

The 25th IEEE international Symposium on Cluster, Cloud and Internet Computing
Competition etc, Others, Dec. 2024 - May 2025