
Toru NAKASHIKA
Department of Computer and Network Engineering | Associate Professor |
Cluster I (Informatics and Computer Engineering) | Associate Professor |
Researcher Information
Field Of Study
Career
Educational Background
- 01 Apr. 2011 - 25 Sep. 2014
Kobe University, Graduate School of System Informatics, Department of Information Science - 01 Apr. 2009 - 25 Mar. 2011
Kobe University, Graduate School of Engineering, Department of Computer Science and Systems Engineering - 01 Apr. 2005 - 25 Mar. 2009
Kobe University, Faculty of Engineering, Department of Computer Science and Systems Engineering - 25 Mar. 2005
滋賀県立膳所高等学校, 普通科
Member History
- Apr. 2023 - Present
運営委員, 電子情報通信学会音声研究会 - Apr. 2023 - Present
運営委員, 情報処理学会音声言語情報処理 - Feb. 2021 - Present
Interspeech Technical Program Committee (TPC), ICASSP - Apr. 2023 - Jun. 2025
編集委員会会誌部会委員, 日本音響学会 - Apr. 2021 - Jun. 2025
広報・電子化委員会委員, 日本音響学会 - Apr. 2023 - May 2025
編集委員会査読委員, 日本音響学会 - Apr. 2021 - Mar. 2023
幹事補佐, 音響学会音声コミュニケーション調査研究委員会 - Apr. 2021 - Mar. 2023
幹事, 情報処理学会音声言語情報処理 - Apr. 2021 - Mar. 2023
幹事補佐, 電子情報通信学会音声研究会 - Feb. 2020 - Mar. 2020
Technical Program Committee (TPC), IJCAI-PRICAI 2020 - 15 Mar. 2018 - Present
Technical Program Committee (TPC), Interspeech - Feb. 2018 - Mar. 2018
Technical Program Committee (TPC), ACM International Conference on Multimedia Retrieval (ICMR) 2018
Research Activity Information
Award
- Mar. 2024
電子情報通信学会音声研究会
Transformerを用いた脳波信号からの音声復元の検討
音声研究会学生ポスター賞, 水野友暁;岸田拓也;吉村奈津江;中鹿亘 - Mar. 2024
日本音響学会
学会活動貢献賞, 中鹿亘 - Jun. 2021
電子情報通信学会音声研究会
話者特徴抽出器を加えたFaderNetVCによる未知話者声質変換
音声研究会学生ポスター賞, 井硲巧;岸田拓也;中鹿亘 - May 2020
Acoustical Society of Japan
拡張ボルツマンマシンに基づく音声合成に関する研究
Itakura Prize Innovative Young Researcher Award
Japan society - Sep. 2018
Acoustical Society of Japan
長・短期記憶構造を持つ拡張ボルツマンマシンの検討
Awaya Prize Young Researcher Award
Japan society - May 2016
Information Processing Society of Japan
Three-way restricted Boltzmann machineによる音声モデリングに基づく話者・音素の同時認識
SIGMUS Excellent Presentation Award
Japan society - Jun. 2014
The Institute of Electronics, Information and Communication Engineers
話者依存型Conditional Restricted Boltzmann Machineによる声質変換
IEICE ISS Young Researcher's Award in Speech Field
Japan society
Paper
- Fast and Lightweight Non-Parallel Voice Conversion Based on Free-Energy Minimization of Speaker-Conditional Restricted Boltzmann Machine
Takuya KISHIDA; Toru NAKASHIKA
IEICE Transactions on Information and Systems, Institute of Electronics, Information and Communications Engineers (IEICE), 2025, Peer-reviwed
Scientific journal - An Investigation on the Speech Recovery from EEG Signals Using Transformer
Tomoaki Mizuno; Takuya Kishida; Natsue Yoshimura; Toru Nakashika
Last, 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, 1-6, 03 Dec. 2024, Peer-reviwed
International conference proceedings - Gamma-VAE: Speech representation based on VAE assuming gamma distribution for both latent variables and observation
Nanako Imaichi; Toru Nakashika
Last, 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, 1-6, 03 Dec. 2024, Peer-reviwed
International conference proceedings, English - DDPMVC: Non-parallel any-to-many voice conversion using diffusion encoder
Ryuichi Hatakeyama; Kohei Okuda; Toru Nakashika
Last, 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, 1-6, 03 Dec. 2024, Peer-reviwed
International conference proceedings, English - SBERT-based Chord Progression Estimation from Lyrics Trained with Imbalanced Data
Mastuti Puspitasari; Takuya Takahashi; Gen Hori; Shigeki Sagayama; Toru Nakashika
Proceedings of the CMMR 2023, Nov. 2023, Peer-reviwed - Controllable Automatic Melody Composition Model across Pitch/Stress-accent Languages
Takuya Takahashi; Shigeki Sagayama; Toru Nakashika
Proceedings of the CMMR 2023, Nov. 2023, Peer-reviwed - Gamma Boltzmann Machine for Audio Modeling
Toru Nakashika; Kohei Yatabe
IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 2591-2605, 2021, Peer-reviwed, This paper presents an energy-based probabilistic model that handles nonnegative data in consideration of both linear and logarithmic scales. In audio applications, magnitude of time-frequency representation, including spectrogram, is regarded as one of the most important features. Such magnitude-based features have been extensively utilized in learning-based audio processing. Since a logarithmic scale is important in terms of auditory perception, the features are usually computed with a logarithmic function. That is, a logarithmic function is applied within the computation of features so that a learning machine does not have to explicitly model the logarithmic scale. We think in a different way and propose a restricted Boltzmann machine (RBM) that simultaneously models linear- and log-magnitude spectra. RBM is a stochastic neural network that can discover data representations without supervision. To manage both linear and logarithmic scales, we define an energy function based on both scales. This energy function results in a conditional distribution (of the observable data, given hidden units) that is written as the gamma distribution, and hence the proposed RBM is termed gamma-Bernoulli RBM. The proposed gamma-Bernoulli RBM was compared to the ordinary Gaussian-Bernoulli RBM by speech representation experiments. Both objective and subjective evaluations illustrated the advantage of the proposed model.
Scientific journal - Gamma Boltzmann Machine for Simultaneously Modeling Linear- and Log-amplitude Spectra
Toru Nakashika; Kohei Yatabe
Proceedings of APSIPA Annual Summit and Conference 2020, 471-476, Dec. 2020, Peer-reviwed
International conference proceedings, English - Complex-Valued Variational Autoencoder: A Novel Deep Generative Model for Direct Representation of Complex Spectra
Toru Nakashika
Proceedings of the Interspeech 2020, 2002-2006, Oct. 2020, Peer-reviwed
International conference proceedings, English - Simultaneous Conversion of Speaker Identity and Emotion Based on Multiple-Domain Adaptive RBM
Takuya Kishida; Shin Tsukamoto; Toru Nakashika
Proceedings of the Interspeech 2020, 3431-3435, Oct. 2020, Peer-reviwed
International conference proceedings, English - Many-to-Many Symbolic Multi-track Music Genre Transfer
Michel Pezzat; Hector Perez-Meana; Toru Nakashika; Mariko Nakano
Proceedings of the SoMeT 2020, 272-281, Sep. 2020, Peer-reviwed
International conference proceedings, English - Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition
Yuki Takashima; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
EURASIP Journal on Audio, Speech, and Music Processing, Springer, DOI: 10.1186/s13636-019-0160-1, 1-11, 14 Aug. 2019, Peer-reviwed
Scientific journal, English - Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech
Kentaro Sone; Toru Nakashika
IEICE TRANSACTIONS on Information and Systems, IEICE, E102-D, 8, 1546-1553, 01 Aug. 2019, Peer-reviwed
Scientific journal, English - STFT spectral loss for training a neural speech waveform model
Shinji Takaki; Toru Nakashika; Xin Wang; Junichi Yamagishi
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), 7065-7069, May 2019, Peer-reviwed
International conference proceedings, English - 複素数の観測データを直接表現する制限ボルツマンマシンの拡張と音声信号処理への応用
中鹿 亘
日本音響学会誌, 日本音響学会, 75, 3, 164-172, 01 Mar. 2019, Invited
Scientific journal, Japanese - Complex-Valued Restricted Boltzmann Machine for Speaker-Dependent Speech Parameterization From Complex Spectra
Toru Nakashika; Shinji Takaki; Junichi Yamagishi
IEEE/ACM Transactions on Audio, Speech and Language Processing, IEEE/ACM, 27, 2, 244-254, 22 Oct. 2018, Peer-reviwed
Scientific journal, English - 音声スペクトル系列の自己回帰性を考慮した複素RBMの拡張
中鹿 亘; 高木 信二; 山岸 順一
日本音響学会秋季研究発表会, 1135-1138, Sep. 2018
Research society, Japanese - スペクトル系列誤差に基づくDNN音声波形モデルの学習'
高木 信二; 中鹿 亘; 山岸 順一
日本音響学会秋季研究発表会, 1131-1132, Sep. 2018
Research society, Japanese - DNN-based Speech Synthesis for Small Data Sets Considering Bidirectional Speech-Text Conversion
Kentaro Sone; Toru Nakashika
Proceedings of the Interspeech 2018, 2519-2523, Sep. 2018, Peer-reviwed
International conference proceedings, English - LSTBM: A Novel Sequence Representation of Speech Spectra Using Restricted Boltzmann Machine with Long Short-Term Memory
Toru Nakashika
Proceedings of the Interspeech 2018, 2529-2533, Sep. 2018, Peer-reviwed
International conference proceedings, English - Bidirectional Voice Conversion Based on Joint Training Using Gaussian-Gaussian Deep Relational Model
Kentaro Sone; Shinji Takaki; Toru Nakashika
Proceedings of the Odyssey 2018, 261-266, Jun. 2018, Peer-reviwed
International conference proceedings, English - Parallel-Data-Free Dictionary Learning for Voice Conversion Using Non-Negative Tucker Decomposition
Yuki Takashima; Hajime Yano; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), IEEE, 5294-5298, Apr. 2018, Peer-reviwed
International conference proceedings, English - 非負値タッカー分解によるNMF辞書学習に基づく非パラレル声質変換
TAKASHIMA Yuki; YANO Hajime; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2018年春季研究発表会講演論文集, 211-214, Mar. 2018
Research society, Japanese - リカレント構造を持つ複素制限ボルツマンマシンによる複素スペクトル系列モデリング
中鹿 亘; 高木 信二; 山岸 順一
第120回音声言語情報処理研究会 (SIG-SLP) SLP-21, Feb. 2018
Symposium, Japanese - 国際会議Interspeech2017報告
高木 信二; 倉田 岳人; 郡山 知樹; 塩田 さやか; 鈴木 雅之; 玉森 聡; 俵 直弘; 中鹿 亘; 福田 隆; 増村 亮; 森勢 将雅; 山岸 順一; 山本 克
第120回音声言語情報処理研究会 (SIG-SLP) SLP-14, Feb. 2018
Symposium, Japanese - Deep relational model: A joint probabilistic model with a hierarchical structure for bidirectional estimation of image and labels
Toru Nakashika
IEICE Transactions on Information and Systems, Institute of Electronics, Information and Communication, Engineers, IEICE, E101D, 2, 428-436, 01 Feb. 2018, Peer-reviwed, Two different types of representations, such as an image and its manually-assigned corresponding labels, generally have complex and strong relationships to each other. In this paper, we represent such deep relationships between two different types of visible variables using an energy-based probabilistic model, called a deep relational model (DRM) to improve the prediction accuracies. A DRM stacks several layers from one visible layer on to another visible layer, sandwiching several hidden layers between them. As with restricted Boltzmann machines (RBMs) and deep Boltzmann machines (DBMs), all connections (weights) between two adjacent layers are undirected. During maximum likelihood (ML) -based training, the network attempts to capture the latent complex relationships between two visible variables with its deep architecture. Unlike deep neural networks (DNNs), 1) the DRM is a totally generative model and 2) allows us to generate one visible variables given the other, and 2) the parameters can be optimized in a probabilistic manner. The DRM can be also finetuned using DNNs, like deep belief nets (DBNs) or DBMs pre-training. This paper presents experiments conduced to evaluate the performance of a DRM in image recognition and generation tasks using the MNIST data set. In the image recognition experiments, we observed that the DRM outperformed DNNs even without fine-tuning. In the image generation experiments, we obtained much more realistic images generated from the DRM more than those from the other generative models.
Scientific journal, English - 複素RBMを用いた音声スペクトルモデリングの改良と評価
中鹿 亘; 高木 信二; 山岸 順一
日本音響学会秋季研究発表会, 169-172, Sep. 2017
Symposium, Japanese - Practice Process Analysis Using Score Matching Method Based on OBE-DTW and its Effects on Memorizing Musical Score
Toru Nakashika; Eriko Aiba
Proceedings of International Symposium on Performance Science 2017 (ISPS2017), 66-67, Sep. 2017, Peer-reviwed
International conference proceedings, English - Speaker-adaptive-trainable Boltzmann machine and its application to non-parallel voice conversion
Toru Nakashika; Yasuhiro Minami
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, SPRINGER INTERNATIONAL PUBLISHING AG, DOI: 10.1186/s13636-017-0112-6, 1-10, Jun. 2017, Peer-reviwed, In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. Voice conversion is a technique where only speaker-specific information in the source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data-pairs of speech data from the source and target speakers uttering the same sentences. However, the use of parallel data in training causes several problems: (1) the data used for the training is limited to the pre-defined sentences, (2) the trained model is only applied to the speaker pair used in the training, and (3) a mismatch in alignment may occur. Although it is generally preferable in VC to not use parallel data, a non-parallel approach is considered difficult to learn. In our approach, we realize the non-parallel training based on speaker-adaptive training (SAT). Speech signals are represented using a probabilistic model based on the Boltzmann machine that defines phonological information and speaker-related information explicitly. Speaker-independent (SI) and speaker-dependent (SD) parameters are simultaneously trained using SAT. In the conversion stage, a given speech signal is decomposed into phonological and speaker-related information, the speaker-related information is replaced with that of the desired speaker, and then voice-converted speech is obtained by combining the two. Our experimental results showed that our approach outperformed the conventional non-parallel approach regarding objective and subjective criteria.
Scientific journal, English - 適応型Gaussian-Gaussian RBMを用いた構音障害者音声認識
TAKASHIMA Yuki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2017年春季研究発表会講演論文集, 95-98, Mar. 2017
Research society, Japanese - 複素 RBM:制限ボルツマンマシンの複素数拡張と音声信号への応用
中鹿 亘; 高木 信二; 山岸 順一
日本音響学会春季研究発表会, 219-222, Mar. 2017
Symposium, Japanese - CAB: An energy-based speaker clustering model for rapid adaptation in non-parallel voice conversion
Toru Nakashika
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2017-, 3369-3373, 2017, Peer-reviwed, In this paper, a new energy-based probabilistic model, called CAB (Cluster Adaptive restricted Boltzmann machine), is proposed for voice conversion (VC) that does not require parallel data during the training and requires only a small amount of speech data during the adaptation. Most of the existing VC methods require parallel data for training. Recently, VC methods that do not require parallel data (called non-parallel VCs) have been also proposed and are attracting much attention because they do not require prepared or recorded parallel speech data, unlike conventional approaches. The proposed CAB model is aimed at statistical non-parallel VC based on cluster adaptive training (CAT). This extends the VC method used in our previous model, ARBM (adaptive restricted Boltzmann machine). The ARBM approach assumes that any speech signals can be decomposed into speaker-invariant phonetic information and speaker-identity information using the ARBM adaptation matrices of each speaker. VC is achieved by switching the source speaker's identity into those of the target speaker while retaining the phonetic information obtained by decomposition of the source speaker's speech. In contrast, CAB speaker identities are represented as cluster vectors that determine the adaptation matrices. As the number of clusters is generally smaller than the number of speakers, the number of model parameters can be reduced compared to ARBM, which enables rapid adaptation of a new speaker. Our experimental results show that the proposed method especially performed better than the ARBM approach, particularly in adaptation.
International conference proceedings, English - Complex-valued restricted Boltzmann machine for direct learning of frequency spectra
Toru Nakashika; Shinji Takaki; Junichi Yamagishi
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2017-, 4021-4025, 2017, Peer-reviwed, In this paper, we propose a new energy-based probabilistic model where a restricted Boltzmann machine (RBM) is extended to deal with complex-valued visible units. The RBM that automatically learns the relationships between visible units and hidden units (but without connections in the visible or the hidden units) has been widely used as a feature extractor, a generator, a classifier, pre-training of deep neural networks, etc. However, all the conventional RBMs have assumed the visible units to be either binary-valued or real-valued, and therefore complex-valued data cannot be fed to the RBM. In various applications, however, complex-valued data is frequently used such examples include complex spectra of speech, fMRI images, wireless signals, and acoustic intensity. For the direct learning of such the complex-valued data, we define the new model called "complex-valued RBM (CRBM)" where the conditional probability of the complex-valued visible units given the hidden units forms a complex-Gaussian distribution. Another important characteristic of the CRBM is to have connections between real and imaginary parts of each of the visible units unlike the conventional real-valued RBM. Our experiments demonstrated that the proposed CRBM can directly encode complex spectra of speech signals without decoupling imaginary number or phase from the complex-value data.
International conference proceedings, English - 3WRBM-based speech factor modeling for arbitrary-source and non-parallel voice conversion
Toru Nakashika; Yasuhiro Minami
European Signal Processing Conference, European Signal Processing Conference, EUSIPCO, 2016-, 607-611, 28 Nov. 2016, Peer-reviwed, In recent years, voice conversion (VC) becomes a popular technique since it can be applied to various speech tasks. Most existing approaches on VC must use aligned speech pairs (parallel data) of the source speaker and the target speaker in training, which makes hard to handle it. Furthermore, VC methods proposed so far require to specify the source speaker in conversion stage, even though we just want to obtain the speech of the target speaker from the other speakers in many cases of VC. In this paper, we propose a VC method where it is not necessary to use any parallel data in the training, nor to specify the source speaker in the conversion. Our approach models a joint probability of acoustic, phonetic, and speaker features using a three-way restricted Boltzmann machine (3WRBM). Speakerindependent (SI) and speaker-dependent (SD) parameters in our model are simultaneously estimated under the maximum likelihood (ML) criteria using a speech set of multiple speakers. In conversion stage, phonetic features are at first estimated in a probabilistic manner given a speech of an arbitrary speaker, then a voice-converted speech is produced using the SD parameters of the target speaker. Our experimental results showed not only that our approach outperformed other non-parallel VC methods, but that the performance of the arbitrary-source VC was close to those of the traditional source-specified VC in our approach.
International conference proceedings, English - 3WRBM-Based Speech Factor Modeling for Arbitrary-Source and Non-Parallel Voice Conversion
Toru Nakashika; Yasuhiro Minami
Interspeech 2016, 1487-1491, Sep. 2016, Peer-reviwed
International conference proceedings, English - Factored 3-Way Restricted Boltzmann Machine を用いたマルチモーダル音声認識の検討
TAKASHIMA Yuki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2016年秋季研究発表会講演論文集, 109-112, Sep. 2016
Research society, Japanese - Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform
Zhaojie Luo; Jinhui Chen; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
The 9th ISCA Speech Synthesis Workshop (SSW), 153-158, Sep. 2016, Peer-reviwed
International conference proceedings, English - Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine
Toru Nakashika; Tetsuya Takiguchi; Yasuhiro Minami
IEEE/ACM Transactions on Audio, Speech and Language Processing, 23, 3, 1-14, Aug. 2016, Peer-reviwed
Scientific journal, English - Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine
Toru Nakashika; Tetsuya Takiguchi; Yasuhiro Minami
IEEE/ACM Transactions on Audio, Speech and Language Processing, IEEE/ACM, 24, 11, 2032-2045, Aug. 2016, Peer-reviwed
Scientific journal, English - Phone Labeling Based on the Probabilistic Representation for Dysarthric Speech Recognition
Yuki Takashima; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
American Journal of Signal Processing, American Journal of Signal Processing, 6, 1, 19-23, Jun. 2016, Peer-reviwed
Scientific journal, English - MODELING DEEP BIDIRECTIONAL RELATIONSHIPS FOR IMAGE CLASSIFICATION AND GENERATION
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
IEEE ICASSP, 2016, ICASSP, 1331, Mar. 2016, Peer-reviwed
International conference proceedings, English - Restricted Boltzmann Machine を用いた話者性・雑音を考慮したモデリングの検討
TAKASHIMA Yuki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2016年春季研究発表会講演論文集, 299-302, Mar. 2016
Research society, Japanese - MODELING DEEP BIDIRECTIONAL RELATIONSHIPS FOR IMAGE CLASSIFICATION AND GENERATION
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, IEEE, 1327-1331, 2016, Peer-reviwed, This paper presents a novel probabilistic model that represents a joint probability of two visible variables with a deep architecture, called a deep relational model (DRM). The model stacks several layers from one visible layer on to another visible layer, sandwiching hidden layers between them. As with restricted Boltzmann machines (RBMs) and deep Boltzmann machines (DBMs), all connections (weights) between two adjacent layers are undirected. During the maximumlikelihood (ML)-based training, the network attempts to capture latent complex relationships between two visible variables (e.g., an image showing a certain number and its corresponding label) thanks to its deep architecture. Unlike deep neural networks, 1) the proposed DRM is a totally generative model, and 2) the weights can be optimized in a probabilistic manner. This paper presents and discusses the experiments conduced to evaluate our DRM's performance in recognition and generation tasks.
International conference proceedings, English - SPEAKER ADAPTIVE MODEL BASED ON BOLTZMANN MACHINE FOR NON-PARALLEL TRAINING IN VOICE CONVERSION
Torsi Nakashika; Yasuhiro Minami
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, IEEE, 5530-5534, 2016, Peer-reviwed, In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. VC is a technique where only speaker specific information in source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data-pairs of speech data from the source and target speakers uttering the same sentences. However, the use of parallel data in training causes several problems; 1) the data used for the training is limited to the pre-defined sentences, 2) the trained model is only applied to the speaker pair used in the training, and 3) mismatch in alignment may happen. Although it is, thus, fairy preferable in VC not to use parallel data, a non-parallel approach is considered difficult to learn. In our approach, we realize the non-parallel training based on speaker-adaptive training (SAT). Speech signals are represented using a probabilistic model based on the Boltzmann machine that defines phonological information and speaker-related information explicitly. Speaker-independent (SI) and speaker-dependent (SD) parameters are simultaneously trained using SAT. In conversion stage, a given speech signal is decomposed into phonological and speaker-related information, the speaker-related information is replaced with that of the desired speaker, and then a voice-converted speech is obtained by mixing the two. Our experimental results showed that our approach unfortunately fell short of the popular conventional GMM-based method that used parallel data, but outperformed the conventional non-parallel approach.
International conference proceedings, English - Selection of an Optimum Random Matrix Using a Genetic Algorithm for Acoustic Feature Extraction
Yuichiro Kataoka; Toru Nakashika; Ryo Aihara; Tetsuya Takiguchi; Yasuo Ariki
2016 IEEE/ACIS 15TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), IEEE COMPUTER SOC, 983-988, 2016, Peer-reviwed, This paper describes a selection technique of an optimum random matrix using a genetic algorithm for speech recognition based on random projections. Random projections have been suggested as a means of dimensionality reduction, where the original data are projected onto a subspace using a random matrix. Moreover, as we are able to produce various random matrices, it may be possible to find a transform matrix that is superior to conventional transformation matrices among random matrices. In this paper, a genetic algorithm is introduced to find an optimum random matrix. Its effectiveness is confirmed by word recognition experiments.
International conference proceedings, English - 3WRBM-Based Speech Factor Modeling for Arbitrary-Source and Non-Parallel Voice Conversion
Toru Nakashika; Yasuhiro Minami
2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), IEEE, 607-611, 2016, Peer-reviwed, In recent years, voice conversion (VC) becomes a popular technique since it can be applied to various speech tasks. Most existing approaches on VC must use aligned speech pairs (parallel data) of the source speaker and the target speaker in training, which makes hard to handle it. Furthermore, VC methods proposed so far require to specify the source speaker in conversion stage, even though we just want to obtain the speech of the target speaker from the other speakers in many cases of VC. In this paper, we propose a VC method where it is not necessary to use any parallel data in the training, nor to specify the source speaker in the conversion. Our approach models a joint probability of acoustic, phonetic, and speaker features using a three-way restricted Boltzmann machine (3WRBM). Speaker-independent (SI) and speaker-dependent (SD) parameters in our model are simultaneously estimated under the maximum likelihood (ML) criteria using a speech set of multiple speakers. In conversion stage, phonetic features are at first estimated in a probabilistic manner given a speech of an arbitrary speaker, then a voice-converted speech is produced using the SD parameters of the target speaker. Our experimental results showed not only that our approach outperformed other non-parallel VC methods, but that the performance of the arbitrary-source VC was close to those of the traditional source-specified VC in our approach.
International conference proceedings, English - Generative Acoustic-Phonemic-Speaker Model Based on Three-Way Restricted Boltzmann Machine
Toru Nakashika; Yasuhiro Minami
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5, ISCA-INT SPEECH COMMUNICATION ASSOC, 1487-1491, 2016, Peer-reviwed, In this paper, we argue the way of modeling speech signals based on three-way restricted Boltzmann machine (3WRBM) for separating phonetic-related information and speaker-related information from an observed signal automatically. The proposed model is an energy-based probabilistic model that includes three-way potentials of three variables: acoustic features, latent phonetic features, and speaker-identity features. We train the model so that it automatically captures the undirected relationships among the three variables.. Once the model is trained, it can be applied to many tasks in speech signal processing. For example, given a speech signal, estimating speaker-identity features is equivalent to speaker recognition; on the other hand, estimated latent phonetic features may be helpful for speech recognition because they contain more phonetic-related information than the acoustic features. Since the model is generative, we can also apply it to voice conversion; i.e., we just estimate acoustic features from the phonetic features that were estimated given the source speakers acoustic features along with the desired speaker-identity features. In our experiments, we discuss the effectiveness of the speech modeling through a speaker recognition, a speech (continuous phone) recognition, and a voice conversion tasks.
International conference proceedings, English - Small-parallel exemplar-based voice conversion in noisy environments using affine non-negative matrix factorization
Ryo Aihara; Takao Fujii; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, SPRINGER INTERNATIONAL PUBLISHING AG, 2015:32, DOI: 10.1186/s13636-015-0075-4, 1-9, Nov. 2015, Peer-reviwed, The need to have a large amount of parallel data is a large hurdle for the practical use of voice conversion (VC). This paper presents a novel framework of exemplar-based VC that only requires a small number of parallel exemplars. In our previous work, a VC technique using non-negative matrix factorization (NMF) for noisy environments was proposed. This method requires parallel exemplars (which consist of the source exemplars and target exemplars that have the same texts uttered by the source and target speakers) for dictionary construction. In the framework of conventional Gaussian mixture model (GMM)-based VC, some approaches that do not need parallel exemplars have been proposed. However, in the framework of exemplar-based VC for noisy environments, such a method has never been proposed. In this paper, an adaptation matrix in an NMF framework is introduced to adapt the source dictionary to the target dictionary. This adaptation matrix is estimated using only a small parallel speech corpus. We refer to this method as affine NMF, and the effectiveness of this method has been confirmed by comparing its effectiveness with that of a conventional NMF-based method and a GMM-based method in noisy environments.
Scientific journal, English - Parallel-Data-Free, Many-to-Many Voice Conversion Using an Adaptive Restricted Boltzmann Machine
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
MLSLP 2015, 1-6, Sep. 2015, Peer-reviwed
International conference proceedings, English - Voice conversion using RNN pre-trained by recurrent temporal restricted boltzmann machines
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
IEEE Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers Inc., 23, 3, 580-587, 01 Mar. 2015, Peer-reviwed, This paper presents a voice conversion (VC) method that utilizes the recently proposed probabilistic models called recurrent temporal restricted Boltzmann machines (RTRBMs). One RTRBM is used for each speaker, with the goal of capturing high-order temporal dependencies in an acoustic sequence. Our algorithm starts from the separate training of one RTRBM for a source speaker and another for a target speaker using speaker-dependent training data. Because each RTRBM attempts to discover abstractions to maximally express the training data at each time step, as well as the temporal dependencies in the training data, we expect that the models represent the linguistic-related latent features in high-order spaces. In our approach, we convert (match) features of emphasis for the source speaker to those of the target speaker using a neural network (NN), so that the entire network (consisting of the two RTRBMs and the NN) acts as a deep recurrent NN and can be fine-tuned. Using VC experiments, we confirm the high performance of our method, especially in terms of objective criteria, relative to conventional VC methods such as approaches based on Gaussian mixture models and on NNs.
Scientific journal, English - Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 23, 3, 580-587, Mar. 2015, Peer-reviwed, This paper presents a voice conversion (VC) method that utilizes the recently proposed probabilistic models called recurrent temporal restricted Boltzmann machines (RTRBMs). One RTRBM is used for each speaker, with the goal of capturing high-order temporal dependencies in an acoustic sequence. Our algorithm starts from the separate training of one RTRBM for a source speaker and another for a target speaker using speaker-dependent training data. Because each RTRBM attempts to discover abstractions to maximally express the training data at each time step, as well as the temporal dependencies in the training data, we expect that the models represent the linguistic-related latent features in high-order spaces. In our approach, we convert (match) features of emphasis for the source speaker to those of the target speaker using a neural network (NN), so that the entire network (consisting of the two RTRBMs and the NN) acts as a deep recurrent NN and can be fine-tuned. Using VC experiments, we confirm the high performance of our method, especially in terms of objective criteria, relative to conventional VC methods such as approaches based on Gaussian mixture models and on NNs.
Scientific journal, English - Voice conversion using speaker-dependent conditional restricted Boltzmann machine
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, SPRINGER INTERNATIONAL PUBLISHING AG, 2015:8, DOI 10.1186/s13636-014-0044-3, 1-12, Feb. 2015, Peer-reviwed, This paper presents a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs) for each speaker to obtain high-order speaker-independent spaces where voice features are converted more easily than those in an original acoustic feature space. The CRBM is expected to automatically discover common features lurking in time-series data. When we train two CRBMs for a source and target speaker independently using only speaker-dependent training data, it can be considered that each CRBM tries to construct subspaces where there are fewer phonemes and relatively more speaker individuality than the original acoustic space because the training data include various phonemes while keeping the speaker individuality unchanged. Each obtained high-order feature is then concatenated using a neural network (NN) from the source to the target. The entire network (the two CRBMs and the NN) can be also fine-tuned as a recurrent neural network (RNN) using the acoustic parallel data since both the CRBMs and the concatenating NN have network-based representation with time dependencies. Through voice-conversion experiments, we confirmed the high performance of our method especially in terms of objective evaluation, comparing it with conventional GMM, NN, RNN, and our previous work, speaker-dependent DBN approaches.
Scientific journal, English - 話者正規化学習に基づく潜在的音韻情報を考慮した音声モデリングによる非パラレル声質変換
NAKASHIKA Toru; TAKIGUCHI Tetsuya
日本音響学会2015年秋季研究発表会講演論文集, 223-236, 2015
Research society, Japanese - 適応型 Restricted Boltzmann Machine を用いたパラレルデータフリーな任意話者声質変換
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2015年春季研究発表会講演論文集, 279-282, 2015
Research society, Japanese - 制約付きThree-Way Restricted Boltzmann Machineを用いた音響・音韻・話者情報の同時モデリング
NAKASHIKA Toru; TAKIGUCHI Tetsuya
電子情報通信学会技術研究報告, 115, 346, 7-12, 2015
Symposium, Japanese - Deep Boltzmann Machine を用いた音素ラベル情報推定
TAKASHIMA Yuki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2015年春季研究発表会講演論文集, 3-6, 2015
Research society, Japanese - 少量のパラレルデータを用いたNon-negative Matrix Factorizationによる雑音環境下の声質変換
FUJII Takao; AIHARA Ryo; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2015年春季研究発表会講演論文集, 393-396, 2015
Research society, Japanese - 構音障害者音声認識のための混合正規分布に基づく音素ラベリングの検討
TAKASHIMA Yuki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
電子情報通信学会技術研究報告, 115, 99, 71-76, 2015
Symposium, Japanese - 構音障害者音声認識のための確率表現に基づく音素ラベリングの検討
TAKASHIMA Yuki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2015年秋季研究発表会講演論文集, 1243-1246, 2015
Research society, Japanese - FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
TAKASHIMA Yuki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
EUSIPCO, 1411-1415, 2015, Peer-reviwed
International conference proceedings, English - Content-based Image Retrieval Using Rotation-invariant Histograms of Oriented Gradients
Jinhui Chen; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ASSOC COMPUTING MACHINERY, 443-446, 2015, Peer-reviwed, Our research focuses on the question of feature descriptors for robust effective computing, proposing a novel feature representation method, namely, rotation-invariant histograms of oriented gradients (Ri-HOG) for image retrieval. Most of the existing HOG techniques are computed on a dense grid of uniformly-spaced cells and use overlapping local contrast of rectangular blocks for normalization. However, we adopt annular spatial bins type cells and apply radial gradient to attain gradient binning invariance for feature extraction. In this way, it significantly enhances HOG in regard to rotation invariant ability and feature descripting accuracy. In experiments, the proposed method is evaluated on Corel-5k and Corel-10k datasets. The experimental results demonstrate that the proposed method is much more effective than many existing image feature descriptors for content-based image retrieval.
International conference proceedings, English - SPARSE NONLINEAR REPRESENTATION FOR VOICE CONVERSION
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), IEEE, 1-6, 2015, Peer-reviwed, In voice conversion, sparse-representation-based methods have recently been garnering attention because they are, relatively speaking, not affected by over-fitting or over-smoothing problems. In these approaches, voice conversion is achieved by estimating a sparse vector that determines which dictionaries of the target speaker should be used, calculated from the matching of the input vector and dictionaries of the source speaker. The sparse-representation-based voice conversion methods can be broadly divided into two approaches: 1) an approach that uses raw acoustic features in the training data as parallel dictionaries, and 2) an approach that trains parallel dictionaries from the training data. In our approach, we follow the latter approach and systematically estimate the parallel dictionaries using a joint-density restricted Boltzmann machine with sparse constraints. Through voice-conversion experiments, we confirmed the high-performance of our method, comparing it with the conventional Gaussian mixture model (GMM)-based approach, and a non-negative matrix factorization (NMF)-based approach, which is based on sparse representation.
International conference proceedings, English - NOISE-ROBUST VOICE CONVERSION USING A SMALL PARALLE DATA BASED ON NON-NEGATIVE MATRIX FACTORIZATION
Ryo Aihara; Takao Fujii; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), IEEE, 315-319, 2015, Peer-reviwed, This paper presents a novel framework of voice conversion (VC) based on non-negative matrix factorization (NMF) using a small parallel corpus. In our previous work, a VC technique using NMF for noisy environments has been proposed, and it requires parallel exemplars (dictionary), which consist of the source exemplars and target exemplars, having the same texts uttered by the source and target speakers. The large parallel corpus is used to construct a conversion function in NMF-based VC (in the same way as common GMM-based VC). In this paper, an adaptation matrix in an NMF framework is introduced to adapt the source dictionary to the target dictionary. This adaptation matrix is estimated using a small parallel speech corpus only. The effectiveness of this method is confirmed by comparing its effectiveness with that of a conventional NMF-based method and a GMM-based method in a noisy environment.
International conference proceedings, English - FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
Yuki Takashima; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), IEEE, 1411-1415, 2015, Peer-reviwed, In this paper, we investigate the recognition of speech uttered by a person with an articulation disorder resulting from athetoid cerebral palsy based on a robust feature extraction method using pre-trained convolutive bottleneck networks (CBN). Generally speaking, the amount of speech data obtained from a person with an articulation disorder is limited because their burden is large due to strain on the speech muscles. Therefore, a trained CBN tends toward overfitting for a small corpus of training data. In our previous work, the experimental results showed speech recognition using features extracted from CBNs outperformed conventional features. However, the recognition accuracy strongly depends on the initial values of the convolution kernels. To prevent overfitting in the networks, we introduce in this paper a pre-training technique using a convolutional restricted Boltzmann machine (CRBM). Through word-recognition experiments, we confirmed its superiority in comparison to convolutional networks without pre-training.
International conference proceedings, English - High-Order Sequence Modeling Using Speaker-Dependent Recurrent Temporal Restricted Boltzmann Machines for Voice Conversion
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Proceedings of the 15th Conference of the International Speech Communication Association (Interspeech 2014), 2278-2282, Sep. 2014, Peer-reviwed
International conference proceedings, English - Error Correction of Automatic Speech Recognition Based on Normalized Web Distance
E. Byambakhishig; K. Tanaka; R. Aihara; T. Nakashika; T. Takiguchi; Y. Ariki
Proceedings of the 15th Conference of the International Speech Communication Association (Interspeech 2014), 2852-2856, Sep. 2014, Peer-reviwed
International conference proceedings, English - Parallel Dictionary Learning Using a Joint Density Restricted Boltzmann Machine for Sparse-Representation-Based Voice Conversion
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
Advances in Computer Science and Engineering, 12, 2, 101-117, Jun. 2014, Peer-reviwed
Scientific journal, English - Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG, E97D, 6, 1403-1410, Jun. 2014, Peer-reviwed, This paper presents a voice conversion technique using speaker-dependent Restricted Boltzmann Machines (RBM) to build high-order eigen spaces of source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. We build a deep conversion architecture that concatenates the two speaker-dependent RBMs with neural networks, expecting that they automatically discover abstractions to express the original input features. Under this concept, if we train the RBMs using only the speech of an individual speaker that includes various phonemes while keeping the speaker individuality unchanged, it can be considered that there are fewer phonemes and relatively more speaker individuality in the output features of the hidden layer than original acoustic features. Training the RBMs for a source speaker and a target speaker, we can then connect and convert the speaker individuality abstractions using Neural Networks (NN). The converted abstraction of the source speaker is then back-propagated into the acoustic space (e.g., MFCC) using the RBM of the target speaker. We conducted speaker-voice conversion experiments and confirmed the efficacy of our method with respect to subjective and objective criteria, comparing it with the conventional Gaussian Mixture Model-based method and an ordinary NN.
Scientific journal, English - 話者適応を用いたNMFによる声質変換
FUJII Takao; AIHARA Ryo; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2014年春季研究発表会講演論文集, 日本音響学会, 421-424, Mar. 2014, 本稿では,話者適応を用いたNMFによる声質変換手法を提案する.我々が提案してきた従来のNMFによる声質変換手法では,入力話者と出力話者の同一発話内容のパラレルデータを用いることが前提となっていた.つまり,対応する任意の話者の大量のデータをあらかじめ用意しておかなければならないという問題点があった.そこで,出力話者の少量の音声データのみを辞書適応に用いることで,入力話者辞書から出力話者辞書を生成する手法を提案する.評価実験では,話者適応を用いた本手法の有効性を示す.
Research society, Japanese - 声質変換のための Restricted Boltzmann Machine を用いた パラレル辞書の学習法
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2014年春季研究発表会講演論文集, 日本音響学会, 415-416, Mar. 2014, 本稿では,スパース表現に基づく声質変換において,パラレル辞書の作成・選択を統一的な枠組みで行うために,結合型RBM(restricted Boltzmann machine)を用いた声質変換法を提案する.
Research society, Japanese - Convolutive Bottleneck Network 特徴量を用いた構音障害者の音声認識
YOSHIOKA Toshiya; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2014年春季研究発表会講演論文集, 日本音響学会, 237-240, Mar. 2014, 本論文では,構音障害者を対象とした音声認識の実現に向けて,障害者音響モデルを用いた認識実験を行う.さらに,筋肉の緊張により発話が変動しやすいという障害者特有の問題に対して,ボトルネックの構成を持つCNN(CBN)を用いた特徴量抽出法を提案する.
Research society, Japanese - Hierarchical Sparse Representation for Object Recognition
NAKASHIKA Toru; OKUMURA Takeshi; TAKIGUCHI Tetsuya; ARIKI Yasuo
Transactions on Machine Learning and Artificial Intelligence, 2, 1, 46-60, Feb. 2014, Peer-reviwed
Scientific journal, English - Depth Spatial Pyramid: a Pooling Method for 3D-Object Recognition
NAKASHIKA Toru; HORI Takafumi; TAKIGUCHI Tetsuya; ARIKI Yasuo
Advances in Computer Science and Engineering, 12, 1, 15-30, 2014, Peer-reviwed
Scientific journal, English - 話者適応型 Restricted Boltzmann Machine を用いた声質変換の検討
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
電子情報通信学会技術研究報告, 114, 365, 165-170, 2014
Symposium, Japanese - 話者適応を用いたNMFによる雑音環境下の声質変換
FUJII Takao; AIHARA Ryo; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2014年秋季研究発表会講演論文集, 345-348, 2014
Research society, Japanese - 話者依存型 Recurrent Temporal Restricted Boltzmann Machine を用いた声質変換
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2014年秋季研究発表会講演論文集, 219-222, 2014
Research society, Japanese - 遺伝的アルゴリズムを用いた 構音障害者の音声特徴量抽出に最適なランダム行列の生成
KATAOKA Yuichiro; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2014年秋季研究発表会講演論文集, 83-86, 2014
Research society, Japanese - スパース表現に基づく声質変換のための結合型 restricted Boltzmann machine
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
電子情報通信学会技術研究報告, 114, 52, 343-348, 2014
Symposium, Japanese - Convolutive Bottleneck Network with Dropout for Dysarthric Speech Recognition
NAKASHIKA Toru; YOSHIOKA Toshiya; TAKIGUCHI Tetsuya; ARIKI Yasuo; DUFFNER Stefan; GARCIA Christophe
Transactions on Machine Learning and Artificial Intelligence, 2, 2, 46-60, 2014, Peer-reviwed
Scientific journal, English - VOICE CONVERSION BASED ON NON-NEGATIVE MATRIX FACTORIZATION USING PHONEME-CATEGORIZED DICTIONARY
AIHARA Ryo; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
ICASSP, 2014 Vol.10, 7894-7898, 2014, Peer-reviwed
International conference proceedings, English - VOICE CONVERSION BASED ON NON-NEGATIVE MATRIX FACTORIZATION USING PHONEME-CATEGORIZED DICTIONARY
Ryo Aihara; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 7944-7948, 2014, Peer-reviwed, We present in this paper an exemplar-based voice conversion (VC) method using a phoneme-categorized dictionary. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for spectral conversion between different speakers. In our previous NMF-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all the training exemplars (frames), and it may cause mismatching of phonemes between input signals and selected exemplars. In this paper, in order to reduce the mismatching of phoneme alignment, we propose a phoneme-categorized sub-dictionary and a dictionary selection method using NMF. By using the sub-dictionary, the performance of VC is improved compared to a conventional NMF-based VC. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method and a conventional NMF-based method.
International conference proceedings, English - VOICE CONVERSION IN TIME-INVARIANT SPEAKER-INDEPENDENT SPACE
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 7939-7943, 2014, Peer-reviwed, In this paper, we present a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs) for each speaker to obtain time-invariant speaker-independent spaces where voice features are converted more easily than those in an original acoustic feature space. First, we train two CRBMs for a source and target speaker independently using speaker-dependent training data (without the need to parallelize the training data). Then, a small number of parallel data are fed into each CRBM and the high-order features produced by the CRBMs are used to train a concatenating neural network (NN) between the two CRBMs. Finally, the entire network (the two CRBMs and the NN) is fine-tuned using the acoustic parallel data. Through voice-conversion experiments, we confirmed the high performance of our method in terms of objective and subjective evaluations, comparing it with conventional GMM, NN, and speaker-dependent DBN approaches.
International conference proceedings, English - Probabilistic spectral envelope modeling of musical instruments within the non-negative matrix factorization framework for mixed music analysis
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Acoustical Science and Technology, Acoustical Society of Japan, 35, 4, 181-191, 2014, Peer-reviwed, Non-negative matrix factorization (NMF) has been one of the most useful techniques for musical signal analysis in recent years. In particular, supervised NMF, in which a large number of instrumental samples are used for the analysis, is garnering much attention with respect to analytical accuracy and speed. The accuracy, however, deteriorates if the system does not have enough samples. Therefore, in principle, such methods require as many samples as possible in order for the analysis to be accurate. In this paper, we propose an analysis method that 1) does not require the collection of a large number of training samples, and 2) combines the NMF and probabilistic approaches. In this approach, it is assumed that each instrumental category has a model-invariant feature, called a probabilistic spectral envelope (PSE). As an extension of a spectral envelope, this feature represents the probabilities of spectral envelopes belonging to the instrumental category in a two-dimensional (frequency-amplitude) space. The analysis of an input musical signal is carried out using a supervised NMF framework, where the basis matrix contains the optimum spectra that have been generated from pretrained PSEs. © 2014 The Acoustical Society of Japan.
Scientific journal, English - 3D-Object Recognition Based on LLC Using Depth Spatial Pyramid
Toru Nakashika; Takafumi Hori; Tetsuya Takiguchi; Yasuo Ariki
2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), IEEE COMPUTER SOC, 4224-4228, 2014, Peer-reviwed, Recently introduced high-accuracy RGB-D cameras are capable of providing high quality three-dimension information (color and depth information) easily. The overall shape of the object can be understood by acquiring depth information. However, conventional methods adopted this camera use depth information only to extract the local feature. To improve the object recognition accuracy, in our approach, the overall object shape is expressed by the depth spatial pyramid based on depth information. In more detail, multiple features within each subregion of the depth spatial pyramid are pooled. As a result, the feature representation including the depth topological information is constructed. We use histogram of oriented normal vectors (HONV) designed to capture local geometric characteristics as 3D local features and locality-constrained linear coding (LLC) to project each descriptor into its local-coordinate system. As a result of image recognition, the proposed method has improved the recognition rate compared with conventional methods.
International conference proceedings, English - Dysarthric Speech Recognition Using a Convolutive Bottleneck Network
Toru Nakashika; Toshiya Yoshioka; Tetsuya Takiguchi; Yasuo Ariki; Stefan Duffner; Christophe Garcia
2014 12TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP), IEEE, 505-509, 2014, Peer-reviwed, In this paper, we investigate the recognition of speech produced by a person with an articulation disorder resulting from athetoid cerebral palsy. The articulation of the first spoken words tends to become unstable due to strain on speech muscles, and that causes a degradation of traditional speech recognition systems. Therefore, we propose a robust feature extraction method using a convolutive bottleneck network (CBN) instead of the well-known MFCC. The CBN stacks multiple various types of layers, such as a convolution layer, a subsampling layer, and a bottleneck layer, forming a deep network. Applying the CBN to feature extraction for dysarthric speech, we expect that the CBN will reduce the influence of the unstable speaking style caused by the athetoid symptoms. We confirmed its effectiveness through word-recognition experiments, where the CBN-based feature extraction method outperformed the conventional feature extraction method.
International conference proceedings, English - 話者依存型 Conditional Restricted Boltzmann Machine による声質変換
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
電子情報通信学会技術研究報告, 電子情報通信学会, 113, 366, 83-88, Dec. 2013, 本研究では,元の音響特徴量空間よりも音韻性や時間変化性を抑え,話者性を強調させることによって,より入力話者音声の声質を出力話者のものへと変換しやすい話者依存空間を形成することを目的として,話者ごとにconditional restricted Boltzmann machine (CRBM)を用いた声質変換法を提案する.提案手法ではまず初めに,話者ごとに用意した学習データ(パラレルデータである必要は無い)を用いて,入力話者,出力話者のCRBMを独立に学習させる.次に,少量のパラレルデータの音響特徴量を,それぞれのCRBMを通して話者依存高次元空間へ写像(CRBMの前方推論)し,その高次特徴量同士をNeural Network (NN)を用いて変換させる.NNの変換で得られた特徴量は,CRBMの後方推論によって元の音響特徴量へ逆変換することが可能である
Symposium, Japanese - 辞書選択に基づく非負値行列因子分解による声質変換
AIHARA Ryo; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2013年秋季研究発表会講演論文集, 日本音響学会, 1473-1476, Sep. 2013, 本稿では,声質変換においてもっとも一般的な,音声スペクトルを特徴量とした話者変換をタスクとし,NMFを用いた声質変換手法の精度を向上させるため,辞書選択手法の導入を提案する.これまではパラレルデータの全フレームをそのまま辞書の基底として用いており,辞書のサイズが膨大となっていた.そのため,入力音声のフレームと,入力話者辞書から選ばれる基底の音素が必ずしも一致しないといった問題があった.そこで本稿では,入力・出力話者辞書を音素カテゴリに分けた副辞書を作成する.NMFを用いて音素カテゴリ認識を行い,選択した副辞書上でマッピングを行うことで声質変換を行う.
Research society, Japanese - 時間変化を考慮した Deep Learning を用いた声質変換
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2013年秋季研究発表会講演論文集, 日本音響学会, 1471-1472, Sep. 2013, 本研究では,Conditional Restricted Boltzmann Machine を用いて音声の時間的変化を捉え,Deep Learningの枠組みで声質変換を行う手法を提案する.
Research society, Japanese - Convolutional Neural Networksを用いた構音障害者のための音声認識
YOSHIOKA Toshiya; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2013年秋季研究発表会講演論文集, 日本音響学会, 167-168, Sep. 2013, 提案手法では,音声のスペクトログラムから得られた2次元特徴を入力層,入力層の音素情報を要素として持つベクトルを出力層とするConvolutional Neural Networks (CNN) を構築し,特徴量抽出に用いる.
Research society, Japanese - Specmurtを利用した調波構造行列による混合楽音解析の検討
NISHIMURA Daiki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2013年春季研究発表会, 日本音響学会, 843-844, Mar. 2013, 我々が耳にする楽曲の多くは様々な楽器が同時刻に存在する混合楽音である.しかし,Specmurt法は単一楽器の多重音の解析のみしか行うことができない.そこで我々は従来のSpecmurtを拡張し,複数の楽器の混合楽音から,各楽器に分離された音高を解析する新たな手法を提案する.各楽器に分離された音高を解析する新たな手法を提案する.
Research society, Japanese - Sparseness Criteria of F0-Frequencies Selection for Specmurt-Based Multi-Pitch Analysis without Modeling Harmonic Structure
NISHIMURA Daiki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
Journal of Signal Processing, Research Institute of Signal Processing, 17, 2, 29-38, Mar. 2013, Peer-reviwed, This paper introduces a multi-pitch analysis method using specmurt analysis without modeling the common harmonic structure pattern. Specmurt analysis is based on the idea that the fundamental frequency distribution is expressed as a deconvolution of the observed spectrum by the common harmonic structure pattern. To analyze the fundamental frequency distribution, the common harm
Scientific journal, English - Deep Belief Nets による低次元空間表現を用いた声質変換の検討
NAKASHIKA Toru; TAKASHIMA Ryoichi; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2013年春季研究発表会, 日本音響学会, 517-520, Mar. 2013, 本稿では,DBNとNNを組み合わせて,話者性の取り除いた低次元空間で非線形変換を行う声質変換法を提案した.主観的・客観的に評価実験を行い,いずれの実験においても高い精度を示した.
Research society, Japanese - SPARSE REPRESENTATION FOR OUTLIERS SUPPRESSION IN SEMI-SUPERVISED IMAGE ANNOTATION
Toru Nakashika; Takeshi Okumura; Tetsuya Takiguchi; Yasuo Ariki
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 2080-2083, 2013, Peer-reviwed, Recently, generic object recognition (automatic image annotation) that achieves human-like vision using a computer has being looked to for use in robot vision, automatic categorization of images, and retrieval of images. For the annotation, semi-supervised learning, which incorporates a large amount of unsupervised training data (unlabeled data) along with a small amount of supervised data (labeled data), is expected to be an effective tool as it reduces the burden of manual annotation. However, some unlabeled data in semi-supervised models contains outliers that negatively affect the parameter estimation on the training stage. Such outliers often cause the over-fitting problem especially when a small amount of training data is used. In this paper, we propose a practical method to prevent the over-fitting in semi-supervised learning, suppressing existing outliers by sparse representation. In our experiments we got 4 points improvement comparing conventional semi-supervised methods, SemiNB and TSVM.
International conference proceedings, English - Voice Conversion in High-order Eigen Space Using Deep Belief Nets
Toru Nakashika; Ryoichi Takashima; Tetsuya Takiguchi; Yasuo Ariki
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, ISCA-INT SPEECH COMMUNICATION ASSOC, 369-372, 2013, Peer-reviwed, This paper presents a voice conversion technique using Deep Belief Nets (DBNs) to build high-order eigen spaces of the source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. DBNs have a deep architecture that automatically discovers abstractions to maximally express the original input features. If we train the DBNs using only the speech of an individual speaker, it can be considered that there is less phonological information and relatively more speaker individuality in the output features at the highest layer. Training the DBNs for a source speaker and a target speaker, we can then connect and convert the speaker individuality abstractions using Neural Networks (NNs). The converted abstraction of the source speaker is then brought back to the cepstrum space using an inverse process of the DBNs of the target speaker. We conducted speaker voice conversion experiments and confirmed the efficacy of our method with respect to subjective and objective criteria, comparing it with the conventional Gaussian Mixture Model -based method.
International conference proceedings, English - A Combination of Hand-Crafted and Hierarchical High-Level Learnt Feature Extraction for Music Genre Classification
Julien Martel; Toru Nakashika; Christophe Garcia; Khalid Idrissi
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2013, SPRINGER-VERLAG BERLIN, 8131, 397-404, 2013, Peer-reviwed, In this paper, we propose a new approach for automatic music genre classification which relies on learning a feature hierarchy with a deep learning architecture over hand-crafted feature extracted from an audio signal. Unlike the state-of-the-art approaches, our scheme uses an unsupervised learning algorithm based on Deep Belief Networks (DBN) learnt on block-wise MFCC (that we treat as 2D images), followed by a supervised learning algorithm for fine-tuning the extracted features. Experiments performed on the GTZAN dataset show that the proposed scheme clearly outperforms the state-of-the-art approaches.
International conference proceedings, English - High-frequency restoration using deep belief nets for super-resolution
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Proceedings - 2013 International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2013, 38-42, 2013, Peer-reviwed, Super-resolution technology, which restores high-frequency information given a low-resolved image, has attracted much attention recent years. Various super-resolution algorithms were proposed so far: example-based approach, sparse-coding-based, GMM (Gaussian Mixture Model), BPLP (Back Projection for Lost Pixels), and so on. Most of these statistical approaches rely on the training (or just preparing) of the correspondence relationships between low-resolved/high-resolved images. In this paper, we propose a novel super-resolution method that is based on a statistical model but does not require any pairs of low and high-resolved images in the database. In our approach, Deep Belief Bets are used to restore high-frequency information from a low-resolved image. The idea is that only using high-resolved images, the trained networks seek the high-order dependencies among the observed nodes (each spatial frequency: e.g., high and low frequencies). Experimental results show the high performance of our proposed method. © 2013 IEEE.
International conference proceedings, English - 重みつきノルム基準によるF0周波数選択を用いたSpecmurtによる多重音解析
NISHIMURA Daiki; NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2012年秋季研究発表会, 日本音響学会, 781-784, Sep. 2012, 本稿では共通調波構造をモデル化しないで,重みつきノルムによるスパース性を考慮したSpecmurtによる多重音解析の有効性を示した.この手法は音色の学習を必要とせず,また和音数などといった知識も用いないで多重音の解析ができる.
Research society, Japanese - Convolutional Neural Networks を用いた局所特徴統合による 自動音楽ジャンル分類
NAKASHIKA Toru; Garcia Christophe; TAKIGUCHI Tetsuya; ARIKI Yasuo
日本音響学会2012年秋季研究発表会, 日本音響学会, 789-790, Sep. 2012, 近年のコンピュータの発展とともに音楽のデジタルコンテンツが爆発的に増大し,web上や個人の情報端末上で音楽データを整理・検索することが困難になってきている.このような背景の中で,類似した音楽を自動的にクラスタリングする自動音楽ジャンル分類の研究が盛んに行われている.本稿では後者のアプローチに基づき,各マップから計算される画像特徴であるGLCM (Gray Level Co-occurrence Matrix)を特徴量とし,Convolutional Neural Networks (ConvNets)を用いて複数のGLCMを統合しつつ音楽ジャンルを識別する手法を提案する.
Research society, Japanese - Local-feature-map Integration Using Convolutional Neural Networks for Music Genre Classification
Toru Nakashika; Christophe Garcia; Tetsuya Takiguchi
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, ISCA-INT SPEECH COMMUNICATION ASSOC, 1750-1753, 2012, Peer-reviwed, A map-based approach, which treats 2-dimensional acoustic features using image analysis, has recently attracted attention in music genre classification. While this is successful at extracting local music-patterns compared with other frame-based methods, in most works the extracted features are not sufficient for music genre classification. In this paper, we focus on appropriate feature extraction and proper classification by integrating automatically learnt image feature. For the musical feature extraction, we build gray level co-occurrence matrix (GLCM) descriptors with different offsets from a short-term mel spectrogram. These feature maps are integratively classified using convolutional neural networks (Cony Nets). In our experiments, we obtained a large improvement of more than 10 points in classification accuracy on the GTZAN database, compared with other Cony Nets-based methods.
International conference proceedings, English - Constrained Spectrum Generation Using A Probabilistic Spectrum Envelope for Mixed Music Analysis
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 181-184, Oct. 2011, Peer-reviwed
International conference proceedings, English - GENERIC OBJECT RECOGNITION USING AUTOMATIC REGION EXTRACTION AND DIMENSIONAL FEATURE INTEGRATION UTILIZING MULTIPLE KERNEL LEARNING
Toru Nakashika; Akira Suga; Tetsuya Takiguchi; Yasuo Ariki
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, IEEE, 1229-1232, 2011, Peer-reviwed, Recently, in generic object recognition research, a classification technique based on integration of image features is garnering much attention. However, with a classifying technique using feature integration, there are some features that may cause incorrect recognition of objects and a large amount of noise that causes a degradation in the recognition accuracy of image data. In this paper, we propose feature selection in an object area that is restricted by removing its background region, and multiple kernel learning (MKL) to weight each dimension, as well as the features themselves. This enables accurate and effective weighting since the weight is computed for each dimension using the selected feature. Experimental results indicate the validity of automatic feature selection. Classification performance is improved by using a background removing technique that utilizes saliency maps and graph cuts, and each dimensional weighting method using MKL.
International conference proceedings, English - Probabilistic Spectrum Envelope: Categorized Audio-features Representation for NMF-based Sound Decomposition
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, ISCA-INT SPEECH COMMUNICATION ASSOC, 1776-1779, 2011, Peer-reviwed, NMF (Non-negative Matrix Factorization) has been one of the most useful techniques for audio signal analysis in recent years. In particular, supervised NMF, in which a large number of samples is used for analyzing a signal, is garnering much attention in sound source separation or noise reduction research. However, because such methods require all the possible samples for the analysis, it is hard to build a practical system based on this method. In this paper, we propose a novel method of signal analysis that combines the NMF and probabilistic approaches. In this approach, it is assumed that each audio-source category (such as phonemes or musical instruments) has an environment-invariant feature, called a probabilistic spectrum envelope (PSE). At the start, the PSE of each category is learned using a technique based on Gaussian Process Regression. Then, the observed spectrum is analyzed using a combination of supervised NMF and Genetic Algorithm with pre-trained PSEs.
International conference proceedings, English - Speech Synthesis by Modeling Harmonics Structure with Multiple Function
Toru Nakashika; Ryuki Tachibana; Masafumi Nishimura; Tetsuya Takiguchi; Yasuo Ariki
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, ISCA-INT SPEECH COMMUNICATION ASSOC, 945-+, 2010, Peer-reviwed, In this paper, we present a new approach for the speech synthesis, in which speech utterances are synthesized using the parameters of spectro-modeling function (Multiple function). With this approach, only harmonic-parts are extracted from the phoneme spectrum, and the time-varying spectrum corresponding to the harmonics or sinusoidal components is modeled using the Multiple function. We introduce two types of the functions, and present the method to estimate the parameters of each function using the observed phoneme spectrum. In the synthesis stage, speech signals are generated from the parameters of the Multiple function. The advantage of this method is that it only requires a few speech synthesis parameters. We discuss the effectiveness of our proposed method through experimental results.
International conference proceedings, English - MATHEMATICAL MODELING OF HARMONIC-TIMBRE STRUCTURE WITH MULTI-BETA-DISTRIBUTION
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
2009 IEEE/SP 15TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING, VOLS 1 AND 2, IEEE, 768-771, 2009, Peer-reviwed, Recently, a large amount of signal processing technology research on applications associated with music is being carried out. Sound synthesis, in particular, is one of the most interesting research themes. In this paper we propose a new approach to mathematically modeling harmonic-timbre structure with multi-beta-distribution (MBD). This probabilistic distribution has the advantage of enabling one to easily express varied timbre-structure with only a few parameters. We will define MBD itself, and present a method of estimating MBD parameters. Some experimental results are presented to discuss the performance of this method.
International conference proceedings, English
MISC
- コーヒーブレイク 〜 あの日私は
中鹿 亘
日本音響学会, 01 Apr. 2022, 日本音響学会誌, 78, 4, 210-211, Japanese, Invited, Introduction scientific journal - Feature Extraction Using Adaptive Restricted Boltzmann Machine for Dysarthric Speech Recognition
高島 悠樹; 中鹿 亘; 滝口 哲也; 有木 康雄
電子情報通信学会, 01 Mar. 2017, 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 116, 477, 321-326, Japanese, 0913-5685, 40021161268, AN10013221 - Phone Labeling Based on Gaussian Mixture Model for Dysarthric Speech Recognition
高島 悠樹; 中鹿 亘; 滝口 哲也
日本音響学会, 18 Jun. 2015, 聴覚研究会資料 = Proceedings of the auditory research meeting, 45, 4, 275-280, Japanese, 1346-1109, 40020532927, AN00227138 - Voice Conversion Using Speaker Adaptive Restricted Boltzmann Machine
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Voice conversion (VC) is a technique where only speaker-specific information in source speech is converted while keeping phonological information. The technique can be applied to various tasks such as speaker-identity conversion, emotion conversion and aid to speaking for people with articulation disorders. Most of the existing VC methods rely on parallel data?pairs of speech data from source and target speakers uttering the same articles. However, this approach involves several problems; firstly, the data used for the training is limited to the pre-defined articles. Secondly, the use of the trained model is limited only to the speaker pair used in the training. In this paper, we propose a novel probabilistic model called an adaptive restricted Boltzmann machine (ARBM) for VC between arbitrary speakers without use of parallel data. This model consists of a visible-unit and a hidden-unit layer with the speaker-dependent connection. In this paper, we report our experimental results of arbitrary-speaker VC using our model, an ARBM., Information Processing Society of Japan (IPSJ), 08 Dec. 2014, IPSJ SIG Notes, 2014, 30, 1-6, Japanese, 110009850974, AN10442647 - A joint restricted Boltzmann machine for dictionary learning in sparse-representation-based voice conversion
中鹿 亘; 滝口 哲也; 有木 康雄
近年,声質変換の研究分野において,over-fitting や over-smoothing の生じにくいスパース表現に基づく手法が注目を浴びている.スパース表現に基づく声質変換法では,予め入力話者・出力話者のパラレル辞書を求めておき,スパースな辞書選択重みを用いて適切な辞書を選択することで声質変換を実現するとの手法は主に 2 つのアプローチに分けることができる.1 つ目はパラレル辞書として,学習データの音響特徴量をそのまま辞書として用いるアプローチであり,もう 1 つは,パラレル辞書そのものを何らかの手法で学習させるアプローチである.本研究では,後者のアプローチに基づき,近年注目を浴びている Deep Learning の基礎技術となる restricted Bolzmann machine(RBM) を用いて,入力話者・出力話者のパラレル辞書を体系的に求める手法を提案する.評価実験では,代表的な手法である Gaussian mixture model(GMM) だけでなく,従来のスパース表現に基づく手法である、non-negative matrix factorization (NMF) による声質変換法に比べて高い精度が得られたことを確認した.In voice conversion, sparse-representation-based methods have recently been garnering attention because they are, relatively speaking, not affected by over-fitting or over-smoothing problems. In these approaches, voice conversion is achieved by estimating a sparse vector that determines which dictionaries of the target speaker should be used, calculated from the matching of the input vector and dictionaries of the source speaker. The sparse-repre sentation-based voice conversion methods can be broadly divided into two approaches: 1) an approach that uses raw acoustic features in the training data as parallel dictionaries, and 2) an approach that trains parallel dictionaries from the training data. Our approach belongs to the latter; we systematically estimate the parallel dictionaries using a restricted Boltzmann machine, a fundamental technology commonly used in deep learning. Through voice-conver sion experiments, we confirmed the high-performance of our method, comparing it with the conventional Gaussian mixture model (GMM)-based approach, and a non-negative matrix factorization (NMF)-based approach, which is based on sparse-representation., 17 May 2014, 研究報告音楽情報科学(MUS), 2014, 66, 1-6, Japanese, 170000083787, AN10438388 - Speaker-dependent conditional restricted Boltzmann machine for voice conversion
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
In this paper, we present a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs) for each speaker to obtain time-invariant speaker-independent spaces where voice features are converted more easily than those in an original acoustic feature space. First, we train two CRBMs for a source and target speaker independently using speaker-dependent training data (without the need to parallelize the training data). Then, a small number of parallel data are fed into each CRBM and the high-order features produced by the CRBMs are used to train a concatenating neural network (NN) between the two CRBMs. Finally, the entire network (the two CRBMs and the NN) is fine-tuned using the acoustic parallel data. Through voice-conversion experiments, we confirmed the high performance of our method in terms of objective and subjective evaluations, comparing it with conventional GMM, NN, and speaker-dependent DBN approaches., The Institute of Electronics, Information and Communication Engineers, 19 Dec. 2013, IEICE technical report. Speech, 113, 366, 83-88, Japanese, 0913-5685, 110009903078, AN10013221 - Speaker-dependent conditionl restricted Boltzmann machine for voice conversion
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
In this paper, we present a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs) for each speaker to obtain time-invariant speaker-independent spaces where voice features are converted more easily than those in an original acoustic feature space. First, we train two CRBMs for a source and target speaker independently using speaker-dependent training data (without the need to parallelize the training data). Then, a small number of parallel data are fed into each CRBM and the high-order features produced by the CRBMs are used to train a concatenating neural network (NN) between the two CRBMs. Finally, the entire network (the two CRBMs and the NN) is fine-tuned using the acoustic parallel data. Through voice-conversion experiments, we confirmed the high performance of our method in terms of objective and subjective evaluations, comparing it with conventional GMM, NN, and speaker-dependent DBN approaches., Information Processing Society of Japan (IPSJ), 12 Dec. 2013, IPSJ SIG Notes, 2013, 14, 1-6, Japanese, 110009646537, AN10442647 - 確率スペクトル包絡を用いた混合音解析における制約付きスペクトル生成法の検討
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
Jul. 2011, IEICE Speech Committee, SP2011-50,pp. 51-56, Japanese, Report scientific journal - 確率スペクトル包絡に基づくNMF 基底生成モデルを用いた混合楽音解析
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
Feb. 2011, IPSJ-SIGMUS, Vol.2011-MUS-89,No.18, pp. 1-6, Japanese, Report scientific journal - 基底の反復生成と教師ありNMFを用いた信号解析
NAKASHIKA Toru; TAKIGUCHI Tetsuya; ARIKI Yasuo
Dec. 2010, IEICE Speech Committee, SP2010-102,pp. 195-200, Japanese, Report scientific journal - 物体領域特徴の自動選定とマルチカーネル学習を用いた特徴統合による一般物体認識
NAKASHIKA Toru; SUGA Akira; TAKIGUCHI Tetsuya; ARIKI Yasuo
Jul. 2010, MIRU, OS8-2, pp. 1404-1411, Japanese, Report scientific journal - 多重ベータ混合モデルを用いた調波時間構造のモデル化による音声合成の検討
NAKASHIKA Toru; TACHIBANA Ryuki; NISHIMURA Masafumi; TAKIGUCHI Tetsuya; ARIKI Yasuo
Dec. 2009, 第11回音声言語シンポジウム, SP2009-93,No. 29,pp. 165-170, Japanese, Report scientific journal
Books and other publications
Lectures, oral presentations, etc.
- 2 種のラグ窓によるスペクトル平滑化を用いた F0 推定
越森 道貴; 嵯峨山 茂樹; 中鹿 亘
日本音響学会2024年春季研究発表会
Mar. 2024 - FaderNetworks を用いた F0 変換による歌唱技術の付与
後藤 純平; 中鹿 亘
日本音響学会2024年春季研究発表会
Mar. 2024 - 歌唱音声合成における F0 の自然性向上のための Diffusion-GAN モデルの検討
芦田 裕飛; 中鹿 亘
日本音響学会2024年春季研究発表会
Mar. 2024 - 拡散確率モデルを用いたノンパラレルな Any-to-many 声質変換
畠山 瑠一; 奥田 耕平; 中鹿 亘
日本音響学会2024年春季研究発表会
Mar. 2024 - 事前学習済みモデルによる埋め込み表現を組み込んだ音声編集モデルの検討
平本 佳弘; 中鹿 亘
日本音響学会2024年春季研究発表会
Mar. 2024 - 分類型半制限ボルツマンマシンによる全音程関係を考慮した和音認識
石川峻弥; 中鹿亘
日本音響学会2024年春季研究発表会
Mar. 2024 - Transformerを用いた脳波信号からの音声復元の検討
水野友暁; 岸田拓也; 吉村奈津江; 中鹿亘
第151回音声言語情報処理研究発表会
Mar. 2024 - 潜在変数と観測データにガンマ分布を仮定したVAEによる音声振幅スペクトル表現
今市夏菜子; 中鹿亘
第151回音声言語情報処理研究発表会
Mar. 2024 - 複数のラグ窓対を用いた音声基本周波数と周期性尺度の推定
越森道貴; 嵯峨山茂樹; 中鹿亘
第151回音声言語情報処理研究発表会
Mar. 2024 - DDPMVC: 連続時間拡散確率モデルを用いた非パラレル声質変換と評価
畠山瑠一; 奥田耕平; 中鹿亘
第151回音声言語情報処理研究発表会
Mar. 2024 - ベータ分布に基づくFaderNetを用いた音声印象変換の性能評価
釘本咲; 中鹿亘
日本音響学会2023年秋季研究発表会
Sep. 2023 - レイリー型制限ボルツマンマシンを用いた独立低ランク行列分析に基づくブラインド音源分離
古田翔太郎; 中鹿亘
日本音響学会2023年秋季研究発表会
Sep. 2023 - SiFiSinger: SiFi-GANを内包した歌唱音声合成
芦田裕飛; 中鹿亘
日本音響学会2023年秋季研究発表会
Sep. 2023 - FaderNetを用いた未知話者に対する音声印象変換
釘本咲; 中鹿亘
音学シンポジウム2023
Jun. 2023 - 入力特徴量で条件づけた拡散確率モデルによるパラレル声質変換
岸田拓也; 中鹿亘
第146回研究会音声言語情報処理研究会
Mar. 2023 - Speechsplit を用いたイントネーション・リズム・発音の矯正による外国語アクセント変換
許 誠; 岸田 拓也; 中鹿 亘
日本音響学会2023年春季研究発表会
Mar. 2023 - 振幅重み付けエネルギー関数を用いたボルツマンマシンによる位相復元
羽賀洋克; 矢田部浩平; 岸田拓也; 中鹿亘
日本音響学会2023年春季研究発表会
Mar. 2023 - Dual Diffusion Implicit Bridgesを用いた話者間の匿名性を担保した声質変換
奥田耕平; 岸田拓也; 中鹿亘
日本音響学会2023年春季研究発表会
Mar. 2023 - 条件付き制限ボルツマンマシンの平衡化傾向を利用したノンパラレル声質変換
岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2022年秋季研究発表会, Domestic conference
Sep. 2022 - 話者因子係数の量子化に基づく声色制御可能な話者変換
井硲 巧; 大西 弘太郎; 岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2022年秋季研究発表会, Domestic conference
Sep. 2022 - MoCoVC: モーメンタム対照表現学習によるノンパラレル声質変換
大西 弘太郎; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2022年秋季研究発表会, Domestic conference
Sep. 2022 - マルチモーダルVAEを用いた顔画像に基づく目標話者音声不要な声質変換
飯田紘崇; 岸田拓也; 中鹿亘
Oral presentation, Japanese, 日本音響学会2022年春季研究発表会, Domestic conference
Mar. 2022 - 時系列条件付きボルツマンマシンによる位相復元
羽賀洋克; 矢田部浩平; 岸田拓也; 中鹿亘
Oral presentation, Japanese, 日本音響学会2022年春季研究発表会, Domestic conference
Mar. 2022 - 印象表現語ラベルを用いたFaderNetworksに基づく音声印象変換
岡留有希; 大西弘太郎; 岸田拓也; 中鹿亘
Oral presentation, Japanese, 日本音響学会2022年春季研究発表会, Domestic conference
Mar. 2022 - TTSモデルにおけるアラインメントロバスト性向上のための非停滞化制約付きForward Attention
Zhou Yujin; 岸田拓也; 中鹿亘
Oral presentation, Japanese, 日本音響学会2022年春季研究発表会, Domestic conference
Mar. 2022 - 非可逆圧縮を用いた敵対的ニューラルボコーダのためのデータ拡張法
大西弘太郎; 中鹿亘
Oral presentation, Japanese, 日本音響学会2022年春季研究発表会, Domestic conference
Mar. 2022 - リズムスタイルを考慮したFader Networksに基づく外国語学習者の発音変換
王庭輝; 岸田拓也; 中鹿亘
Oral presentation, Japanese, 日本音響学会2022年春季研究発表会, Domestic conference
Mar. 2022 - 深層エネルギーベースモデルによる音声の音響特徴量の生成
岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2021年秋季研究発表会, Domestic conference
Sep. 2021 - 深層エネルギーベースモデルによる音声の音響特徴量の生成
井硲 巧; 岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2021年秋季研究発表会, Domestic conference
Sep. 2021 - 話者特徴抽出器を加えたFaderNetVCによる未知話者声質変換
井硲巧; 岸田拓也; 中鹿亘
Poster presentation, Japanese, 音学シンポジウム2021, Domestic conference
Jun. 2021 - VQVAEに基づくリアルタイム波形ベース声質変換の検討
大西 弘太郎; 中鹿 亘; 松本 光春
Oral presentation, Japanese, 日本音響学会2021年春季研究発表会, Domestic conference
Mar. 2021 - 条件付きボルツマンマシンによる位相復元の初期検討
羽賀 洋克; 矢田部 浩平; 岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2021年春季研究発表会, Domestic conference
Mar. 2021 - Attention RBMによる音声特徴量系列の符号化と生成
岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2021年春季研究発表会, Domestic conference
Mar. 2021 - Cluster ARBM を用いた話者・音韻相互作用分類による声質変換
岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2020年秋季研究発表会, Domestic conference
Sep. 2020 - HMelGAN: 階層的構造を導入した敵対的学習ネットワークに基づく高速ニューラルボコーダ
大西 弘太郎; 中鹿 亘; 松本 光春
Oral presentation, Japanese, 日本音響学会2020年秋季研究発表会, Domestic conference
Sep. 2020 - Speech chain を模倣したボルツマンマシンによるワンショット多対多声質変換の検討
岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2020年春季研究発表会, Domestic conference
Mar. 2020 - マルチタスクモデルを用いたdisentangleな学習による楽器音変換
荒川 賢也; 岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2020年春季研究発表会, Domestic conference
Mar. 2020 - 適応型 RBM を用いた音声情報の分離による話者と感情の同時変換
塚本 伸; 岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2020年春季研究発表会, Domestic conference
Mar. 2020 - 適応型RBMを用いたノンパラレル感情音声変換
塚本 伸; 岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2019年秋季研究発表会, Domestic conference
Sep. 2019 - Fader Networksを用いた楽器音変換
荒川 賢也; 岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2019年秋季研究発表会, Domestic conference
Sep. 2019 - 複素VAE: 音声の複素スペクトルを直接表現する新しい変分自己符号化器
中鹿 亘
Oral presentation, Japanese, 日本音響学会2019年秋季研究発表会, Domestic conference
Sep. 2019 - Speech chain VC: 音声コミュニケーションの言語-生理-音響連鎖を考慮する声質変換
岸田 拓也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2019年秋季研究発表会, Domestic conference
Sep. 2019 - Degree of Inharmonicity: Index to Evaluate Sustained Pedal Control
Toru Nakashika; Eriko Aiba
Poster presentation, English, International Symposium on Performance Science (ISPS) 2019, International conference
Jul. 2019 - パラレル制約付きVAEを用いた未知話者声質変換の検討
大西 弘太郎; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2019年春季研究発表会, Domestic conference
Mar. 2019 - セミパラレル手法による適応型 RBM を用いた声質変換の性能改善
塚本 伸; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2019年春季研究発表会, Domestic conference
Mar. 2019 - VAE を用いた多対多声質変換における音素識別制約の検討
木庭 慶人; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2019年春季研究発表会, Domestic conference
Mar. 2019 - スペクトル系列誤差に基づくDNN音声波形モデルの学習
高木 信二; 中鹿 亘; 山岸 順一
Oral presentation, Japanese, 日本音響学会2018年秋季研究発表会, Domestic conference
Sep. 2018 - 音声スペクトル系列の自己回帰性を考慮した複素RBMの拡張
中鹿 亘; 高木 信二; 山岸 順一
Oral presentation, Japanese, 日本音響学会2018年秋季研究発表会, Domestic conference
Sep. 2018 - DRMを用いた唇動画像と音声の双方向変換
塚本伸; 中鹿亘
Poster presentation, Japanese, 音学シンポジウム2018, Domestic conference
Jun. 2018 - RBMを用いた楽器音基底と演奏情報への分離による多重音解析
荒川賢也; 中鹿亘
Oral presentation, Japanese, 2018年度人工知能学会全国大会, Domestic conference
May 2018 - 長・短期記憶構造を持つ拡張ボルツマンマシンの検討
中鹿亘
Oral presentation, Japanese, 日本音響学会2018年春季研究発表会, Domestic conference
Mar. 2018 - 非負値タッカー分解による NMF 辞書学習に基づく非パラレル声質変換
高島悠樹; 矢野肇; 中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 日本音響学会2018年春季研究発表会, Domestic conference
Mar. 2018 - GGDRMによる双方向変換を考慮したDNN声質変換のための事前学習法
曾根 健太郎; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2018年春季研究発表会, Domestic conference
Mar. 2018 - RBMを用いた楽器音基底と演奏情報への分離による多重音解析の検討
荒川 賢也; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2018年春季研究発表会, Domestic conference
Mar. 2018 - DRMを用いた唇動画像と音声の双方向変換の検討
塚本 伸; 曾根 健太郎; 中鹿 亘
Oral presentation, Japanese, 日本音響学会2018年春季研究発表会, Domestic conference
Mar. 2018 - リカレント構造を持つ複素制限ボルツマンマシンによる複素スペクトル系列モデリング
中鹿亘; 高木信二; 山岸順一
Oral presentation, Japanese, 第120回音声言語情報処理研究会, Domestic conference
Feb. 2018 - 国際会議Interspeech2017報告
高木 信二; 倉田 岳人; 郡山 知樹; 塩田 さやか; 鈴木 雅之; 玉森 聡; 俵 直弘; 中鹿 亘; 福田 隆; 増村 亮; 森勢 将雅; 山岸 順一; 山本 克彦
Oral presentation, Japanese, 第120回音声言語情報処理研究会, Domestic conference
Feb. 2018 - フェイクデータを用いた ARBM に基づく非パラレル声質変換手法の改善
中鹿亘
Oral presentation, Japanese, 日本音響学会2017年秋季研究発表会, Domestic conference
Sep. 2017 - 複素 RBM を用いた音声スペクトルモデリングの改良と評価
中鹿亘; 高木信二; 山岸順一
Oral presentation, Japanese, 日本音響学会2017年秋季研究発表会, Domestic conference
Sep. 2017 - GCDRMを用いたテキスト・音声の同時確率表現に基づく音声認識・合成器の同時構築
曾根健太郎; 中鹿亘
Oral presentation, Japanese, 日本音響学会2017年秋季研究発表会, Domestic conference
Sep. 2017 - Practice Process Analysis Using Score Matching Method Based on OBE-DTW and Its Effects on Memorizing Musical Score
Toru Nakashika; Eriko Aiba
Poster presentation, English, International Symposium on Performance Science (ISPS) 2017, International conference
Aug. 2017 - 複素RBM:制限ボルツマンマシンの複素数拡張と音声信号への応用と評価
中鹿亘; 高木信二; 山岸順一
Oral presentation, Japanese, 情報処理学会音声言語研究会技術研究報告, Domestic conference
Jul. 2017 - テキスト・音声間の双方向変換に基づくDNN音声認識・合成のための事前学習法
曾根健太郎; 中鹿亘; 南泰浩
Poster presentation, Japanese, 音学シンポジウム2017, Domestic conference
Jun. 2017 - 話者クラスタ適応学習可能な拡張制限ボルツマンマシンに基づく非パラレル声質変換
曾根健太郎; 中鹿亘; 南泰浩
Poster presentation, Japanese, 音学シンポジウム2017, Domestic conference
Jun. 2017 - クラスタ適応制限ボルツマンマシンを用いた話者クラスタリングと声質変換への応用
中鹿亘; 南泰浩
Oral presentation, Japanese, 第31回人工知能学会全国大会, Domestic conference
May 2017 - 適応型 Gaussian-Gaussian RBM を用いた構音障害者音声認識
高島悠樹; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2017年春季研究発表会, Domestic conference
Mar. 2017 - 構音障害者音声認識のための適応型 restricted Boltzmann machine を用いた特徴量抽出
高島悠樹; 中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 電子情報通信学会技術研究報告, Domestic conference
Mar. 2017 - 話者クラスタ適応学習可能な拡張制限ボルツマンマシンに基づく非パラレル声質変換
中鹿亘; 南泰浩
Oral presentation, Japanese, 日本音響学会2017年春季研究発表会, Domestic conference
Mar. 2017 - 複素RBM:制限ボルツマンマシンの複素数拡張と音声信号への応用
中鹿亘; 高木信二; 山岸順一
Oral presentation, Japanese, 日本音響学会2017年春季研究発表会, Domestic conference
Mar. 2017 - Simultaneous recognition of phone and speaker using three-way restricted Boltzmann machine
Toru Nakashika; Yasuhiro Minami
Poster presentation, English, The 5th Joint Meeting Acoustical Society of America and Acoustical Society of Japan, International conference
Nov. 2016 - 音響・音韻・話者ファクターを考慮したThree-way RBMよる話者・音素の同時認識
中鹿亘; 南泰浩
Oral presentation, Japanese, 日本音響学会2016年秋季研究発表会, Domestic conference
Sep. 2016 - Factored 3-Way Restricted Boltzmann Machine を用いたマルチモーダル音声認識の検討
高島悠樹; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2016年秋季研究発表会, Domestic conference
Sep. 2016 - Speech modeling using three-way restricted Boltzmann machine for simultaneous speaker-phoneme recognition
Toru Nakashika; Yasuhiro Minami
Poster presentation, Japanese, 音学シンポジウム2016, Domestic conference
May 2016 - Restricted Boltzmann Machine を用いた話者性・雑音を考慮したモデリングの検討
高島悠樹; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2016年春季研究発表会, Domestic conference
Mar. 2016 - 音響・音韻・話者情報を考慮したThree-Way Restricted Boltzmann Machineを用いた任意入力声質変換
中鹿亘; 南泰浩
Oral presentation, Japanese, 日本音響学会2016年春季研究発表会, Domestic conference
Mar. 2016 - 制約付きThree-Way Restricted Boltzmann Machineを用いた音響・音韻・話者情報の同時モデリング
中鹿亘; 滝口哲也
Oral presentation, Japanese, 電子情報通信学会技術研究報告, Domestic conference
02 Dec. 2015 - 構音障害者音声認識のための確率表現に基づく音素ラベリングの検討
高島悠樹; 中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 日本音響学会2015年秋季研究発表会, Domestic conference
Sep. 2015 - 遺伝的アルゴリズムを用いたランダム写像行列の選択
片岡悠一郎; 中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 日本音響学会2015年秋季研究発表会, Domestic conference
Sep. 2015 - 話者正規化学習に基づく潜在的音韻情報を考慮した音声モデリングによる非パラレル声質変換
中鹿亘; 滝口哲也
Oral presentation, Japanese, 日本音響学会2015年秋季研究発表会, Domestic conference
Sep. 2015 - Modeling Deep Bidirectional Relationships for Image Classification and Generation
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Poster presentation, English, The 18th Meeting on Image Recognition and Understanding, Domestic conference
Jul. 2015 - 構音障害者音声認識のための混合正規分布に基づく音素ラベリングの検討
高島悠樹; 中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 電子情報通信学会技術研究報告, Domestic conference
Jun. 2015 - 適応型 Restricted Boltzmann Machine を用いたパラレルデータフリーな任意話者声質変換
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 日本音響学会2015年春季研究発表会, Domestic conference
Mar. 2015 - 少量のパラレルデータを用いたNon-negative Matrix Factorizationによる雑音環境下の声質変換
藤井貴生; 相原龍; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2015年春季研究発表会, Domestic conference
Mar. 2015 - Deep Boltzmann Machine を用いた音素ラベル情報推定
高島悠樹; 中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 日本音響学会2015年春季研究発表会, Domestic conference
Mar. 2015 - 話者適応型 Restricted Boltzmann Machine を用いた声質変換の検討
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 電子情報通信学会技術研究報告, Domestic conference
Dec. 2014 - 話者適応を用いたNMFによる雑音環境下の声質変換
藤井貴生; 相原龍; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2014年秋季研究発表会, Domestic conference
Sep. 2014 - 話者依存型 Recurrent Temporal Restricted Boltzmann Machine を用いた声質変換
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 日本音響学会2014年秋季研究発表会, Domestic conference
Sep. 2014 - 遺伝的アルゴリズムを用いた 構音障害者の音声特徴量抽出に最適なランダム行列の生成
片岡悠一郎; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2014年秋季研究発表会, Domestic conference
Sep. 2014 - スパース表現に基づく声質変換のための結合型 restricted Boltzmann machine
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 電子情報通信学会技術研究報告, Domestic conference
May 2014 - 話者適応を用いたNMFによる声質変換
藤井貴生; 相原龍; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2014年春季研究発表会, Domestic conference
Mar. 2014 - 声質変換のための Restricted Boltzmann Machine を用いた パラレル辞書の学習法
中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2014年春季研究発表会, Domestic conference
Mar. 2014 - Convolutive Bottleneck Network 特徴量を用いた構音障害者の音声認識
吉岡利也; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2014年春季研究発表会, Domestic conference
Mar. 2014 - 話者依存型 Conditional Restricted Boltzmann Machine による声質変換
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 電子情報通信学会技術研究報告, Domestic conference
Dec. 2013 - 辞書選択に基づく非負値行列因子分解による声質変換
相原龍; 中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 日本音響学会2013年秋季研究発表会, Domestic conference
Sep. 2013 - 時間変化を考慮した Deep Learning を用いた声質変換
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 日本音響学会2013年秋季研究発表会, Domestic conference
Sep. 2013 - Convolutional Neural Networksを用いた構音障害者のための音声認識
吉岡利也; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2013年秋季研究発表会, Domestic conference
Sep. 2013 - High-frequency Restoration using Deep Belief Nets for Super-resolution
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Poster presentation, Japanese, 画像の認識・理解シンポジウム (MIRU) 2013, Domestic conference
Jul. 2013 - RGB-D based 3D-Object Recognition by LLC using Depth Spatial Pyramid
Toru Nakashika; Takahiro Hori; Tetsuya Takiguchi; Yasuo Ariki
Poster presentation, Japanese, 画像の認識・理解シンポジウム (MIRU) 2013, Domestic conference
Jul. 2013 - Deep Belief Nets による低次元空間表現を用いた声質変換の検討
中鹿亘; 高島遼一; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2013年春季研究発表会, Domestic conference
Mar. 2013 - Specmurtを利用した調波構造行列による混合楽音解析の検討
西村大樹; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2013年春季研究発表会, Domestic conference
Mar. 2013 - Gray Level Co‑occurrence Matrix を用いた時間・音高シフトに頑健な自動音楽ジャンル分類
中鹿 亘; Christophe Garcia; 滝口 哲也; 有木 康雄
Poster presentation, Japanese, 第15回日本音響学会関西支部若手研究者交流研究発表会, Domestic conference
Dec. 2012 - 重みつきノルム基準によるF0周波数選択を用いたSpecmurtによる多重音解析
西村大樹; 中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2012年秋季研究発表会, Domestic conference
Sep. 2012 - Convolutional Neural Networks を用いた局所特徴統合による 自動音楽ジャンル分類
中鹿亘; Garcia Christophe; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2012年秋季研究発表会, Domestic conference
Sep. 2012 - スパース性基準によるF0 周波数選択を用いたSpecmurt による多重音解析
西村 大樹; 中鹿 亘; 滝口 哲也; 有木 康雄
Poster presentation, Japanese, 日本音響学会2011年秋季研究発表会, Domestic conference
Sep. 2011 - 確率スペクトル包絡を用いた混合音解析における制約付きスペクトル生成法の検討
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 電子情報通信学会技術研究報告, Domestic conference
Jul. 2011 - スパース性を考慮したSpecmurtによる多重音解析
西村 大樹; 中鹿 亘; 滝口 哲也; 有木 康雄
Poster presentation, Japanese, 日本音響学会2011年春季研究発表会, Domestic conference
Mar. 2011 - 確率スペクトルを用いた基底生成モデルとNMFによる混合楽音解析
中鹿 亘; 滝口 哲也; 有木 康雄
Oral presentation, Japanese, 日本音響学会2011年春季研究発表会, Domestic conference
Mar. 2011 - 確率スペクトル包絡に基づくNMF 基底生成モデルを用いた混合楽音解析
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 第89回音楽情報科学研究会, Domestic conference
Feb. 2011 - 基底の反復生成と教師ありNMFを用いた信号解析
中鹿亘; 滝口哲也; 有木康雄
Oral presentation, Japanese, 電子情報通信学会技術研究報告, Domestic conference
Dec. 2010 - NMFと基底モデルを用いた多重楽音解析
中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2010年秋季研究発表会, Domestic conference
Sep. 2010 - 物体領域特徴の自動選定とマルチカーネル学習を用いた特徴統合による一般物体認識
中鹿亘; 須賀晃; 滝口哲也; 有木康雄
Oral presentation, Japanese, 画像の認識・理解シンポジウム (MIRU) 2010, Domestic conference
Jul. 2010 - 多重関数を用いた調波時間スペクトル形状のモデル化による音声合成
中鹿亘; 立花隆輝; 西村雅史; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2010年春季研究発表会, Domestic conference
Mar. 2010 - 多重ベータ混合モデルを用いた調波時間構造のモデル化による音声合成の検討
中鹿亘; 立花隆輝; 西村雅史; 滝口哲也; 有木康雄
Oral presentation, Japanese, 第11回音声言語シンポジウム, Domestic conference
Dec. 2009 - 多重ベータ分布を用いた音色形状の数理モデリングによる楽器音生成
中鹿亘; 滝口哲也; 有木康雄
Poster presentation, Japanese, 日本音響学会2009年秋季研究発表会, Domestic conference
Sep. 2009 - Mathematical Modeling of Harmonic-Timbre Structure with Multi-Beta-Distribution
Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Oral presentation, Japanese, IEEE Statistical Signal Processing Workshop (SSP0 2009, International conference
Aug. 2009 - 多重ベータ分布による音色形状モデルを用いた 多重楽音の解析
中鹿 亘; 滝口 哲也; 有木 康雄
Oral presentation, Japanese, 日本音響学会2009年春季研究発表会, Domestic conference
Mar. 2009
Courses
- コンピュータサイエンス実験第二A
The University of Electro-Communications - コンピュータサイエンス実験第二A
電気通信大学 - コンピュータサイエンス実験第二B
The University of Electro-Communications - コンピュータサイエンス実験第二B
電気通信大学 - イノベイティブ総合コミュニケーションデザイン1
The University of Electro-Communications - イノベイティブ総合コミュニケーションデザイン1
電気通信大学 - 情報領域演習第二K演習
The University of Electro-Communications - 情報領域演習第二K演習
電気通信大学 - 情報領域演習第一P演習
The University of Electro-Communications - 情報領域演習第二Q演習
The University of Electro-Communications - 情報領域演習第二Q演習
電気通信大学 - コンピュータサイエンス実験第一
The University of Electro-Communications - コンピュータサイエンス実験第一
電気通信大学 - 情報領域演習第一P演習
The University of Electro-Communications - 情報領域演習第一P演習
電気通信大学 - Elements of Information Systems Fundamentals 2
The University of Electro-Communications - 情報システム基盤学基礎2
電気通信大学
Research Themes
- 非侵襲型脳波を用いた言語・非言語音声合成による次世代コミュニケーション技術の確立
中鹿 亘
日本学術振興会, 科学研究費助成事業, 電気通信大学, 基盤研究(A), 24H00715
01 Apr. 2024 - 31 Mar. 2029 - 深層エネルギーベースモデルによる創造的声質変換の研究
岸田 拓也; 中鹿 亘
日本学術振興会, 科学研究費助成事業, 愛知淑徳大学, 基盤研究(C), 23K11161
01 Apr. 2023 - 31 Mar. 2026 - 音響的分析と聞き手の心理評価に基づく表情豊かな英語スピーチ力の育成
Tomoko Yamashita
日本学術振興会, 科学研究費助成事業, 芝浦工業大学, 基盤研究(C), スピーチを行う場合、話し手は言語情報だけでなくパラ言語情報(意図的に表出する感情・意図・態度)をも伝達することで聞き手に伝わる表情豊かなスピーチとなる。本研究では、日本人英語学習者が英語でスピーチを行うときに、その音声にどのような音響的特徴が現れればパラ言語情報が聞き手によりよく伝わるようになるのかを明らかにすることを目的とする。本研究では独自に作成したスピーチ原稿と教材音声で英語学習者の発話訓練を行う計画である。訓練前後の音声の音響的特徴と聞き手の心理的評価結果について多変量解析を行い、どのような音響的特徴がパラ言語伝達に関連するのかを明らかにする予定である。得られた知見は、英語教育の現場で英語スピーチの指導に役立てることができると期待される。 感情表現にフォーカスした英語スピーチ力を育成するという上記で述べた本研究の最終的な目的を念頭におき、まずは実験用のスピーチスクリプトを作成する参考とするため、実験参加者が感情を込めやすい状況やスピーチ内容について情報を収集する計画を立てた。本実験では日本人英語学習者の大学生を対象として実験を行う計画であるため、情報収集においても日本人大学生を対象に、特定の感情が湧いた状況をヒアリングすることとし、ヒアリング用のアンケートの作成を行った。研究代表者の山下がアンケート項目のドラフトを作成し、分担者の冬野が11名の学生を対象にパイロット実験を実施した。パイロット実験で特に問題が見られず、アンケートの有用性が確認できた。その後、石井が117名の学生を対象に本実験を行った。 モデル音声作成の研究課題では、英語学習者の英語訓練を促進することを目的とし、スピーチの訓練を受けた英語母語話者の収録音声と、その音声に対するパラ言語情報のラベリングを用いて得られる、パラ言語情報の評価値と音響的特徴の相関の高いモデル音声を作成し、声質変換システムを用いて自分の声質をモデル音声へ変換する。本研究期間ではモデル学習のための事前実験として英語学習者および英語母語話者の現場での実環境音声収録、音響解析のための環境整備と声質変換モデルの考案を実施した。後者については具体的には入力特徴量中の指定属性を軽減させた潜在特徴抽出が可能なFaderNetworkを応用して、入力される英語学習者音声から得られるアクセント属性を軽減した特徴量と、英語母語話者のアクセント属性を復号器に通すことで学習者の話者性を保持したまま英語母語者のアクセントを持つ音声を合成する手法を検討しており、話者変換タスクにおける簡単な動作確認を行った。, 20K00842
Apr. 2020 - Mar. 2025 - 音声スペクトルを対数的に表現する浅層ニューラルネットに関する研究
Toru Nakashika
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, The University of Electro-Communications, Grant-in-Aid for Scientific Research (C), 本研究では,音声の表現に適した新しい機械学習手法として,確率的な浅層ニューラルネットである制限ボルツマンマシン(RBM)をベースに,対数振幅スペクトルと位相のデータ構造を適切に表現する音声技術を確立する.深層学習とは異なり,RBMはコンパクトで解釈性があり,データの確率分布を陽に仮定できるため,より適切に音声を表現することができると期待される.初年度である本研究期間では,対数振幅スペクトル表現に着目し,モデル定義と実装・評価実験及び論文執筆を行なった.具体的には,可視素子として振幅スペクトルと対数振幅スペクトルを用いたRBMを定義することで,結果的に隠れ素子が与えられた時の可視素子の条件付き確率が,隠れ素子で定まる2種の分布パラメータを持つガンマ分布で表現されることを導出した.評価実験では,振幅スペクトルを従来のガウス型RBMで表現したモデルと,対数振幅スペクトルを従来のガウス型RBMで表現したモデルよりも,上記提案モデルの方が,PESQ及びSTOIに基づく客観評価基準と,自然性に関するMOS(5段階評価)に基づく主観的基準において高い精度で音声の符号化・復号化が可能であることを示した.従来研究においても,音声のスペクトルを表現する分布として正規分布よりもガンマ分布の方が適しているという報告がなされていたが,本研究では,その根拠として振幅スペクトルと対数振幅スペクトルの両方の存在確率を同時に表現しているからと考えることができるという,新しい解釈ができたという点においても意義がある.本研究成果に関して,権威のある国際論文誌であるIEEE/ACM Transactions on Audio Speech and Language Processingに1件投稿し,採択された.また本研究に関連して9件の国内会議発表,1件の特許出願を行なった., 21K11957
Apr. 2021 - Mar. 2024 - Speech Representation Using Emotion-Speaker Controllable Probabilistic Model Based on Extended Boltzmann Distribution
Toru Nakashika
Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research, The University of Electro-Communications, Grant-in-Aid for Early-Career Scientists, Principal investigator, In speech signal processing, few methods have been established to simultaneously perform multiple different tasks such as speaker recognition and emotion recognition. In this research, we focused on the Boltzmann machine, which has the property of representing the relationships between various factors with its high potential ability, and examined the effectiveness of simultaneously realizing speaker recognition, emotion recognition, speaker conversion, and emotion conversion with it. From the experimental results, it was found that speaker recognition, emotion recognition, speaker conversion, and emotion conversion can be achieved using only a Boltzmann machine. We also revealed that the Boltzmann machine that simultaneously represents speakers and emotions outperformed the Boltzmann machine that represents either speakers or emotions in recognition and voice conversion accuracy., 18K18069
01 Apr. 2018 - 31 Mar. 2021 - 制限ボルツマンマシンの複素数拡張モデルにおける最適化アルゴリズムとMRI画像への応用
中鹿亘
中島記念国際交流財団, Principal investigator
01 Apr. 2018 - 制限ボルツマンマシンの複素数拡張と音声合成への応用
中鹿亘
電気通信普及財団, Principal investigator
01 Apr. 2017
Industrial Property Rights
- 声質変換装置、声質変換方法及びプログラム
Patent right, 大西弘太郎, 中鹿亘, 特願2021-026128, Date applied: 22 Feb. 2021, The University of Electro-Communications - 符号化装置、復号装置、パラメータ学習装置、およびプログラム
Patent right, 中鹿亘, 特願2019-150516, Date applied: 20 Aug. 2019, 国立大学法人電気通信大学 - 符号化装置、符号化方法およびプログラム
Patent right, 中鹿亘, 特願2018-31875, Date applied: 26 Feb. 2018, 国立大学法人電気通信大学 - 符号化装置、符号化方法およびプログラム
Patent right, 中鹿亘, 高木信二, 山岸順一, 特願2017-037640, Date applied: 28 Feb. 2017, 国立大学法人電気通信大学 - 声質変換装置、声質変換方法およびプログラム
Patent right, 中鹿亘, 特願2017-036109, Date applied: 28 Feb. 2017, 国立大学法人電気通信大学 - 声質変換装置、声質変換方法および声質変換プログラム
Patent right, 中鹿亘, 南泰浩, 特願2016-032488, Date applied: 23 Feb. 2016, 国立大学法人電気通信大学 - 声質変換方法および声質変換装置
Patent right, 中鹿亘, 滝口哲也, 有木康雄, 特願2015-114238, Date applied: 04 Jun. 2015, 国立大学法人神戸大学
Academic Contribution Activities
- 音学シンポジウム2023 現地世話人
Competition etc, Planning etc, 23 Jun. 2023 - 24 Jun. 2023 - 情報処理学会 第85回全国大会
Panel chair etc, 02 Mar. 2023 - 04 Mar. 2023 - 第140回音声言語情報処理研究会
Academic society etc, Planning etc, 中鹿 亘, 28 Feb. 2023 - 01 Mar. 2023 - Interspeech 2023
Peer review, 2023 - Interspeech 2022
Peer review etc, Peer review, Jul. 2022 - 音学シンポジウム2022
Academic society etc, Panel chair etc, 17 Jun. 2022, 幹事として学会開催を運営した. - 第141回音声言語情報処理研究会
Academic society etc, Planning etc, 23 Mar. 2022 - 第139回音声言語情報処理研究会
Academic society etc, Planning etc, 01 Dec. 2021 - Interspeech 2021
Peer review etc, Peer review, Jul. 2021 - 音学シンポジウム2021
Academic society etc, Panel chair etc, 18 Jun. 2021, 副実行委員長として学会開催を補佐した. - 第137回音声言語情報処理研究会
Academic society etc, Planning etc, 18 Jun. 2021 - 日本音響学会2021年春季研究発表会
Academic society etc, Panel chair etc, 10 Mar. 2021, 第2会場 音声A/音声B/午後-後半B(16:00~17:45)[音声認識・合成I]