JTrans: a Java software for text-to-speech alignment
What is this all about ?
"Text-to-speech" alignment is a speech processing task that is related
to the well-known "speech recognition" task, but at the difference to
speech recognition, text-to-speech alignment assumes that the text that
is pronounced is already known. All what remains to do is to align the
words onto the speech signal, i.e., find the millisecond at which every
word starts in the audio WAVE file, and the millisecond at which every
word ends. This is very useful for example to synchronize the lips
animation of a cartoon character with the recording of an actor, or to
build "Karaoke"-like applications, or yet to quickly find in a video
database the start of a target utterance.
It is easier than speech recognition, thanks to the additional
information that comes from the available (approximate) transcription,
which makes this technology more precise and more efficient than speech
recognition for many corpora where the sound quality is not good enough
and where speech recognition fails.
Have a look at the online applet !
If Java is installed, you may go to demo to see a limited applet version of JTrans.
How to get it
and unzip it.
You can then execute the file dist/jtrans.jar, and test JTrans by loading the wav file "culture.wav" and
the text file "culture.txt". Then, press the "Play" button to check that the audio playback is
working; press the same button again to stop audio playback. Then, use menu "edit / parse text standard"
to analyze the text (you should see punctuation marks highlighted in orange).
Then, press the "AutoAlign" button to start automatic alignement. After a few seconds, the aligned text gets
underlined and you can press the "Play" button again to start playback in "karaoke" mode.
The following excamples files are also included in the zip file : culture.wav and
If you want to develop with JTrans, you may get the source code from there:
git clone http://rapsodis.loria.fr/jtrans2.git jtrans2
(This MUST be done before unzipping jtrans2full.tgz; otherwise, you
might have to remove/rename first the jtrans2 directory that has been
created when unzipping jtrans2full.tgz, and then copy/untar back all
the files from jtrans2full.tgz into the jtrans2 directory from GIT)
If you are interested in JTrans, please contact me: cerisara AT loria DOT fr
JTrans main features
- Depends on the Sphinx4 library for automatic alignment
- Optionnaly depends on the WEKA library for automatic
phonetisation, and on the TRITONUS and JAVALAYER libraries for MP3
- Integrates 3 levels of phonetisation: dictionary-based,
rule-based and decision-tree based.
- Export Praat Format
Tutorial: first steps for using JTrans
Setting-up a project
Any project in JTrans requires two types of file: a text file, and a
The text file can be loaded with the "File - load text" menu, but the
recommended method is rather to open the text file within another
editor - just like Word, WordPad, Gedit... - and copy/paste the text
within the JTrans window. Regarding the WAVE file, it must be loaded
with the menu "File - load wav". In theory, any type of WAVE file is
supported (even mp3). But in case of trouble, try to convert first your
speech file into an uncompressed WAVE format, monophone, 16 kHz, signed
When the WAVE file is correctly loaded, you should see a spectrogram of
the speech stream in the lower panel of JTrans. When clicking on the
"PLAY" button, you should also hear the speech. If you see the
spectrogram, but cannot listen to the speech, this may be due to an
audio driver issue. See section 4 to solve it.
Analyzing the text
The second step consists in analyzing the text, i.e., identifying in
the text what is pronounced vs. what is not pronounced (punctuation
marks, comments, speakers identity...). This is realized with the menu
"Edit - parse text standard". When this is done, you shall see the
unpronounced text highlighted in different colors. Note that the text
cannot be modified any more after it has been analyzed. You may still
later on edit the text with the menu "edit- edit text", but this might
destroy all alignment obtained so far.
Aligning the text and speech
Once the text is analyzed, you can launch the automatic alignment
process: just click on "Auto-Align", and waits for a few minutes. The
first time you click on this button, it will take about 1 to 2 minutes
to load all the models. Next times it will be faster. You can stop this
process at any time by clicking on the "Stop it !" button.
As soon as some words are aligned, they are underlined in the JTrans
window, so you can see the progress of the automatic alignment.
Checking the alignment
You can check the alignment while it is aligning, as soon as you see
some text that is underlined. You can then press the "Play" button,
which will start the playback from the last position (or from the
beginning) in a "Karaoke mode", i.e., the words that you are currently
hearing should be highlighted in grey. If you see that the alignment is
fine, you can go faster and skip some sentences by clicking with the
mouse on any word that is underlined: the player will then immediately
stop, position itself on the corresponding speech segment, and
from there. You can then "naviguate" through the corpus by clicking on
aligned words. Note that as soon as you click on a word, the
spectrogram of the aligned speech is shown in the lower panel, along
with a timeline with the words boundaries. If you see that the
alignment gets wrong, you may guide the automatic aligner as explained
in the following.
Correcting an erroneous alignement
First, you may want to stop the aligner ("Stop it !" button), and
eventually clean the existing alignment from a previous position that
is correct: to do so, click on a word that is correctly aligned, then
use the menu "process - clear alignment from selected word". You now
want to manually define an "anchor" that associates a position in the
audio stream to a position in the text. This can be realized in several
- You can play from the last aligned word, wait once you have
about a dozen words, and then "Ctrl-clic" (with both the "Ctrl" key and
the left mouse button) on a word in the text panel as soon as you hear
- Another more accurate option is to stop the player and use
instead the spectrogram panel: you can move it in the audio file with
either "+1 sec." or "-1 sec." buttons, and then, clicking at any
position on the spectrogram will make you hear 1 second of speech
before that audio position. You can thus try several positions until
you hear a whole word. The last audio position clicked is always saved
in memory, even though no "visual bar" is shown for now. You can now
"Ctrl-clic" on a word in the text panel in order to associate this word
to the last audio position clicked.
- Same approach as in (ii), but rather "Ctrl-clic" with the
right mouse button, which will produce a basic "equally-spaced"
from the last aligned word to the anchor. With both (i) and (ii), as
soon as an anchor is defined, the background automatic process is
executed to find the best possible alignment between the last alignment
word and the anchor. But in some cases, it may be better to just skip a
very noisy audio segment with a right Ctrl-click.
When an anchor is defined, you can re-launch the automatic alignment
process from this position with the "Auto-Align" button.
Accents are replaced by weird
This is an encoding issue. A good option to prevent this is to
copy/paste the text from another text editor (Word, WordPad...) into
JTrans. Otherwise, JTrans better works with UTF-8 encoding in an UTF-8
After parsing the text, the characters
that are highlighted in yellow are not the punctuation marks !
This is an "end-of-line" issue: in Linux, end-of-lines are encoded with
a single byte "\n", while in Windows, they are encoded with two bytes
"\r\n". Again, you may want to try the copy/paste method as above, or
you may want to first save the text with "Unix-like" end-of-lines
encoding (most editors like Work allow offer this option).
I can see the spectrogram but I cannot
hear the sound
This might be a sound driver issue. You should try closing all
programs, to prevent concurrent access to your sound card, and then
open again only JTrans. If this does not work, you may want to select,
in JTrans, an alternative soundcard driver: this can be realized with
the menu "option - audio mixers", which brings up a new menu with all
possible sound drivers. If none works, you can try launching JTrans
with another java virtual machine: in particular, the OpenJDK is known
to perform better than the Sun JDK regarding sound management,
especially with Pulse on Linux.