Python – encoding mp3 from a audio stream of PyTTS

encodermp3pythontext-to-speech

I work on text-to-speech trasforming text, in audio mp3 files, using python 2.5.

I use pyTSS as a python Text-To-Speech module, to transform text in audio .wav files (in pyTTS is not possible to encode in mp3 format directly). So after that, I code these wav files, in mp3 format, using lame command line encoder.

Now, the problem is that, I would like to insert (in particular point of an audio mp3 file, between two words) a particular external sound file (like a sound warning) or (if possible a generated warning sound).

Questions are:

1) I have seen that PyTTS have possibilities to save audio stream on a file or in a memory stream. using two function:

tts.SpeakToWave(file, text) or tts.SpeakToMemory(text)

Exploiting tts.SpeakToMemory(text) function, and using PyMedia I have been able to save an mp3 directly but mp3 file (when reproducing), sounds uncomprensible like donald duck! 🙂
Here a snippet of code:

            params = {'id': acodec.getCodecID('mp3'), 'bitrate': 128000, 'sample_rate': 44100, 'ext': 'mp3', 'channels': 2}

            m = tts.SpeakToMemory(p.Text)
            soundBytes = m.GetData()

            enc = acodec.Encoder(params)

            frames = enc.encode(soundBytes)
            f = file("test.mp3", 'wb')
            for frame in frames:
                f.write(frame)
            f.close()

I can not understand where is the problem?!?
This possibility (if it would work correctly), it would be good to skip wav files transformation step.

2) As second problem, I need to concatenate audio mp3 file (obtained from text-to-speech module) with a particular warning sound.

Obviously, it would be great if I could concatenate audio memory streams of text (after text-to-speech module) and the stream of a warning sound, before encoding the whole audio memory stream in an unique mp3 file.

I have seen also that tksnack libraries, can concatenate audio, but they are not able to write mp3 files.

I hope to have been clear. 🙂

Many thanks to for your answers to my questions.

Giulio

Best Solution

I don't think PyTTS produces default PCM data (i.e. 44100 Hz, stereo, 16-bit). You should check the format like this:

memStream = tts.SpeakToMemory("some text")
format = memStream.Format.GetWaveFormatEx()

...and hand it over correctly to acodec. Therefore you can use the attributes format.Channels, format.BitsPerSample and format.SamplesPerSec.

As to your second question, if the sounds are in the same format, you should be able to simply pass them all to enc.encode, one after another.