WhisperStream User Guide

Getting Started

WhisperStream is a Windows application that provides powerful audio transcription and text-to-speech capabilities. When you first launch the application, you'll see the main menu with options for different operations.

Note: The application requires Python, Whisper, and FFmpeg to be installed. If these are not detected, the Diagnostics screen will help you install them.

Main Menu

The main menu is your starting point for all operations in WhisperStream. It provides quick access to all major features:

Single File Speech-to-Text: Transcribe a single audio file to text
Batch Speech-to-Text: Process multiple audio files in a batch operation
Single File Text-to-Speech: Convert text to speech for a single file
Batch Text-to-Speech: Convert multiple text files to speech
Exit: Close the application

Simply click on any button to open the corresponding form and begin your work.

Single File Speech-to-Text

This screen allows you to transcribe a single audio file into text using Whisper AI models.

Using Single File Speech-to-Text

Select an Audio File: Click the "Browse" button to select an audio file. Supported formats include WAV, MP3, M4A, OGG, FLAC, and MP4.
Choose a Model: Select a Whisper model from the dropdown:
- Tiny: Fastest processing, least accurate
- Base: Good balance of speed and accuracy (recommended for most users)
- Small: Better accuracy, moderate processing time
- Medium: High accuracy, slower processing
- Large: Best accuracy, slowest processing
Transcribe: Click the "Transcribe" button to start the transcription process. The status bar will show progress updates.
Review Results: Once complete, the transcribed text will appear in the output area.
Save Transcript: Click "Save" to save your transcript in various formats:
- Plain Text (.txt)
- SubRip Subtitles (.srt)
- WebVTT Subtitles (.vtt)
- CSV Data (.csv)
- Timestamped Text (.txt)

Menu Options

File Menu

Open Audio File: Browse and select an audio file
Load Transcript: Load an existing transcript file
Save Transcript: Save the current transcript
Export Transcript: Export as plain text without timing data
Exit: Close the form

Tools Menu

Text-to-Speech: Access TTS features (Single or Batch)
Batch Processing: Open the batch transcription form
Media Player: Open the media player for synchronized playback
Model Manager: Manage Whisper models
Settings: Configure application settings

Tip: The application remembers your last selected model, so you don't need to select it every time if you're processing multiple files with the same settings.

Batch Speech-to-Text

The Batch Speech-to-Text form allows you to process multiple audio files simultaneously, making it ideal for transcribing large collections of audio files.

Using Batch Speech-to-Text

Add Files: Click "Add Files" to select multiple audio files at once. All selected files will be added to the job queue.
Configure Parallel Processing: Set the maximum number of parallel jobs (1-8). Higher values process faster but use more system resources.
Start Processing: Click "Start Processing" to begin transcribing all queued files. The status bar shows overall progress.
Monitor Progress: The job grid displays:
- File name
- Status (Queued, Processing, Completed, Failed, Cancelled)
- Progress percentage
- File size
- Status message
- Duration
- Error messages (if any)
- Transcript text preview
Manage Jobs:
- Cancel Selected: Cancel selected jobs
- Retry Selected: Retry failed or cancelled jobs
- Remove Selected: Remove jobs from the queue
- Remove Completed: Clear completed jobs from the list
- Clear All: Remove all jobs (when not processing)
Export Results: Click "Export Results" to save all completed transcripts to a folder. You can choose to remove filler words during export.

Job Grid Features

Right-Click Context Menu

Right-click on any job in the grid to access:

Copy Transcript Text: Copy the transcript to clipboard
Copy File Path: Copy the source file path
Open Transcript in Notepad: View the transcript in Notepad

Keyboard Shortcuts

Ctrl+C: Copy transcript text of selected job
Delete: Remove selected jobs (when not processing)

Status Bar Information

The status bar at the bottom shows:

Total jobs
Queued jobs
Processing jobs
Completed jobs
Failed jobs
Overall progress percentage
Estimated time remaining

Note: Jobs are color-coded in the grid: Blue (Processing), Green (Completed), Red (Failed), Gray (Cancelled), Yellow (Retrying).

Tip: You can stop processing at any time using the "Stop Processing" button. Queued jobs will remain in the queue and can be restarted later.

Single File Text-to-Speech

Convert text into natural-sounding speech using voice cloning technology. This feature uses Chatterbox-TTS for high-quality voice synthesis.

Using Single File Text-to-Speech

Enter Text: Type or paste the text you want to convert to speech in the input area.
Select Voice Prompt (Optional):
- Click "Browse" to select a voice sample file (WAV format)
- The voice sample will be used to clone the voice characteristics
- You can also enter a text description of the desired voice
Choose Output Format: Select the audio format for the output file (WAV, MP3, etc.)
Generate Audio: Click "Generate" to start the TTS process. A save dialog will appear to choose where to save the output file.
Play Audio: Once generation is complete, click "Play Audio" to preview the generated speech. Click again to stop playback.

Additional Features

Test Installation

Click "Test Installation" to verify that Chatterbox-TTS is properly installed and configured. This helps diagnose any setup issues.

View Log

Click "View Log" to open the TTS operations log file, which contains detailed information about TTS operations and any errors that occurred.

Warning: If you receive an error that Chatterbox-TTS is not available, you can open the Model Manager from the error dialog to install it.

Tip: For best results, use a clear voice sample (3-10 seconds) with minimal background noise. The voice sample should contain natural speech in the language you want to generate.

Single File Text-to-Speech interface

Batch Text-to-Speech

Process multiple text files or text entries simultaneously, converting them all to speech audio files. This is ideal for creating large collections of audio content.

Setting Up Batch TTS

Select Output Directory: Click "Browse" next to the output directory field to choose where generated audio files will be saved.
Choose Output Format: Select the audio format (WAV, MP3, etc.) for all generated files.
Select Default Voice: Choose a default voice from the dropdown. This voice will be used for all jobs unless overridden.
Open Voices Folder: Click "Open Voices Folder" to access the directory where voice samples are stored. You can organize voices in subfolders.

Adding Jobs

Add Text Files

Click "Add Files" to select multiple text files. Each file becomes one TTS job. The entire file content is used as the text to convert.

Add Manual Entry

Click "Add Manual" to enter text manually. You can enter multiple paragraphs separated by blank lines, and each paragraph becomes a separate job.

Add Pasted Text

Click "Add Pasted" to paste multiple text entries. You can use two formats:

Simple format: One text entry per line
With voice specification: Use text|voice format to specify a different voice for each line

Example:

Hello world|subfolder/voice1
This is another line|voice2
This line uses the default voice

Job Management

The job grid displays:

Text: Preview of the text content
Voice: Voice file name being used
Status: Current processing status
Started: When processing began
Completed: When processing finished
Output File: Path to the generated audio file

Processing Jobs

Start Processing: Click "Start" to begin processing all queued jobs. Jobs are processed sequentially.
Monitor Progress: Watch the progress bar and status messages to track processing.
Cancel Processing: Click "Cancel" to stop processing. Current job will finish, but queued jobs will remain.

Managing Results

Clear Completed Jobs

Click "Clear" to remove all completed and failed jobs from the list, keeping only queued and processing jobs.

Export Results

Click "Export" to create a ZIP file containing all completed audio files. This makes it easy to share or archive your generated content.

Open Output Folder

Click "Open Output Folder" to open Windows Explorer at the output directory, making it easy to access your generated files.

Delete Jobs

Select one or more jobs and press Delete or right-click and select "Delete Selected Jobs" to remove them from the queue.

Note: Jobs are automatically saved to disk and will persist between application sessions. You can close the form and return later to see your job history.

Tip: Voice files can be organized in subfolders within the voices directory. The dropdown will show the relative path (e.g., "subfolder/voice1") for easy organization.

Batch Text-to-Speech interface

Media Player

The Media Player form provides synchronized audio playback with transcript editing, making it perfect for reviewing and editing transcriptions while listening to the original audio.

Opening the Media Player

You can open the Media Player from:

The Tools menu in the Single File Speech-to-Text form
Directly from the main menu (if available)

Using the Media Player

Open Audio File: Use File → Open Audio File or the toolbar button to load an audio file (WAV, MP3, M4A, FLAC, OGG, AAC).
Load Transcript (Optional): Use File → Load Transcript to load an existing transcript file (JSON or TXT format).
Playback Controls: The waveform display shows the audio waveform. Click on the waveform to seek to specific positions.
Edit Transcript: The transcript editor allows you to edit the text while the audio plays. Changes are synchronized with playback position.
Save Transcript: Use File → Save Transcript to save your edited transcript with timing information.
Export Transcript: Use File → Export Transcript to export as plain text without timing data.

Menu Options

File Menu

Open Audio File: Load an audio file for playback
Load Transcript: Load an existing transcript file
Save Transcript: Save the current transcript with timing data
Export Transcript: Export as plain text
Exit: Close the media player

Edit Menu

Find: Search for text within the transcript
Go to Time: Jump to a specific time position in the audio
Font: Change the font family and size for the transcript editor
Colors: Customize text colors and highlighting options

View Menu

Zoom In: Zoom in on the audio waveform for more detail
Zoom Out: Zoom out to see more of the audio waveform
Auto-scroll: Toggle automatic scrolling of the transcript to follow playback position

Features

Word-Level Synchronization

When enabled in settings, the transcript editor highlights words as they are spoken, providing precise synchronization between audio and text.

Spell Checking

Spell checking can be enabled in settings to help catch transcription errors while editing.

Confidence Colors

When enabled, text is color-coded based on transcription confidence levels, making it easy to identify potentially inaccurate sections.

Tip: You can resize the splitter between the waveform and transcript editor by dragging it to adjust the layout to your preference.

Settings

The Settings form allows you to configure all aspects of WhisperStream. Access it from the Tools menu in any form.

General Tab

Setting	Description
Output Folder	Default folder where transcribed files will be saved
Default Model	Whisper model to use by default (tiny, base, small, medium, large)
Language	Default language for the user interface
Max Recent Files	Maximum number of recently used files to remember (1-50)
Check for updates on startup	Automatically check for application updates when starting
Start with Windows	Launch WhisperStream automatically when Windows starts
Minimize to system tray	Minimize to system tray instead of taskbar

Transcription Tab

Setting	Description
Default Task	Default transcription task (transcribe or translate)
Default Language	Default language for transcription (auto-detect or specific language)
Enable speaker diarization	Identify different speakers in the audio
Remove filler words	Automatically remove filler words like "um", "uh", "like" from transcripts
Default Format	Default output format for transcription files

Audio Tab

Setting	Description
Default Volume	Default volume level for audio playback (0-100%)
Remember volume between sessions	Save and restore the last used volume level

UI Tab

Setting	Description
Theme	Visual theme (Auto, Light, Dark)
Font Family	Font for displaying transcribed text
Font Size	Font size for transcript text (8-72 points)
Show confidence colors	Color-code text based on transcription confidence levels
Enable spell check	Enable spell checking for transcribed text
Spell check language	Language for spell checking

Privacy Tab

Telemetry Settings

Configure anonymous usage data collection to help improve WhisperStream:

Enable telemetry: Master switch for all telemetry features
Send crash reports: Automatically send crash reports when errors occur
Send usage statistics: Send anonymous usage statistics
Send performance metrics: Send performance data to help optimize the application
Send interval: How often to collect telemetry data (1-168 hours)

Anonymous ID: A unique identifier used for telemetry. You can reset it to generate a new ID.

View Privacy Policy: Click the button to view the complete privacy policy online.

Advanced Tab

Custom Paths & URLs

Python Path: Custom path to Python executable (if not in system PATH)
FFmpeg Path: Custom path to FFmpeg executable (if not in system PATH)
Whisper Args: Additional command line arguments for Whisper
Python URL: Custom URL for Python installer download
FFmpeg URL: Custom URL for FFmpeg download
Use system runtime only: Use system-installed Python/FFmpeg instead of bundled versions

Logging

Enable debug logging: Create detailed log files for troubleshooting
Log path: Directory where log files are stored
Max log size: Maximum size of individual log files (1-100 MB)
Log retention: Number of days to keep log files (1-365 days)

Performance

Enable GPU: Enable GPU acceleration for transcription (if available)
GPU device: Which GPU to use for acceleration
Memory limit: Maximum memory usage for transcription (512-8192 MB)
Temp cleanup: Automatically delete temporary files older than specified days (1-30)

Saving Settings

OK: Save all settings and close the dialog
Cancel: Discard changes and close without saving
Apply: Apply settings without closing the dialog
Reset: Reset all settings to their default values

Note: Some settings require restarting the application to take full effect. The application will prompt you if a restart is needed.

Diagnostics

The Diagnostics form helps identify and resolve setup issues with required components like Python, Whisper, and FFmpeg.

Understanding Diagnostics

When you first launch WhisperStream or if there are setup issues, the Diagnostics form will display:

A list of any missing or improperly configured components
Recommended installation steps
Buttons to automatically install missing components

Available Actions

Install FFmpeg

If FFmpeg is not detected, click "Install FFmpeg" to download and install it automatically. A progress window will show the installation process with command-line output.

Cleanup

Click "Cleanup" to remove temporary files from failed downloads. This can help resolve installation issues and free up disk space.

Installation Progress

When installing components, a progress window displays:

Real-time command-line output
Progress percentage
Status messages
Error information (if any)

You can save the installation log for troubleshooting by clicking "Save Log" in the progress window.

Warning: Do not close the progress window during installation, as this may interrupt the process and leave your system in an inconsistent state.

Tip: If automatic installation fails, you can manually install the required components and configure their paths in the Advanced settings tab.

Tips & Tricks

Getting the Best Transcription Results

Audio Quality: Use clear audio with minimal background noise for best results
Model Selection: Start with "base" model for most use cases. Use "large" only when accuracy is critical and processing time is not a concern
Language Settings: Specify the language if you know it, rather than using auto-detect, for better accuracy
Speaker Diarization: Enable this feature when transcribing conversations with multiple speakers

Optimizing Performance

Batch Processing: Adjust the parallel jobs setting based on your system's capabilities. More parallel jobs = faster processing but higher CPU/memory usage
GPU Acceleration: If you have a compatible GPU, enable GPU acceleration in Advanced settings for significantly faster transcription
Memory Limits: Adjust memory limits in Advanced settings if you encounter out-of-memory errors

Text-to-Speech Best Practices

Voice Samples: Use 3-10 second voice samples with clear, natural speech
Text Formatting: Use proper punctuation and capitalization for better speech synthesis
Batch Processing: For large projects, use Batch TTS to process multiple files efficiently
Voice Organization: Organize voice files in subfolders within the voices directory for easy management

Workflow Tips

Save Frequently: Always save your transcripts after transcription to avoid losing work
Use Media Player: Use the Media Player to review and edit transcripts while listening to the original audio
Export Formats: Export in multiple formats (SRT for video editing, TXT for text processing, etc.)
Batch Operations: Use batch processing for large collections of files to save time

Troubleshooting

Common Issues

Transcription fails: Check that Python, Whisper, and FFmpeg are properly installed. Use Diagnostics to verify.
TTS not working: Verify Chatterbox-TTS installation using the "Test Installation" button
Slow processing: Try reducing parallel jobs, or enable GPU acceleration if available
Out of memory: Reduce memory limits or process fewer files in parallel
File access errors: Ensure audio files are not being used by another application

Getting Help

Check the log files in the logs directory for detailed error information
Use the Diagnostics form to identify setup issues
Enable debug logging in Advanced settings for more detailed troubleshooting information
View the TTS operations log for text-to-speech specific issues

Conclusion

This guide covers all the major features and screens in WhisperStream. The application is designed to be intuitive and user-friendly, with tooltips and help text available throughout the interface.

For the best experience:

Take time to explore the Settings to customize WhisperStream to your needs
Use batch processing for efficiency when working with multiple files
Take advantage of the Media Player for reviewing and editing transcripts
Keep your components (Python, Whisper, FFmpeg) up to date

Happy transcribing! We hope WhisperStream helps you efficiently convert between speech and text for all your projects.

WhisperStream User Guide

Table of Contents

Getting Started

Main Menu

Single File Speech-to-Text

Using Single File Speech-to-Text

Menu Options

File Menu

Tools Menu

Batch Speech-to-Text

Using Batch Speech-to-Text

Job Grid Features

Right-Click Context Menu

Keyboard Shortcuts

Status Bar Information

Single File Text-to-Speech

Using Single File Text-to-Speech

Additional Features

Test Installation

View Log

Batch Text-to-Speech

Setting Up Batch TTS

Adding Jobs

Add Text Files

Add Manual Entry

Add Pasted Text

Job Management

Processing Jobs

Managing Results

Clear Completed Jobs

Export Results

Open Output Folder

Delete Jobs

Media Player

Opening the Media Player

Using the Media Player

Menu Options

File Menu

Edit Menu

View Menu

Features

Word-Level Synchronization

Spell Checking

Confidence Colors

Settings

General Tab

Transcription Tab

Audio Tab

UI Tab

Privacy Tab

Telemetry Settings

Advanced Tab

Custom Paths & URLs

Logging

Performance

Saving Settings

Diagnostics

Understanding Diagnostics

Available Actions

Install FFmpeg

Cleanup

Installation Progress

Tips & Tricks

Getting the Best Transcription Results

Optimizing Performance

Text-to-Speech Best Practices

Workflow Tips

Troubleshooting

Common Issues

Getting Help

Conclusion