Overview
LaoZhang API provides powerful audio processing capabilities, including Speech-to-Text (STT) and Text-to-Speech (TTS). Using the unified OpenAI API format, you can easily implement meeting transcription, subtitle generation, voice assistants, audiobook creation and more. 🎙️ Intelligent Audio ProcessingSupport for multi-language audio transcription, HD voice synthesis, and real-time streaming - let AI truly “hear” and “speak” your content.
🌟 Key Features
- 🎯 Multiple Models: GPT-4o Transcribe, Whisper, TTS-1/HD and other professional audio models
- 🌍 Multi-language: Support for 50+ languages in audio transcription
- 🎤 High Quality: Standard and HD quality voice synthesis
- 🗣️ Multiple Voices: 6 different voice options available
- ⚡ Fast Response: High-performance processing with sub-second results
- 💰 Flexible Pricing: Pay per token or duration, cost-effective
📋 Supported Audio Models
Speech-to-Text (Transcription)
| Model Name | Model ID | Billing | Features |
|---|---|---|---|
| GPT-4o Transcribe ⭐ | gpt-4o-transcribe | Token | High accuracy, multi-language |
| GPT-4o Mini Transcribe | gpt-4o-mini-transcribe | Token | Fast and efficient, low cost |
| Whisper v1 | whisper-1 | Duration (seconds) | OpenAI Whisper model |
Text-to-Speech (TTS)
| Model Name | Model ID | Quality | Features |
|---|---|---|---|
| TTS-1 ⭐ | tts-1 | Standard | Fast generation, real-time apps |
| TTS-1 HD | tts-1-hd | HD Quality | Better audio, content creation |
Available Voice Options
- alloy - Neutral, clear and natural
- echo - Male voice, steady and strong
- fable - British accent, elegant
- onyx - Deep male voice, news/broadcast
- nova - Female voice, warm and friendly
- shimmer - Soft female voice, narration
🎙️ Speech-to-Text
1. Basic Example - cURL
2. Python Example - Using OpenAI SDK
3. Specify Language and Response Format
4. Using Whisper Model (Duration-based Billing)
Supported Audio Formats
Supports the following audio formats (max file size 25 MB):- mp3 - MP3 audio file
- mp4 - MP4 audio file
- mpeg - MPEG audio file
- mpga - MPEG audio file
- m4a - M4A audio file
- wav - WAV audio file
- webm - WebM audio file
🗣️ Text-to-Speech
1. Basic Example - cURL
2. Python Example - Generate Audio File
3. Using HD Model
4. Adjust Speech Speed
5. Real-time Streaming Output
🎯 Common Use Cases
1. Meeting Transcription
2. Video Subtitle Generation
3. Multi-language Content Broadcasting
4. Audiobook Creation
💡 Best Practices
Speech-to-Text Optimization
- Audio Quality:
- Sample rate ≥16 kHz recommended
- Lower background noise improves accuracy
- Clear voice recording works best
- File Size:
- Single file ≤25 MB
- Split large files into segments
- Language Specification:
- Specify language for better accuracy
- Supported codes: zh (Chinese), en (English), ja (Japanese), etc.
- Response Format Selection:
json: Default format with full informationtext: Plain text outputsrt/vtt: Subtitles with timestampsverbose_json: Detailed JSON with timestamps and word-level info
Text-to-Speech Optimization
- Voice Selection:
alloy/nova: General purposeecho/onyx: News and broadcastingfable/shimmer: Story narration
- Speed Adjustment:
- Normal speed: 1.0
- Fast broadcast: 1.2 - 1.5
- Slow teaching: 0.75 - 0.9
- Text Optimization:
- Max text length ≤4096 characters per request
- Use punctuation to control pauses and intonation
- Convert numbers and symbols to words
- Cost Control:
- Use
tts-1for standard scenarios - Use
tts-1-hdfor high-quality needs - Choose appropriate model based on requirements
- Use
Error Handling
📊 Performance Comparison
Speech-to-Text Models
| Model | Accuracy | Speed | Languages | Billing | Price |
|---|---|---|---|---|---|
| gpt-4o-transcribe | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 50+ | Token | $$ |
| gpt-4o-mini-transcribe | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 50+ | Token | $ |
| whisper-1 | ⭐⭐⭐⭐ | ⭐⭐⭐ | 50+ | Duration | $ |
Text-to-Speech Models
| Model | Quality | Speed | Naturalness | Price |
|---|---|---|---|---|
| tts-1 | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | $ |
| tts-1-hd | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | $$ |
🚨 Important Notes
- Privacy Protection: Don’t upload audio files with sensitive information
- Compliance: Follow relevant laws and regulations, avoid illegal uses
- Copyright Notice: Generated speech content should be marked as AI-generated
- File Limits: Max audio file 25 MB, max text 4096 characters
- Usage Restrictions: Do not use for impersonation or misinformation
🔗 Related Resources
- Chat Completions API - Learn more about the Chat API
- Pricing Information - View pricing details
gpt-4o-mini-transcribe or tts-1 for testing, then upgrade to premium models for production deployment.