first commit

This commit is contained in:
Lorenzo Iovino 2025-05-23 14:45:22 +02:00
commit b6d47982c7
8 changed files with 1353 additions and 0 deletions

28
.env.sample Normal file
View file

@ -0,0 +1,28 @@
# Whisper Parallel Configuration
# SSH Key Configuration
KEY_NAME=whisper-key
KEY_FILE=$HOME/.ssh/whisper-key.pem
SECURITY_GROUP=whisper-sg
# AWS Instance Configuration
INSTANCE_TYPE=g4dn.12xlarge
REGION=eu-south-1
AMI_ID=ami-059603706d3734615
# Video/Audio Processing
VIDEO_FILE=mio_video.mp4
START_MIN=0
END_MIN=0
SHIFT_SECONDS=0
SHIFT_ONLY=false
INPUT_PREFIX=
# GPU Configuration
GPU_COUNT=1
# Processing Options
NUM_SPEAKERS=
FIX_START=true
# API Tokens
HF_TOKEN=your_huggingface_token_here

53
.gitignore vendored Normal file
View file

@ -0,0 +1,53 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Environment variables
.env
.env.local
.env.development.local
.env.test.local
.env.production.local
# Processed files
*.wav
*.mp3
*.mp4
*.srt
*.vtt
transcript*.*
!transcription-runner/mio_video.mp4
# Logs
*.log
# OS specific
.DS_Store
._.DS_Store
*.swp
*.swo
# IDE
.idea/
.vscode/
*.sublime-project
*.sublime-workspace
.ropeproject/

215
README.md Normal file
View file

@ -0,0 +1,215 @@
# Transcription Runner con Multi-Chunk Processing e GPU Parallela
Questo pacchetto ti consente di:
- Creare un'istanza EC2 GPU su AWS (g4dn.12xlarge)
- Suddividere e trascrivere un file video `.mp4` in più chunk
- Generare automaticamente transcript + speaker diarization
- Scaricare i file di output
- Terminare l'istanza per risparmiare costi
- Applicare spostamento temporale ai timestamp delle trascrizioni
- Configurare facilmente le opzioni tramite file `.env`
---
## ✅ Prerequisiti
### 1. **Installare AWS CLI**
Se non hai ancora installato AWS CLI:
- Su macOS con Homebrew:
```bash
brew install awscli
```
- Su Linux (Debian/Ubuntu):
```bash
sudo apt update
sudo apt install awscli
```
### 2. **Configurare AWS CLI**
Una volta installato, esegui:
```bash
aws configure
```
Inserisci:
- Access key ID
- Secret access key
- Regione predefinita (es: `eu-south-1`)
- Formato output: `json`
### 3. **Creare una chiave SSH per EC2**
Nel terminale, esegui:
```bash
aws ec2 create-key-pair --key-name whisper-key --query 'KeyMaterial' --output text > ~/.ssh/whisper-key.pem
chmod 400 ~/.ssh/whisper-key.pem
```
### 4. **Installa netcat**
- Su macOS con Homebrew:
```bash
brew install netcat
```
- Su Linux (Debian/Ubuntu):
```bash
sudo apt install netcat
```
### 5. **Registrarsi su Hugging Face e ottenere token**
Vai su: https://huggingface.co/settings/tokens
Crea un token con accesso ai modelli (read access) e copia il valore.
### 6. **IAM role "WhisperS3Profile" con accesso S3**
Assicurati che il tuo account AWS abbia un ruolo IAM chiamato "WhisperS3Profile" con permessi di accesso S3.
### 7. **Configurare il file .env**
Copia il file `.env.sample` in `.env` e modifica i valori secondo le tue esigenze:
```bash
cp .env.sample .env
nano .env # o usa l'editor che preferisci
```
---
## ▶️ Come usare
### Metodo Base
```bash
chmod +x whisper_parallel.sh
./whisper_parallel.sh
```
### Configurazione tramite file .env
Modifica il file `.env` con i tuoi parametri e poi esegui:
```bash
./whisper_parallel.sh
```
### Specificare i parametri tramite variabili d'ambiente (sovrascrive .env)
```bash
VIDEO_FILE="mia_intervista.mp4" START_MIN=5 END_MIN=15 GPU_COUNT=4 ./whisper_parallel.sh
```
### Parametri disponibili
Questi parametri possono essere specificati nel file `.env` o tramite variabili d'ambiente:
| Parametro | Descrizione | Default |
|-----------|-------------|---------|
| VIDEO_FILE | Il file video/audio da trascrivere | mio_video.mp4 |
| START_MIN | Minuto di inizio per il crop | 0 |
| END_MIN | Minuto di fine per il crop | 0 (fino alla fine) |
| SHIFT_SECONDS | Sposta i timestamp di X secondi | 0 |
| GPU_COUNT | Numero di chunk in cui dividere l'audio | 4 |
| NUM_SPEAKERS | Numero di speaker se conosciuto in anticipo | (auto) |
| DIARIZATION_ENABLED | Attiva/disattiva riconoscimento speaker | true |
| INSTANCE_TYPE | Tipo di istanza EC2 | g4dn.12xlarge |
| REGION | Regione AWS | eu-south-1 |
| BUCKET_NAME | Nome del bucket S3 | whisper-video-transcripts |
| HF_TOKEN | Token Hugging Face per Pyannote | (richiesto) |
| FIX_START | Aggiunge silenzio all'inizio per migliorare la cattura | true |
| SHIFT_ONLY | Applica solo lo spostamento timestamp a file esistenti | false |
| INPUT_PREFIX | Prefisso per i file di input quando si usa SHIFT_ONLY | "" |
| WHISPER_MODEL | Modello Whisper da utilizzare | large |
---
## 📦 Output
Al termine troverai questi file nella cartella corrente:
- `{nome-file}_{start}_{end}_{random}.txt` → transcript grezzo
- `{nome-file}_{start}_{end}_{random}_final.txt` → transcript con speaker
- `{nome-file}_{start}_{end}_{random}.srt` → file SRT per i sottotitoli
- `{nome-file}_{start}_{end}_{random}.vtt` → file VTT per i sottotitoli web
---
## 🚀 Modalità Multi-Chunk
La versione attuale dello script divide automaticamente l'audio in più parti e le elabora in parallelo su GPU. Questo:
1. Migliora l'utilizzo della memoria per file lunghi
2. Accelera il processo di trascrizione di file estesi
3. Ottimizza l'utilizzo delle risorse hardware
### Suggerimenti per le prestazioni
1. **Instanza ideale**: g4dn.xlarge è sufficiente per file brevi, g4dn.12xlarge per file lunghi con multi-GPU
2. **Numero di chunk**: Per file lunghi, suddividere in più chunk aiuta a gestire meglio la memoria
3. **Modello**: Per file molto lunghi, considerare l'uso del modello "medium" o "base" invece di "large"
---
## 🧪 Esempi di utilizzo
### Configurazione tramite .env
Modifica il file `.env` con i tuoi parametri e poi esegui:
```bash
./whisper_parallel.sh
```
### Trascrivere un intero file
```bash
VIDEO_FILE="conferenza.mp4" ./whisper_parallel.sh
```
### Trascrivere una porzione specifica
```bash
VIDEO_FILE="lezione.mp4" START_MIN=10 END_MIN=20 ./whisper_parallel.sh
```
### Suddividere un file lungo in più chunk
```bash
VIDEO_FILE="intervista.mp4" GPU_COUNT=6 ./whisper_parallel.sh
```
### Disabilitare la diarizzazione (solo trascrizione)
```bash
VIDEO_FILE="audio.mp4" DIARIZATION_ENABLED=false ./whisper_parallel.sh
```
### Specificare il numero di speaker
```bash
VIDEO_FILE="intervista.mp4" NUM_SPEAKERS=2 ./whisper_parallel.sh
```
### Spostare i timestamp di una trascrizione esistente
```bash
SHIFT_ONLY=true SHIFT_SECONDS=30 INPUT_PREFIX="mia_trascrizione" ./whisper_parallel.sh
```
---
## 🔄 Funzionalità avanzate
### Spostamento dei timestamp
Lo script può spostare i timestamp nei file di trascrizione, utile quando:
- Hai tagliato una porzione iniziale del video
- Devi sincronizzare i sottotitoli con un video modificato
- Lavori con segmenti estratti da un video più lungo
### Tipi di file supportati per lo spostamento
- `.srt` (SubRip Text)
- `.vtt` (WebVTT)
- `.txt` (Transcript con timestamp)
---
## ☁️ Note
- L'istanza EC2 viene **distrutta automaticamente** al termine.
- I file audio vengono rimossi dal bucket S3 dopo il download.
- I nomi dei file di output includono un suffisso casuale per evitare conflitti.
- In caso di interruzione dello script, il sistema eseguirà comunque la pulizia delle risorse AWS.
- Richiede il file companion `parallel_transcript.py` per l'elaborazione su EC2.
## Dettagli tecnici
- Utilizza FFmpeg per l'estrazione audio
- Crea automaticamente security group AWS e utilizza la VPC predefinita se disponibile
- Implementa un cleanup automatico alla terminazione dello script
- Supporta diarizzazione di alta qualità tramite Pyannote/WhisperX
- Fornisce funzionalità di spostamento timestamp per tutti i formati di output
## Sicurezza
- Lo script crea un security group che consente l'accesso SSH da qualsiasi IP (0.0.0.0/0)
- Sono necessarie credenziali AWS con permessi EC2 e S3
- Le chiavi SSH vengono utilizzate per l'accesso sicuro all'istanza
- Il file `.env` contiene dati sensibili e non dovrebbe essere aggiunto al controllo di versione (è già incluso in `.gitignore`)

398
parallel_transcript.py Normal file
View file

@ -0,0 +1,398 @@
import os
import argparse
import whisper
import torch
import time
import threading
import json
from pyannote.audio import Pipeline
from datetime import timedelta
import numpy as np
from pydub import AudioSegment
import math
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
def start_spinner():
def spin():
while not spinner_done:
print(".", end="", flush=True)
time.sleep(1)
global spinner_done
spinner_done = False
t = threading.Thread(target=spin)
t.start()
return t
def stop_spinner(thread):
global spinner_done
spinner_done = True
thread.join()
print("")
def extend_audio_beginning(input_audio, output_audio, silence_duration=0):
"""Aggiunge un breve silenzio all'inizio dell'audio per catturare meglio i primi secondi"""
print(f"🔄 Aggiungendo {silence_duration/1000} secondi di silenzio all'inizio dell'audio...")
audio = AudioSegment.from_file(input_audio)
silence = AudioSegment.silent(duration=silence_duration) # 2 secondi di silenzio
extended_audio = silence + audio
extended_audio.export(output_audio, format="wav")
print(f"✅ Audio esteso salvato come {output_audio}")
return output_audio
def transcribe_audio(audio_path, model_size="large"):
"""Trascrivi l'audio con Whisper usando impostazioni ottimizzate"""
print(f"🔹 Trascrizione con Whisper ({model_size})...")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"⚙️ Uso dispositivo: {device.upper()}")
model = whisper.load_model(model_size).to(device)
# Impostazioni avanzate per migliorare il rilevamento del discorso
options = {
"language": "it",
"condition_on_previous_text": True, # Migliora la coerenza tra segmenti
"suppress_tokens": [-1], # Sopprime i tokens di silenzio
"initial_prompt": "Trascrizione di una conversazione tra tre persone." # Contestualizza
}
# Verifica se la versione di Whisper supporta word_timestamps
try:
test_options = options.copy()
test_options["word_timestamps"] = True
whisper.transcribe(audio_path, **test_options)
options["word_timestamps"] = True
print("✅ Utilizzo timestamp a livello di parola")
except:
print("⚠️ Questa versione di Whisper non supporta i timestamp a livello di parola")
spinner = start_spinner()
start = time.time()
result = model.transcribe(audio_path, **options)
stop_spinner(spinner)
duration = time.time() - start
print(f"✅ Trascrizione completata in {round(duration, 2)} secondi")
# Salva anche i timestamp delle parole per post-processing
with open(f"{TRANSCRIPT_FILE}.words.json", "w", encoding="utf-8") as f:
json.dump(result, f, ensure_ascii=False, indent=2)
with open(TRANSCRIPT_FILE, "w", encoding="utf-8") as f:
f.write(result["text"])
return result["segments"]
def diarize_audio(audio_path, hf_token, num_speakers=None):
"""Diarizzazione audio con parametri ottimizzati per sovrapposizioni"""
print("🔹 Riconoscimento speaker (v3.1) con Pyannote...")
# Carica il modello senza tentare di modificare i parametri
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=hf_token)
spinner = start_spinner()
start = time.time()
# Utilizzo del numero di speaker se specificato
if num_speakers is not None:
print(f" Utilizzo {num_speakers} speaker come specificato")
diarization = pipeline(audio_path, num_speakers=num_speakers)
else:
diarization = pipeline(audio_path)
stop_spinner(spinner)
duration = time.time() - start
print(f"✅ Speaker identificati in {round(duration, 2)} secondi")
# Analizza gli speaker identificati
speakers = set()
for segment, _, speaker in diarization.itertracks(yield_label=True):
speakers.add(speaker)
print(f"👥 Identificati {len(speakers)} speaker: {', '.join(sorted(speakers))}")
# Salva la diarizzazione grezza per ispezione
with open(f"{OUTPUT_FILE}.diarization.json", "w", encoding="utf-8") as f:
segments = []
for segment, _, speaker in diarization.itertracks(yield_label=True):
segments.append({
"start": segment.start,
"end": segment.end,
"speaker": speaker
})
json.dump(segments, f, indent=2)
return diarization
def format_time(seconds, srt=False):
"""Formatta il tempo in formato leggibile"""
td = timedelta(seconds=float(seconds))
hours, remainder = divmod(td.seconds, 3600)
minutes, seconds = divmod(remainder, 60)
milliseconds = round(td.microseconds / 1000)
if srt:
return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"
else:
return f"{hours:d}:{minutes:02d}:{seconds:02d}.{milliseconds:03d}"
def find_overlapping_speech(diarization, threshold=0.5):
"""Identifica segmenti con sovrapposizione di parlato"""
overlap_segments = []
speaker_segments = {}
# Organizza i segmenti per speaker
for segment, _, speaker in diarization.itertracks(yield_label=True):
if speaker not in speaker_segments:
speaker_segments[speaker] = []
speaker_segments[speaker].append((segment.start, segment.end))
# Trova sovrapposizioni tra speaker diversi
speakers = list(speaker_segments.keys())
for i in range(len(speakers)):
for j in range(i+1, len(speakers)):
speaker1 = speakers[i]
speaker2 = speakers[j]
for seg1_start, seg1_end in speaker_segments[speaker1]:
for seg2_start, seg2_end in speaker_segments[speaker2]:
# Controlla se i segmenti si sovrappongono
if seg1_start < seg2_end and seg2_start < seg1_end:
overlap_start = max(seg1_start, seg2_start)
overlap_end = min(seg1_end, seg2_end)
overlap_duration = overlap_end - overlap_start
if overlap_duration >= threshold:
overlap_segments.append({
"start": overlap_start,
"end": overlap_end,
"speakers": [speaker1, speaker2],
"duration": overlap_duration
})
# Combina sovrapposizioni vicine
if overlap_segments:
overlap_segments.sort(key=lambda x: x["start"])
merged = [overlap_segments[0]]
for current in overlap_segments[1:]:
previous = merged[-1]
if current["start"] - previous["end"] < 0.5: # Meno di mezzo secondo di distanza
# Unisci gli intervalli
previous["end"] = max(previous["end"], current["end"])
previous["speakers"] = list(set(previous["speakers"] + current["speakers"]))
previous["duration"] = previous["end"] - previous["start"]
else:
merged.append(current)
overlap_segments = merged
return overlap_segments
def match_transcript_to_speakers(transcript_segments, diarization, min_segment_length=1.0, max_chars=150):
"""Abbina la trascrizione agli speaker con gestione migliorata delle sovrapposizioni"""
print("🔹 Combinazione transcript + speaker...")
# Trova le potenziali sovrapposizioni
overlaps = find_overlapping_speech(diarization)
if overlaps:
print(f" Rilevate {len(overlaps)} potenziali sovrapposizioni di parlato")
with open(f"{OUTPUT_FILE}.overlaps.json", "w", encoding="utf-8") as f:
json.dump(overlaps, f, indent=2)
# Combina segmenti brevi dello stesso speaker
combined_segments = []
current_segment = None
for segment, _, speaker in diarization.itertracks(yield_label=True):
start_time = round(segment.start, 2)
end_time = round(segment.end, 2)
# Skip se il segmento è troppo breve
if end_time - start_time < 0.2:
continue
# Se è il primo segmento o c'è un cambio di speaker
if current_segment is None or current_segment["speaker"] != speaker:
if current_segment is not None:
combined_segments.append(current_segment)
current_segment = {
"start": start_time,
"end": end_time,
"speaker": speaker
}
else:
# Estendi il segmento corrente
current_segment["end"] = end_time
# Aggiungi l'ultimo segmento
if current_segment is not None:
combined_segments.append(current_segment)
# Ora abbina il testo ai segmenti combinati
output_segments = []
counter = 1
for segment in combined_segments:
start_time = segment["start"]
end_time = segment["end"]
speaker = segment["speaker"]
# Skip segmenti troppo brevi dopo la combinazione
if end_time - start_time < min_segment_length:
continue
# Trova il testo che corrisponde a questo intervallo di tempo
text = ""
for s in transcript_segments:
# Se il segmento di testo si sovrappone al segmento speaker
if (s["start"] < end_time and s["end"] > start_time):
text += s["text"] + " "
text = text.strip()
if text:
# Controlla se questo segmento è in una sovrapposizione
is_overlap = False
overlap_speakers = []
for overlap in overlaps:
if (start_time < overlap["end"] and end_time > overlap["start"]):
is_overlap = True
overlap_speakers = overlap["speakers"]
break
# Formatta l'output in base alla presenza di sovrapposizione
if is_overlap and speaker in overlap_speakers:
speaker_text = f"[{speaker}+] " if len(overlap_speakers) > 1 else f"[{speaker}] "
else:
speaker_text = f"[{speaker}] "
# Crea il segmento completo con testo
output_segment = {
"start": start_time,
"end": end_time,
"speaker": speaker,
"speaker_text": speaker_text,
"text": text
}
output_segments.append(output_segment)
# Ora formatta e salva l'output finale
output = []
srt = []
vtt = ["WEBVTT\n"]
for i, segment in enumerate(output_segments, 1):
start_time = segment["start"]
end_time = segment["end"]
speaker_text = segment["speaker_text"]
text = segment["text"]
formatted_text = f"{speaker_text}({format_time(start_time)} - {format_time(end_time)}): {text}"
srt_text = f"{counter}\n{format_time(start_time, True)} --> {format_time(end_time, True)}\n{speaker_text}{text}"
vtt_text = f"{format_time(start_time)} --> {format_time(end_time)}\n{speaker_text}{text}"
output.append(formatted_text)
srt.append(srt_text)
vtt.append(vtt_text)
counter += 1
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
f.write("\n".join(output))
with open(SRT_FILE, "w", encoding="utf-8") as f:
f.write("\n\n".join(srt))
with open(VTT_FILE, "w", encoding="utf-8") as f:
f.write("\n\n".join(vtt))
print("✅ Output finale salvato:", OUTPUT_FILE, SRT_FILE, VTT_FILE)
def parse_timestamp(time_str):
"""Convert a timestamp string to seconds"""
# Handle both SRT (00:00:00,000) and standard format (00:00:00.000)
time_str = time_str.replace(',', '.')
hours, minutes, seconds = time_str.split(':')
hours = int(hours)
minutes = int(minutes)
seconds = float(seconds)
total_seconds = hours * 3600 + minutes * 60 + seconds
return total_seconds
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Trascrizione + Speaker Diarization avanzata")
parser.add_argument("--audio", help="File audio WAV", required=True)
parser.add_argument("--token", help="Token Hugging Face per Pyannote")
parser.add_argument("--model", default="large", help="Modello Whisper (tiny, base, medium, large)")
parser.add_argument("--no-diarization", action="store_true", help="Disabilita il riconoscimento speaker")
parser.add_argument("--output-prefix", default="transcript", help="Prefisso per i file di output")
parser.add_argument("--num-speakers", type=int, default=None, help="Numero di speaker se conosciuto in anticipo")
parser.add_argument("--fix-start", action="store_true", help="Aggiungi silenzio all'inizio per catturare meglio i primi secondi")
parser.add_argument("--min-segment", type=float, default=1.0, help="Lunghezza minima dei segmenti in secondi")
args = parser.parse_args()
# Use Hugging Face token from environment variable if not provided via argument
hf_token = args.token or os.getenv("HF_TOKEN")
if not hf_token:
raise ValueError("Token Hugging Face non fornito. Specificarlo con --token o nella variabile HF_TOKEN nel file .env")
# Use model from environment variable if available
model_size = os.getenv("WHISPER_MODEL", args.model)
# Use number of speakers from environment variable if available and not provided via argument
num_speakers = args.num_speakers
if num_speakers is None and os.getenv("NUM_SPEAKERS"):
try:
num_speakers = int(os.getenv("NUM_SPEAKERS"))
except ValueError:
pass
# Use fix-start from environment variable if available and not provided via argument
fix_start = args.fix_start
if not fix_start and os.getenv("FIX_START", "").lower() == "true":
fix_start = True
# Definizione dei nomi dei file di output
output_prefix = os.getenv("OUTPUT_PREFIX", args.output_prefix)
TRANSCRIPT_FILE = f"{output_prefix}.txt"
OUTPUT_FILE = f"{output_prefix}_final.txt"
SRT_FILE = f"{output_prefix}.srt"
VTT_FILE = f"{output_prefix}.vtt"
if not os.path.exists(args.audio):
raise ValueError(f"File audio {args.audio} non trovato")
# Aggiungi silenzio all'inizio se richiesto
input_audio = args.audio
if fix_start:
extended_audio = "extended_" + os.path.basename(args.audio)
input_audio = extend_audio_beginning(args.audio, extended_audio)
# Trascrivi l'audio
segments = transcribe_audio(input_audio, model_size)
# Esegui diarizzazione e abbina trascrizione a speaker
if not args.no_diarization:
diarization = diarize_audio(input_audio, hf_token, num_speakers)
match_transcript_to_speakers(segments, diarization, args.min_segment)
else:
print("🛑 Diarization disabilitata. Salvo solo la trascrizione.")
with open(OUTPUT_FILE, "w", encoding="utf-8") as f_out, open(SRT_FILE, "w", encoding="utf-8") as f_srt, open(VTT_FILE, "w", encoding="utf-8") as f_vtt:
f_vtt.write("WEBVTT\n\n")
for i, s in enumerate(segments, 1):
start = format_time(s['start'])
end = format_time(s['end'])
f_out.write(f"({start} - {end}): {s['text'].strip()}\n")
f_srt.write(f"{i}\n{format_time(s['start'], True)} --> {format_time(s['end'], True)}\n{s['text'].strip()}\n\n")
f_vtt.write(f"{start} --> {end}\n{s['text'].strip()}\n\n")
print(f"✅ Output salvato senza diarizzazione: {OUTPUT_FILE}, {SRT_FILE}, {VTT_FILE}")
# Rimuovi file audio esteso se creato
if args.fix_start and os.path.exists(extended_audio):
os.remove(extended_audio)

10
requirements.txt Normal file
View file

@ -0,0 +1,10 @@
whisper
torch>=1.13.1
numpy>=1.20.0
pyannote.audio>=3.0.0
python-dotenv>=0.19.0
pydub>=0.25.1
tqdm>=4.64.0
matplotlib>=3.5.0
scikit-learn>=1.0.0
soundfile>=0.10.3

113
setup.sh Executable file
View file

@ -0,0 +1,113 @@
#!/bin/bash
set -e # Exit on error
echo "🚀 Setting up Transcription Runner..."
# Check if Python is installed
if ! command -v python3 &> /dev/null; then
echo "❌ Python 3 not found. Please install Python 3 before proceeding."
exit 1
fi
# Check if pip is installed
if ! command -v pip3 &> /dev/null; then
echo "❌ pip3 not found. Please install pip3 before proceeding."
exit 1
fi
# Check if AWS CLI is installed
if ! command -v aws &> /dev/null; then
echo "⚠️ AWS CLI not found. Installing..."
if [[ "$OSTYPE" == "darwin"* ]]; then
# macOS
if command -v brew &> /dev/null; then
brew install awscli
else
echo "❌ Homebrew not found. Please install AWS CLI manually."
exit 1
fi
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
# Linux
if command -v apt-get &> /dev/null; then
sudo apt-get update
sudo apt-get install -y awscli
elif command -v yum &> /dev/null; then
sudo yum install -y awscli
else
echo "❌ Unable to detect package manager. Please install AWS CLI manually."
exit 1
fi
else
echo "❌ Unsupported OS. Please install AWS CLI manually."
exit 1
fi
fi
# Check if netcat is installed
if ! command -v nc &> /dev/null; then
echo "⚠️ netcat not found. Installing..."
if [[ "$OSTYPE" == "darwin"* ]]; then
# macOS
if command -v brew &> /dev/null; then
brew install netcat
else
echo "❌ Homebrew not found. Please install netcat manually."
exit 1
fi
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
# Linux
if command -v apt-get &> /dev/null; then
sudo apt-get update
sudo apt-get install -y netcat
elif command -v yum &> /dev/null; then
sudo yum install -y netcat
else
echo "❌ Unable to detect package manager. Please install netcat manually."
exit 1
fi
else
echo "❌ Unsupported OS. Please install netcat manually."
exit 1
fi
fi
# Create virtual environment
echo "🔨 Creating Python virtual environment..."
python3 -m venv venv
source venv/bin/activate
# Install dependencies
echo "📦 Installing dependencies..."
pip install -r requirements.txt
# Create .env file if it doesn't exist
if [ ! -f .env ]; then
echo "📝 Creating .env file from template..."
cp .env.sample .env
echo " Please edit the .env file with your configuration before running the scripts."
fi
# Make shell scripts executable
echo "🔑 Making scripts executable..."
chmod +x whisper_parallel.sh
# Set up AWS credentials if needed
if ! aws configure list &> /dev/null; then
echo "⚠️ AWS credentials not configured. Setting up..."
echo "Please enter your AWS credentials:"
aws configure
fi
# Check if AWS key pair exists
KEY_NAME=$(grep KEY_NAME .env | cut -d '=' -f2 || echo "whisper-key")
if ! aws ec2 describe-key-pairs --key-names "$KEY_NAME" &> /dev/null; then
echo "🔑 Creating EC2 key pair..."
mkdir -p ~/.ssh
aws ec2 create-key-pair --key-name "$KEY_NAME" --query 'KeyMaterial' --output text > ~/.ssh/"$KEY_NAME".pem
chmod 400 ~/.ssh/"$KEY_NAME".pem
echo "✅ Key pair created: ~/.ssh/$KEY_NAME.pem"
fi
echo "✅ Setup complete! You can now run ./whisper_parallel.sh"
echo " Remember to edit the .env file with your configuration."

78
test_fix.py Normal file
View file

@ -0,0 +1,78 @@
import json
def split_long_segments(segments, max_chars=150):
"""Split segments that are too long into smaller chunks."""
import re
new_segments = []
for segment in segments:
if "text" in segment and len(segment["text"]) > max_chars:
# Split text at sentence boundaries or by character count
sentences = re.split(r'(?<=[.!?]) +', segment["text"])
current_text = ""
start_time = segment["start"]
for sentence in sentences:
if len(current_text) + len(sentence) > max_chars and current_text:
# Calculate proportional time based on text length
portion = len(current_text) / len(segment["text"])
mid_time = segment["start"] + portion * (segment["end"] - segment["start"])
new_segments.append({
"start": start_time,
"end": mid_time,
"text": current_text.strip(),
"speaker": segment.get("speaker", ""),
"speaker_text": segment.get("speaker_text", f"[{segment.get('speaker', '')}] ") # Fixed line
})
start_time = mid_time
current_text = sentence
else:
current_text += " " + sentence if current_text else sentence
# Add the last part
if current_text:
new_segments.append({
"start": start_time,
"end": segment["end"],
"text": current_text.strip(),
"speaker": segment.get("speaker", ""),
"speaker_text": segment.get("speaker_text", f"[{segment.get('speaker', '')}] ") # Fixed line
})
else:
new_segments.append(segment)
return new_segments
# Create mock test data
test_segments = [
{
"start": 0.0,
"end": 10.0,
"speaker": "SPEAKER_00",
"speaker_text": "[SPEAKER_00] ",
"text": "This is a very long text that exceeds the maximum character limit. It should be split into multiple segments. This is another sentence to make sure we have enough text to split. And one more sentence to be really sure."
}
]
# Run the split_long_segments function
print("Testing split_long_segments...")
split_segments = split_long_segments(test_segments, max_chars=50)
print(f"Number of segments after splitting: {len(split_segments)}")
# Verify that all segments have the speaker_text field
all_have_speaker_text = all("speaker_text" in segment for segment in split_segments)
print(f"All segments have speaker_text field: {all_have_speaker_text}")
# Dump the result to inspect
print("\nSplit segments:")
print(json.dumps(split_segments, indent=2))
# Check if we can access speaker_text without error
try:
for segment in split_segments:
speaker_text = segment["speaker_text"]
print("\n✅ Successfully accessed speaker_text on all segments")
except KeyError as e:
print(f"\n❌ KeyError when accessing: {e}")

458
whisper_parallel.sh Executable file
View file

@ -0,0 +1,458 @@
#!/bin/bash
# Load environment variables from .env file
if [ -f .env ]; then
echo "Loading environment variables from .env file..."
set -o allexport
source .env
set +o allexport
else
echo "Warning: .env file not found. Using default values."
fi
# === CONFIGURAZIONE ===
# These defaults will be used if not set in .env file
KEY_NAME=${KEY_NAME:-"whisper-key"}
KEY_FILE=${KEY_FILE:-"$HOME/.ssh/${KEY_NAME}.pem"}
SECURITY_GROUP=${SECURITY_GROUP:-"whisper-sg"}
INSTANCE_TYPE=${INSTANCE_TYPE:-"g4dn.12xlarge"} # Default a 1 GPU per rispettare limiti vCPU
REGION=${REGION:-"eu-south-1"}
AMI_ID=${AMI_ID:-"ami-059603706d3734615"}
VIDEO_FILE=${VIDEO_FILE:-"mio_video.mp4"}
ORIGINAL_FILENAME=$(basename "$VIDEO_FILE" | cut -d. -f1)
START_MIN=${START_MIN:-0} # Default value if not set
END_MIN=${END_MIN:-0} # Default value if not set
SHIFT_SECONDS=${SHIFT_SECONDS:-0} # Shift timestamps by this many seconds
SHIFT_ONLY=${SHIFT_ONLY:-false} # Set to true to only perform shifting on existing files
INPUT_PREFIX=${INPUT_PREFIX:-""} # Prefix for input files when using SHIFT_ONLY
GPU_COUNT=${GPU_COUNT:-1} # Numero di GPU da utilizzare (default: 1)
NUM_SPEAKERS=${NUM_SPEAKERS:-""} # Numero di speaker se conosciuto (opzionale)
FIX_START=${FIX_START:-"true"} # Aggiunge silenzio all'inizio per catturare i primi secondi
# === FUNZIONE PER SHIFT DEI TIMESTAMPS ===
shift_timestamps() {
local input_file=$1
local output_file=$2
local shift_by=$3
local file_ext="${input_file##*.}"
if [ "$file_ext" = "srt" ]; then
echo "🕒 Shifting SRT timestamps by $shift_by seconds..."
# SRT format: 00:00:05,440 --> 00:00:08,300
awk -v shift=$shift_by '
function time_to_seconds(time_str) {
split(time_str, parts, ",")
split(parts[1], time_parts, ":")
return time_parts[1]*3600 + time_parts[2]*60 + time_parts[3] + parts[2]/1000
}
function seconds_to_time(seconds) {
h = int(seconds/3600)
m = int((seconds-h*3600)/60)
s = int(seconds-h*3600-m*60)
ms = int((seconds - int(seconds))*1000)
return sprintf("%02d:%02d:%02d,%03d", h, m, s, ms)
}
{
if (match($0, /^([0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}) --> ([0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})$/)) {
start_time = time_to_seconds(substr($0, RSTART, RLENGTH/2-5))
end_time = time_to_seconds(substr($0, RSTART+RLENGTH/2+5, RLENGTH/2-5))
new_start = start_time + shift
new_end = end_time + shift
# Handle negative times (not allowed in SRT)
if (new_start < 0) new_start = 0
if (new_end < 0) new_end = 0
print seconds_to_time(new_start)" --> "seconds_to_time(new_end)
} else {
print $0
}
}' "$input_file" > "$output_file"
elif [ "$file_ext" = "vtt" ]; then
echo "🕒 Shifting VTT timestamps by $shift_by seconds..."
# VTT format: 00:00:05.440 --> 00:00:08.300
awk -v shift=$shift_by '
function time_to_seconds(time_str) {
split(time_str, parts, ".")
split(parts[1], time_parts, ":")
return time_parts[1]*3600 + time_parts[2]*60 + time_parts[3] + parts[2]/1000
}
function seconds_to_time(seconds) {
h = int(seconds/3600)
m = int((seconds-h*3600)/60)
s = int(seconds-h*3600-m*60)
ms = int((seconds - int(seconds))*1000)
return sprintf("%02d:%02d:%02d.%03d", h, m, s, ms)
}
{
if (match($0, /^([0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}) --> ([0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3})$/)) {
start_time = time_to_seconds(substr($0, RSTART, RLENGTH/2-5))
end_time = time_to_seconds(substr($0, RSTART+RLENGTH/2+5, RLENGTH/2-5))
new_start = start_time + shift
new_end = end_time + shift
# Handle negative times
if (new_start < 0) new_start = 0
if (new_end < 0) new_end = 0
print seconds_to_time(new_start)" --> "seconds_to_time(new_end)
} else {
print $0
}
}' "$input_file" > "$output_file"
elif [ "$file_ext" = "txt" ]; then
echo "🕒 Shifting timestamps in TXT by $shift_by seconds..."
# For text files, we need to handle timestamps in formats like [00:05.440]
awk -v shift=$shift_by '
function time_to_seconds(time_str) {
# Remove brackets
gsub(/[\[\]]/, "", time_str)
# Check format - either MM:SS.mmm or HH:MM:SS.mmm
if (split(time_str, parts, ":") == 2) {
# MM:SS.mmm format
mm = parts[1]
split(parts[2], sec_parts, ".")
ss = sec_parts[1]
ms = sec_parts[2] ? sec_parts[2] : 0
return mm*60 + ss + ms/1000
} else {
# HH:MM:SS.mmm format
hh = parts[1]
mm = parts[2]
split(parts[3], sec_parts, ".")
ss = sec_parts[1]
ms = sec_parts[2] ? sec_parts[2] : 0
return hh*3600 + mm*60 + ss + ms/1000
}
}
function seconds_to_time(seconds) {
h = int(seconds/3600)
m = int((seconds-h*3600)/60)
s = seconds-h*3600-m*60
# Format with up to 3 decimal places for milliseconds
if (h > 0) {
return sprintf("[%02d:%02d:%05.3f]", h, m, s)
} else {
return sprintf("[%02d:%05.3f]", m, s)
}
}
{
line = $0
# Match timestamps in the format [MM:SS.mmm] or [HH:MM:SS.mmm]
while (match(line, /\[[0-9]+:[0-9]+(\.[0-9]+)?\]/) || match(line, /\[[0-9]+:[0-9]+:[0-9]+(\.[0-9]+)?\]/)) {
time_str = substr(line, RSTART, RLENGTH)
time_sec = time_to_seconds(time_str)
new_time = time_sec + shift
if (new_time < 0) new_time = 0
new_time_str = seconds_to_time(new_time)
# Replace the timestamp
line = substr(line, 1, RSTART-1) new_time_str substr(line, RSTART+RLENGTH)
}
print line
}' "$input_file" > "$output_file"
else
echo "⚠️ Unsupported file extension for shifting: $file_ext"
cp "$input_file" "$output_file"
fi
}
# If we're only shifting timestamps, do that and exit
if [ "$SHIFT_ONLY" = "true" ]; then
if [ -z "$INPUT_PREFIX" ]; then
echo "❌ ERROR: When using SHIFT_ONLY=true, you must specify INPUT_PREFIX"
exit 1
fi
echo "🕒 Performing timestamp shifting by $SHIFT_SECONDS seconds..."
# Process each file type
for ext in txt srt vtt; do
# Check for regular transcript
if [ -f "${INPUT_PREFIX}.${ext}" ]; then
shift_timestamps "${INPUT_PREFIX}.${ext}" "${INPUT_PREFIX}_shifted.${ext}" $SHIFT_SECONDS
echo "✅ Created ${INPUT_PREFIX}_shifted.${ext}"
fi
# Check for final transcript
if [ -f "${INPUT_PREFIX}_final.${ext}" ]; then
shift_timestamps "${INPUT_PREFIX}_final.${ext}" "${INPUT_PREFIX}_final_shifted.${ext}" $SHIFT_SECONDS
echo "✅ Created ${INPUT_PREFIX}_final_shifted.${ext}"
fi
done
echo "✅ Timestamp shifting complete!"
exit 0
fi
# Generate random suffix
if command -v openssl > /dev/null 2>&1; then
RANDOM_SUFFIX=$(openssl rand -hex 4)
elif command -v md5sum > /dev/null 2>&1; then
RANDOM_SUFFIX=$(date +%s | md5sum | head -c 8)
elif command -v shasum > /dev/null 2>&1; then
RANDOM_SUFFIX=$(date +%s | shasum | head -c 8)
else
RANDOM_SUFFIX=$RANDOM$RANDOM
fi
AUDIO_FILE="${ORIGINAL_FILENAME}_${START_MIN}_${END_MIN}_${RANDOM_SUFFIX}.wav"
DIARIZATION_ENABLED=${DIARIZATION_ENABLED:-true}
HF_TOKEN=${HF_TOKEN:-""}
BUCKET_NAME=${BUCKET_NAME:-"whisper-video-transcripts"}
# Output file names with the same format
TRANSCRIPT_PREFIX="${ORIGINAL_FILENAME}_${START_MIN}_${END_MIN}_${RANDOM_SUFFIX}"
TRANSCRIPT_FILE="${TRANSCRIPT_PREFIX}.txt"
FINAL_TRANSCRIPT_FILE="${TRANSCRIPT_PREFIX}_final.txt"
SRT_FILE="${TRANSCRIPT_PREFIX}.srt"
VTT_FILE="${TRANSCRIPT_PREFIX}.vtt"
# === CONTROLLI PRELIMINARI ===
if [ ! -f "$KEY_FILE" ]; then
echo "❌ Chiave SSH non trovata in $KEY_FILE"
exit 1
fi
if [ ! -f "parallel_transcript.py" ]; then
echo "❌ File parallel_transcript.py non trovato"
exit 1
fi
if [ ! -f "$VIDEO_FILE" ]; then
echo "❌ File video $VIDEO_FILE non trovato"
exit 1
fi
# === CONVERTI MP4 IN WAV E APPLICA CROP PRIMA DELL'UPLOAD ===
echo "🎙️ Converto $VIDEO_FILE in $AUDIO_FILE con crop applicato..."
FFMPEG_CMD="ffmpeg -i \"$VIDEO_FILE\""
# Aggiungi parametri di crop se START_MIN o END_MIN sono impostati
if [ "$START_MIN" != "0" ] || [ "$END_MIN" != "0" ]; then
START_SEC=$((START_MIN * 60))
if [ "$END_MIN" != "0" ]; then
END_SEC=$((END_MIN * 60))
FFMPEG_CMD+=" -ss $START_SEC -to $END_SEC"
else
FFMPEG_CMD+=" -ss $START_SEC"
fi
echo "⏱️ Crop video da $START_MIN min a ${END_MIN:-fine} min"
fi
# Completa il comando ffmpeg con gli altri parametri necessari
FFMPEG_CMD+=" -ac 1 -ar 16000 -vn \"$AUDIO_FILE\" -y"
# Esegui il comando ffmpeg
eval $FFMPEG_CMD
echo "☁️ Controllo se l'audio è già presente su S3..."
AUDIO_UPLOADED=""
if ! aws s3 ls s3://$BUCKET_NAME/$AUDIO_FILE >/dev/null 2>&1; then
echo "⬆️ Carico $AUDIO_FILE su S3..."
aws s3 cp $AUDIO_FILE s3://$BUCKET_NAME/
AUDIO_UPLOADED="true"
else
echo "✅ Audio già presente su S3. Salto upload."
fi
# === CONTROLLA O CREA LA DEFAULT VPC ===
echo "🔍 Controllo default VPC nella regione $REGION..."
DEFAULT_VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters Name=isDefault,Values=true --query "Vpcs[0].VpcId" --output text)
if [ "$DEFAULT_VPC_ID" = "None" ]; then
echo " Nessuna default VPC trovata. La creo..."
DEFAULT_VPC_ID=$(aws ec2 create-default-vpc --region $REGION --query "Vpc.VpcId" --output text)
echo "✅ Default VPC creata: $DEFAULT_VPC_ID"
else
echo "✅ Default VPC esistente: $DEFAULT_VPC_ID"
fi
# === CREA SECURITY GROUP SE NECESSARIO ===
aws ec2 describe-security-groups --group-names $SECURITY_GROUP --region $REGION &>/dev/null
if [ $? -ne 0 ]; then
echo " Creo security group $SECURITY_GROUP..."
aws ec2 create-security-group --group-name $SECURITY_GROUP --description "Whisper SG" --vpc-id $DEFAULT_VPC_ID --region $REGION
aws ec2 authorize-security-group-ingress --group-name $SECURITY_GROUP --protocol tcp --port 22 --cidr 0.0.0.0/0 --region $REGION
fi
# === AVVIA L'ISTANZA EC2 ===
echo "🚀 Avvio istanza EC2 GPU ($INSTANCE_TYPE con GPU)..."
INSTANCE_ID=$(aws ec2 run-instances \
--image-id $AMI_ID \
--instance-type $INSTANCE_TYPE \
--key-name $KEY_NAME \
--security-groups $SECURITY_GROUP \
--iam-instance-profile Name=WhisperS3Profile \
--block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50}}]' \
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=whisper-runner}]" \
--region $REGION \
--query "Instances[0].InstanceId" \
--output text)
if [ -z "$INSTANCE_ID" ]; then
echo "❌ ERRORE: ID istanza non ottenuto. Verifica che l'AMI sia corretta per la regione $REGION."
exit 1
fi
echo "🆔 Istanza avviata: $INSTANCE_ID"
# === FUNZIONE DI CLEANUP IN CASO DI USCITA IMPROVVISA ===
function cleanup {
echo "🧨 Cleanup in corso..."
# Rimuove il file audio locale se esiste
if [ -f "$AUDIO_FILE" ]; then
echo "🧹 Rimuovo file audio locale $AUDIO_FILE..."
rm -f "$AUDIO_FILE"
echo "✅ File audio locale rimosso."
fi
# Rimuove l'audio da S3 se è stato caricato in questo script
if [ "$AUDIO_UPLOADED" = "true" ]; then
echo "🧹 Rimuovo $AUDIO_FILE da S3..."
aws s3 rm s3://$BUCKET_NAME/$AUDIO_FILE
echo "✅ File rimosso da S3."
fi
# Termina l'istanza EC2 se è stata avviata
if [ -n "$INSTANCE_ID" ]; then
echo "🧹 Termino l'istanza EC2 ($INSTANCE_ID)..."
aws ec2 terminate-instances --instance-ids $INSTANCE_ID --region $REGION >/dev/null
# Aspetta la terminazione con timeout
echo "⏳ Aspetto la terminazione dell'istanza (max 60 secondi)..."
WAIT_TIMEOUT=60
WAIT_START=$(date +%s)
WAITING=true
while [ "$WAITING" = true ]; do
# Controlla lo stato dell'istanza
STATUS=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --region $REGION --query "Reservations[0].Instances[0].State.Name" --output text 2>/dev/null)
# Se lo stato è terminated o l'istanza non esiste più, esci dal ciclo
if [ "$STATUS" = "terminated" ] || [ "$STATUS" = "None" ]; then
echo "✅ Istanza terminata con successo."
WAITING=false
else
# Controlla se è scaduto il timeout
WAIT_ELAPSED=$(($(date +%s) - WAIT_START))
if [ $WAIT_ELAPSED -ge $WAIT_TIMEOUT ]; then
echo "⚠️ Timeout durante l'attesa della terminazione. L'istanza potrebbe essere ancora in fase di terminazione."
WAITING=false
else
# Aspetta un secondo prima di controllare di nuovo
sleep 2
echo -n "."
fi
fi
done
fi
}
# Esegui cleanup su qualsiasi uscita: normale, errore, o Ctrl+C
trap cleanup EXIT
echo "⏳ Attendo che sia pronta..."
aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION
echo "🔐 Aspetto che l'istanza sia pronta per SSH..."
for i in {1..35}; do
PUBLIC_IP=$(aws ec2 describe-instances --instance-id $INSTANCE_ID --region $REGION --query "Reservations[0].Instances[0].PublicIpAddress" --output text)
echo "🌍 IP pubblico: $PUBLIC_IP"
nc -zv $PUBLIC_IP 22 >/dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✅ Porta 22 aperta, l'istanza è pronta!"
break
else
echo "⏳ Tentativo $i/35: porta 22 ancora chiusa. Riprovo tra 5s..."
sleep 5
fi
done
# === CARICA SCRIPT PYTHON SULL'ISTANZA ===
echo "📦 Carico script sulla macchina EC2..."
scp -o StrictHostKeyChecking=no -i $KEY_FILE parallel_transcript.py ubuntu@$PUBLIC_IP:/home/ubuntu/
scp -o StrictHostKeyChecking=no -i $KEY_FILE .env ubuntu@$PUBLIC_IP:/home/ubuntu/
scp -o StrictHostKeyChecking=no -i $KEY_FILE requirements.txt ubuntu@$PUBLIC_IP:/home/ubuntu/
echo "⚙️ Scarico audio da S3 ed eseguo trascrizione avanzata..."
ssh -t -i $KEY_FILE -o "SendEnv=TERM" ubuntu@$PUBLIC_IP "
# Prevent broken pipe errors
export PYTHONUNBUFFERED=1
set -e
cd /home/ubuntu
echo '⬇️ Download da S3...'
aws s3 cp s3://$BUCKET_NAME/$AUDIO_FILE /home/ubuntu/$AUDIO_FILE --region $REGION
echo '📦 File scaricato:'
ls -lh $AUDIO_FILE
echo '⚙️ Attivo ambiente virtuale...'
source whisper-env/bin/activate
# Installa PyDub se non presente
if ! pip list | grep -q pydub; then
echo '📦 Installo dipendenze mancanti...'
pip install pydub
fi
# Installa le dipendenze da requirements.txt
pip install -r requirements.txt
echo '🖥️ Informazioni GPU:'
nvidia-smi
echo 'Audio file: $AUDIO_FILE'
echo 'Token Hugging Face: $HF_TOKEN'
echo 'Diarization enabled: $DIARIZATION_ENABLED'
echo 'Numero di speaker: $NUM_SPEAKERS'
echo '✍️ Lancio trascrizione avanzata...'
CMD=\"python3 parallel_transcript.py --audio $AUDIO_FILE --token $HF_TOKEN \
--output-prefix $TRANSCRIPT_PREFIX\"
if [ \"$DIARIZATION_ENABLED\" = false ]; then
CMD+=\" --no-diarization\"
fi
if [ -n \"$NUM_SPEAKERS\" ]; then
CMD+=\" --num-speakers $NUM_SPEAKERS\"
echo '👥 Utilizzo numero di speaker specificato: $NUM_SPEAKERS'
fi
if [ \"$FIX_START\" = true ]; then
CMD+=\" --fix-start\"
echo '⏱️ Aggiunta correzione per i primi secondi'
fi
eval \$CMD
"
# === SCARICA I FILE ===
echo "⬇️ Scarico i file di output..."
scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}_final.txt . || echo "⚠️ Impossibile scaricare _final.txt (potrebbe non essere stato generato)"
scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}.txt . || echo "⚠️ Impossibile scaricare .txt"
scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}.srt . || echo "⚠️ Impossibile scaricare .srt"
scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}.vtt . || echo "⚠️ Impossibile scaricare .vtt"
# Scarica anche i file JSON con dati aggiuntivi per debugging
scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}.txt.words.json . 2>/dev/null || true
scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}_final.txt.diarization.json . 2>/dev/null || true
scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}_final.txt.overlaps.json . 2>/dev/null || true
echo "📄 File scaricati:"
ls -lh ${TRANSCRIPT_PREFIX}* 2>/dev/null || echo "⚠️ Nessun file trovato con il prefisso specificato"