commit b6d47982c7087f7d2f2d2df6822a225d6468ed1a Author: Lorenzo Iovino Date: Fri May 23 14:45:22 2025 +0200 first commit diff --git a/.env.sample b/.env.sample new file mode 100644 index 0000000..dac2cb6 --- /dev/null +++ b/.env.sample @@ -0,0 +1,28 @@ +# Whisper Parallel Configuration +# SSH Key Configuration +KEY_NAME=whisper-key +KEY_FILE=$HOME/.ssh/whisper-key.pem +SECURITY_GROUP=whisper-sg + +# AWS Instance Configuration +INSTANCE_TYPE=g4dn.12xlarge +REGION=eu-south-1 +AMI_ID=ami-059603706d3734615 + +# Video/Audio Processing +VIDEO_FILE=mio_video.mp4 +START_MIN=0 +END_MIN=0 +SHIFT_SECONDS=0 +SHIFT_ONLY=false +INPUT_PREFIX= + +# GPU Configuration +GPU_COUNT=1 + +# Processing Options +NUM_SPEAKERS= +FIX_START=true + +# API Tokens +HF_TOKEN=your_huggingface_token_here diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..2bad62d --- /dev/null +++ b/.gitignore @@ -0,0 +1,53 @@ +# Python +__pycache__/ +*.py[cod] +*$py.class +*.so +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +*.egg-info/ +.installed.cfg +*.egg + +# Environment variables +.env +.env.local +.env.development.local +.env.test.local +.env.production.local + +# Processed files +*.wav +*.mp3 +*.mp4 +*.srt +*.vtt +transcript*.* +!transcription-runner/mio_video.mp4 + +# Logs +*.log + +# OS specific +.DS_Store +._.DS_Store +*.swp +*.swo + +# IDE +.idea/ +.vscode/ +*.sublime-project +*.sublime-workspace +.ropeproject/ \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..229b8b8 --- /dev/null +++ b/README.md @@ -0,0 +1,215 @@ +# Transcription Runner con Multi-Chunk Processing e GPU Parallela + +Questo pacchetto ti consente di: +- Creare un'istanza EC2 GPU su AWS (g4dn.12xlarge) +- Suddividere e trascrivere un file video `.mp4` in più chunk +- Generare automaticamente transcript + speaker diarization +- Scaricare i file di output +- Terminare l'istanza per risparmiare costi +- Applicare spostamento temporale ai timestamp delle trascrizioni +- Configurare facilmente le opzioni tramite file `.env` + +--- + +## ✅ Prerequisiti + +### 1. **Installare AWS CLI** +Se non hai ancora installato AWS CLI: +- Su macOS con Homebrew: +```bash +brew install awscli +``` +- Su Linux (Debian/Ubuntu): +```bash +sudo apt update +sudo apt install awscli +``` + +### 2. **Configurare AWS CLI** +Una volta installato, esegui: +```bash +aws configure +``` +Inserisci: +- Access key ID +- Secret access key +- Regione predefinita (es: `eu-south-1`) +- Formato output: `json` + +### 3. **Creare una chiave SSH per EC2** +Nel terminale, esegui: +```bash +aws ec2 create-key-pair --key-name whisper-key --query 'KeyMaterial' --output text > ~/.ssh/whisper-key.pem +chmod 400 ~/.ssh/whisper-key.pem +``` + +### 4. **Installa netcat** +- Su macOS con Homebrew: +```bash +brew install netcat +``` +- Su Linux (Debian/Ubuntu): +```bash +sudo apt install netcat +``` + +### 5. **Registrarsi su Hugging Face e ottenere token** +Vai su: https://huggingface.co/settings/tokens +Crea un token con accesso ai modelli (read access) e copia il valore. + +### 6. **IAM role "WhisperS3Profile" con accesso S3** +Assicurati che il tuo account AWS abbia un ruolo IAM chiamato "WhisperS3Profile" con permessi di accesso S3. + +### 7. **Configurare il file .env** +Copia il file `.env.sample` in `.env` e modifica i valori secondo le tue esigenze: +```bash +cp .env.sample .env +nano .env # o usa l'editor che preferisci +``` + +--- + +## ▶️ Come usare + +### Metodo Base +```bash +chmod +x whisper_parallel.sh +./whisper_parallel.sh +``` + +### Configurazione tramite file .env +Modifica il file `.env` con i tuoi parametri e poi esegui: +```bash +./whisper_parallel.sh +``` + +### Specificare i parametri tramite variabili d'ambiente (sovrascrive .env) +```bash +VIDEO_FILE="mia_intervista.mp4" START_MIN=5 END_MIN=15 GPU_COUNT=4 ./whisper_parallel.sh +``` + +### Parametri disponibili +Questi parametri possono essere specificati nel file `.env` o tramite variabili d'ambiente: + +| Parametro | Descrizione | Default | +|-----------|-------------|---------| +| VIDEO_FILE | Il file video/audio da trascrivere | mio_video.mp4 | +| START_MIN | Minuto di inizio per il crop | 0 | +| END_MIN | Minuto di fine per il crop | 0 (fino alla fine) | +| SHIFT_SECONDS | Sposta i timestamp di X secondi | 0 | +| GPU_COUNT | Numero di chunk in cui dividere l'audio | 4 | +| NUM_SPEAKERS | Numero di speaker se conosciuto in anticipo | (auto) | +| DIARIZATION_ENABLED | Attiva/disattiva riconoscimento speaker | true | +| INSTANCE_TYPE | Tipo di istanza EC2 | g4dn.12xlarge | +| REGION | Regione AWS | eu-south-1 | +| BUCKET_NAME | Nome del bucket S3 | whisper-video-transcripts | +| HF_TOKEN | Token Hugging Face per Pyannote | (richiesto) | +| FIX_START | Aggiunge silenzio all'inizio per migliorare la cattura | true | +| SHIFT_ONLY | Applica solo lo spostamento timestamp a file esistenti | false | +| INPUT_PREFIX | Prefisso per i file di input quando si usa SHIFT_ONLY | "" | +| WHISPER_MODEL | Modello Whisper da utilizzare | large | + +--- + +## 📦 Output + +Al termine troverai questi file nella cartella corrente: +- `{nome-file}_{start}_{end}_{random}.txt` → transcript grezzo +- `{nome-file}_{start}_{end}_{random}_final.txt` → transcript con speaker +- `{nome-file}_{start}_{end}_{random}.srt` → file SRT per i sottotitoli +- `{nome-file}_{start}_{end}_{random}.vtt` → file VTT per i sottotitoli web + +--- + +## 🚀 Modalità Multi-Chunk + +La versione attuale dello script divide automaticamente l'audio in più parti e le elabora in parallelo su GPU. Questo: +1. Migliora l'utilizzo della memoria per file lunghi +2. Accelera il processo di trascrizione di file estesi +3. Ottimizza l'utilizzo delle risorse hardware + +### Suggerimenti per le prestazioni + +1. **Instanza ideale**: g4dn.xlarge è sufficiente per file brevi, g4dn.12xlarge per file lunghi con multi-GPU +2. **Numero di chunk**: Per file lunghi, suddividere in più chunk aiuta a gestire meglio la memoria +3. **Modello**: Per file molto lunghi, considerare l'uso del modello "medium" o "base" invece di "large" + +--- + +## 🧪 Esempi di utilizzo + +### Configurazione tramite .env +Modifica il file `.env` con i tuoi parametri e poi esegui: +```bash +./whisper_parallel.sh +``` + +### Trascrivere un intero file +```bash +VIDEO_FILE="conferenza.mp4" ./whisper_parallel.sh +``` + +### Trascrivere una porzione specifica +```bash +VIDEO_FILE="lezione.mp4" START_MIN=10 END_MIN=20 ./whisper_parallel.sh +``` + +### Suddividere un file lungo in più chunk +```bash +VIDEO_FILE="intervista.mp4" GPU_COUNT=6 ./whisper_parallel.sh +``` + +### Disabilitare la diarizzazione (solo trascrizione) +```bash +VIDEO_FILE="audio.mp4" DIARIZATION_ENABLED=false ./whisper_parallel.sh +``` + +### Specificare il numero di speaker +```bash +VIDEO_FILE="intervista.mp4" NUM_SPEAKERS=2 ./whisper_parallel.sh +``` + +### Spostare i timestamp di una trascrizione esistente +```bash +SHIFT_ONLY=true SHIFT_SECONDS=30 INPUT_PREFIX="mia_trascrizione" ./whisper_parallel.sh +``` + +--- + +## 🔄 Funzionalità avanzate + +### Spostamento dei timestamp +Lo script può spostare i timestamp nei file di trascrizione, utile quando: +- Hai tagliato una porzione iniziale del video +- Devi sincronizzare i sottotitoli con un video modificato +- Lavori con segmenti estratti da un video più lungo + +### Tipi di file supportati per lo spostamento +- `.srt` (SubRip Text) +- `.vtt` (WebVTT) +- `.txt` (Transcript con timestamp) + +--- + +## ☁️ Note + +- L'istanza EC2 viene **distrutta automaticamente** al termine. +- I file audio vengono rimossi dal bucket S3 dopo il download. +- I nomi dei file di output includono un suffisso casuale per evitare conflitti. +- In caso di interruzione dello script, il sistema eseguirà comunque la pulizia delle risorse AWS. +- Richiede il file companion `parallel_transcript.py` per l'elaborazione su EC2. + +## Dettagli tecnici + +- Utilizza FFmpeg per l'estrazione audio +- Crea automaticamente security group AWS e utilizza la VPC predefinita se disponibile +- Implementa un cleanup automatico alla terminazione dello script +- Supporta diarizzazione di alta qualità tramite Pyannote/WhisperX +- Fornisce funzionalità di spostamento timestamp per tutti i formati di output + +## Sicurezza + +- Lo script crea un security group che consente l'accesso SSH da qualsiasi IP (0.0.0.0/0) +- Sono necessarie credenziali AWS con permessi EC2 e S3 +- Le chiavi SSH vengono utilizzate per l'accesso sicuro all'istanza +- Il file `.env` contiene dati sensibili e non dovrebbe essere aggiunto al controllo di versione (è già incluso in `.gitignore`) diff --git a/parallel_transcript.py b/parallel_transcript.py new file mode 100644 index 0000000..5a713e0 --- /dev/null +++ b/parallel_transcript.py @@ -0,0 +1,398 @@ +import os +import argparse +import whisper +import torch +import time +import threading +import json +from pyannote.audio import Pipeline +from datetime import timedelta +import numpy as np +from pydub import AudioSegment +import math +from dotenv import load_dotenv + +# Load environment variables from .env file +load_dotenv() + +def start_spinner(): + def spin(): + while not spinner_done: + print(".", end="", flush=True) + time.sleep(1) + global spinner_done + spinner_done = False + t = threading.Thread(target=spin) + t.start() + return t + +def stop_spinner(thread): + global spinner_done + spinner_done = True + thread.join() + print("") + +def extend_audio_beginning(input_audio, output_audio, silence_duration=0): + """Aggiunge un breve silenzio all'inizio dell'audio per catturare meglio i primi secondi""" + print(f"🔄 Aggiungendo {silence_duration/1000} secondi di silenzio all'inizio dell'audio...") + audio = AudioSegment.from_file(input_audio) + silence = AudioSegment.silent(duration=silence_duration) # 2 secondi di silenzio + extended_audio = silence + audio + extended_audio.export(output_audio, format="wav") + print(f"✅ Audio esteso salvato come {output_audio}") + return output_audio + +def transcribe_audio(audio_path, model_size="large"): + """Trascrivi l'audio con Whisper usando impostazioni ottimizzate""" + print(f"🔹 Trascrizione con Whisper ({model_size})...") + device = "cuda" if torch.cuda.is_available() else "cpu" + print(f"⚙️ Uso dispositivo: {device.upper()}") + + model = whisper.load_model(model_size).to(device) + + # Impostazioni avanzate per migliorare il rilevamento del discorso + options = { + "language": "it", + "condition_on_previous_text": True, # Migliora la coerenza tra segmenti + "suppress_tokens": [-1], # Sopprime i tokens di silenzio + "initial_prompt": "Trascrizione di una conversazione tra tre persone." # Contestualizza + } + + # Verifica se la versione di Whisper supporta word_timestamps + try: + test_options = options.copy() + test_options["word_timestamps"] = True + whisper.transcribe(audio_path, **test_options) + options["word_timestamps"] = True + print("✅ Utilizzo timestamp a livello di parola") + except: + print("⚠️ Questa versione di Whisper non supporta i timestamp a livello di parola") + + spinner = start_spinner() + start = time.time() + result = model.transcribe(audio_path, **options) + stop_spinner(spinner) + + duration = time.time() - start + print(f"✅ Trascrizione completata in {round(duration, 2)} secondi") + + # Salva anche i timestamp delle parole per post-processing + with open(f"{TRANSCRIPT_FILE}.words.json", "w", encoding="utf-8") as f: + json.dump(result, f, ensure_ascii=False, indent=2) + + with open(TRANSCRIPT_FILE, "w", encoding="utf-8") as f: + f.write(result["text"]) + + return result["segments"] + +def diarize_audio(audio_path, hf_token, num_speakers=None): + """Diarizzazione audio con parametri ottimizzati per sovrapposizioni""" + print("🔹 Riconoscimento speaker (v3.1) con Pyannote...") + + # Carica il modello senza tentare di modificare i parametri + pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=hf_token) + + spinner = start_spinner() + start = time.time() + + # Utilizzo del numero di speaker se specificato + if num_speakers is not None: + print(f"ℹ️ Utilizzo {num_speakers} speaker come specificato") + diarization = pipeline(audio_path, num_speakers=num_speakers) + else: + diarization = pipeline(audio_path) + + stop_spinner(spinner) + + duration = time.time() - start + print(f"✅ Speaker identificati in {round(duration, 2)} secondi") + + # Analizza gli speaker identificati + speakers = set() + for segment, _, speaker in diarization.itertracks(yield_label=True): + speakers.add(speaker) + + print(f"👥 Identificati {len(speakers)} speaker: {', '.join(sorted(speakers))}") + + # Salva la diarizzazione grezza per ispezione + with open(f"{OUTPUT_FILE}.diarization.json", "w", encoding="utf-8") as f: + segments = [] + for segment, _, speaker in diarization.itertracks(yield_label=True): + segments.append({ + "start": segment.start, + "end": segment.end, + "speaker": speaker + }) + json.dump(segments, f, indent=2) + + return diarization + +def format_time(seconds, srt=False): + """Formatta il tempo in formato leggibile""" + td = timedelta(seconds=float(seconds)) + hours, remainder = divmod(td.seconds, 3600) + minutes, seconds = divmod(remainder, 60) + milliseconds = round(td.microseconds / 1000) + + if srt: + return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}" + else: + return f"{hours:d}:{minutes:02d}:{seconds:02d}.{milliseconds:03d}" + +def find_overlapping_speech(diarization, threshold=0.5): + """Identifica segmenti con sovrapposizione di parlato""" + overlap_segments = [] + speaker_segments = {} + + # Organizza i segmenti per speaker + for segment, _, speaker in diarization.itertracks(yield_label=True): + if speaker not in speaker_segments: + speaker_segments[speaker] = [] + speaker_segments[speaker].append((segment.start, segment.end)) + + # Trova sovrapposizioni tra speaker diversi + speakers = list(speaker_segments.keys()) + for i in range(len(speakers)): + for j in range(i+1, len(speakers)): + speaker1 = speakers[i] + speaker2 = speakers[j] + + for seg1_start, seg1_end in speaker_segments[speaker1]: + for seg2_start, seg2_end in speaker_segments[speaker2]: + # Controlla se i segmenti si sovrappongono + if seg1_start < seg2_end and seg2_start < seg1_end: + overlap_start = max(seg1_start, seg2_start) + overlap_end = min(seg1_end, seg2_end) + overlap_duration = overlap_end - overlap_start + + if overlap_duration >= threshold: + overlap_segments.append({ + "start": overlap_start, + "end": overlap_end, + "speakers": [speaker1, speaker2], + "duration": overlap_duration + }) + + # Combina sovrapposizioni vicine + if overlap_segments: + overlap_segments.sort(key=lambda x: x["start"]) + merged = [overlap_segments[0]] + + for current in overlap_segments[1:]: + previous = merged[-1] + if current["start"] - previous["end"] < 0.5: # Meno di mezzo secondo di distanza + # Unisci gli intervalli + previous["end"] = max(previous["end"], current["end"]) + previous["speakers"] = list(set(previous["speakers"] + current["speakers"])) + previous["duration"] = previous["end"] - previous["start"] + else: + merged.append(current) + + overlap_segments = merged + + return overlap_segments + +def match_transcript_to_speakers(transcript_segments, diarization, min_segment_length=1.0, max_chars=150): + """Abbina la trascrizione agli speaker con gestione migliorata delle sovrapposizioni""" + print("🔹 Combinazione transcript + speaker...") + + # Trova le potenziali sovrapposizioni + overlaps = find_overlapping_speech(diarization) + if overlaps: + print(f"ℹ️ Rilevate {len(overlaps)} potenziali sovrapposizioni di parlato") + + with open(f"{OUTPUT_FILE}.overlaps.json", "w", encoding="utf-8") as f: + json.dump(overlaps, f, indent=2) + + # Combina segmenti brevi dello stesso speaker + combined_segments = [] + current_segment = None + + for segment, _, speaker in diarization.itertracks(yield_label=True): + start_time = round(segment.start, 2) + end_time = round(segment.end, 2) + + # Skip se il segmento è troppo breve + if end_time - start_time < 0.2: + continue + + # Se è il primo segmento o c'è un cambio di speaker + if current_segment is None or current_segment["speaker"] != speaker: + if current_segment is not None: + combined_segments.append(current_segment) + + current_segment = { + "start": start_time, + "end": end_time, + "speaker": speaker + } + else: + # Estendi il segmento corrente + current_segment["end"] = end_time + + # Aggiungi l'ultimo segmento + if current_segment is not None: + combined_segments.append(current_segment) + + # Ora abbina il testo ai segmenti combinati + output_segments = [] + counter = 1 + + for segment in combined_segments: + start_time = segment["start"] + end_time = segment["end"] + speaker = segment["speaker"] + + # Skip segmenti troppo brevi dopo la combinazione + if end_time - start_time < min_segment_length: + continue + + # Trova il testo che corrisponde a questo intervallo di tempo + text = "" + for s in transcript_segments: + # Se il segmento di testo si sovrappone al segmento speaker + if (s["start"] < end_time and s["end"] > start_time): + text += s["text"] + " " + + text = text.strip() + if text: + # Controlla se questo segmento è in una sovrapposizione + is_overlap = False + overlap_speakers = [] + + for overlap in overlaps: + if (start_time < overlap["end"] and end_time > overlap["start"]): + is_overlap = True + overlap_speakers = overlap["speakers"] + break + + # Formatta l'output in base alla presenza di sovrapposizione + if is_overlap and speaker in overlap_speakers: + speaker_text = f"[{speaker}+] " if len(overlap_speakers) > 1 else f"[{speaker}] " + else: + speaker_text = f"[{speaker}] " + + # Crea il segmento completo con testo + output_segment = { + "start": start_time, + "end": end_time, + "speaker": speaker, + "speaker_text": speaker_text, + "text": text + } + output_segments.append(output_segment) + + # Ora formatta e salva l'output finale + output = [] + srt = [] + vtt = ["WEBVTT\n"] + + for i, segment in enumerate(output_segments, 1): + start_time = segment["start"] + end_time = segment["end"] + speaker_text = segment["speaker_text"] + text = segment["text"] + + formatted_text = f"{speaker_text}({format_time(start_time)} - {format_time(end_time)}): {text}" + srt_text = f"{counter}\n{format_time(start_time, True)} --> {format_time(end_time, True)}\n{speaker_text}{text}" + vtt_text = f"{format_time(start_time)} --> {format_time(end_time)}\n{speaker_text}{text}" + + output.append(formatted_text) + srt.append(srt_text) + vtt.append(vtt_text) + counter += 1 + + with open(OUTPUT_FILE, "w", encoding="utf-8") as f: + f.write("\n".join(output)) + with open(SRT_FILE, "w", encoding="utf-8") as f: + f.write("\n\n".join(srt)) + with open(VTT_FILE, "w", encoding="utf-8") as f: + f.write("\n\n".join(vtt)) + + print("✅ Output finale salvato:", OUTPUT_FILE, SRT_FILE, VTT_FILE) + +def parse_timestamp(time_str): + """Convert a timestamp string to seconds""" + # Handle both SRT (00:00:00,000) and standard format (00:00:00.000) + time_str = time_str.replace(',', '.') + + hours, minutes, seconds = time_str.split(':') + hours = int(hours) + minutes = int(minutes) + seconds = float(seconds) + + total_seconds = hours * 3600 + minutes * 60 + seconds + return total_seconds + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Trascrizione + Speaker Diarization avanzata") + parser.add_argument("--audio", help="File audio WAV", required=True) + parser.add_argument("--token", help="Token Hugging Face per Pyannote") + parser.add_argument("--model", default="large", help="Modello Whisper (tiny, base, medium, large)") + parser.add_argument("--no-diarization", action="store_true", help="Disabilita il riconoscimento speaker") + parser.add_argument("--output-prefix", default="transcript", help="Prefisso per i file di output") + parser.add_argument("--num-speakers", type=int, default=None, help="Numero di speaker se conosciuto in anticipo") + parser.add_argument("--fix-start", action="store_true", help="Aggiungi silenzio all'inizio per catturare meglio i primi secondi") + parser.add_argument("--min-segment", type=float, default=1.0, help="Lunghezza minima dei segmenti in secondi") + + args = parser.parse_args() + + # Use Hugging Face token from environment variable if not provided via argument + hf_token = args.token or os.getenv("HF_TOKEN") + if not hf_token: + raise ValueError("Token Hugging Face non fornito. Specificarlo con --token o nella variabile HF_TOKEN nel file .env") + + # Use model from environment variable if available + model_size = os.getenv("WHISPER_MODEL", args.model) + + # Use number of speakers from environment variable if available and not provided via argument + num_speakers = args.num_speakers + if num_speakers is None and os.getenv("NUM_SPEAKERS"): + try: + num_speakers = int(os.getenv("NUM_SPEAKERS")) + except ValueError: + pass + + # Use fix-start from environment variable if available and not provided via argument + fix_start = args.fix_start + if not fix_start and os.getenv("FIX_START", "").lower() == "true": + fix_start = True + + # Definizione dei nomi dei file di output + output_prefix = os.getenv("OUTPUT_PREFIX", args.output_prefix) + TRANSCRIPT_FILE = f"{output_prefix}.txt" + OUTPUT_FILE = f"{output_prefix}_final.txt" + SRT_FILE = f"{output_prefix}.srt" + VTT_FILE = f"{output_prefix}.vtt" + + if not os.path.exists(args.audio): + raise ValueError(f"File audio {args.audio} non trovato") + + # Aggiungi silenzio all'inizio se richiesto + input_audio = args.audio + if fix_start: + extended_audio = "extended_" + os.path.basename(args.audio) + input_audio = extend_audio_beginning(args.audio, extended_audio) + + # Trascrivi l'audio + segments = transcribe_audio(input_audio, model_size) + + # Esegui diarizzazione e abbina trascrizione a speaker + if not args.no_diarization: + diarization = diarize_audio(input_audio, hf_token, num_speakers) + match_transcript_to_speakers(segments, diarization, args.min_segment) + else: + print("🛑 Diarization disabilitata. Salvo solo la trascrizione.") + with open(OUTPUT_FILE, "w", encoding="utf-8") as f_out, open(SRT_FILE, "w", encoding="utf-8") as f_srt, open(VTT_FILE, "w", encoding="utf-8") as f_vtt: + f_vtt.write("WEBVTT\n\n") + for i, s in enumerate(segments, 1): + start = format_time(s['start']) + end = format_time(s['end']) + f_out.write(f"({start} - {end}): {s['text'].strip()}\n") + f_srt.write(f"{i}\n{format_time(s['start'], True)} --> {format_time(s['end'], True)}\n{s['text'].strip()}\n\n") + f_vtt.write(f"{start} --> {end}\n{s['text'].strip()}\n\n") + print(f"✅ Output salvato senza diarizzazione: {OUTPUT_FILE}, {SRT_FILE}, {VTT_FILE}") + + # Rimuovi file audio esteso se creato + if args.fix_start and os.path.exists(extended_audio): + os.remove(extended_audio) diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..48b5481 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,10 @@ +whisper +torch>=1.13.1 +numpy>=1.20.0 +pyannote.audio>=3.0.0 +python-dotenv>=0.19.0 +pydub>=0.25.1 +tqdm>=4.64.0 +matplotlib>=3.5.0 +scikit-learn>=1.0.0 +soundfile>=0.10.3 \ No newline at end of file diff --git a/setup.sh b/setup.sh new file mode 100755 index 0000000..ef242fd --- /dev/null +++ b/setup.sh @@ -0,0 +1,113 @@ +#!/bin/bash + +set -e # Exit on error + +echo "🚀 Setting up Transcription Runner..." + +# Check if Python is installed +if ! command -v python3 &> /dev/null; then + echo "❌ Python 3 not found. Please install Python 3 before proceeding." + exit 1 +fi + +# Check if pip is installed +if ! command -v pip3 &> /dev/null; then + echo "❌ pip3 not found. Please install pip3 before proceeding." + exit 1 +fi + +# Check if AWS CLI is installed +if ! command -v aws &> /dev/null; then + echo "⚠️ AWS CLI not found. Installing..." + if [[ "$OSTYPE" == "darwin"* ]]; then + # macOS + if command -v brew &> /dev/null; then + brew install awscli + else + echo "❌ Homebrew not found. Please install AWS CLI manually." + exit 1 + fi + elif [[ "$OSTYPE" == "linux-gnu"* ]]; then + # Linux + if command -v apt-get &> /dev/null; then + sudo apt-get update + sudo apt-get install -y awscli + elif command -v yum &> /dev/null; then + sudo yum install -y awscli + else + echo "❌ Unable to detect package manager. Please install AWS CLI manually." + exit 1 + fi + else + echo "❌ Unsupported OS. Please install AWS CLI manually." + exit 1 + fi +fi + +# Check if netcat is installed +if ! command -v nc &> /dev/null; then + echo "⚠️ netcat not found. Installing..." + if [[ "$OSTYPE" == "darwin"* ]]; then + # macOS + if command -v brew &> /dev/null; then + brew install netcat + else + echo "❌ Homebrew not found. Please install netcat manually." + exit 1 + fi + elif [[ "$OSTYPE" == "linux-gnu"* ]]; then + # Linux + if command -v apt-get &> /dev/null; then + sudo apt-get update + sudo apt-get install -y netcat + elif command -v yum &> /dev/null; then + sudo yum install -y netcat + else + echo "❌ Unable to detect package manager. Please install netcat manually." + exit 1 + fi + else + echo "❌ Unsupported OS. Please install netcat manually." + exit 1 + fi +fi + +# Create virtual environment +echo "🔨 Creating Python virtual environment..." +python3 -m venv venv +source venv/bin/activate + +# Install dependencies +echo "📦 Installing dependencies..." +pip install -r requirements.txt + +# Create .env file if it doesn't exist +if [ ! -f .env ]; then + echo "📝 Creating .env file from template..." + cp .env.sample .env + echo "ℹ️ Please edit the .env file with your configuration before running the scripts." +fi + +# Make shell scripts executable +echo "🔑 Making scripts executable..." +chmod +x whisper_parallel.sh + +# Set up AWS credentials if needed +if ! aws configure list &> /dev/null; then + echo "⚠️ AWS credentials not configured. Setting up..." + echo "Please enter your AWS credentials:" + aws configure +fi + +# Check if AWS key pair exists +KEY_NAME=$(grep KEY_NAME .env | cut -d '=' -f2 || echo "whisper-key") +if ! aws ec2 describe-key-pairs --key-names "$KEY_NAME" &> /dev/null; then + echo "🔑 Creating EC2 key pair..." + mkdir -p ~/.ssh + aws ec2 create-key-pair --key-name "$KEY_NAME" --query 'KeyMaterial' --output text > ~/.ssh/"$KEY_NAME".pem + chmod 400 ~/.ssh/"$KEY_NAME".pem + echo "✅ Key pair created: ~/.ssh/$KEY_NAME.pem" +fi + +echo "✅ Setup complete! You can now run ./whisper_parallel.sh" +echo "ℹ️ Remember to edit the .env file with your configuration." \ No newline at end of file diff --git a/test_fix.py b/test_fix.py new file mode 100644 index 0000000..5834461 --- /dev/null +++ b/test_fix.py @@ -0,0 +1,78 @@ +import json + +def split_long_segments(segments, max_chars=150): + """Split segments that are too long into smaller chunks.""" + import re + + new_segments = [] + for segment in segments: + if "text" in segment and len(segment["text"]) > max_chars: + # Split text at sentence boundaries or by character count + sentences = re.split(r'(?<=[.!?]) +', segment["text"]) + current_text = "" + start_time = segment["start"] + + for sentence in sentences: + if len(current_text) + len(sentence) > max_chars and current_text: + # Calculate proportional time based on text length + portion = len(current_text) / len(segment["text"]) + mid_time = segment["start"] + portion * (segment["end"] - segment["start"]) + + new_segments.append({ + "start": start_time, + "end": mid_time, + "text": current_text.strip(), + "speaker": segment.get("speaker", ""), + "speaker_text": segment.get("speaker_text", f"[{segment.get('speaker', '')}] ") # Fixed line + }) + + start_time = mid_time + current_text = sentence + else: + current_text += " " + sentence if current_text else sentence + + # Add the last part + if current_text: + new_segments.append({ + "start": start_time, + "end": segment["end"], + "text": current_text.strip(), + "speaker": segment.get("speaker", ""), + "speaker_text": segment.get("speaker_text", f"[{segment.get('speaker', '')}] ") # Fixed line + }) + else: + new_segments.append(segment) + + return new_segments + +# Create mock test data +test_segments = [ + { + "start": 0.0, + "end": 10.0, + "speaker": "SPEAKER_00", + "speaker_text": "[SPEAKER_00] ", + "text": "This is a very long text that exceeds the maximum character limit. It should be split into multiple segments. This is another sentence to make sure we have enough text to split. And one more sentence to be really sure." + } +] + +# Run the split_long_segments function +print("Testing split_long_segments...") +split_segments = split_long_segments(test_segments, max_chars=50) +print(f"Number of segments after splitting: {len(split_segments)}") + +# Verify that all segments have the speaker_text field +all_have_speaker_text = all("speaker_text" in segment for segment in split_segments) +print(f"All segments have speaker_text field: {all_have_speaker_text}") + +# Dump the result to inspect +print("\nSplit segments:") +print(json.dumps(split_segments, indent=2)) + +# Check if we can access speaker_text without error +try: + for segment in split_segments: + speaker_text = segment["speaker_text"] + print("\n✅ Successfully accessed speaker_text on all segments") +except KeyError as e: + print(f"\n❌ KeyError when accessing: {e}") diff --git a/whisper_parallel.sh b/whisper_parallel.sh new file mode 100755 index 0000000..48b7db6 --- /dev/null +++ b/whisper_parallel.sh @@ -0,0 +1,458 @@ +#!/bin/bash + +# Load environment variables from .env file +if [ -f .env ]; then + echo "Loading environment variables from .env file..." + set -o allexport + source .env + set +o allexport +else + echo "Warning: .env file not found. Using default values." +fi + +# === CONFIGURAZIONE === +# These defaults will be used if not set in .env file +KEY_NAME=${KEY_NAME:-"whisper-key"} +KEY_FILE=${KEY_FILE:-"$HOME/.ssh/${KEY_NAME}.pem"} +SECURITY_GROUP=${SECURITY_GROUP:-"whisper-sg"} +INSTANCE_TYPE=${INSTANCE_TYPE:-"g4dn.12xlarge"} # Default a 1 GPU per rispettare limiti vCPU +REGION=${REGION:-"eu-south-1"} +AMI_ID=${AMI_ID:-"ami-059603706d3734615"} +VIDEO_FILE=${VIDEO_FILE:-"mio_video.mp4"} +ORIGINAL_FILENAME=$(basename "$VIDEO_FILE" | cut -d. -f1) +START_MIN=${START_MIN:-0} # Default value if not set +END_MIN=${END_MIN:-0} # Default value if not set +SHIFT_SECONDS=${SHIFT_SECONDS:-0} # Shift timestamps by this many seconds +SHIFT_ONLY=${SHIFT_ONLY:-false} # Set to true to only perform shifting on existing files +INPUT_PREFIX=${INPUT_PREFIX:-""} # Prefix for input files when using SHIFT_ONLY +GPU_COUNT=${GPU_COUNT:-1} # Numero di GPU da utilizzare (default: 1) +NUM_SPEAKERS=${NUM_SPEAKERS:-""} # Numero di speaker se conosciuto (opzionale) +FIX_START=${FIX_START:-"true"} # Aggiunge silenzio all'inizio per catturare i primi secondi + +# === FUNZIONE PER SHIFT DEI TIMESTAMPS === +shift_timestamps() { + local input_file=$1 + local output_file=$2 + local shift_by=$3 + local file_ext="${input_file##*.}" + + if [ "$file_ext" = "srt" ]; then + echo "🕒 Shifting SRT timestamps by $shift_by seconds..." + # SRT format: 00:00:05,440 --> 00:00:08,300 + awk -v shift=$shift_by ' + function time_to_seconds(time_str) { + split(time_str, parts, ",") + split(parts[1], time_parts, ":") + return time_parts[1]*3600 + time_parts[2]*60 + time_parts[3] + parts[2]/1000 + } + + function seconds_to_time(seconds) { + h = int(seconds/3600) + m = int((seconds-h*3600)/60) + s = int(seconds-h*3600-m*60) + ms = int((seconds - int(seconds))*1000) + return sprintf("%02d:%02d:%02d,%03d", h, m, s, ms) + } + + { + if (match($0, /^([0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}) --> ([0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})$/)) { + start_time = time_to_seconds(substr($0, RSTART, RLENGTH/2-5)) + end_time = time_to_seconds(substr($0, RSTART+RLENGTH/2+5, RLENGTH/2-5)) + + new_start = start_time + shift + new_end = end_time + shift + + # Handle negative times (not allowed in SRT) + if (new_start < 0) new_start = 0 + if (new_end < 0) new_end = 0 + + print seconds_to_time(new_start)" --> "seconds_to_time(new_end) + } else { + print $0 + } + }' "$input_file" > "$output_file" + + elif [ "$file_ext" = "vtt" ]; then + echo "🕒 Shifting VTT timestamps by $shift_by seconds..." + # VTT format: 00:00:05.440 --> 00:00:08.300 + awk -v shift=$shift_by ' + function time_to_seconds(time_str) { + split(time_str, parts, ".") + split(parts[1], time_parts, ":") + return time_parts[1]*3600 + time_parts[2]*60 + time_parts[3] + parts[2]/1000 + } + + function seconds_to_time(seconds) { + h = int(seconds/3600) + m = int((seconds-h*3600)/60) + s = int(seconds-h*3600-m*60) + ms = int((seconds - int(seconds))*1000) + return sprintf("%02d:%02d:%02d.%03d", h, m, s, ms) + } + + { + if (match($0, /^([0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}) --> ([0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3})$/)) { + start_time = time_to_seconds(substr($0, RSTART, RLENGTH/2-5)) + end_time = time_to_seconds(substr($0, RSTART+RLENGTH/2+5, RLENGTH/2-5)) + + new_start = start_time + shift + new_end = end_time + shift + + # Handle negative times + if (new_start < 0) new_start = 0 + if (new_end < 0) new_end = 0 + + print seconds_to_time(new_start)" --> "seconds_to_time(new_end) + } else { + print $0 + } + }' "$input_file" > "$output_file" + + elif [ "$file_ext" = "txt" ]; then + echo "🕒 Shifting timestamps in TXT by $shift_by seconds..." + # For text files, we need to handle timestamps in formats like [00:05.440] + awk -v shift=$shift_by ' + function time_to_seconds(time_str) { + # Remove brackets + gsub(/[\[\]]/, "", time_str) + + # Check format - either MM:SS.mmm or HH:MM:SS.mmm + if (split(time_str, parts, ":") == 2) { + # MM:SS.mmm format + mm = parts[1] + split(parts[2], sec_parts, ".") + ss = sec_parts[1] + ms = sec_parts[2] ? sec_parts[2] : 0 + return mm*60 + ss + ms/1000 + } else { + # HH:MM:SS.mmm format + hh = parts[1] + mm = parts[2] + split(parts[3], sec_parts, ".") + ss = sec_parts[1] + ms = sec_parts[2] ? sec_parts[2] : 0 + return hh*3600 + mm*60 + ss + ms/1000 + } + } + + function seconds_to_time(seconds) { + h = int(seconds/3600) + m = int((seconds-h*3600)/60) + s = seconds-h*3600-m*60 + # Format with up to 3 decimal places for milliseconds + if (h > 0) { + return sprintf("[%02d:%02d:%05.3f]", h, m, s) + } else { + return sprintf("[%02d:%05.3f]", m, s) + } + } + + { + line = $0 + # Match timestamps in the format [MM:SS.mmm] or [HH:MM:SS.mmm] + while (match(line, /\[[0-9]+:[0-9]+(\.[0-9]+)?\]/) || match(line, /\[[0-9]+:[0-9]+:[0-9]+(\.[0-9]+)?\]/)) { + time_str = substr(line, RSTART, RLENGTH) + time_sec = time_to_seconds(time_str) + + new_time = time_sec + shift + if (new_time < 0) new_time = 0 + + new_time_str = seconds_to_time(new_time) + + # Replace the timestamp + line = substr(line, 1, RSTART-1) new_time_str substr(line, RSTART+RLENGTH) + } + print line + }' "$input_file" > "$output_file" + else + echo "⚠️ Unsupported file extension for shifting: $file_ext" + cp "$input_file" "$output_file" + fi +} + +# If we're only shifting timestamps, do that and exit +if [ "$SHIFT_ONLY" = "true" ]; then + if [ -z "$INPUT_PREFIX" ]; then + echo "❌ ERROR: When using SHIFT_ONLY=true, you must specify INPUT_PREFIX" + exit 1 + fi + + echo "🕒 Performing timestamp shifting by $SHIFT_SECONDS seconds..." + + # Process each file type + for ext in txt srt vtt; do + # Check for regular transcript + if [ -f "${INPUT_PREFIX}.${ext}" ]; then + shift_timestamps "${INPUT_PREFIX}.${ext}" "${INPUT_PREFIX}_shifted.${ext}" $SHIFT_SECONDS + echo "✅ Created ${INPUT_PREFIX}_shifted.${ext}" + fi + + # Check for final transcript + if [ -f "${INPUT_PREFIX}_final.${ext}" ]; then + shift_timestamps "${INPUT_PREFIX}_final.${ext}" "${INPUT_PREFIX}_final_shifted.${ext}" $SHIFT_SECONDS + echo "✅ Created ${INPUT_PREFIX}_final_shifted.${ext}" + fi + done + + echo "✅ Timestamp shifting complete!" + exit 0 +fi + +# Generate random suffix +if command -v openssl > /dev/null 2>&1; then + RANDOM_SUFFIX=$(openssl rand -hex 4) +elif command -v md5sum > /dev/null 2>&1; then + RANDOM_SUFFIX=$(date +%s | md5sum | head -c 8) +elif command -v shasum > /dev/null 2>&1; then + RANDOM_SUFFIX=$(date +%s | shasum | head -c 8) +else + RANDOM_SUFFIX=$RANDOM$RANDOM +fi + +AUDIO_FILE="${ORIGINAL_FILENAME}_${START_MIN}_${END_MIN}_${RANDOM_SUFFIX}.wav" +DIARIZATION_ENABLED=${DIARIZATION_ENABLED:-true} +HF_TOKEN=${HF_TOKEN:-""} +BUCKET_NAME=${BUCKET_NAME:-"whisper-video-transcripts"} + +# Output file names with the same format +TRANSCRIPT_PREFIX="${ORIGINAL_FILENAME}_${START_MIN}_${END_MIN}_${RANDOM_SUFFIX}" +TRANSCRIPT_FILE="${TRANSCRIPT_PREFIX}.txt" +FINAL_TRANSCRIPT_FILE="${TRANSCRIPT_PREFIX}_final.txt" +SRT_FILE="${TRANSCRIPT_PREFIX}.srt" +VTT_FILE="${TRANSCRIPT_PREFIX}.vtt" + +# === CONTROLLI PRELIMINARI === +if [ ! -f "$KEY_FILE" ]; then + echo "❌ Chiave SSH non trovata in $KEY_FILE" + exit 1 +fi +if [ ! -f "parallel_transcript.py" ]; then + echo "❌ File parallel_transcript.py non trovato" + exit 1 +fi + +if [ ! -f "$VIDEO_FILE" ]; then + echo "❌ File video $VIDEO_FILE non trovato" + exit 1 +fi + +# === CONVERTI MP4 IN WAV E APPLICA CROP PRIMA DELL'UPLOAD === +echo "🎙️ Converto $VIDEO_FILE in $AUDIO_FILE con crop applicato..." +FFMPEG_CMD="ffmpeg -i \"$VIDEO_FILE\"" + +# Aggiungi parametri di crop se START_MIN o END_MIN sono impostati +if [ "$START_MIN" != "0" ] || [ "$END_MIN" != "0" ]; then + START_SEC=$((START_MIN * 60)) + if [ "$END_MIN" != "0" ]; then + END_SEC=$((END_MIN * 60)) + FFMPEG_CMD+=" -ss $START_SEC -to $END_SEC" + else + FFMPEG_CMD+=" -ss $START_SEC" + fi + echo "⏱️ Crop video da $START_MIN min a ${END_MIN:-fine} min" +fi + +# Completa il comando ffmpeg con gli altri parametri necessari +FFMPEG_CMD+=" -ac 1 -ar 16000 -vn \"$AUDIO_FILE\" -y" + +# Esegui il comando ffmpeg +eval $FFMPEG_CMD + +echo "☁️ Controllo se l'audio è già presente su S3..." +AUDIO_UPLOADED="" +if ! aws s3 ls s3://$BUCKET_NAME/$AUDIO_FILE >/dev/null 2>&1; then + echo "⬆️ Carico $AUDIO_FILE su S3..." + aws s3 cp $AUDIO_FILE s3://$BUCKET_NAME/ + AUDIO_UPLOADED="true" +else + echo "✅ Audio già presente su S3. Salto upload." +fi + +# === CONTROLLA O CREA LA DEFAULT VPC === +echo "🔍 Controllo default VPC nella regione $REGION..." +DEFAULT_VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters Name=isDefault,Values=true --query "Vpcs[0].VpcId" --output text) + +if [ "$DEFAULT_VPC_ID" = "None" ]; then + echo "➕ Nessuna default VPC trovata. La creo..." + DEFAULT_VPC_ID=$(aws ec2 create-default-vpc --region $REGION --query "Vpc.VpcId" --output text) + echo "✅ Default VPC creata: $DEFAULT_VPC_ID" +else + echo "✅ Default VPC esistente: $DEFAULT_VPC_ID" +fi + +# === CREA SECURITY GROUP SE NECESSARIO === +aws ec2 describe-security-groups --group-names $SECURITY_GROUP --region $REGION &>/dev/null +if [ $? -ne 0 ]; then + echo "➕ Creo security group $SECURITY_GROUP..." + aws ec2 create-security-group --group-name $SECURITY_GROUP --description "Whisper SG" --vpc-id $DEFAULT_VPC_ID --region $REGION + aws ec2 authorize-security-group-ingress --group-name $SECURITY_GROUP --protocol tcp --port 22 --cidr 0.0.0.0/0 --region $REGION +fi + +# === AVVIA L'ISTANZA EC2 === +echo "🚀 Avvio istanza EC2 GPU ($INSTANCE_TYPE con GPU)..." +INSTANCE_ID=$(aws ec2 run-instances \ + --image-id $AMI_ID \ + --instance-type $INSTANCE_TYPE \ + --key-name $KEY_NAME \ + --security-groups $SECURITY_GROUP \ + --iam-instance-profile Name=WhisperS3Profile \ + --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":50}}]' \ + --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=whisper-runner}]" \ + --region $REGION \ + --query "Instances[0].InstanceId" \ + --output text) + +if [ -z "$INSTANCE_ID" ]; then + echo "❌ ERRORE: ID istanza non ottenuto. Verifica che l'AMI sia corretta per la regione $REGION." + exit 1 +fi + +echo "🆔 Istanza avviata: $INSTANCE_ID" + +# === FUNZIONE DI CLEANUP IN CASO DI USCITA IMPROVVISA === +function cleanup { + echo "🧨 Cleanup in corso..." + + # Rimuove il file audio locale se esiste + if [ -f "$AUDIO_FILE" ]; then + echo "🧹 Rimuovo file audio locale $AUDIO_FILE..." + rm -f "$AUDIO_FILE" + echo "✅ File audio locale rimosso." + fi + + # Rimuove l'audio da S3 se è stato caricato in questo script + if [ "$AUDIO_UPLOADED" = "true" ]; then + echo "🧹 Rimuovo $AUDIO_FILE da S3..." + aws s3 rm s3://$BUCKET_NAME/$AUDIO_FILE + echo "✅ File rimosso da S3." + fi + + # Termina l'istanza EC2 se è stata avviata + if [ -n "$INSTANCE_ID" ]; then + echo "🧹 Termino l'istanza EC2 ($INSTANCE_ID)..." + aws ec2 terminate-instances --instance-ids $INSTANCE_ID --region $REGION >/dev/null + + # Aspetta la terminazione con timeout + echo "⏳ Aspetto la terminazione dell'istanza (max 60 secondi)..." + WAIT_TIMEOUT=60 + WAIT_START=$(date +%s) + + WAITING=true + while [ "$WAITING" = true ]; do + # Controlla lo stato dell'istanza + STATUS=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --region $REGION --query "Reservations[0].Instances[0].State.Name" --output text 2>/dev/null) + + # Se lo stato è terminated o l'istanza non esiste più, esci dal ciclo + if [ "$STATUS" = "terminated" ] || [ "$STATUS" = "None" ]; then + echo "✅ Istanza terminata con successo." + WAITING=false + else + # Controlla se è scaduto il timeout + WAIT_ELAPSED=$(($(date +%s) - WAIT_START)) + if [ $WAIT_ELAPSED -ge $WAIT_TIMEOUT ]; then + echo "⚠️ Timeout durante l'attesa della terminazione. L'istanza potrebbe essere ancora in fase di terminazione." + WAITING=false + else + # Aspetta un secondo prima di controllare di nuovo + sleep 2 + echo -n "." + fi + fi + done + fi +} + +# Esegui cleanup su qualsiasi uscita: normale, errore, o Ctrl+C +trap cleanup EXIT + +echo "⏳ Attendo che sia pronta..." +aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION + +echo "🔐 Aspetto che l'istanza sia pronta per SSH..." +for i in {1..35}; do + PUBLIC_IP=$(aws ec2 describe-instances --instance-id $INSTANCE_ID --region $REGION --query "Reservations[0].Instances[0].PublicIpAddress" --output text) + echo "🌍 IP pubblico: $PUBLIC_IP" + + nc -zv $PUBLIC_IP 22 >/dev/null 2>&1 + if [ $? -eq 0 ]; then + echo "✅ Porta 22 aperta, l'istanza è pronta!" + break + else + echo "⏳ Tentativo $i/35: porta 22 ancora chiusa. Riprovo tra 5s..." + sleep 5 + fi +done + +# === CARICA SCRIPT PYTHON SULL'ISTANZA === +echo "📦 Carico script sulla macchina EC2..." +scp -o StrictHostKeyChecking=no -i $KEY_FILE parallel_transcript.py ubuntu@$PUBLIC_IP:/home/ubuntu/ +scp -o StrictHostKeyChecking=no -i $KEY_FILE .env ubuntu@$PUBLIC_IP:/home/ubuntu/ +scp -o StrictHostKeyChecking=no -i $KEY_FILE requirements.txt ubuntu@$PUBLIC_IP:/home/ubuntu/ + +echo "⚙️ Scarico audio da S3 ed eseguo trascrizione avanzata..." +ssh -t -i $KEY_FILE -o "SendEnv=TERM" ubuntu@$PUBLIC_IP " + # Prevent broken pipe errors + export PYTHONUNBUFFERED=1 + set -e + cd /home/ubuntu + + echo '⬇️ Download da S3...' + aws s3 cp s3://$BUCKET_NAME/$AUDIO_FILE /home/ubuntu/$AUDIO_FILE --region $REGION + + echo '📦 File scaricato:' + ls -lh $AUDIO_FILE + + echo '⚙️ Attivo ambiente virtuale...' + source whisper-env/bin/activate + + # Installa PyDub se non presente + if ! pip list | grep -q pydub; then + echo '📦 Installo dipendenze mancanti...' + pip install pydub + fi + + # Installa le dipendenze da requirements.txt + pip install -r requirements.txt + + echo '🖥️ Informazioni GPU:' + nvidia-smi + + echo 'Audio file: $AUDIO_FILE' + echo 'Token Hugging Face: $HF_TOKEN' + echo 'Diarization enabled: $DIARIZATION_ENABLED' + echo 'Numero di speaker: $NUM_SPEAKERS' + echo '✍️ Lancio trascrizione avanzata...' + CMD=\"python3 parallel_transcript.py --audio $AUDIO_FILE --token $HF_TOKEN \ + --output-prefix $TRANSCRIPT_PREFIX\" + + if [ \"$DIARIZATION_ENABLED\" = false ]; then + CMD+=\" --no-diarization\" + fi + + if [ -n \"$NUM_SPEAKERS\" ]; then + CMD+=\" --num-speakers $NUM_SPEAKERS\" + echo '👥 Utilizzo numero di speaker specificato: $NUM_SPEAKERS' + fi + + if [ \"$FIX_START\" = true ]; then + CMD+=\" --fix-start\" + echo '⏱️ Aggiunta correzione per i primi secondi' + fi + + eval \$CMD +" + +# === SCARICA I FILE === +echo "⬇️ Scarico i file di output..." +scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}_final.txt . || echo "⚠️ Impossibile scaricare _final.txt (potrebbe non essere stato generato)" +scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}.txt . || echo "⚠️ Impossibile scaricare .txt" +scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}.srt . || echo "⚠️ Impossibile scaricare .srt" +scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}.vtt . || echo "⚠️ Impossibile scaricare .vtt" + +# Scarica anche i file JSON con dati aggiuntivi per debugging +scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}.txt.words.json . 2>/dev/null || true +scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}_final.txt.diarization.json . 2>/dev/null || true +scp -i $KEY_FILE ubuntu@$PUBLIC_IP:/home/ubuntu/${TRANSCRIPT_PREFIX}_final.txt.overlaps.json . 2>/dev/null || true + +echo "📄 File scaricati:" +ls -lh ${TRANSCRIPT_PREFIX}* 2>/dev/null || echo "⚠️ Nessun file trovato con il prefisso specificato"