I have a use-case where I want to:
- Locate all PII data in any given Audio file (done: using GPT/similar models)
- Transcribe the audio and then mask all those PII in the text file (done using whisper/similar models)
- Also, in the original audio mask the PII portions with beeps. (Remaining)
The typical problem is, a transcription model isn't giving back the times (start/end time) of each word spoken. Hence, it becomes very difficult to locate back the PII basis the transcription output.
Anyone figured out any way to solve the same? On-prem models or API based services, anything is fine, some direction is what I am looking for.
Источник: https://stackoverflow.com/questions/781 ... tion-other