Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://transcript.fish

I have been working on this podcast transcription project for a couple months and it's been super rewarding.

I listen to a podcast called No Such Thing As A Fish[0], where some researchers talk about their favorite facts they learned that week. Then they riff on it and are generally smart and funny. I listened to the series so many times that I decided I wanted to listen to the show on shuffle, not at the episode level, but at the fact level.

Since I have been playing around with whisper.cpp in python this seemed like a perfect way to combine some technologies I've been wanting to play with.

I ran whisper[1] over the entire podcast and transcribed all the episodes. I had to do this multiple times because I kept messing up. It eventually took like 7 straight days of my M1 processing to get through ~490 episodes using the medium.en model.

4 million words, and an 800Mb SQLite database later, I got the transcriptions done and have put up a nice site for searching through the data.

Now I just need to figure out the rest. Breaking it up into facts. Getting the audio working. Highlighting and linking to words, phrases, etc.

Some cool info about the process so far:

1. The SQLite database is chunked up and stored as static files, and the frontend queries the static files directly using HTTP range requests, so it only downloads a couple hundred kbs when querying.

2. I've been proper using ChatGPT 3.5 free version to help me write python and SQL. It's been pretty game changing as I feel basically no pain from not knowing what I'm doing.

The code is here: https://github.com/noman-land/transcript.fish

Please help if you know how to get whisper speaker diarization working!! I would really appreciate the help.

Also any tips on ways to index[2] or search[3] my database that will be super efficient would be helpful. Indexing matters a lot when querying the database in ranges like this... I have learned...

[0] https://www.nosuchthingasafish.com/

[1] https://github.com/guillaumekln/faster-whisper

[2] https://github.com/noman-land/transcript.fish/blob/maste/db/...

[3] https://github.com/noman-land/transcript.fish/blob/master/sr...



This is fantastic. I'm a huge fan of No Such Thing as a Fish podcast.

Are you part of club fish? I bet the folks in the discord channel would love this.


Oh yeah, I'm in there. Come say hi! I just shared it with them a couple days ago and they already pinned it! Felt pretty great.


Naive comment here but you could check out using a vector database for semantic searches. Check out chromadb.


i've tried a couple and probably prefer Marqo to the rest. definitely the most user-friendly pick of the bunch.


Thanks!


Thanks for the suggestion. Vector databases have been on my list of things to check out so this is timely.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: