Issue
I am slowly working on a project which where it would be very useful if the computer could find where in an mp3 file a certain sample occurs. I would restrict this problem to meaning a fairly exact snippet of the audio, not just for example the chorus in a song on a different recording by the same band where it would become more some kind of machine learning problem. Am thinking if it has no noise added and comes from the same file, it should somehow be possible to locate the time at which it occurs without machine learning, just like grep can find the lines in a textfile where a word occurs.
In case you don’t have an mp3 lying around, can set up the problem with some music available on the net which is in the public domain, so nobody complains:
curl https://web.archive.org/web/20041019004300/http://www.navyband.navy.mil/anthems/ANTHEMS/United%20Kingdom.mp3 --output godsavethequeen.mp3
It’s a minute long:
exiftool godsavethequeen.mp3 | grep Duration
Duration : 0:01:03 (approx)
Now cut out a bit between 30 and 33 seconds (the bit which goes la la la la..):
ffmpeg -ss 30 -to 33 -i godsavethequeen.mp3 gstq_sample.mp3
both files in the folder:
$ ls -la
-rw-r--r-- 1 cardamom cardamom 48736 Jun 23 00:08 gstq_sample.mp3
-rw-r--r-- 1 cardamom cardamom 1007055 Jun 22 23:57 godsavethequeen.mp3
For some reason exiftool seems to overestimate the duration of the sample:
$ exiftool gstq_sample.mp3 | grep Duration
Duration : 6.09 s (approx)
..but I suppose it’s only approximate like it tells you.
This is what am after:
$ findsoundsample gstq_sample.mp3 godsavethequeen.mp3
start 30 end 33
Am happy if it is a bash script or a python solution, even using some kind of python library. Sometimes if you use the wrong tool, the solution might work but look horrible, so whichever tool is more suitable. This is a one minute mp3, have not thought yet about performance just about getting it done at all, but would like some scalability, eg find ten seconds somewhere in half an hour.
Have been looking at the following resources as I try to solve this myself:
How to recognize a music sample using Python and Gracenote?
https://github.com/craigfrancis/audio-detect
https://madmom.readthedocs.io/en/latest/introduction.html
https://github.com/aubio/aubio
aubionset
is a good candidate
https://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
Solution
As suggested in Carson‘s answer, processing the audio gets a lot easier once the files are converted to the .wav format.
You may do so using Wernight‘s answer on reading mp3 in python:
ffmpeg -i godsavethequeen.mp3 -vn -acodec pcm_s16le -ac 1 -ar 44100 -f wav godsavethequeen.wav
ffmpeg -i gstq_sample.mp3 -vn -acodec pcm_s16le -ac 1 -ar 44100 -f wav gstq_sample.wav
Then to find the position of the sample is mostly a matter of obtaining the peak of the cross-correlation function between the source (godsavethequeen.wav
in this case) and the sample to look for (gstq_sample.wav
). In essence, this will find the shift at which the sample looks the most like the corresponding portion in the source. This can be done with python using scipy.signal.correlate
.
Throwing a small python script to do just that would look like:
import numpy as np
import sys
from scipy.io import wavfile
from scipy import signal
snippet = sys.argv[1]
source = sys.argv[2]
# read the sample to look for
rate_snippet, snippet = wavfile.read(snippet);
snippet = np.array(snippet, dtype='float')
# read the source
rate, source = wavfile.read(source);
source = np.array(source, dtype='float')
# resample such that both signals are at the same sampling rate (if required)
if rate != rate_snippet:
num = int(np.round(rate*len(snippet)/rate_snippet))
snippet = signal.resample(snippet, num)
# compute the cross-correlation
z = signal.correlate(source, snippet);
peak = np.argmax(np.abs(z))
start = (peak-len(snippet)+1)/rate
end = peak/rate
print("start {} end {}".format(start, end))
Note that for good measures I’ve included a check to make sure both .wav files have the same sampling rate (and resample as needed), but you could alternatively make sure they are always the same while you convert them from .mp3 format using the -ar 44100
argument to ffmpeg
.
Answered By – SleuthEye
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0