With about 6 months of Python experience, I decided to write a custom backend for the Internet radio I run, interfacing it with an IRC bot to report songs and let listeners manipulate the playlist. supybot and a very convenient hack of ezstream made this a fairly easy task, considering.
However, I have encountered a problem. One of my bot's features is a song request command that looks up its parameters in an index dictionary, into which every artist/title/album was entered as a key. If there is a match, it places the song at the front of the playlist (selecting randomly from a list if necessary.)
However, my listeners are requesting the ability to do substring matches. i.e. 'lonely hearts' should match "Sgt. Pepper's Lonely Hearts Club Band 40th Anniversary Remastered Collectors Editions", and ideally 'police' would match both the British rock sensation and NWA's "Fuck Tha Police". As well as convenient, this is critical for some songs with foreign-language titles that most users can't type out completely.
Is there a way to do substring matches in a dictionary (other then iterating through the entire keyset, which- at 5000 elements and growing- would probably be unacceptable performance-wise)? Alternately, is there another data structure I ought to be using instead? I considered trying a proper database module when designing this feature, but opted for a dictionary because I thought they were easier and more flexible. This seems to have become ironic.
On an unrelated note, are lambda functions any more efficient or just a way to flex one's geek muscles?
If you define "substring" fairly rigidly as say, "group of non-whitespace characters bounded at each end by whitespace" (ie, whole words), then limit substring searching to whole words only (no phrases, no partial words), then you could augment your main dictionary with another dictionary of all the words (an index, if you will). "Substring" searches (eg "lonely") would then be a matter of looking up the index. (Each time a song is added to the playlist, you'd need to generate index data for it; and right at the start you'd need to index everything already in there.)
Of course, the limitation of this is that you can only do substring matches on whole words (eg "lonely", not "onel"), and can't do substring matches on phrases ("lonely hearts") -- although these could be faked by looking up the index for each word in the phrase and returning only the results that match all words ("lonely" and "hearts").
This is coming close to implementing your own database, so if you need much more than this you're probably better off investing the effort in moving to a database rather than reinventing one (at which point, depending on your chosen database, you get real substring indexing and so on for "free").
If you want to search by complete words, you'll need to implement a keyword index (like >>2 says). This is how most search engines works.
If you need sub-word accuracy, you'll probably want to look at the BLAST or FASTA algorithms for ideas.
The ugly and lazy way to do it is to keep a really long string in sync with the dictionary and run a processed regex on it.
Those solutions sound a little hackish, so I'd rather migrate to a proper DBM. Next question is, will a light one work for keyword and phrase searching (sub-word accuracy probably isn't necessary, though it would be nice) or would I need an industrial-grade one? Python's anydbm module looks pretty useless, so I'm hoping SQLite or something else that uses files/libraries rather then servers/daemons will do the job.
Realistically, this is an IRC bot. It's not like you're going to be dealing with a lot of queries. You could run it on nearly anything.
True, but responsiveness can be an issue, as this bot doesn't thread particularly well. I don't want it to lock up and make the other bot functions unresponsive when someone makes a request; I already have a log grep function which does that nicely. (I might toss the code at that for you sometime, but I don't think it can be optimized any further; it's just the large size of the log files causing a problem.)
I guess I'll start work on a keyword index and see how it performs. It'll be nice dealing with only a single dictionary instead of five of them, anyway.
SQLite is the way! In fact since python2.5 it is integrated with python, if you use 2.4 or less just download the sqlite3 module and the sqlite client (all plataforms), this is a quick example.
#open or create the database if it doesn't exist.
import sqlite3
conn = sqlite3.Connection('path/to/database.anyextension')
#let's assume the tables are there ok?
cur = conn.cursor()
pattern = "%lonely heart%" #any tring will do
res = cur.execute("select file from repo where title like " + pattern)
print res
# I didn't run the above code, might not run as is
You're one step away from an SQL injection vulnerability there. NEVER, EVER build SQL statements with string concaternation.
Once again, with emphasis: NEVER, EVER build SQL statements with string concaternation.
>>2's suggestion is working out pretty well. The host server of the radio's continued existence is uncertain, though, so it might become a moot point. Also, I became very leery of SQLite after it corrupted two databases of mine, and heard from numerous programmer friends that it's unstable under load and does such things often... not that it matters now, but I'm unlikely to use it in the future.
To give this thread a continued purpose, that bot's log searching function is getting really slow. I suspect it's more due to the sheer size of the log directory (120mb) rather then inefficiency, but does anyone see anything that could stand to be optimized?
def _blockIterator(self, f):
data = f.read(32768)
while data:
# Be warned! if no newline is encountered within bytes characters, IndexError will be thrown here!
yielded, remaining = data.rsplit('\n',1)
yield yielded + '\n';
data = remaining + f.read(32768 - len(remaining))
def _matchWord(self, f, what):
#n = re.compile(r"\n(.*\b%s\b.*)\t.+\t[^\n]+" % re.escape(what))
n = re.compile(".*%s.*" % what)
return sum((n.findall(data) for data in self._blockIterator(f)), [])
def grep(self, irc, msg, args, channel, reverse, maxmatch, query):
"""[<channel>] [<reverse>] [<max-matches>] <quote/Perl-regex>
Searches the logs of the specified channel for a given string or
regular expression. <channel> is only necessary if the message isn't
sent in the channel itself.
"""
if maxmatch == None:
maxmatch = 5
if maxmatch > 5 and msg.args[0].startswith("#"):
irc.error("I refuse to spam the channel with more then 5 lines. Please PM the command to me.")
return
results = []
self.log.info('log grepping with args %s, %s, %i, %s.', channel, reverse, maxmatch, query)
for root, dirs, files in os.walk("/home/keine/logs/ChannelLogger/%s/" % channel):
if reverse:
files.reverse()
for name in files:
f = open((root + name), "r")
matches = self._matchWord(f, query)
results.extend(matches)
if len(results) >= maxmatch:
break
f.close()
if reverse:
results.reverse()
for result in results[0:maxmatch]:
irc.reply(result)
grep = wrap(grep, ["channel", optional(("literal", ("reverse"))), optional("int"), "text"])
Another option which is unlikely to be suggested by most. There are several free implementations of MUMPS (long an ANSI standard language) and MUMPS makes this sort of thing rather easy as all data elements (called 'globals') are stored on disk in sorted order. MUMPS is quite fast (all other things being equal, there are reports of it being rather faster than SQL) and isn't a resource hog.
Doing what you need will probably take no more than a page or two of MUMPS, as it is purpose designed for just such things. Be aware, though, that MUMPS is not really suited for using disk structures other than its own. Your existing data will have to be swept into MUMPS globals for it to be able to use them. Not so hard to do, really.
The chief problem might be to interface from Python to MUMPS; I suppose a bit of shell code or some such might be sufficient. *nixen will have lots of small tools hanging around which can help. If you're on Windows, you might consider the Cygwin.org package as a way to get some of those tools into your Windows box.
See Kevin O'Kane's MUMPS implementation (at Univ of N Iowa) and Sanchez' MUMPS implementation (more traditional, as it includes the self modifying 'feature' that has been part of MUMPS for a long time. O'Kane leaves it out. See the en.Wikipedia.org article for additional pointers.
Please no flames for violating langauge design safety conventions. MUMPS was first specified in 1966, so few of those who would be responding with horror, oh the horror!!, will even have been born then.