Net-Dredge : reverse mode search engine
Posted: 01/02/2010 by Ron Scheckelhoff
Search engine development : it ain't easy
Who in their right mind would attempt to design a search engine? Maybe someone who has grown tired of looking at the same few hundred items that are (with minor rotation) provided as the result-set for any given set of search-terms that have been submitted to most any of the major search engines?
The design of such an animal (a search engine) carries with it many more nuanced problem angles than any random initial designer (such as myself) would have guessed (at the onset of the process).
The old saw is "garbage in / garbage out", and you won't find a dump with more garbage in it than the internet.
The trick is to get a handle on the garbage, controlling the stream of sewage as it filters through a "treatment plant" on it's
way to the "collector" database. In the beginning, realizing that the jumbled up mass arriving at my collector port would
create big problems, I created two databases - one to service the collector(s) (called "zorb"), and one to serve out the
collected search engine
entries to clients (the search-data database, called "net-dredge").
By "garbage", I do not mean "foul language", or "r rated". Instead, I mean that the incoming data stream is so undisciplined, jumbled, non-standardized, and filled with dangerous character sequences in non-text "binary" representations that it boggles the mind.
An initial question on the design list was related to the determination of what language would be used to code the "filters"? As an answer to that question, consider the following code, which is part of the
script that handles the delivery of search-data results to clients:
lstTempKeys = self.getkeywords(strPostedData)
if (lstTempKeys != None):
if (len(lstTempKeys) > 0):
self.strSelectList = u"select distinct d.caipaddress, d.catitle, d.cadomainname, d.caurls,
d.cameaningfulwords, d.cngoodwordcount, d.cameta, d.caparag1, d.cnfkzorb1 "
self.intClockStart = time.clock()
for keyword in lstTempKeys:
self.strSearchWord = "'" + keyword + "'"
if (len(self.strSearchWord) > 0):
self.strSQLJoin += self.strSelectList + u" from domainnames d, searchwords s
where d.cnfkzorb1 = s.cnfk and s.caword = %s" % (self.strSearchWord)
if (intIndex < (len(lstTempKeys))):
if (len(self.strSQLJoin) > 1):
self.strSQLJoin += " intersect "
else:
self.strSQLJoin += " limit 100;"
curPgSQL2.execute(self.strSQLJoin)
lstRows = curPgSQL2.fetchall()
self.intClockEnd = time.clock()
Most readers will recognize the script in the snippet as Python. Why Python? Well, for starters, "C/C++" might take too long, and in any case I know Python better than Perl. Secondly, Python has lists. Thirdly, Python has lists. Lastly, with third party "just in time" (jit) python compilers I should be able to be on parity with other scripting languages or with Java. Python's threading and the application of some load balancing software should make for reasonable performance for a moderate volume engine. We could hope for so much. Did I mention Python lists?
Python lists rock. If you look through my other articles, you will find I repeat that phrase often. Look at the code snippet example. It takes incoming client search-terms in a list (made simple by other lists, not shown). In a short number of lines of code, it provides the resulting search engine response to the client (in the lstRows list). It's easy, simple and intuitive. In the seventeen lines that comprise the snippet, there are six references to lists. 'nuff said.
Dirty Fingers
So, we have defined a strategy to cover the first big hurdle in the process towards the development of a complete search engine.
We haven't mapped out a complete
solution, but we have gotten our fingers dirty. (Actually, the author's fingers were stained by his first attempt to create
a search engine
design (the fourcaloriespider project))
Strange SQL
Some may laugh at the "intersect" clause in the SQL statement that is buried within the snippet code shown (above).
Such persons would have experienced
even more boisterous laughing effects had they seen some of the precursor codings ....
The original version of the snippet used a correlated subquery which, expectedly, produced poor results when connected to the 17,000 web-site index (coupled with a search engine keyword-list table that probably contains around 100,000 unique values). I am using the (relatively) small database for initial tests, but of course it will be given additional bulk as time goes by ... The correlated subquery clocked in at (around 30 seconds), while the latest incarnation (using "intersect") brings back the query result in less than a hundred milli-seconds (most of the time).
Want a peek at the latest test search engine? You can see it at http://www.net-dredge.com or http://www.netdredge.com (couldn't decide which domain sounds better)
if
it happens to be running (during development it may be frequently up and down, as I add to it, break it, and then unbreak it :-) .
Another snippet, showing Python lists in action once again, is shown below:
def getkeywords(self,strKeys):
lstTempKeys = re.split(" ", strKeys)
lstTemp = []
for item in lstTempKeys:
if str(item) not in self.lstTrivial:
lstTemp.append(item)
return lstTemp
Yes, it could use more validation, but the beauty of the list is evident. Notice the lstTrivial list. Since both the collector and net-dredge back-end databases are PostgreSQL, there is no full-text search capability built into the databases. The trivial list lightens the load as such terms as (and,or,but,the etc) are removed. Of course this means that (at least for now) quoted string searches will not work.
A little code from the opposite side (the collector) :
Another snippet from within the "collector" scripts, pulls urls into the database. (Admittedly, this could be done more cleanly with regular expression syntax and the re (regular expression) object ...
def geturls(self, strData):
lstUrls = []
intBailOut = 10
strTemp = strData
intPos = 0
intPos2 = 0
intCount = 0
strHttp = "http://"
while (1):
intPos = strTemp.find(strHttp, intPos)
if (intPos != (-1)):
intPos += len(strHttp)
intEndPos = strTemp.find("/",intPos)
intEndPos2 = strTemp.find('"',intPos)
if ((intEndPos2 != (-1)) and (intEndPos2 < intEndPos)):
intEndPos = intEndPos2
if ((intEndPos - intPos) > 0):
lstUrls.append(strTemp[intPos:intEndPos])
intCount += 1
if (intCount > intBailOut):
break
else:
break
return (lstUrls)
The previous snippet serves to show Python lists in action, once more ... "lstUrls.append" is gorgeous stuff (especially with that sliced string-list "strTemp[:]" ).
In Python, strings are lists too. ... well, sort of ....
The following snippet can be found within one of the "collector" scripts, and it pulls domain names into the database.
curPgSQL.execute('''select cnpkzorblineitem, caipaddress
from zorb1
''')
lstRows = curPgSQL.fetchall()
if (lstRows != None):
for item in lstRows:
try:
self.strPKey = str(item[0])
self.strIPAddress = str(item[1])
self.strDomainName = self.resolveIPtoDomain(self.strIPAddress)
if (len(self.strDomainName) != 0):
self.strSQL = u"begin; insert into domainnames (caipaddress,cadomainname,cnfkzorb1) values ('%s' , '%s', %s ); commit;"
% (self.strIPAddress, self.strDomainName, self.strPKey)
curPgSQL.execute(self.strSQL)
except Exception, e:
print "Exception Error:", e
Another snippet from the "collector side" of the project, which is referenced by the domain name function ...
def resolveIPtoDomain(self, strRowIP):
try:
# Note gethostbyaddr returns tuple, first element 0 is host
strRowDomainName = str(gethostbyaddr(strRowIP)[0])
return strRowDomainName
except Exception, e:
return ""
The previous snippet gives insight into why the search engine is called a "reverse mode" search engine. The reverse IP (Dns resolved) domain
name is extracted by the function.
Culling through the Garbage ...
As one plays with raw internet (http server generated) data for a while, one can get a sense for it's outer parameters, and
generate some "rules of thumb" ... For instance,
to find 17,000 sites with index pages that actually contain standard-form html titles, one needs to scan 43,000 sites. A significant
number of sites do not even contain basic html page standards like <html>, <body>, and <head> tags.
A not small number of sites
contain only text, with no html encoding whatsoever! . Another oft-seen bit of crud that will arrive at the collector port
is the infamous "shell code", which is nothing but specially prepared pieces of binary data honed to break into http servers with
known exploit vulnerabilities . I guess the main "rule of thumb" is that all of them will be broken.
The Code Page nightmare ... In the U.S., we have always taken the arrogant approach to coding text characters. We assume that an "o" with an umlaut is just as good without. Some Germans take exception to our ideas about code pages, and how we have always thought they are unnecessary. Forunately, Python comes to the rescue on code page issues, providing a wide range of object modules and global functions to work around the international text issues. Believe me, such issues come to the fore when you are developing a search engine. Every conceivable away to encode text will appear at your collector port, and many times the server will provide no clue as to what it is supposed to be.
Theoretically, an http client may instruct the http server as to which code set is required. Many servers blythely ignore such requests. With some embarrassment, I must admit that the FourCalorieHttp server does the same. Someday, perhaps after I finish this search engine project, I may revisit that problem.
I guess my referral to "code pages" is a give-away for my age. Hey - old talk for an old man. Got a problem with that?
To be continued ...
Attribution: This site and it's authors have absolutely no affiliation with, and are not endorsed by FreeBSD, Debian, Microsoft, Sun, Adobe, Haiku-OS, or any other entities mentioned on this opinion page.
More info about Haiku operating system from the source at: http://www.haiku-os.org
More info about FreeBSD operating system from the source at: http://www.freebsd.org
More info on Sun's VirtualBox environment at: http://www.virtualbox.org
1 More info on Sun's Solaris 10 operating system at: http://www.sun.com
More info about Flash, as described in this editorial, at: http://www.adobe.com
Note: The use of any of the information on these pages is entirely at the user's risk. No guarantee of accuracy is suggested or implied, and this information has been collated primarily for the use of Datazygte staff. The information is what it is, it is "as is", and nothing more. No license is given for redistribution or commercial use, and no use beyond the satisfaction of intellectual curiosity is expected. Neither the company nor the individual contributors may be held liable for any direct, indirect, or consequential damages, however caused, and on any theory of liability, arising out of or as a result of contact with any information on this site ...