Product Description - PhpDig Search Engine
HTTP spidering
PhpDig follows HREF links as shown by any web browser to find the pages to index. Links can also be in AreaMap, frames, or simple like window.open() or window.location() JavaScript. PhpDig supports redirections and indexes by following links. PhpDig does not traverse directories or database tables to index content.
By default, PhpDig does not go outside of the domain you define for the indexing. Various index options are choosen by the user, including a parameter to extend indexing to subdomains and a parameter to limit the indexing to a specific directory.
You can limit indexing so that the maximum links found is ((X * Y) + 1) where X is links and Y is depth. Alternatively, you can index just one page, or you can set options to index a greater number of pages.
Any HTML content is indexed, for example from static HTML pages to dynamic HTML pages produced from say PHP scripts. PhpDig searches the Mime-Type of the document, and can be set to auto-index via a cron job.
Full-text indexing
PhpDig indexes all words of a document, but you can avoid common words by defining such words in a text file. Underscores and other characters can be part of a word. Words in the title can have a more important weight in ranking results.
Note that the MySQL FULLTEXT index is different from the PhpDig full-text indexing. The MySQL FULLTEXT index is a table index used with MyISAM tables. PhpDig does full-text indexing of page content but does not use the MySQL FULLTEXT index for searches.
Indexed file types
PhpDig indexes HTML and text files by itself. PhpDig could index PDF, MS-Word, MS-Excel, and MS-PowerPoint files if you install external binaries on the server for this purpose. PhpDig is configured to use catdoc, xls2csv, pstotext or pdftotext, and ppt2text programs.
- You can find catdoc and xls2csv at this link: http://www.45.free.net/~vitus/ice/catdoc/
- You can find pstotext at this link: http://research.compaq.com/SRC/virtualpaper/pstotext.html
- You can find pdftotext at this link: http://public.planetmirror.com/pub/xpdf/
- You can query for ppt2text at this link: http://www.google.com/search?q=ppt2text
The author of PhpDig does not offer support for the binary programs. Contact the authors of those programs if you have trouble with compiling and/or installing them.
Of course, you can use other binary programs to extract text from PDF, MS-Word, MS-Excel, and MS-PowerPoint files.
To demonstrate the external binaries feature, you can search Hamlet (tragedy, Shakespeare, from MS-Word format) or L'Avare (comedy, MoliçĞre, from PDF format).
Other features
PhpDig tries to read a robots.txt file at the server web root, and considers META robots tags too. The last-modified header value is stored in the database to avoid redundant indexing. Also, the META revisit-after tag is considered.
PhpDig can spider sites served on another port other than the default 80 but spidering 443 https:// may be met with limited success. Sites that are password protected with a .htaccess file can be indexed if you give the robot a valid username and password such as http://username:password@www.domain.com but be careful!
This .htaccess related feature could let an unauthorized user read protected information, and the username and password are sent in plain text. It is recommended that you create a specific instance of PhpDig, protected by the same credentials as the restricted site, and index within the protected area.
If desired, PhpDig can store textual content of indexed documents in files. In this case, relevant extracts from found pages are displayed in the search results with highlighted search keys. Otherwsie, a chunk of text as specified in the config file is stored in a database table.
Display templates
PhpDig comes with a template system that lets the search page fit into the look of an existing site. Making a template consists only of inserting a few XML-like tags into an HTML page. See the templates that came with PhpDig for examples. Also, see section 8 for further information about different templating options, and see section 9 for how to insert PhpDig into a website.
Limits
Because of the time consuming indexing process, PHP must not be safe_mode configured and the server that performs the index must not timeout. Also, the PHP allow_url_fopen option must be enabled. It doesn't matter for the search queries.
You can try to circumvent safe_mode, should it be enabled, by a) using distant indexing with MySQL TCP connection and FTP connection, or b) launching the indexing process in a shell command such as a cron job.
Spidering and indexing is a bit slow, as there is a decent amount of processing needed to index pages. On the other hand, search queries are fast enough, even in a somewhat extended context. You may find that, by indexing via shell using say cron, the process is somewhat faster.
|