Indexing Tools
These command line scripts allow for the processing of MARC records and RDF documents. These scripts are located on jarry in the /usr/local/patacriticism/indexer directory. Examples of use in this document assume you are in this directory.
Contents
RDF Indexer (rdf_indexer.rb)
The RDF Indexer works on a directory which contains RDF files, parses these files and indexes their content into a specified SOLR index. It outputs a report containing any parsing errors that were encountered. This report is named 'report.txt' and can be found in the directory of the supplied RDF files after execution of this script. A schedule of links is also output in a file called 'link_data.txt'. This file can be input into the link checker, described below.
The rdf_indexer script can be run directly but the safest way to execute it is with these two shortcuts:
./run_indexer or ./run_indexer_fulltext
These scripts act on whatever RDF is present in the indexer/rdf directory or its child directories.
Both of these scripts only update the staging environment, they do not effect the production environments. The first script runs the indexer but does not index the full text of documents for which the archive has provided a full text URL. This is advantageous since the indexing process is much faster without this step and this step is not always necessary. The second script runs the indexer and indexes the full texts provided.
Both of these scripts run in the background automatically and produce little output to the console. To follow their progress, execute the following command:
tail -f indexer.log
This will follow the indexer log so you can watch it work if you desire to do so.
When you are satisfied with the results on staging, you can index the RDF to the production index with the following command:
./run_indexer_production
This will perform a full text indexing to the production environment.
Strip Bit Order Marks (BOM)
From indexer directory, run this script:
ruby script/remove_bom <input directory> <output directory>
Link Checker (link_checker.rb)
The process of verifying the integrity of links provided by archives can be time intensive and is thus separated from the work flow of indexing RDF files. The link checker performs this function. By default it takes the link_data.txt file as input, so if you have just run the rdf_indexer all you need to do to check the links is run:
./run link_checker > link.report
This generates a file called 'link.report' which reports the HTTP Response Codes for the links. To interpret these codes, refer to the W3C specification. (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html)
To get a list of available command line options:
./run link_checker -h
link_checker [options] -g, --get Use GET instead of HEAD to test URLs (HEAD default) -s, --slow Pause between links to prevent overloading the remote host. -f, --file [file] location of the link data file. (link_data.txt default) -h, --help Show this usage statement
Tips: If you are seeing "Unable to Connect" errors, try the -g option. If the target server appears to be overloaded by the rate of requests, try the -s option. To run the link checker as a background process, append an '&' to the end of any of the command. The background process will continue to run even if you log off:
./run link_checker > link.report &
MARC Tools (marc_tools.rb)
The MARC Tools script is two tools in one. It has a scanner mode, in which it scans the supplied MRC files and produces a report of their content and an index mode which indexes the MARC records into a specified SOLR index. By default, both modes look for MRC files in the indexer/marc/data directory. All MRC files found in this directory or its child directories will be worked on.
Scanner Tool
The Scanner tool will analyze the provided MRC files and how well they map onto the genre mappings provided. The genre mappings are stored in indexer/script/lib/nines_mappings.rb. The script generates a report which is output to stdout. The example below writes the report to a file called report.txt.
script/marc_tools.rb -t scan > report.txt
Indexer Tool
The Indexer Tool will index the provided MRC files into the provided Solr index. Assuming the MRC files are in the indexer/marc/data directory, the following command will index them into NINES staging:
script/marc_tools.rb -t index
Another handy feature is the -v or verbose output option. Verbose mode will output a report of each document in MARC format and the resulting SOLR index document. This allows you to examine exactly what is being placed in the index. This feature is best used on a subset of the data at a time as it generates a tremendous amount of output. The following command:
script/marc_tools.rb -t index -v > index.report
Produces a text file called index.report with entries like this for each MARC Record:
Marc Record =========== LEADER 02154cam 2200385 4500 001 GLAD82 005 20071012072901.0 008 830810 1875 mau eng u 010 $a 07003651//r 035 $a (CU)ocm02879262 903 $a 2 $b PS 01828 A1 1875 903 $a 2 $b F 00855.1 H327TA COPY 1 903 $a 2 $b F 00855.1 H327TA COPY 2 100 10 $a Harte, Bret, $d 1836-1902. 245 10 $a Tales of the Argonauts, $b and other sketches. $c By Bret Harte 260 0 $a Boston, $b J. R. Osgood and company, $c 1875 300 $a 2 p. l., 283 p. $c 19 cm 500 $a First edition. 505 0 $a The Rose of Tuolumne.--A passage in the life of Mr. John Oakhurst.--Wan Lee, the pagan.--How old man Plunkett went home.--The fool of Five Forks.--Baby Sylvester.--An episode of Fiddletown.--A Jersey centenarian 510 4 $a BAL $c 7280 700 10 $a Honeyman, Robert B. $4 asn $5 CU-BANC 752 $a United States $b Massachusetts $d Boston $9 (1875) 950 $l MAIN $s B 4 103 186 $z Main Stack $a PS1828 $b .A1 1875 902 $a NRLF 950 $l BANC $s V 5 857 $z Bancroft $d \x\ $a F855.1 $b .H327ta copy 1 $g Non-circulating; may be used only in The Bancroft Library. $t Contact Bancroft Library for availability. $q Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. $f Bound in purple cloth boards*uAcc. no. 32378\ $p Bookplate of Charles Awtood Kofoid\ 902 $a NRLF 950 $l BANC $s V 5 856 $z Bancroft $d \x\ $a F855.1 $b .H327ta copy 2 $g Non-circulating; may be used only in The Bancroft Library. $t Contact Bancroft Library for availability. $q Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. $f Bound in red-brown cloth boards; author's inscription tipped in\ 902 $a NRLF 900 $a Bret-Gart, Frensis, $d 1836-1902 900 $a Gart, Bret, $d 1836-1902 900 $a Harte, Francis Bret, $d 1836-1902 900 $a Chart, Bret, $d 1836-1902 900 $a Harte, Bret, $d 1839-1902 954 10 $a Honeyman, Robert B. $4 asn $5 CU-BANC 953 $a United States $b Massachusetts $d Boston $9 (1875)
Solr Document ============= text: 2 PS 01828 A1 1875 2 F 00855.1 H327TA COPY 1 2 F 00855.1 H327TA COPY 2 Harte, Bret, 1836-1902. Tales of the Argonauts, and other sketches. By Bret Harte Boston, J. R. Osgood and company, 1875 2 p. l., 283 p. 19 cm First edition. The Rose of Tuolumne.--A passage in the life of Mr. John Oakhurst.--Wan Lee, the pagan.--How old man Plunkett went home.--The fool of Five Forks.--Baby Sylvester.--An episode of Fiddletown.--A Jersey centenarian BAL 7280 Honeyman, Robert B. asn CU-BANC United States Massachusetts Boston (1875) MAIN B 4 103 186 Main Stack PS1828 .A1 1875 NRLF BANC V 5 857 Bancroft \x\ F855.1 .H327ta copy 1 Non-circulating; may be used only in The Bancroft Library. Contact Bancroft Library for availability. Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. Bound in purple cloth boards*uAcc. no. 32378\ Bookplate of Charles Awtood Kofoid\ NRLF BANC V 5 856 Bancroft \x\ F855.1 .H327ta copy 2 Non-circulating; may be used only in The Bancroft Library. Contact Bancroft Library for availability. Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. Bound in red-brown cloth boards; author's inscription tipped in\ NRLF Bret-Gart, Frensis, 1836-1902 Gart, Bret, 1836-1902 Harte, Francis Bret, 1836-1902 Chart, Bret, 1836-1902 Harte, Bret, 1839-1902 Honeyman, Robert B. asn CU-BANC United States Massachusetts Boston (1875) role_AUT: Harte, Bret, date_label: 1875 title: Tales of the Argonauts, and other sketches agent: Harte, Bret, archive: bancroft uri: lib://bancroft/GLAD82 year: 1875 batch: MARC-2007-12-18T00-00-00-05-00 role_PBL: J. R. Osgood and company, genre: Citation type: A
For a list of commands:
script/marc_tools.rb -h
Usage: marc_tools [options] -d, --dir [PATH] Location of .MRC files. Defaults to marc/data -x, --debug debug mode -v, --verbose turn on verbose logging -a, --archive [archive] archive code for the indexed material -t, --tool [tool] scan, index, or extract MARC data. Default is scan -u, --solr [URL] Specify a NINES Solr URL default is http://localhost:8989/solr -o, --ouput [filename] target file to output extracted records. Default is extracted.mrc -h, --help Show this usage statement
When everything is working properly, index to Production. Be sure to specify the archive name, such as this command for indexing the Bancroft collection:
script/marc_tools.rb -t index -a bancroft -u http://localhost:8983/solr
Delete Archive (delete_archive.rb)
To remove an archive from a Solr index on staging, you can use the following command:
script/delete_archive.rb -a <archive name>
This will remove the archive from the Solr Index but not from the Collex DB. Be sure to reset the Rails cache to see your change take effect. (see below)
Resetting the Rails Cache
When deploying to staging or production, it is now necessary to perform an extra step after indexing to ensure that the records appear properly on the web site. After indexing the RDF or MARC records:
In Staging Environment:
cd /usr/local/patacriticism/staging-web/current
rake tmp:cache:clear
In Production Environment:
cd /usr/local/patacriticism/production-web/current
rake tmp:cache:clear
In Dev Environment:
cd /var/www/apps/collex/current
rake tmp:cache:clear'