Difference between revisions of "Indexing Tools"

From ARC Wiki
Jump to navigation Jump to search
(RDF Indexer (rdf_indexer.rb))
Line 20: Line 20:
 
This will follow the indexer log so you can watch it work if you desire to do so.
 
This will follow the indexer log so you can watch it work if you desire to do so.
  
When you are satisfied with the results on staging, you can index the RDF to the production index with the following command:
 
 
'''./run_indexer_production'''
 
 
This will perform a full text indexing to the production environment.
 
 
 
==Strip Bit Order Marks (BOM)==
 
==Strip Bit Order Marks (BOM)==
  

Revision as of 15:13, 9 October 2009

These command line scripts allow for the processing of MARC records and RDF documents. These scripts are located on jarry in the /usr/local/patacriticism/indexer directory. Examples of use in this document assume you are in this directory.

RDF Indexer (rdf_indexer.rb)

The RDF Indexer works on a directory which contains RDF files, parses these files and indexes their content into a specified SOLR index. It outputs a report containing any parsing errors that were encountered. This report is named 'report.txt' and can be found in the directory of the supplied RDF files after execution of this script. A schedule of links is also output in a file called 'link_data.txt'. This file can be input into the link checker, described below.

The rdf_indexer script can be run directly but the safest way to execute it is with these two shortcuts:

./run_indexer or ./run_indexer_fulltext

These scripts act on whatever RDF is present in the indexer/rdf directory or its child directories.

Both of these scripts only update the staging environment, they do not effect the production environments. The first script runs the indexer but does not index the full text of documents for which the archive has provided a full text URL. This is advantageous since the indexing process is much faster without this step and this step is not always necessary. The second script runs the indexer and indexes the full texts provided.

Both of these scripts run in the background automatically and produce little output to the console. To follow their progress, execute the following command:

tail -f indexer.log

This will follow the indexer log so you can watch it work if you desire to do so.

Strip Bit Order Marks (BOM)

From indexer directory, run this script:

ruby script/remove_bom <input directory> <output directory>

Please be sure that the files all have a .RDF extension.

Link Checker (link_checker.rb)

The process of verifying the integrity of links provided by archives can be time intensive and is thus separated from the work flow of indexing RDF files. The link checker performs this function. By default it takes the link_data.txt file as input, so if you have just run the rdf_indexer all you need to do to check the links is run:

./run link_checker > link.report

This generates a file called 'link.report' which reports the HTTP Response Codes for the links. To interpret these codes, refer to the W3C specification. (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html)

To get a list of available command line options:

./run link_checker -h

   link_checker [options]
   -g, --get                        Use GET instead of HEAD to test URLs (HEAD default)
   -s, --slow                       Pause between links to prevent overloading the remote host.
   -f, --file [file]                location of the link data file. (link_data.txt default)
   -h, --help                       Show this usage statement

Tips: If you are seeing "Unable to Connect" errors, try the -g option. If the target server appears to be overloaded by the rate of requests, try the -s option. To run the link checker as a background process, append an '&' to the end of any of the command. The background process will continue to run even if you log off:

./run link_checker > link.report &


MARC Tools (marc_tools.rb)

The MARC Tools script is two tools in one. It has a scanner mode, in which it scans the supplied MRC files and produces a report of their content and an index mode which indexes the MARC records into a specified SOLR index. By default, both modes look for MRC files in the indexer/marc/data directory. All MRC files found in this directory or its child directories will be worked on.

Scanner Tool

The Scanner tool will analyze the provided MRC files and how well they map onto the genre mappings provided. The genre mappings are stored in indexer/script/lib/nines_mappings.rb. The script generates a report which is output to stdout. The example below writes the report to a file called report.txt.

script/marc_tools.rb -t scan > report.txt

Indexer Tool

The Indexer Tool will index the provided MRC files into the provided Solr index. Assuming the MRC files are in the indexer/marc/data directory, the following command will index them into NINES staging:

script/marc_tools.rb -t index

Another handy feature is the -v or verbose output option. Verbose mode will output a report of each document in MARC format and the resulting SOLR index document. This allows you to examine exactly what is being placed in the index. This feature is best used on a subset of the data at a time as it generates a tremendous amount of output. The following command:

script/marc_tools.rb -t index -v > index.report

Produces a text file called index.report with entries like this for each MARC Record:

  Marc Record
  ===========
  LEADER 02154cam  2200385   4500
  001 GLAD82
  005 20071012072901.0
  008 830810 1875    mau                 eng u
  010    $a    07003651//r 
  035    $a (CU)ocm02879262 
  903    $a 2 $b PS 01828 A1 1875 
  903    $a 2 $b F  00855.1 H327TA COPY 1 
  903    $a 2 $b F  00855.1 H327TA COPY 2 
  100 10 $a Harte, Bret, $d 1836-1902. 
  245 10 $a Tales of the Argonauts, $b and other sketches. $c By Bret Harte 
  260 0  $a Boston, $b J. R. Osgood and company, $c 1875 
  300    $a 2 p. l., 283 p. $c 19 cm 
  500    $a First edition. 
  505 0  $a The Rose of Tuolumne.--A passage in the life of Mr. John Oakhurst.--Wan Lee, the pagan.--How old man Plunkett went    home.--The fool of Five Forks.--Baby Sylvester.--An episode of Fiddletown.--A Jersey centenarian 
  510 4  $a BAL $c 7280 
  700 10 $a Honeyman, Robert B. $4 asn $5 CU-BANC 
  752    $a United States $b Massachusetts $d Boston $9 (1875) 
  950    $l MAIN $s B 4 103 186  $z Main Stack   $a PS1828 $b .A1 1875 
  902    $a NRLF 
  950    $l BANC $s V 5 857  $z Bancroft     $d \x\ $a F855.1 $b .H327ta copy 1 $g Non-circulating; may be used only in The Bancroft Library. $t Contact Bancroft Library for availability. $q Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. $f Bound in purple cloth boards*uAcc. no. 32378\ $p Bookplate of Charles Awtood Kofoid\ 
  902    $a NRLF 
  950    $l BANC $s V 5 856  $z Bancroft     $d \x\ $a F855.1 $b .H327ta copy 2 $g Non-circulating; may be used only in The Bancroft Library. $t Contact Bancroft Library for availability. $q Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. $f Bound in red-brown cloth boards; author's inscription tipped in\ 
  902    $a NRLF 
  900    $a Bret-Gart, Frensis, $d 1836-1902 
  900    $a Gart, Bret, $d 1836-1902 
  900    $a Harte, Francis Bret, $d 1836-1902 
  900    $a Chart, Bret, $d 1836-1902 
  900    $a Harte, Bret, $d 1839-1902 
  954 10 $a Honeyman, Robert B. $4 asn $5 CU-BANC 
  953    $a United States $b Massachusetts $d Boston $9 (1875) 
  Solr Document
  =============
  text: 2 PS 01828 A1 1875 2 F  00855.1 H327TA COPY 1 2 F  00855.1 H327TA COPY 2 Harte, Bret, 1836-1902. Tales of the Argonauts, and    other sketches. By Bret Harte Boston, J. R. Osgood and company, 1875 2 p. l., 283 p. 19 cm First edition. The Rose of Tuolumne.--A passage in the life of Mr. John Oakhurst.--Wan Lee, the pagan.--How old man Plunkett went home.--The fool of Five Forks.--Baby Sylvester.--An episode of Fiddletown.--A Jersey centenarian BAL 7280 Honeyman, Robert B. asn CU-BANC United States Massachusetts Boston (1875) MAIN B 4 103 186  Main Stack   PS1828 .A1 1875 NRLF BANC V 5 857  Bancroft     \x\ F855.1 .H327ta copy 1 Non-circulating; may be used only in The Bancroft Library. Contact Bancroft Library for availability. Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. Bound in purple cloth boards*uAcc. no. 32378\ Bookplate of Charles Awtood Kofoid\ NRLF BANC V 5 856  Bancroft     \x\ F855.1 .H327ta copy 2 Non-circulating; may be used only in The Bancroft Library. Contact Bancroft Library for availability. Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. Bound in red-brown cloth boards; author's inscription tipped in\ NRLF Bret-Gart, Frensis, 1836-1902 Gart, Bret, 1836-1902 Harte, Francis Bret, 1836-1902 Chart, Bret, 1836-1902 Harte, Bret, 1839-1902 Honeyman, Robert B. asn CU-BANC United States Massachusetts Boston (1875) 
  role_AUT: Harte, Bret,
  date_label: 1875
  title: Tales of the Argonauts, and other sketches
  agent: Harte, Bret,
  archive: bancroft
  uri: lib://bancroft/GLAD82
  year: 1875
  batch: MARC-2007-12-18T00-00-00-05-00
  role_PBL: J. R. Osgood and company,
  genre: Citation
  type: A


For a list of commands:

script/marc_tools.rb -h

   Usage: marc_tools [options]
     -d, --dir [PATH]                 Location of .MRC files. Defaults to marc/data
     -x, --debug                      debug mode
     -v, --verbose                    turn on verbose logging
     -a, --archive [archive]          archive code for the indexed material
     -t, --tool [tool]                scan, index, or extract MARC data. Default is scan
     -u, --solr [URL]                 Specify a NINES Solr URL default is http://localhost:8989/solr
     -o, --ouput [filename]           target file to output extracted records. Default is extracted.mrc
     -h, --help                       Show this usage statement

When everything is working properly, index to Production. Be sure to specify the archive name, such as this command for indexing the Bancroft collection:

 script/marc_tools.rb -t index -a bancroft -u http://localhost:8983/solr 

Delete Archive (delete_archive.rb)

To remove an archive from a Solr index on staging, you can use the following command:

script/delete_archive.rb -a <archive name>

This will remove the archive from the Solr Index but not from the Collex DB. Be sure to reset the Rails cache to see your change take effect. (see below)


Resetting the Rails Cache

When deploying to staging or production, it is now necessary to perform an extra step after indexing to ensure that the records appear properly on the web site. After indexing the RDF or MARC records:

In Staging Environment:

cd /usr/local/patacriticism/staging-web/current

rake tmp:cache:clear

In Production Environment:

cd /usr/local/patacriticism/production-web/current

rake tmp:cache:clear

In Dev Environment:

cd /var/www/apps/collex/current

rake tmp:cache:clear