Difference between revisions of "Indexing Tools"

From ARC Wiki
Jump to navigation Jump to search
 
(19 intermediate revisions by 5 users not shown)
Line 1: Line 1:
These command line scripts allow for the processing of MARC records and RDF documents. These scripts are located on jarry in the /usr/local/patacriticism/indexer directory. Examples of use in this document assume you are in this directory.  
+
As of summer 2013, previous versions of this indexing tools page are officially out of date. As ARC has moved from SVN to Git, we have a new indexing workflow that involves GitLab.
  
== RDF Indexer (rdf_indexer.rb) ==
+
Below is a generalized sketch of the workflow for indexing that all ARC nodes currently use. For more information on the indexing process, please see the [https://github.com/collex Collex GitHub code repository].
The RDF Indexer works on a directory which contains RDF files, parses these files and indexes their content into a specified SOLR index. It outputs a report containing any parsing errors that were encountered. This report is named 'report.txt' and can be found in the directory of the supplied RDF files after execution of this script. A schedule of links is also output in a file called 'link_data.txt'. This file can be input into the link checker, described below.  
 
  
The rdf_indexer script can be run directly but the safest way to execute it is with these two shortcuts:
+
== Prerequisites ==
  
'''./run_indexer'''
+
<b>Make sure you have access to or have installed the following programs</b>:
or
+
#Terminal or Command Prompt
'''./run_indexer_fulltext'''
+
#Source Tree (http://www.sourcetreeapp.com)
 +
#Oxygen (http://www.oxygenxml.com)
 +
#An ssh key for gitlab (run "ssh-keygen" on your computer from the command prompt)
 +
#Install git (http://git-scm.com/book/en/Getting-Started-Installing-Git)
 +
##Have Homebrew? Just: ~brew install git
  
These scripts act on whatever RDF is present in the indexer/rdf directory or its child directories.
+
Open the following programs to begin indexing: Terminal, SourceTree (booksmarks view), and Oxygen.
  
Both of these scripts only update the staging environment, they do not effect the production environments. The first script runs the indexer but does not index the full text of documents for which the archive has provided a full text URL. This is advantageous since the indexing process is much faster without this step and this step is not always necessary. The second script runs the indexer and indexes the full texts provided.
+
<b>Wanna know what your SSH key is?</b>
 +
:cat ~/.ssh/id_rsa.pub 
  
Both of these scripts run in the background automatically and produce little output to the console. To follow their progress, execute the following command:
 
  
'''tail -f indexer.log'''
+
== Indexing Workflow ==
 
+
<b>The overall steps for reindexing a resource in ARC are:</b>
This will follow the indexer log so you can watch it work if you desire to do so.
+
#Get the new RDF into a folder on your local computer
 
+
##Via email
When you are satisfied with the results on staging, you can index the RDF to the production index with the following command:
+
##Via staging server
 
+
#Get a copy of the the old RDF from Gitlab on your local computer
'''./run_indexer_production'''
+
##If this is a new archive
 
+
##If this is an existing archive which you don't already have on your local computer
This will perform a full text indexing to the production environment.
+
##If this is an existing archive which you do have on your local computer
+
#Compare the new RDF to old RDF and Overwrite old with new using <oXygen>
 
+
#Commit changes and push from local computer to Gitlab
== Link Checker (link_checker.rb) ==
+
#Pull changes from Gitlab to ARC Staging Index
 
+
#Test changes on ARC Staging Index
The process of verifying the integrity of links provided by archives can be time intensive and is thus separated from the work flow of indexing RDF files. The link checker performs this function. By default it takes the link_data.txt file as input, so if you have just run the rdf_indexer all you need to do to check the links is run:
+
##(Harvest text, if any)
 
+
#Push changes from Staging Index to Staging Site
'''./run link_checker > link.report'''
+
#Push changes from Staging Site to Production Site
 
 
This generates a file called 'link.report' which reports the HTTP Response Codes for the links. To interpret these codes, refer to the W3C specification. (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html)
 
 
 
To get a list of available command line options:
 
 
'''./run link_checker -h'''
 
 
 
    link_checker [options]
 
    -g, --get                        Use GET instead of HEAD to test URLs (HEAD default)
 
    -s, --slow                      Pause between links to prevent overloading the remote host.
 
    -f, --file [file]                location of the link data file. (link_data.txt default)
 
    -h, --help                      Show this usage statement
 
 
 
Tips: If you are seeing "Unable to Connect" errors, try the -g option. If the target server appears to be overloaded by the rate of requests, try the -s option. To run the link checker as a background process, append an '&' to the end of any of the command. The  background process will continue to run even if you log off:
 
 
 
'''bash /usr/local/jruby-1.0.2/bin/jruby script/link_checker.rb > link.report &'''
 
 
 
 
 
== MARC Tools (marc_tools.rb) ==
 
 
 
The MARC Tools script is two tools in one. It has a scanner mode, in which it scans the supplied MRC files and produces a report of their content and an index mode which indexes the MARC records into a specified SOLR index. By default, both modes look for MRC files in the indexer/marc/data directory. All MRC files found in this directory or its child directories will be worked on.
 
 
 
=== Scanner Tool ===
 
The Scanner tool will analyze the provided MRC files and how well they map onto the genre mappings provided.  The genre mappings are stored in indexer/script/lib/nines_mappings.rb. The script generates a report which is output to stdout. The example below writes the report to a file called report.txt.
 
 
 
'''script/marc_tools.rb -t scan > report.txt'''
 
 
 
=== Indexer Tool ===
 
The Indexer Tool will index the provided MRC files into the provided Solr index. Assuming the MRC files are in the indexer/marc/data directory, the following command will index them into NINES staging:
 
 
 
'''script/marc_tools.rb -t index'''
 
 
 
Another handy feature is the -v or verbose output option. Verbose mode will output a report of each document in MARC format and the resulting SOLR index document. This allows you to examine exactly what is being placed in the index. This feature is best used on a subset of the data at a time as it generates a tremendous amount of output. The following command:
 
 
 
'''script/marc_tools.rb -t index -v > index.report'''
 
 
 
Produces a text file called index.report with entries like this for each MARC Record:
 
 
 
  Marc Record
 
  ===========
 
  LEADER 02154cam  2200385  4500
 
  001 GLAD82
 
  005 20071012072901.0
 
  008 830810 1875    mau                eng u
 
  010    $a    07003651//r
 
  035    $a (CU)ocm02879262
 
  903    $a 2 $b PS 01828 A1 1875
 
  903    $a 2 $b F  00855.1 H327TA COPY 1
 
  903    $a 2 $b F  00855.1 H327TA COPY 2
 
  100 10 $a Harte, Bret, $d 1836-1902.
 
  245 10 $a Tales of the Argonauts, $b and other sketches. $c By Bret Harte
 
  260 0  $a Boston, $b J. R. Osgood and company, $c 1875
 
  300    $a 2 p. l., 283 p. $c 19 cm
 
  500    $a First edition.
 
  505 0  $a The Rose of Tuolumne.--A passage in the life of Mr. John Oakhurst.--Wan Lee, the pagan.--How old man Plunkett went    home.--The fool of Five Forks.--Baby Sylvester.--An episode of Fiddletown.--A Jersey centenarian
 
  510 4  $a BAL $c 7280
 
  700 10 $a Honeyman, Robert B. $4 asn $5 CU-BANC
 
  752    $a United States $b Massachusetts $d Boston $9 (1875)
 
  950    $l MAIN $s B 4 103 186  $z Main Stack  $a PS1828 $b .A1 1875
 
  902    $a NRLF
 
  950    $l BANC $s V 5 857  $z Bancroft    $d \x\ $a F855.1 $b .H327ta copy 1 $g Non-circulating; may be used only in The Bancroft Library. $t Contact Bancroft Library for availability. $q Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. $f Bound in purple cloth boards*uAcc. no. 32378\ $p Bookplate of Charles Awtood Kofoid\
 
  902    $a NRLF
 
  950    $l BANC $s V 5 856  $z Bancroft    $d \x\ $a F855.1 $b .H327ta copy 2 $g Non-circulating; may be used only in The Bancroft Library. $t Contact Bancroft Library for availability. $q Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. $f Bound in red-brown cloth boards; author's inscription tipped in\
 
  902    $a NRLF
 
  900    $a Bret-Gart, Frensis, $d 1836-1902
 
  900    $a Gart, Bret, $d 1836-1902
 
  900    $a Harte, Francis Bret, $d 1836-1902
 
  900    $a Chart, Bret, $d 1836-1902
 
  900    $a Harte, Bret, $d 1839-1902
 
  954 10 $a Honeyman, Robert B. $4 asn $5 CU-BANC
 
  953    $a United States $b Massachusetts $d Boston $9 (1875)
 
 
 
  Solr Document
 
  =============
 
  text: 2 PS 01828 A1 1875 2 F  00855.1 H327TA COPY 1 2 F  00855.1 H327TA COPY 2 Harte, Bret, 1836-1902. Tales of the Argonauts, and    other sketches. By Bret Harte Boston, J. R. Osgood and company, 1875 2 p. l., 283 p. 19 cm First edition. The Rose of Tuolumne.--A passage in the life of Mr. John Oakhurst.--Wan Lee, the pagan.--How old man Plunkett went home.--The fool of Five Forks.--Baby Sylvester.--An episode of Fiddletown.--A Jersey centenarian BAL 7280 Honeyman, Robert B. asn CU-BANC United States Massachusetts Boston (1875) MAIN B 4 103 186  Main Stack  PS1828 .A1 1875 NRLF BANC V 5 857  Bancroft    \x\ F855.1 .H327ta copy 1 Non-circulating; may be used only in The Bancroft Library. Contact Bancroft Library for availability. Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. Bound in purple cloth boards*uAcc. no. 32378\ Bookplate of Charles Awtood Kofoid\ NRLF BANC V 5 856  Bancroft    \x\ F855.1 .H327ta copy 2 Non-circulating; may be used only in The Bancroft Library. Contact Bancroft Library for availability. Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. Bound in red-brown cloth boards; author's inscription tipped in\ NRLF Bret-Gart, Frensis, 1836-1902 Gart, Bret, 1836-1902 Harte, Francis Bret, 1836-1902 Chart, Bret, 1836-1902 Harte, Bret, 1839-1902 Honeyman, Robert B. asn CU-BANC United States Massachusetts Boston (1875)
 
  role_AUT: Harte, Bret,
 
  date_label: 1875
 
  title: Tales of the Argonauts, and other sketches
 
  agent: Harte, Bret,
 
  archive: bancroft
 
  uri: lib://bancroft/GLAD82
 
  year: 1875
 
  batch: MARC-2007-12-18T00-00-00-05-00
 
  role_PBL: J. R. Osgood and company,
 
  genre: Citation
 
  type: A
 
 
 
 
 
For a list of commands:
 
 
 
'''script/marc_tools.rb -h'''
 
 
 
    Usage: marc_tools [options]
 
      -d, --dir [PATH]                Location of .MRC files. Defaults to marc/data
 
      -x, --debug                      debug mode
 
      -v, --verbose                    turn on verbose logging
 
      -a, --archive [archive]          archive code for the indexed material
 
      -t, --tool [tool]                scan, index, or extract MARC data. Default is scan
 
      -u, --solr [URL]                Specify a NINES Solr URL default is http://localhost:8989/solr
 
      -o, --ouput [filename]          target file to output extracted records. Default is extracted.mrc
 
      -h, --help                      Show this usage statement
 
 
 
When everything is working properly, index to Production. Be sure to specify the archive name, such as this command for indexing the Bancroft collection:
 
 
 
<pre> script/marc_tools.rb -t index -a bancroft -u http://localhost:8983/solr </pre>
 
 
 
== Delete Archive (delete_archive.rb) ==
 
 
 
To remove an archive from a Solr index on staging, you can use the following command:
 
 
 
'''script/delete_archive.rb -a <archive name>'''
 
 
 
This will remove the archive from the Solr Index but not from the Collex DB. Be sure to reset the Rails cache
 
to see your change take effect. (see below)
 
 
 
 
 
== Resetting the Rails Cache ==
 
 
 
When deploying to staging or production, it is now necessary to perform an extra step after indexing to ensure
 
that the records appear properly on the web site. After indexing the RDF or MARC records:
 
 
 
In Staging Environment:
 
 
 
'''cd /usr/local/patacriticism/staging-web/current'''
 
 
 
'''rake tmp:cache:clear'''
 
 
 
In Production Environment:
 
 
 
'''cd /usr/local/patacriticism/production-web/current'''
 
 
 
'''rake tmp:cache:clear'''
 

Latest revision as of 19:09, 21 March 2014

As of summer 2013, previous versions of this indexing tools page are officially out of date. As ARC has moved from SVN to Git, we have a new indexing workflow that involves GitLab.

Below is a generalized sketch of the workflow for indexing that all ARC nodes currently use. For more information on the indexing process, please see the Collex GitHub code repository.

Prerequisites

Make sure you have access to or have installed the following programs:

  1. Terminal or Command Prompt
  2. Source Tree (http://www.sourcetreeapp.com)
  3. Oxygen (http://www.oxygenxml.com)
  4. An ssh key for gitlab (run "ssh-keygen" on your computer from the command prompt)
  5. Install git (http://git-scm.com/book/en/Getting-Started-Installing-Git)
    1. Have Homebrew? Just: ~brew install git

Open the following programs to begin indexing: Terminal, SourceTree (booksmarks view), and Oxygen.

Wanna know what your SSH key is?

cat ~/.ssh/id_rsa.pub


Indexing Workflow

The overall steps for reindexing a resource in ARC are:

  1. Get the new RDF into a folder on your local computer
    1. Via email
    2. Via staging server
  2. Get a copy of the the old RDF from Gitlab on your local computer
    1. If this is a new archive
    2. If this is an existing archive which you don't already have on your local computer
    3. If this is an existing archive which you do have on your local computer
  3. Compare the new RDF to old RDF and Overwrite old with new using <oXygen>
  4. Commit changes and push from local computer to Gitlab
  5. Pull changes from Gitlab to ARC Staging Index
  6. Test changes on ARC Staging Index
    1. (Harvest text, if any)
  7. Push changes from Staging Index to Staging Site
  8. Push changes from Staging Site to Production Site