Difference between revisions of "Indexing Tools"

From ARC Wiki
Jump to navigation Jump to search
(Indexing a new archive)
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
To create objects in Collex, you need to have a set of source files. Currently RDF and MARC format files can be used to create indexes. Most indexing tasks can be handled from the command line in the collex/web folder. There is an old project that can be downloaded from the SVN repository called "indexer" that is mostly deprecated. TODO: move the remaining useful pieces into Collex so "indexer" isn't needed at all.
+
As of summer 2013, previous versions of this indexing tools page are officially out of date. As ARC has moved from SVN to Git, we have a new indexing workflow that involves GitLab.
  
== Solr architecture for indexing ==
+
Below is a generalized sketch of the workflow for indexing that all ARC nodes currently use. For more information on the indexing process, please see the [https://github.com/collex Collex GitHub code repository].
  
Normally, collex uses a solr index named "resources". On the indexing machine additional indexes are created, one for each archive, and one called "merged" that contains all of the individual archives. On the indexing machine "resources" is the reference index. Indexing tasks create and modify "archive_*". The testing tasks generally compare the resources index with all of the archive_* indexes and report the differences. That way, you know exactly what effect your indexing work has. When you are satisfied, then you merge all the archives into the merged index, and deploy the merged index.
+
== Prerequisites ==
  
== Indexing RDF ==
+
<b>Make sure you have access to or have installed the following programs</b>:
Instructions for new RDF processing
+
#Terminal or Command Prompt
 +
#Source Tree (http://www.sourcetreeapp.com)
 +
#Oxygen (http://www.oxygenxml.com)
 +
#An ssh key for gitlab (run "ssh-keygen" on your computer from the command prompt)
 +
#Install git (http://git-scm.com/book/en/Getting-Started-Installing-Git)
 +
##Have Homebrew? Just: ~brew install git
  
=== Guidelines ===
+
Open the following programs to begin indexing: Terminal, SourceTree (booksmarks view), and Oxygen.
  
* This works best if all the RDF in a folder is in the same archive, and conversely, if the archive only appears in that one folder. An exception is if there are a large number of RDF files for a particular archive, it is best to split them into subfolders. For instance, for archiveX, create a folder named archiveX, and under that, create as many folders as necessary called archiveX/1, archiveX/2, etc.
+
<b>Wanna know what your SSH key is?</b>
 +
:cat ~/.ssh/id_rsa.pub 
  
* Never start an archive name with "exhibit_".
 
  
* It is not required, but it makes maintenance easier if the archive names are all lower case characters or numbers, or the underscore "_" and no other characters.
+
== Indexing Workflow ==
 
+
<b>The overall steps for reindexing a resource in ARC are:</b>
* Maintenance will be easier if the folder name is the same as the archive name.
+
#Get the new RDF into a folder on your local computer
 
+
##Via email
* You may nest folders as deep as you like for your own organization.
+
##Via staging server
 
+
#Get a copy of the the old RDF from Gitlab on your local computer
* Some tools tend to break if there are too many files in a single folder, conversely, some tools tend to break if the files are too large. 250K is not too large a file. 400 is not too many files in a single folder. The actual limits depend upon the individual computer's resources.
+
##If this is a new archive
 
+
##If this is an existing archive which you don't already have on your local computer
* The main index is called "resources". As you index, you will be creating separate indexes for each archive called "archive_XXX". When you are satisfied that the new archive is correct, you will merge the new archive into the main index. At the time you merge, you should be sure that your RDF files are all checked into Source Control, so there is a clear record of what went into each index.
+
##If this is an existing archive which you do have on your local computer
 
+
#Compare the new RDF to old RDF and Overwrite old with new using <oXygen>
=== Indexing a new archive ===
+
#Commit changes and push from local computer to Gitlab
1) Initial setup: You should have a local copy of Collex downloaded from Source Control, a local copy of all the RDF downloaded from Source Control, and a copy of solr_1.4 downloaded from Source Control. In addition, you should have Collex running on your local machine for testing purposes. (Note that you can do all the indexing on the indexing server instead of your local machine. The advantage of using your local machine to gather the RDF is that it could be more convenient, and it forces you to check the RDF into Source Control before the final indexing.) The RDF folder, the Collex folder, and the solr_1.4 folder must be children of the same folder. In other words, if you are using a folder called "collex", then you would have the folders "collex/web", "collex/solr_1.4", and "collex/rdf".
+
#Pull changes from Gitlab to ARC Staging Index
 
+
#Test changes on ARC Staging Index
2) On your local machine, create a folder under the rdf folder with the same name as the archive, put the new rdf in it. (The rdf folder is the folder where you checked out all the rdf from source control.)
+
##(Harvest text, if any)
 
+
#Push changes from Staging Index to Staging Site
3) Open a terminal, and cd to the folder containing the local copy of Collex.
+
#Push changes from Staging Site to Production Site
 
 
4) The rest of these instructions assume that the folder you created is rdf/XXX, and all the RDF in that folder specify an archive of XXX. (When a folder is requested as a parameter, it is always relative to the rdf folder. Also, the folder is recursively searched, so that any subfolders are also processed.)
 
 
 
5) rake folder=XXX solr_index:find_duplicate_objects (This analyzes the new folder to see if two RDF documents contain the same URI. To be more thorough but much slower, just do rake solr_index:find_duplicate_objects, and you will search the entire rdf tree for duplicate documents.)
 
 
 
6) rake folder=XXX solr_index:index_rdf_for_debugging
 
 
 
7) Study collex/web/log/XXX_indexer.log and collex/web/log/XXX_report.txt for errors.
 
 
 
8) Fix the errors and repeat step 6. (Note: if a uri is deleted or changed during this step, also do the following: rake archive = 'archive_XXX' solr_index:clear_reindexing_index).
 
 
 
9) Open the local copy of Collex in a browser, go to the admin page, click "use test index" to see what it will look like. (It will be slower and the relevancy order will be wrong, but the objects will appear. Also there won't be any full text unless that was done locally.)
 
 
 
10) Fix errors and repeat from step 6 until satisfied.
 
 
 
11) Check rdf into source control.
 
 
 
12) Go to indexing computer. (This step can be tested on your local computer if the archive is free culture, but it still has to be done on the indexing computer as the final step.)
 
 
 
13) Check out the new RDF from source control.
 
 
 
14) rake index=archive_XXX solr_index:clear_reindexing_index (This is only needed if there were previous failed attempts.)
 
 
 
15) rake folder=XXX solr_index:index_rdf_with_fulltext
 
 
 
16) Study Collex/log/XXX_indexer.log and Collex/log/XXX_report.txt for errors.
 
 
 
17) Fix the errors and repeat step 15.
 
 
 
18) To get a dump of exactly what was added (and see additional error messages):
 
<pre>
 
  rake archive=XXX solr_index:compare_indexes
 
  rake archive=XXX solr_index:compare_indexes_text
 
</pre>
 
 
 
19) To see exactly what is in a particular document, do: rake uri="YYY" solr_index:examine_solr_document
 
 
 
20) At this point, you should have a good index of the archive, but it is not part of the main index. To put it in the main index, rake archive=XXX solr_index:merge_archive
 
 
 
21) Test the index on the indexing computer until satisfied.
 
 
 
22) To package the index so that it can be installed on the production server, rake index=resources solr:zip (your new index will be in the ~ folder) Alternately, you can do that in one step with: '''rake dest=nines@nines.org solr:send_index_to_server'''.
 
 
 
23) Be sure that the latest RDF is checked into Source Control.
 
 
 
24) To tag the current RDF and MARC files as the files that created the current index, rake label=XXX solr_index:tag_rdf_and_marc
 
 
 
== Reindexing ==
 
If the schema changes, or updated RDF is delivered, or a bug is fixed, then you may have to reindex an archive. To do that, you can run the task:
 
 
 
<pre>
 
cd collex/web
 
rake archive=ARCHIVE,FOLDER solr_index:reindex_and_test_one_archive
 
</pre>
 
 
 
Where ARCHIVE is the name of the archive and FOLDER is the path of the folder that the RDF resides in. The folder path is relative to the rdf folder, so to reindex rossetti, you would type "rake archive=rossetti,rossetti solr_index:reindex_and_test_one_archive"
 
 
 
There are a number of tools which help with reindexing in solr_index.rake, so peruse the source to see what is available.
 
 
 
==Strip Bit Order Marks (BOM)==
 
 
 
From indexer directory, run this script:
 
 
 
'''ruby script/remove_bom <input directory> <output directory>'''
 
 
Please be sure that the files all have a .RDF extension.
 
 
 
== Link Checker (link_checker.rb) ==
 
 
 
The process of verifying the integrity of links provided by archives can be time intensive and is thus separated from the work flow of indexing RDF files. The link checker performs this function. By default it takes the link_data.txt file as input, so if you have just run the rdf_indexer all you need to do to check the links is run:
 
 
 
'''./run link_checker > link.report'''
 
 
 
This generates a file called 'link.report' which reports the HTTP Response Codes for the links. To interpret these codes, refer to the W3C specification. (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html)
 
 
 
To get a list of available command line options:
 
 
'''./run link_checker -h'''
 
 
 
    link_checker [options]
 
    -g, --get                        Use GET instead of HEAD to test URLs (HEAD default)
 
    -s, --slow                      Pause between links to prevent overloading the remote host.
 
    -f, --file [file]                location of the link data file. (link_data.txt default)
 
    -h, --help                      Show this usage statement
 
 
 
Tips: If you are seeing "Unable to Connect" errors, try the -g option. If the target server appears to be overloaded by the rate of requests, try the -s option. To run the link checker as a background process, append an '&' to the end of any of the command. The  background process will continue to run even if you log off:
 
 
 
'''./run link_checker > link.report &'''
 
 
 
 
 
== MARC Tools (marc_tools.rb) ==
 
 
 
NOTE: THE MARC TOOLS ARE NOT GENERIC. TO INDEX NEW MARC RECORDS YOU WILL HAVE TO MODIFY THE RUBY SCRIPTS.
 
 
 
TODO: The MARC tools can be made more generic, and there should be an easy path for expanding them to new archives.
 
 
 
All MARC files should be in the path collex/marc. You should keep all the MARC records that you are using under source control.
 
 
 
The MARC Tools script is two tools in one. It has a scanner mode, in which it scans the supplied MRC files and produces a report of their content and an index mode which indexes the MARC records into a specified SOLR index.
 
 
 
=== Scanner Tool ===
 
The Scanner tool will analyze the provided MRC files and how well they map onto the genre mappings provided.  The genre mappings are stored in indexer/script/lib/nines_mappings.rb. The script generates a report which is output to stdout. The example below writes the report to a file called report.txt.
 
 
 
<pre>
 
cd collex/web
 
script/marc_tools.rb -t scan > report.txt
 
</pre>
 
 
 
=== Indexer Tool ===
 
The Indexer Tool will index the provided MRC files into the provided Solr index. The following command will index them into NINES staging:
 
 
 
<pre>
 
cd collex/web
 
rake archive=XXXX solr_index:reindex_marc
 
</pre>
 
 
 
Where XXXX is either "bancroft" or "lilly". To index other MARC records, first modify the ruby source to support a new archive, then run the above command with that archive name.
 
 
 
You will have to modify solr_index.rb, also. Don't forget to update the line: "args[:federation] = 'NINES'" to reflect your federation name.
 
 
 
A handy feature is the verbose output option. Verbose mode will output a report of each document in MARC format and the resulting SOLR index document. This allows you to examine exactly what is being placed in the index. This feature is best used on a subset of the data at a time as it generates a tremendous amount of output. Change "verbose => false" to "verbose => true" in solr_index.rb to see this output.
 
 
 
Then you'll see output containing:
 
 
 
  Marc Record
 
  ===========
 
  LEADER 02154cam  2200385  4500
 
  001 GLAD82
 
  005 20071012072901.0
 
  008 830810 1875    mau                eng u
 
  010    $a    07003651//r
 
  035    $a (CU)ocm02879262
 
  903    $a 2 $b PS 01828 A1 1875
 
  903    $a 2 $b F  00855.1 H327TA COPY 1
 
  903    $a 2 $b F  00855.1 H327TA COPY 2
 
  100 10 $a Harte, Bret, $d 1836-1902.
 
  245 10 $a Tales of the Argonauts, $b and other sketches. $c By Bret Harte
 
  260 0  $a Boston, $b J. R. Osgood and company, $c 1875
 
  300    $a 2 p. l., 283 p. $c 19 cm
 
  500    $a First edition.
 
  505 0  $a The Rose of Tuolumne.--A passage in the life of Mr. John Oakhurst.--Wan Lee, the pagan.--How old man Plunkett went    home.--The fool of Five Forks.--Baby Sylvester.--An episode of Fiddletown.--A Jersey centenarian
 
  510 4  $a BAL $c 7280
 
  700 10 $a Honeyman, Robert B. $4 asn $5 CU-BANC
 
  752    $a United States $b Massachusetts $d Boston $9 (1875)
 
  950    $l MAIN $s B 4 103 186  $z Main Stack  $a PS1828 $b .A1 1875
 
  902    $a NRLF
 
  950    $l BANC $s V 5 857  $z Bancroft    $d \x\ $a F855.1 $b .H327ta copy 1 $g Non-circulating; may be used only in The Bancroft Library. $t Contact Bancroft Library for availability. $q Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. $f Bound in purple cloth boards*uAcc. no. 32378\ $p Bookplate of Charles Awtood Kofoid\
 
  902    $a NRLF
 
  950    $l BANC $s V 5 856  $z Bancroft    $d \x\ $a F855.1 $b .H327ta copy 2 $g Non-circulating; may be used only in The Bancroft Library. $t Contact Bancroft Library for availability. $q Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. $f Bound in red-brown cloth boards; author's inscription tipped in\
 
  902    $a NRLF
 
  900    $a Bret-Gart, Frensis, $d 1836-1902
 
  900    $a Gart, Bret, $d 1836-1902
 
  900    $a Harte, Francis Bret, $d 1836-1902
 
  900    $a Chart, Bret, $d 1836-1902
 
  900    $a Harte, Bret, $d 1839-1902
 
  954 10 $a Honeyman, Robert B. $4 asn $5 CU-BANC
 
  953    $a United States $b Massachusetts $d Boston $9 (1875)
 
 
 
  Solr Document
 
  =============
 
  text: 2 PS 01828 A1 1875 2 F  00855.1 H327TA COPY 1 2 F  00855.1 H327TA COPY 2 Harte, Bret, 1836-1902. Tales of the Argonauts, and    other sketches. By Bret Harte Boston, J. R. Osgood and company, 1875 2 p. l., 283 p. 19 cm First edition. The Rose of Tuolumne.--A passage in the life of Mr. John Oakhurst.--Wan Lee, the pagan.--How old man Plunkett went home.--The fool of Five Forks.--Baby Sylvester.--An episode of Fiddletown.--A Jersey centenarian BAL 7280 Honeyman, Robert B. asn CU-BANC United States Massachusetts Boston (1875) MAIN B 4 103 186  Main Stack  PS1828 .A1 1875 NRLF BANC V 5 857  Bancroft    \x\ F855.1 .H327ta copy 1 Non-circulating; may be used only in The Bancroft Library. Contact Bancroft Library for availability. Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. Bound in purple cloth boards*uAcc. no. 32378\ Bookplate of Charles Awtood Kofoid\ NRLF BANC V 5 856  Bancroft    \x\ F855.1 .H327ta copy 2 Non-circulating; may be used only in The Bancroft Library. Contact Bancroft Library for availability. Bancroft Library closed Summer to mid-Fall 2008. For construction information: http://bancroft.berkeley.edu/info/move/; 510 642-3781. Bound in red-brown cloth boards; author's inscription tipped in\ NRLF Bret-Gart, Frensis, 1836-1902 Gart, Bret, 1836-1902 Harte, Francis Bret, 1836-1902 Chart, Bret, 1836-1902 Harte, Bret, 1839-1902 Honeyman, Robert B. asn CU-BANC United States Massachusetts Boston (1875)
 
  role_AUT: Harte, Bret,
 
  date_label: 1875
 
  title: Tales of the Argonauts, and other sketches
 
  agent: Harte, Bret,
 
  archive: bancroft
 
  uri: lib://bancroft/GLAD82
 
  year: 1875
 
  batch: MARC-2007-12-18T00-00-00-05-00
 
  role_PBL: J. R. Osgood and company,
 
  genre: Citation
 
 
 
== Delete Archive ==
 
 
 
To remove an archive from a Solr index on staging, you can use the following command:
 
 
 
<pre>
 
  cd collex/web
 
  rake index=archive_* solr_index:clear_reindexing_index
 
</pre>
 

Latest revision as of 19:09, 21 March 2014

As of summer 2013, previous versions of this indexing tools page are officially out of date. As ARC has moved from SVN to Git, we have a new indexing workflow that involves GitLab.

Below is a generalized sketch of the workflow for indexing that all ARC nodes currently use. For more information on the indexing process, please see the Collex GitHub code repository.

Prerequisites

Make sure you have access to or have installed the following programs:

  1. Terminal or Command Prompt
  2. Source Tree (http://www.sourcetreeapp.com)
  3. Oxygen (http://www.oxygenxml.com)
  4. An ssh key for gitlab (run "ssh-keygen" on your computer from the command prompt)
  5. Install git (http://git-scm.com/book/en/Getting-Started-Installing-Git)
    1. Have Homebrew? Just: ~brew install git

Open the following programs to begin indexing: Terminal, SourceTree (booksmarks view), and Oxygen.

Wanna know what your SSH key is?

cat ~/.ssh/id_rsa.pub


Indexing Workflow

The overall steps for reindexing a resource in ARC are:

  1. Get the new RDF into a folder on your local computer
    1. Via email
    2. Via staging server
  2. Get a copy of the the old RDF from Gitlab on your local computer
    1. If this is a new archive
    2. If this is an existing archive which you don't already have on your local computer
    3. If this is an existing archive which you do have on your local computer
  3. Compare the new RDF to old RDF and Overwrite old with new using <oXygen>
  4. Commit changes and push from local computer to Gitlab
  5. Pull changes from Gitlab to ARC Staging Index
  6. Test changes on ARC Staging Index
    1. (Harvest text, if any)
  7. Push changes from Staging Index to Staging Site
  8. Push changes from Staging Site to Production Site