Daniel, not sure if everything is clear now, but the way you are validating a WARC file is seriously flawed! I strongly suggest you use an existing WARC tool to do the actual parsing/validating.
I've seen more and more people on JHove's mailing list posting problems on how to create their own modules, so I decided to write a small how-to-write-your-own-module-guide.
Perhaps you'd like to give it a shot. I'll also post it on the mailing list.
================================================================================
This is a step-by-step tutorial that will enable you to compile and run a
custom made module for Harvard's idetification tool JHove [1]. This is not
a tutorial on Java programming! For a thorough explanation of this tool and
extensive documentation, please see:
http://hul.harvard.edu/jhove/documentation.html
================================================================================
= Setp 1 =
Download and unzip JHove 1.1f [2]. The unzipped folder will be called
JHOVE_HOME from now on.
================================================================================
= Step 2 =
In this example I will construct a very elementary ARC module and will be using
Heritrix' [3] ARCUtils class, so download Heritrix 1.14 [4] and unzip it. After
unzipping, locate the file 'heritrix-1.14.1.jar' (it might be a different
version) and place it in the directory 'JHOVE_HOME/bin'.
================================================================================
= Step 3 =
Create a folder 'JHOVE_HOME/bin/test' and create a new file in it called
'ArcModule.java'. Paste the following contents in that file:
package test;
import java.io.IOException;
import java.io.InputStream;
import edu.harvard.hul.ois.jhove.ModuleBase;
import edu.harvard.hul.ois.jhove.RepInfo;
import org.archive.io.arc.ARCUtils;
public class ArcModule extends ModuleBase {
private static final String NAME = "ARC-hul";
private static final String RELEASE = "0.1";
private static final int[] DATE = {2008, 11, 11};
private static final String[] FORMAT = {"ARC"};
private static final String COVERAGE = null;
private static final String[] MIMETYPE = {"application/arc"};
private static final String WELLFORMED = "...";
private static final String VALIDITY = null;
private static final String REPINFO = "...";
private static final String NOTE = null;
private static final String RIGHTS = "GNU LGPL";
public ArcModule() {
super (NAME, RELEASE, DATE, FORMAT, COVERAGE, MIMETYPE, WELLFORMED,
VALIDITY, REPINFO, NOTE, RIGHTS, false);
// Optionally set some Agent information: see the other Modules how
// this can be done.
}
@Override
public int parse(InputStream stream, RepInfo info, int parseIndex) {
info.setModule(this);
boolean wellFormed = false;
try {
if(ARCUtils.testCompressedARCStream(stream)) {
wellFormed = true;
}
} catch (IOException e) {
e.printStackTrace();
}
info.setWellFormed(wellFormed);
return 0;
}
}
================================================================================
= Step 4 =
Compile this ArcModule by opening a shell (command prompt) and cd-ing to
'JHOVE_HOME/bin' and executing the following command:
*nix & Mac OS:
javac -cp .:JhoveApp.jar:heritrix-1.14.1.jar test/ArcModule.java
Windows:
javac -cp .;JhoveApp.jar;heritrix-1.14.1.jar test\ArcModule.java
(Note, if you're using JDK 1.4, replace '-cp' with '-classpath')
You shouldn't get any messages if all goes well.
================================================================================
= Step 5 =
Open the file 'JHOVE_HOME/conf/jhove.conf' and add the following right beneath
the line
<bufferSize>?????</bufferSize>
, where ????? is a number:
<module>
<class>test.ArcModule</class>
</module>
Save the file.
================================================================================
= Step 6 =
Create a folder called 'JHOVE_HOME/arcs' and copy two compressed ARC files in
them. If you don't have any compressed ARC files laying around, you can
download two small [5]. The file 'A.arc.gz' is a valid compressed ARC file,
while 'B.arc.gz' is the same as 'A.arc.gz' but I removed the ARC-header from
the latter, making it an invalid ARC file.
================================================================================
= Step 7 =
Open a shell, cd to 'JHOVE_HOME/bin' and execute the following command:
*nix & Mac OS:
java -cp .:JhoveApp.jar:heritrix-1.14.1.jar Jhove -c ../conf/jhove.conf -m ARC-hul ../arcs
Windows:
java -cp .;JhoveApp.jar;heritrix-1.14.1.jar Jhove -c ..\conf\jhove.conf -m ARC-hul ..\arcs
Which will cause JHove to scan everything that is in 'JHOVE_HOME/arcs' folder
and throws it through your newly create ArcModule. The output will be as
follows:
Jhove (Rel. 1.1, 2008-02-21)
Date: 2008-11-14 22:29:51 CET
RepresentationInformation: .../jhove/arcs/A.arc.gz
ReportingModule: ARC-hul, Rel. 0.1 (2008-11-11)
LastModified: 2008-08-24 20:23:20 CEST
Size: 130870
Status: Well-Formed and valid
RepresentationInformation: .../jhove/arcs/B.arc.gz
ReportingModule: ARC-hul, Rel. 0.1 (2008-11-11)
LastModified: 2008-11-14 21:53:15 CET
Size: 116136
Status: Not well-formed
Which is the expected result: A is valid and B is not.
================================================================================
= Final remarks =
As I said, this is not a programming tutorial, nor is it the best way to
validate ARC files: more meta data should be extracted from the file. But I
leave that for you. This was only a guide to show you how to get started on
writing and running your own modules. You can have a look at the source
of the existing modules to see the "best practices" w.r.t. writing a module.
Best of luck!
Regards,
Bart.
================================================================================
= References =
[1]
http://hul.harvard.edu
[2]
http://hul.harvard.edu/jhove/download.html
[3]
http://crawler.archive.org
[4]
http://sourceforge.net/project/showfiles.php?group_id=73833&package_id=73980
[5]
http://iruimte.nl/arcs