This is based on the experiences with our (N. Brown and M. Van Moer) ESRT collaboration. If the Text Analytics Gateway ever gets off the ground, some of this may become obsolete, hopefully in a good way.

Acquiring Data

Usually the PI takes care of this, but I have some recommendations because I think JSTOR complicated matters with the way they handed data over.

Recommendations

  • Document the exact query done on the JSTOR website. Maybe take a screenshot of the first page of results and keep track of the number of matching docs.

  • When requesting multiple queries, see if they can send them as separate bundles instead of mixed together. If that’s not an option, maybe make each query as a separate request. I was anxious about having to recreate a query. Also this recreation ate up SUs.

  • Confirm that the queries you did and that they did are the same, as far as keywords and connectives like AND, OR, groupings with (), etc.

  • See if they can return English-language only results. Unfortunately, language does not appear to be in the metadata. This leads to all the German language items. We can cull those out because they seem to be thrown into one or two topics, but then that’s compute time that’s being wasted on something JSTOR should be able to accommodate.

Copying Data to/from an XSEDE Resource

Globus is the preferred method, it may become the only option in the near future. There’s now a web interface available at:

http://www.globus.org 

You may have to install something extra on your laptop, but after that, everything goes through the Globus website. Since JSTOR was willing to ship stuff directly to you, this should work. If you ever get into something like HathiTrust data where they want a specific method like rsync and they don’t want the data to go through other locations, etc., then that would be something you would have to set up with the specific XSEDE site.

Preprocessing

General Overview

Setting aside this particular project for a second, what I know about text preprocessing is all hinged around deleting anything that’ll probably just contribute to statistical noise:

  • Stopwords - prepositions, conjunctions, etc.

  • Punctuation - though apostrophes might be important in conjunctions, e.g., might want to capture “can’t”, but if the apostrophe is removed, leaves “can t” and so “can” might be over-counted. Likewise, hyphenated words have to be handled with care.

  • Personal names - first names seem to be commonly removed. Last names might be more of a judgement call.

  • Roman numerals, possibly arabic numerals.

  • Short-words - this can be a general catchall, since this includes a lot of stopwords, but it might also capture trailing ends of contractions

  • Stemming - verb tenses, “walk”, “walks”, “walking”, “walked” all being reported as “walk-”, but I gather there are more sophisticated ways to look at various adjectival and adverbial forms, too.

  • Removing non-English documents.

  • Checking for OCR errors. This may or may not be an issue. I asked Michael Black about this and he said (a) that they’re probably not statistically significant and (b) that there are ways to tell Mallet to ignore any word counts below a threshold or any unique occurrences, which should clear these out.

  • Removing front matter, back matter - I just remembered this when I started to prepare the Liberalism query for inferencing. The OCR data from these journal articles often contains the front matter and back matter (publisher address, volume no., date, etc.) in the document data. IMHO this should be in the metadata, but since the scanning is an automated process, that’s probably not going to happen.


Depending upon which of these are important to your project, they’ll have to be done with a separate program outside of LDA. E.g., stemming needs to be done with a specialist stemming program or library. I would recommend consulting with Mike Black or someone more versed in this area. On one hand, irrelevant words might have a tendency to congregate into a specific topic which could be ignored. However, that still consumes a topic that could maybe capture something more interesting.

Project specific

The data from JSTOR came in json (javascript object notation) format. Essentially, this is just a way to combine metadata with the document text. JSTOR may have other options for delivery, it might be useful to see what those are.

With json formatted data, the preprocessing and analysis need to be done only on the document, not the metadata, so the documents have to be stripped out and placed in metadata-free files. These have to be put into a format which Mallet can import. Mallet assumes either:

  1. A directory containing each document in a separate file, where the title of the document is the filename, or

  2. A single file containing each document, with each document on a single line. (Which implies that paragraphs, hyphenated line breaks, etc., have been removed.)

Option 2 might be required for most large corpora. This is because most file systems don’t handle having more than 1000 or so files in any particular directory very well. (This is an open area of research.) The other issue that some of these files may have hundreds of thousands of lines, i.e., documents and are too big to be processed efficiently. Therefore, a balance between number of files and file size has to be found. For this project, splitting each query into 5000 line files gave a workable balance.

Additionally, it might be worthwhile at this stage to recreate the queries, since JSTOR sent them randomly mixed up.

Ideally, then, preprocessing at this stage does any or all of the stuff listed in the General Overview section, plus reformats it for Mallet importing.  (Or there could be a sequence of smaller scripts that does each stage separately, called from a control script.)

What actually happened was a mish-mash of these stages as I was getting my head around what was required. I can say for sure what did NOT happen - stemming, removal of personal names, removal of non-English texts, and handling of OCR errors.

To reconstruct the queries, I modified a script that Rob wrote so it would work on Greenfield (and I might have renamed it…) called createTxtFiles.py. This script grabbed the OCR data (the scanned document) and the doi from the json, putting them into two separate files:

  1. A file named after the Query, e.g., citizen.txt, intersectionality.txt, etc., and

  2. A file named mapfile.txt which just contained the dois. The key here is to make sure to never alter the ordering in these two files, otherwise the correspondence is messed up.

Second, I ran a preprocessing script, preprocessor.py, launched the job script preprocess.job. This needed to be done on any amount of text that was then either going to be trained or inferred.

Recommendations

  • Getting a thorough, adaptable, modifiable preprocessing workflow going is something that might make a great SPIN project, because lots of it could be figured out on a regular computer and with smaller datasets. Then when it needed to be ported to a supercomputer, you could ask for ECSS support to do that. What I’m leaving you with is unfortunately a hodge-podge of Python scripts and Linux commands.

  • I also needed to write a script, grabTitles.py, which took the doi from the mapfile and found the actual title in the json metadata. This was for checking that the titles of the docs used for training looked okay.

Importing

Preprocessing is not enough. Mallet also expects files to be in “Mallet format.” This is, confusingly, Java serialization data format, which is completely different than, and has nothing to do with, json. Importing needs to be done with a call to Mallet, separate from training or inferencing.  All of the preprocessed data needs to be imported, whether for training or inferencing. Due to space limitations, it may not be possible to do all of the corpus at once. That means the files will have to be split or the first N lines copied or something. What I usually did to grab, say, the first 50,000 lines from the intersectionality query was just at the Linux command line with something like

$ head -n 50000 intersectionality.txt > first50K_intersectionality.txt

The actual importing I did as a specific job which called the mallet command mallet import-file with various options. I used the most basic options. In theory, one could do stopword removal here instead of as a separate preprocessing step. I don’t know how this handles punctuation or other types of preprocessing.

Recommendations

  • Double-check the mallet import-file options inside mallet-import.job, in particular, see if importing can do more of the preprocessing steps.

Training

Training is what builds the model for later inferring. The training set should consists of a known or suspected quality, e.g., for this project intersectionality, for Ruby’s project, authorship and subject matter.

To do the training I called mallet-train.job (Also written for Greenfield/PBS. Unless we re-train on Bridges, I probably won’t rewrite this for Bridges/SLURM.)

The meat of this is a call to mallet train-topics with a bunch of options. The options I used were a combination of looking at what Methcyborg did, recommendations from Paul Rodriguez, and things I discovered were needed by looking at the options and the output. In particular, you must use  --inferencer-filename to output an inferencer. Allegedly Mallet doesn’t need to do this, but I couldn’t get later inferring runs to work without it. --num-threads needs to be set to take advantage of multithreading. Training is by far the most compute-intensive part of the process, so without multi-threading it’s not feasible to do in a reasonable amount of time and will burn through all the SUs.

Recommendations

  • Use off-the-shelf Mallet. Unless you can always have a Java programmer on staff, it’s going to be difficult to use a custom program like Methcyborg. Even with someone like me, who knows a little Java, I have to also get my head around what the custom program is doing and why. It takes additional work to get running on something like Greenfield, because not only did Mallet have to be installed, but also Eclipse. Additionally, the output format for some of the files was non-standard. (This might have been carried over from Mike Black.) That means you wouldn’t be able to send that output to someone else, they wouldn’t know what to do with it, necessarily.

  • Double check the mallet train-topics options. There are a ton of them. It’s possible that the default values are perfectly fine, but it might be good to have a team member who can confidently answer questions about the settings if a question ever comes up.

  • The --num-iterations option needs to be investigated more thoroughly. I haven’t found any good information on the web about what this should be set at. Mike used 1000 for Ruby’s project. I’ve seen things on the web that use 30,000. The only thing for certain is that the higher the number, the longer it will take. While the training is running, it will print out LL/token numbers every 10 iterations or so (might be an option to set how often that comes out). I don’t know exactly what that number means, but apparently once it settles down, that’s where the number of iterations can be be cut off. So, doing this over, I’d start with 1000 iterations. Then plot the LL/tokens in Excel and see if that plot asymptotically approaches some number. If it does, trying going down, to 500, say. If not, then go up to 2000, and then dial back, etc. This apparently is highly sensitive to the input corpus, too. So every new project may need for this to be investigated a little bit.

Inferencing

During the training, an inferencer file was output. If it wasn’t, then training will have to be redone.

The documents that will be inferred against the topic model will need to go through the same preparation as above:

  1. Splitting into manageable sizes (~5000 lines w/1 doc/line)

  2. Preprocessing

    1. Recombining map_file and preprocessed ocr data

  3. Importing into mallet format

  4. Inferencing against the topic model

  5. Post processing

Inferring is a serial operation, meaning, any parallelization needs to be set up and handled explicitly. The first step is to preprocess the documents we want to infer against the training set. This is essentially the same process as above, but with the additional step of splitting the query into files of 5000 documents or so. This is so that we can parcel out each one of these 5K collections to independent inference runs.

Running multiple serial jobs in parallel is sometimes called task parallelism. It is often used when the problem is embarrassingly parallel. This is supported on Bridges using SLURM job arrays or “job packing.” Other machines may support task parallelism differently. This is something which is definitely appropriate to contact XSEDE help about when migrating to a new resource.

mallet infer-topics also has a --num-iterations option. Probably, by default, this should be the same as the number of iterations used for training. In our case, 5000 was taking too long, so I did some experiments and decided that 2500 was acceptable. This was based on comparing the document vectors and seeing to how many decimals they agreed per element.

The outputs of inferring are document vectors.

Recommendations

  • The essential part is to remember to have the training step output the inferencer file.

  • Similarly to the training step, the number of iterations should be investigated to see how small it can be an still give acceptable results.

  • Another concern is random job failures, see Appendix C.

Post-processing

  • Prepping for transfer, analysis and vis. Probably best to put everything into a compressed, archival format (such as tar.gz).

Transferring from XSEDE

  • Globus, at this point, will need to set an endpoint for your desktop.


Appendix A: Parallelism

Broadly, there are two categories of parallelism you’re likely to encounter on an HPC machine. The first is the type encountered in building the training set, where Mallet itself is given a setting to use multiple cores. All of the parallelization is internal to Mallet. (Mallet does this through multi-threading, which is loosely in the same family as things like OpenMP, MPI, CUDA, etc., i.e., these all require that that application be explicitly programmed to use them.)


The second is task parallelism. In this case, multiple instances of an application can be run independently of each other. This has to be controlled by the operating system or the job control system. On XSEDE machines and most clusters, you’ll have to go through the job control system, usually either PBS Torque or SLURM. They have slightly different syntaxes and capabilities. They may also have slightly different configurations from machine to machine. Greenfield, for example, used PBS Torque, but Bridges uses SLURM. SLURM can do task parallelism in a variety of ways. I initially tried using job arrays, but later PSC recommended job packing. This was much more efficient in usage of SUs, but requires more fiddling to get all the jobs to run and has a few other drawbacks, so I returned to using job arrays. The scripts in Appendix B are a mix of job arrays and job packing.

Appendix B: Job Scripts

recreate-query.job - this attempts to recreate the six queries over the JSTOR jsons. For each query two files are generated: a “map_file.txt” containing the dois and a “${query}.txt” for the OCR data from the json. There is a 1:1 correspondence between the map file and ocr data file for each query, so they should have exactly the same number of lines. These are then split into manageable size files for mallet.

split.job - this is for splitting the OCR data into manageable 5000 line chunks on Bridges. This is a serial job and the length/SUs depends on the size of the query.

preprocess-array.job - This is for running the preprocessing Python script on each of the 5000 line split files. The output is a file ready for importing into Mallet’s format. Uses preprocessor.py to do the actual preprocessing. The output is a text file with an “importable-” prefix.

mallet-import-array.job - This is for converting preprocessed, plain-text files into Mallet’s format, which is a binary format for serialized-Java objects (not to be confused with JSON - JavaScript Object Notation, which is completely different.)

mallet-train.job - For training a data set and creating a topic model for later inferencing.

mallet-infer-array.job - For inferencing an imported .mallet file against a training set created with mallet-train.job. The output can be varied, but is usually csv’s of document vectors. (Note that the default mallet document vector format differs from the one Mike Black would output on Ruby’s project.)

There are alternative versions of some of these for specialized situations. E.g., serial versions which run only one specific instance, packed job versions as an alternative to job arrays (which may be useful on different machines), etc.

Appendix C: Resiliency

Unfortunately, for whatever reason, it’s not that uncommon for a job to fail. This could be due to (ordered loosely from most to least likely):

  • Not requesting enough job time

  • Errors/typos in the script

  • Problems with copying data to/from scratch

  • Resource contention among parallel tasks

  • Hardware issues

  • Miscellaneous, undiagnosable, transient errors


These are cutting edge machines and, especially early in their life cycles, not all the kinks are going to be worked out. Where this came into play on this project was, for example, I’d submit a packed job of 14 task parallel preprocessors. For whatever reason, one of them would fail and only 13 of the split files would be preprocessed. I then had to go back and submit jobs with any of the failures. Most of the time, these would work on the second submission. If they failed a second time, I would double the requested time. If they failed a third time, I’d make a note of it and return to those particular jobs later. Sometimes they’d run right away the next day. If that happens, I tend to think there was something flaky going on with the machine.

If you decide to pursue establishing a more cohesive workflow (would make a good ECSS/ESRT project) then this is something that should be addressed.

Appendix D: Document overlap among queries

When reconstructing the queries though Python, any given document could have ended up in multiple queries. E.g., when looking at the Great Society query, 33,030 dois were found in the jsons. Of those, 2313 were also in the training set and 10,906 were also in the full Intersectionality query.


 

query

Reconstructed count

Less training set

reduction

Less full Intersectionality query

reduction

Great Society

33,303

30,717

7%

22,124

33%

Citizen

495,918

477,070

4%

406,893

18%

Consumerism

676,839

658,554

3%

591,168

13%

Law

1,047,684

1,001,298

4.5%

827,777

21%

Liberalism

111,259

106,387

4.5%

88,168

21%

 

My guess is that this is one of the reasons JSTOR handed over all the queries as one, that they probably removed any redundancies in order to save on space.

This would not have had any effect on the training. I believe it should be all right to remove the overlapping documents from the training set from the visualizations. A more subtle question might involve how to handle the documents which overlapped with the full Intersectionality query, but weren’t in the training set. Also, what about documents that were in both Great Society and Law, etc.? I think the main thing is to not present the queries as being completely separated.


Appendix E: Visualization

The final visualization for comparing the queries uses a style of plot called a tetrahedral plot. These are a specific application of barycentric coordinates. These were originally invented (possibly by Moebius in the early 1800s - same guy that came up with the Moebius strip) as a way to solve problems in triangle geometry using coordinates specific to the triangle. This allows doing generalized math on the triangle without having to know specific cartesian coordinates.

The main idea behind barycentric coordinates is that points internal to the triangle can be given a coordinate based on the triangle’s coordinates. So for some triangle ABC, a point in the interior will have a barycentric coordinate of (a, b, c) where a, b, c are some combination of weighting times A, B and C, respectively.

Often, these are set up so that a + b + c = 1. This is when things clicked. Mallet outputs a document vector of 100 values (100 because that’s the number of topics we picked). These also all sum to 1. Then, it provided that it’s possible to pick 1, 2, or 3 topics of interest, we can do some time of barycentric plotting. A failing of this is that it would break down if there were more than 3 topics of interest, say also looking at age/agism.

With 3 topics, we can now do TopicA + TopicB + TopicC + Other = 1 and plot to a tetrahedron using the same barycentric coordinate idea as used for triangles, just extended to 3D.


  • No labels