Author

Blogroll

Tech
Security
Politics
Public Disclosure
Optimizers
Esoteric

On Building A Library

John David Pressman

I was introduced to Issac Asimov’s Foundation series in the summer between my Freshman and Sophomore years of High School. Certain books are so widely agreed to be excellent that you’ll have heard of them long before you read them, at least vaguely. Foundation was in this category for me, and what compelled me to read it over anything else I might have done at that time was it’s immediate availability in my local school library. However, it would be a misnomer to say I ‘checked out’ the book, as this library has no check out system, no tracking, and no index of what books are available.

To paint a picture for you, the ‘library’ is actually a disorganized wall of books stretching around the east wing of the school. There is no Dewey Decimal, no organization by subject or category. To find a book you like you must crouch down and, head tilted sideways, walk along the shelves scanning titles until you come upon something worthwhile. It actually took me a great deal of time to find Second Foundation after somebody misplaced it somewhere in the rows of babel.

It was for this lack of care that the library largely remained (and remains) unused. I’d walk by it every day for class and never once did I see anybody peruse its shelves. This bothered me for several reasons. One of which being that I had already found one good book in the shelves and knew that there were probably others. But I didn’t want to go through every book just to find the few I would be interested in. Another reason this bothered me was that the school basically had no library besides this to speak of, and in fact there had been whispers of trying to create a library for years.

The donor(s) of the books presumably donated with the faith that their books would be used. To have them laying around is a betrayal of that. I don’t remember quite when the idea first surfaced in my mind, but at some point I started pondering how one might turn these books into a proper library. And it occurred to me that the biggest problem was there was no way of even knowing what we have. A further amount of time was spent pondering how one might catalog the books. Eventually the solution came to me when I sat down during one of my ‘Computer App’1 periods at a table in the East Wing. I remember naively pulling out a piece of paper and a pen, writing down the title and author of about four books before realizing there would be no way I could do it that way in a reasonable amount of time. Obviously libraries had to have a way of managing this, and I examined the books I had on the table. Inside the cover and on the back I noticed an ‘ISBN’ number, which my intuition told me was probably some kind of unique identifier for the book.

ISBN Numbers

My intuition was correct. International Standard Book Numbers, or ‘ISBN’ for short, are issued by registration agencies around the world and uniquely identify a book or publication. There are also competing standards such as the Library of Congress Card Number. The vast majority of books cataloged in the library have an ISBN, so that’s what I will discuss.

International Standard Book Numbers have ten digits, or thirteen. The 10th or thirteenth digit is a ‘check’ to allow you to verify the integrity of the other digits. Usually the bar code on the back of the book is an encoding of that books ISBN number. The font used to print the number below the barcode is OCR-A, which is optimized for machine recognition programs. In general this means that in 2014 you shouldn’t need to catalog the library by hand, as the purchase of a simple barcode scanner hooked to a laptop will relieve you of the burden.

Cataloging the library by hand is exactly what I did. Largely why I did this is that I couldn’t imagine any better way. I really wish that when I’d learned about ISBNs I’d digged a little deeper into their conventions instead of running off to implement a solution. I’d assumed that the barcodes on the back of a book were just data assigned by the company that sold the book for inclusion in their inventory system, if I’d known it was the ISBN I’d have paid fifty bucks for a barcode reader and saved myself the hassle.

Instead what I did was significantly more complicated. I took a picture of the ISBN of every book, then transcribed the numbers from the photos by typing them down.

Photography

This was easier said than done. For most of high school I carried a camera and took photos of whatever things struck me as interesting. (You can find some of my photo albums here.) But photographing the back cover of each library book could take a very long time. I started out by taking a table in the project area and dragging a stack of books over to the table so I could photograph them, then carrying the stack of books back to the shelves. This proved to be slow and laborious. Eventually I recruited volunteers all of whom I am deeply thankful for their help and wish I could remember their names so as to thank them properly, and set up a team. One person would retrieve books from the shelf, I would sit at the table and photograph them, and a third person would take the books back to the shelf. In this way the work was sped up considerably but still took a week or two to complete. At times it seemed like we would never finish and between this portion of the project and the transcription I was forced to confront the problem of patience to get the results I wanted.

Eventually the entire library had been photographed, and still ambiguous on whether I would transcribe the numbers by hand or use OCR software to try and read them the photos sat on my hard drive untouched. Schoolwork and other concerns intruded. Besides a mental note in my head every time I saw the shelves that I still hadn’t finished my work, the project faded from view.

Transcription

By the time I got around to transcribing the ISBN numbers from the photos, this ‘Library Project’ had become my Senior Project. In the state of Washington you need to do 40 hours of community service as part of your graduating requirements, what you do for these 40 hours is called your Senior Project. If at this stage you’re wondering why I didn’t just use a character recognition program to transcribe the numbers for me, the reason is that transcribing them with OCR seemed like it might be error prone and finicky (I did not know the numbers were OCR optimized), and I needed the time sink anyway. Resolving to transcribe them is one thing, actually doing it is another. Modern windowing systems generally don’t seem optimized for a workflow where you are manipulating an application in one window at the same time that you’re manipulating a second application in another window. What this meant in practice was that I had to repetitively go through the same keyboard shortcuts to navigate between windows over and over.

Though I eventually settled on a direct transcription from the photos, I did try experimenting with alternate methods such as recording myself saying the numbers at a steady pace and then typing to keep up with the recording. It turned out in practice to be less effective than just reading the numbers off the photos and typing them. My final setup was:

  1. The Emacs Text Editor

  2. An image viewer

  3. XFCE windowing system for Linux

My workflow would be to open the text editor and ‘pin’ it in the corner, then I would set it using XFCE so that it would always be the top window. I’d then open the image viewer to a folder containing all the files. I would use alt-tab to switch between the text editor and the image viewer, flipping the image then switching to emacs so I could transcribe the ISBN number into a text file. Between sessions I would end the text file with a note telling me what image I was on so that I could quickly get back to work at a later time.

From a psychological perspective, sitting down and copying stale book number after stale book number is very difficult. It requires a good deal of focus to check and double check that you’re putting the number in correctly. This is in fact one of the things the check digit is supposed to help you with, but I didn’t understand that a check digit was available.2

Checking My Work

“Well how many of the ISBN numbers you typed in are correct?”

If your library database has false entries in it you’re going to be sending people on wild goose chases looking for books. The ISBN check digit can seem a little confusing at first, but the mathematics involved are very simple. To start, let’s take a valid ISBN-10 number:

ISBN 0-385-42075-7

As it’s namesake would imply ISBN-10 has ten digits in it. The last digit, 7:

ISBN 0-385-42075-7
                 ^

Is the checkdigit for this number. Checkdigits can be 0 through 10, with ten being represented by the symbol ‘X’. The checkdigit is calculated through an algorithm given the first nine numbers. This is so that if any of the first nine numbers are corrupted the checkdigit will not match up, allowing you to spot the problem. The algorithm for determining the checkdigit is as follows:

Given the first nine numbers, you multiply them by a descending integer ‘weight’, like so:

(0 x 10) (3 x 9) (8 x 8) (5 x 7) (4 x 6) (2 x 5) (0 x 4) (7 x 3) (5 x 2) 
   0       27      64      35      24      10       0      21      10

The final number is the checkdigit, and it’s an integer between zero and ten such that the sum of all ten digits multiplied by the weights will be a multiple of eleven. (I.e. If you were to divide the result by eleven you would get an even number.)

In this case our sum is 191 minus our checkdigit. This gets us most of the way there but how do we determine what the checkdigit actually is?

The short answer is you take 191 modulus our base, in this case eleven, and get four which you then subtract from eleven to get seven which is our checkdigit. If that makes perfect sense to you feel free to safely skip to the next section, otherwise the long explanation follows:

When we divide two numbers, say 4 by 2, there are three parts to the division operator:

4 / 2 = 2

These are the Dividend, the Divisor and the Quotient, which are the number to be divided the number we are dividing by and the result respectively.

dividend / divisor = quotient

In a case like 4 / 2 = 2, two ‘goes into’ four evenly two times, giving us our quotient. However if you consider the case 5 / 2:

5 / 2 = 2 remainder 1

Two does not go into five evenly, we have a number left over which is called the remainder. The operation of finding the remainder given the division of two numbers is called modulo.

So 191 modulus our base, eleven, would be the remainder left over once eleven has gone into 191 evenly as many times as it can. This happens to be four. What this means in practice is that (mod 191 11) tells us we are four away from the last multiple of eleven, this necessarily means that we are four closer to the next multiple of eleven than we were at the last multiple of eleven. Since you must take eleven steps from 187 (191 - 4 = 187), me know that four of those steps have already been and can therefore subtract them from the eleven we have to go.

11 - 4 = 7

Hence our checkdigit.

There’s a different algorithm for ISBN-13’s. The ISBN-13 checkdigit algorithm alternates between multiplying each digit in the ISBN by one and by three such that the final digit is a multiple of ten.

978-1-4165-0778-9

(9 x 1) (7 x 3) (8 x 1) (1 x 3) (4 x 1) (1 x 3) (6 x 1) (5 x 3) (0 x 1) (7 x 3) (7 x 1) (8 x 3) (9 x 1)
   9      21       8       3       4       3       6      15       0      21       7      24       9

But Wait, There’s More!

Knowing how to do this in theory is one thing, but manually going through every ISBN and applying this algorithm would be tedious, time consuming and impractical. Which is of course why we won’t be doing that, instead we’ll write a computer program to do it for us.

Python is great for quick script jobs like this:

# This program takes a filepath and tests for two properties on each line of the text file given:
# 1. That the line is in fact the valid length for ISBN-10 or ISBN-13.
# 2. That the line passes a check function to ensure it is a valid ISBN number.
# It outputs lines that did not meet expectations

import argparse 

def main():
  """
     Open the text file given from the command line.
     Read its contents.
     Go through each line of the file and test if it is a valid ISBN number, if yes ignore if no print it.
     """
  parser = argparse.ArgumentParser(description="Checks a file of newline seperated ISBN, SBN, and LCCN numbers for consistency.", prog="IsbnChecker")
  parser.add_argument('file')

  arguments = parser.parse_args()

  FilePath = arguments.file
  IsbnFile = open(FilePath,'r')
  IsbnLines = IsbnFile.readlines()
  for line in IsbnLines:
    IsbnCheck(line[:(len(line) - 1)])
  return 1
  
def IsbnCheck(isbn):
  """Take a line of text and determine if it is a valid ISBN number."""
  tests = [0,0,0] # Create a list to store what tests the ISBN number passed.
  if CheckLength(isbn):
    tests[0] = 1
    
  if tests[0] == 1 and CheckContent(isbn): 
    tests[1] = 1 

  if tests [1] == 1 and CheckDigit(isbn):
    tests[2] = 1

  if 0 in tests: 
    print(isbn + " " + str(tests[0]) + " " + str(tests[1]) + " " + str(tests[2]))
    return 0
  else:
    return 1
        
     
def CheckLength(isbn):
  """Check that the line given is in fact the proper length for a ISBN-10 or ISBN-13 number."""
  if len(isbn) == 10 or len(isbn) == 13:
    return 1
  else:
    return 0

def CheckContent(isbn):
  """Check that the content of the ISBN number is consistent beyond length."""
  IsbnChars = {'0','1','2','3','4','5','6','7','8','9','x','X'}
  for character in isbn:
    if character not in IsbnChars:
      return 0
  return 1
 
def CheckDigit(isbn):
  """Implements the checkdigit algorithm to determine if the line is mismatched with its checksum digit."""
  # If ISBN-10, do the ISBN-10 check algorithm.
  if len(isbn) == 10:
    weight = 10
    base = 11
  # Convert the isbn numbers last digit to 10 if it's an X.
    try:
      isbn_10 = [(number == 'x' or number == 'X') and 10 or int(number) for number in isbn]
    except ValueError:
      return 0
  # Calculate the weighted sum of each number in the ISBN-10.
    summation = 0
    for number in isbn_10:
      if weight > 1:
        summation = summation + (number * weight)
        weight -= 1
  # If the 10th digit corresponds to what would be expected given the algorithm, return 1, otherwise return 0.
    return (isbn_10[9] == (base - (summation % base))) or (isbn_10[9] == (summation % base))
  # If ISBN-13, do the ISBN-13 check algorithm.
  elif len(isbn) == 13:
    base = 10  
  # Convert the isbn numbers last digit to 10 if it's an X.
    try:
      isbn_13 = [(number == 'x' or number == 'X') and 10 or int(number) for number in isbn]
    except ValueError:
      return 0
  # Calculate the alternately multiplied sum of each number in the ISBN-13
    summation = 0
    index = 2
    for number in isbn_13[::2]:
      if index == 14:
        break
      summation += isbn_13[(index - 2)]
      summation += (isbn_13[(index - 1)] * 3)
      index += 2
  # If the 13th digit corresponds to what would be expected given the algorithm, return 1, otherwise return 0.
    return (isbn_13[12] == (base - (summation % base))) or (isbn_13[12] == (summation % base))
    
main()

In the course of developing this program it was necessary to have known good test data to work with, this can be obtained by extracting it from the Internet Archive’s Open Library Dump’s. The ISBN numbers are under the “Editions” collection. As this is a several gigabyte file I have provided a sample of the numbers for the reader to work with here.

Book Identifiers Besides ISBN Numbers

Running this program on the numbers I transcribed. (Provided Here) Gives eighteen ‘wrong’ entries:

 1  06710006827 0 0 0
 2	0671890184 1 1 0
 3	553287737 0 0 0
 4	345032322150 0 0 0
 5	00182745 0 0 0
 6	015200392 0 0 0
 7	0878422263 1 1 0
 8	0965058010 1 1 0
 9	425043029 0 0 0
10	345242114150 0 0 0
11	060803452 0 0 0
12	425034704 0 0 0
13	670051047 0 0 0
14	0345243757150 1 1 0
15	00028762 0 0 0
16	345243102150 0 0 0
17	0760058768  0 0 0
18	0316544963 1 1 0

Thirteen are the wrong length to be a ISBN-10 or an ISBN-13. The other five did not pass the checksum algorithm. Before we continue it’s important to elaborate on a few cases where even though they’re not ISBN-10 or ISBN-13 numbers, these are still the correct identifier for each given book:

  1. It’s a Library of Congress Control number.

  2. It’s a Standard Book Number. (As opposed to an International Standard Book Number)

  3. It’s an Invalid ISBN.

The Library of Congress Control Number is a standard used by the Library of Congress for their cataloging system, and was at one point the defacto standard for book indexing in the United States. A LCCN is 12 characters long, a full description of the format can be found here. Standard Book Numbers were the system that eventually evolved into the International Standard Book Number, a Standard Book Number is nine characters long and can be converted into a International Standard Book Number by adding a “0” to the front.

The ISBN check system relies on publishers generating their ISBN numbers according to standard. Sometimes they do not do this, and you get invalid ISBN numbers that are in fact the ISBN number for a given book. I couldn’t find a simple way to determine if an ‘invalid’ ISBN number is in fact an invalid ISBN number. Presumably you would need a list of all the invalid ISBN numbers and such a list does not seem to be available on the open web.

So what to do about Standard Book Numbers and LCCN’s? Obviously their existence means our program is incomplete because it doesn’t handle them, and ideally we would patch the program to handle both. However LCCN’s are considerably confusing, so instead we can be contented with fixing the program to work with Standard Book Numbers. The patched version is available here and yields five ‘corrected’ entries:

 1	06710006827 0 0 0
 2	0671890184 1 1 0
 3	345032322150 1 1 0
 4	00182745 0 0 0
 5	015200392 1 1 0
 6	0878422263 1 1 0
 7	0965058010 1 1 0
 8	345242114150 1 1 0
 9	0345243757150 1 1 0
10	00028762 0 0 0
11	345243102150 1 1 0
12	0760058768  0 0 0
13	0316544963 1 1 0

I actually didn’t write the check program until I’d already grabbed the data but obviously one so inclined could either yank these or write another function into the checker for LCCN’s. I’ll go ahead and move on to getting metadata from the ISBN’s.

What To Do With The ISBN’s

Of course the numbers aren’t all that useful by themselves. In order to get a library catalog worth searching through you need a source of metadata to go with the ISBN numbers. The source I used was isbndb.com. They provide 500 free API requests a day, meaning it takes three days to download the entire libraries worth of metadata. (There is also a paid option for larger volumes of requests.) The original program I used to grab this data was painful to use and involved command line utilties such as curl and writing the output to a file that later had to be (annoyingly) parsed back into python data structures.

Python actually provides native functionality for what those command line utilities did, and through the use of command line arguments a lot of the pain points of the old program can be avoided. For those reasons instead of pasting in the listing for the program I originally used I’ll write a new one from scratch, this time going through it line by line for readers:

The first thing to do is state what our program does, essentially this program:

# This program takes a file of newline separated ISBN numbers and generates API calls from them
# The API calls are passed to urllib, which uses them to retrieve JSON metadata for each ISBN number
# The metadata is then put into a list and stored as JSON in an output file given by -o or 'isbn-metadata.json'

To start we want to import libraries that we’ll need to use while writing the program, links to the docs for each library are included, however in actual practice you would only write the part in square brackets:

# We import urllib to make the api calls to isbndb.com

import [urllib.request](https://docs.python.org/3/library/urllib.request.html)

# We import os.path to check if a given output path exists

import [os.path](https://docs.python.org/3.4/library/os.path.html)

# We import json so that we can output our metadata as a JSON document

import [json](https://docs.python.org/3.4/library/json.html)

# We import argparse to define command line arguments for our program

import [argparse](https://docs.python.org/3.4/library/argparse.html)

# We import time for a delay in API requests

import [time](https://docs.python.org/3.4/library/time.html)

Next we define the main function and write a docstring:

def main():
  """Generate api calls, retrieve documents, put those documents into a JSON file."""

We define the command line arguments for our program:

  argparser = argparse.ArgumentParser(description="Grab metadata for a list of ISBN numbers.")

  argparser.add_argument('file', help="The file of ISBN numbers to read from.")
  argparser.add_argument('-o', '--output', default="isbn-metadata.json", help="The filepath to output JSON data to.")
  argparser.add_argument('-s', '--skip', default=0, type=int, help="How many entries to skip before starting.")
  argparser.add_argument('-r', '--requests', default=500, type=int, help="The number of requests to make from the starting line, default 500.")
  argparser.add_argument('-k', '--key', help="The API key tied to the users account.")

  arguments = argparser.parse_args()

Next we define a function outfile to determine whether or not a previous instance of the output file has already been created by this command, in which case the JSON data is read in from it and appended to before writing back out.

def readprevious(filepath):
  """Check if filepath exists, if true read in data and return previous session."""
  if os.path.exists(filepath):
    previous = open(filepath, 'r')
    try:
      session = json.load(previous)
    except ValueError:
      raise ValueError("Output file already exists but does not appear to be valid JSON.")
    previous.close()
    return session
  else:
    return []

We define a list ‘metadata’ and use that to store our entries in, to initially create it the readprevious function is used so that it is either empty or contains the previous session of metadata:

metadata = readprevious(arguments.file)

The file given on the command line to take ISBN numbers from is assigned to the variable isbns:

isbns = open(arguments.file)

From this file the number of lines given by the skip option are read, so that you can resume from a previous session by starting from a line other than zero.

# Read the number of lines given by the start option
  for line in range(arguments.start):
    isbns.readline()

The meat of the program sets the variable ‘key’ to the key supplied on the command line, then makes a request as many times as was specified on the command line with a default of zero, reading a line from the file of ISBN numbers each time then generating the API request from it. urllib returns a document as a series of bytes that need to be decoded to plaintext so that the json library can read the entry as python objects. Finally that json entry is appended to a large list of json entries:

 key = arguments.key
  for request in range(arguments.requests):
    isbn = isbns.readline()
    entry = urllib.request.urlopen("http://isbndb.com/api/v2/json/{key}/book/{isbn}".format(key=key, isbn=isbn))
    entry_bytes = entry.read()
    entry_plaintext = entry_bytes.decode('utf-8')
    entry_json = json.loads(entry_plaintext)
    metadata.append(entry_json)

Finally to finish up the output file is opened and the json metadata written to it:

outfile = open(arguments.output, "w")
  json.dump(metadata, outfile)
  return 1

Don’t forget to call the main function:

main()

The final program listing is available here.

Some notes on the use of this program:

The program checks the -o option to see if an output file has already been generated, if it has been it appends to that file. This makes it easier to work with the data once you need to put the JSON into a database.

When specifying how many times to skip on the command line, you want to skip as many requests as you made in your last session.

The key is the one given with your signup, it is necessary to make requests from the isbndb.com database.

What Next?

Now that you have this data, there are many possible methods of turning it into a complete library system. I’ll go over what I did in part two.

  1. Computer App was basically study period with a computer bent. If I didn’t have it I’d have probably never done this.

  2. One could set up this workflow such that every time you entered an ISBN number it would be automatically checked for integrity, in this case you would be immediately notified of a mistake and have a final error rate of 0%.