Problems with Optical Character Recognition – OCR and Digitized Online Newspapers
Newspapers and Their Bounty
Newspapers are chock full of information inherently valuable to the family historian, first and foremost, vital records, notices of births, marriages, deaths, divorces, secondarily, property sales, bankruptcies, news articles, disasters, sports events, graduation exercises and prizes, scandals, letters to the editor, anything identifying a person or family in a specific time and place.
In order to search broadly enough to get all the references to a particular person, cast a wider net than “search for” Michael Henry Breitenstein Sr. or his wife, Elizabeth Breitenstein. Try any of the following:
- Michael Breitenstein
- M. Breitenstein
- Mike Breitenstein
- Michael H. Breitenstein
- Michael Henry Breitenstein Sr.
- Mr. Breitenstein
- Elizabeth Breitenstein
- Elisabeth Steinmetz Breitenstein
- E. Breitenstein
- Lisette Breitenstein
- Mrs. Breitenstein
- Lizzie Breitenstein
- Mrs. M. Breitenstein
In this case permutations of Beth were not used for Elizabeth, her nickname Lisette leads one to search Liz combinations. OCR can only read and index what the newspaper published. Except for this specific family, the surname Breitenstein is not common in Louisville. Nearly every Breitenstein in Kentucky except those in Covington and Cincinnati were connected, so rather than search for an incredible number of combinations and permutations, use the surname only, with no first name. This will not work for families with the surnames Smith or Brown or surnames used as street, building or company names. With OCR every instance is identified, including advertisements. Other OCR idiosyncrasies, include problems with headlines in all capitals, apostrophes and hyphens.
George Coogle Loses Barn vs. GEORGE COOGLE LOSES BARN. The first way catches the surname Coogle and the second doesn’t. The search results list two matches of “Coogle” on page 5. However, in the Courier Journal article, 20 Nov. 1904 on page 5, these phrases appear in the article:
- GEORGE COOGLE LOSES BARN
- George Coogle’s Loss.
- home of George Coo-gle
- Mr. Coogle’s family
ALTHOUGH The search box reflects:
coogle in the current document
0 documents with 0 instances
This feels like an anomaly, I can’t explain it, the initial search finds two (2) instances of the word Coogle, but the page search lists zero (0). A human reading of the page reveals the four instances listed above. None of them were exactly Coogle, two were Coogle’s, one COOGLE and one Coo-gle. In addition this surname Coogle, continually prompts Google and other search engines too, to ask if you would really rather have Google instead of Coogle. A good example of a case of artifical intelligence, AI, gone awry.
In Louisville in the 1910s, the Geo. F. Breitenstein Machine Co. had the same advertisement in multiple issues of the Kentucky Irish American. In the 1890s and 1900s, Theodore Breitenstein a.k.a. The Breitenstein, a baseball pitcher, was in the sports news on and off all those summers making double plays, getting hits and making the baseball news. Though he is not related in America to the Louisville Breitenstein family, he appears in all the newspapers, Kentucky and otherwise. He was from a St. Louis, Missouri family.
An obituary of Catherine Zeitz mentioned her daughter Catherine Breitenstein as a survivor. OCR indexing does not identify all instances of the surname or phrase, hyphens can be anywhere, page breaks, line breaks, columnar issues, dropped capitals, captions, italics all make it hard for OCR to optimize its search work as do dirty ink-filled printing leads….
Archival and heritage newspapers
In 1909 M. Breitenstein won two awards at the Kentucky State Fair reported by The Courier Journal in an article entitled “Vegetable Awards Bring Out Merit. Judges Discover Some Record–Breaking Displays in Placing Ribbons,” published Thurs., 16 Sept. 1909, on page 3. In Onion Sets, ring 934, one-half bushel white, he won second place, a prize, $2 was awarded to M. Breitenstein, while for Irish Potatoes, ring 946, one-half bushel any other variety, he won first prize, a $3 award. This particular issue has been filmed, digitized and indexed online as part of the National Digital Newspaper Program, NDNP, the Kentucky Edition by the University of Kentucky Libraries. M. Breitenstein, a.k.a., Michael Henry Breitenstein Sr., was farming the old farm in Okolona off Durrett Rd., the driveway a long one, is now known as Breitenstein Lane. This farm was previously farmed by Jacob Breitenstein, Mike’s father and subsequently farmed by Herman and Emil Breitenstein, two of Mike’s sons, and finally farmed by Emil Breitenstein Sr. To know that Mike had entered and won several spots in the Kentucky State Fair that year, means further research into other years’ winners may yield more information.
The Courier Journal, Thurs., 16 Sept. 1909, p. 3, col. 4, (Louisville, Jefferson Co., Ky.), Products of Kentucky Gardens at the State Fair.
A review of the articles included in the Kentuckiana Digitial Library Newspaper Collection included less than ten items with a Henritze connection. Several had to do with the death of Judge T. L. Henritze, one had to do with the wedding of his daughter Mary Helen Henritze and one had to do with the graduation of his youngest child from Millersburg Military Institute.
Bourbon News, Fri., 24 May 1918, p. 5, col 2, (Paris, Bourbon Co., Kentucky), Social and Personal.
This article including as it does a list, alphabetically by surname, of the graduates of the class of 1918 of Millersburg Military Institute makes it extremely likely that these are some of the same young men in a photograph I have of Fred Cleveland Henritze’s baseball team.
Online newspaper research is wonderful and so much more immediate than figuring out the titles of possible choices in the counties needed, identifying what repositories contain which titles and issues, checking the preservation status, microfilmed, scanned, indexed, ordering the film on ILL and then paging through each issue reading articles until an obituary or wedding or birth notice is identified with previous knowledge of the time frame. Browsing can take place in online newspapers too, and I heartily recommend it, but the OCR search capabilities are amazing and narrow searching time tremendously.
Various online newspapers are available from:
- Google News Archives
- Historical Newspapers Online
- HighBeam.com – newspapers but more of a current aspect
- News and Newspapers online from the University of North Carolina
- Library of Congress Chronicling America
None of these lists are comprehensive yet and may never become comprehensive. I have no plans to create a comprehensive list. The last time I did that, it took forever, turned into a book, and consumed enormous amounts of time and energy.
I thought about exploring two of those pay for view services with trial runs prior to Christmas and asking for the best one for a present. This was a very bad idea for several reasons.
- No one has time to add a chore the week before Christmas. What was I thinking?
- If you are going to start a yearly subscription, begin it in a month that doesn’t have a ton of bills, not near Christmas, yearly property taxes, IRS payments and semi annual insurance payments, leaving those six months out, leaves six other months available in a “not breaking the bank” way. If you like the service you will want to convert it into the yearly fee instead of the monthly fee.
- During the Christmas holiday break it was impossible to reach either company in a timely fashion to cancel during the trial period. One, FamilyLink, was understanding and canceled the billing and agreed to let me have another trial another time. The other HighBeam was not understanding and I expect to have to chat again with their customer service.
Once you have figured out how to search a specific newspaper instead of all those in each company’s archives, search that specific newspaper for an address or combination of first names. William and Mary won’t work, but Pamela and Diane might. Browse near dates that should offer information. For instance, with a month and year of death, search the most likely 35 days for an obituary. If the obituary says married 62 years last June, search the month and year for the wedding, and the 25th, 30th, 40th, 50th and 60th anniversaries, starting with the 60th. The first list of hits is the easy one. I fully expect since M. Breitenstein won two awards in 1909 at the Kentucky State Fair, earlier and later Courier Journal articles about the State Fair Winners from 1900 to 1922 may be fruitful. To find those articles search near the same date in other years or search the names of other winners. I haven’t done that yet. It is possible that he only won one year or didn’t take the time to enter other years, but even as frugal as he was, his farm was very close to the Fair grounds, he wouldn’t have had to take off even a full day to be there. He had nine sons who could have “babysat” the entries. There may be more awards articles. The same logic applies to graduations, if a Roman Catholic family with girls every two years has one graduation from Cathedral High School in Baltimore, surrounding years should be checked even if the surname is not indexed. It is likely that all the girls in the family went to the same high school. Graduation exercises are a fantastic place to identify middle names.