Forgotten Books

21 October 2013

I’ve just been given membership to an online research site that you might want to join, too. It’s called Forgotten Books, an online warehouse of over a million books dating back to the 1500s, and all the way up to the 1940s, all image-over-text editions, fully searchable and readable and downloadable in numerous formats. It’s not free, but it is affordable, and superior to Google Books in several respects.

I was given a free lifetime membership if I blogged about the site (no matter what I said, good or bad). But that benefit wouldn’t be worth anything if the site wasn’t worth using. And it definitely is of some use. And that’s worth knowing about. It has some defects that need fixing (and its management is working on those). But it has uses as well. I’ll summarize my thoughts on both counts.

The Basic Plus Side

Its million-plus books going all the way back to the 1500s are browsable and searchable, images and all. Continuing membership is very affordable. And with it not only can you skim and read countless old books, you can also run searches on them, and build statistical charts with the results–similar to text analyzing features I taught and used back at Columbia University when I worked for the Digital Texts Service at the campus library, where I helped numerous researchers compile their own digital texts, search routines, and statistics. And now anyone can do it, online, for cheap. They’ve already compiled the search indexes, so building searches and statistics do not have the long processing delays you would have doing this in the raw (for more on how they built their indexes, and their limitations and capabilities, see here). You can also search for images in these books (of which there must be millions).

You pay the site not only for the tools and services they provide, but also visual builds. Although the text of these books is public domain, the editions, reconstruction, formatting, and presentation are proprietary (e.g. the images are their reproductions, so they still have photography rights). So if you wanted to reuse images or page views, beyond fair use, you would still need to negotiate permissions through them. Otherwise…

You can grab and read texts as PDFs, kindle, and any number of other versions. If you use the online book reader at the site, it has a bookmarks and notes function, much like a kindle. Dedicated mobile apps for the site are also in development. The number of books you can read or download per month is limited by the level of membership you subscribe to, but you can preview a large portion of every book without limit. And all the other features are unlimited use. They also sell paperback editions of most (and soon all) of their books–if you want to have a hard copy.

You can see along the left margin at the site the many categories of books they have. I was most intrigued by their massive collection of esoteric titles: afterlife and immortality, alchemy, astrology, ciphers and codes, ESP and psychic phenomena, freemasonry and secret societies, magic and witchcraft, theosophy, unexplained phenomena. Thousands of titles. Think of the kinds of data (as well as entertainment) you can reap from a collection like that.

The research ability in early American history is profound as well, so studying what people really were saying around the American Revolution or the Civil War is right there for you to search through. Other interesting subjects include books in philosophy, religion, ancient history, early science, languages (including dictionaries from other centuries, valuable for studying words as they have changed meaning; and books in languages other than English, including Spanish, Latin, Italian, French, and German), as well as fiction, and more. They also have a collection of administrative records (genealogy data, audits and surveys, minutes and reports).

You can start learning all about the Forgotten Books service and what you can do with it here.

Experimenting and Limitations

The service offers as examples of what you can do with it:

With this valuable research information, we can tell you virtually anything about anything, from the most commonly used word in fiction books published in 1765, to the book with the most images of cats in the first 20 pages. Or perhaps some more useful information, such as a list of every word in the English language in order of usage frequency.

Experts with this kind of textual analytics will want to know how clean their OCR was. So I asked. They estimate that on average they might have about a 2% error rate, and that agrees with what I saw (some items, e.g. organizational minutes, are worse than others, e.g. commercially printed novels). But they actually know where most of the errors are (every failure to match their master word list is flagged) and are continually working to clean them up (so the database will continue to get more accurate over time). And because the books are presented as image-over-text, those errors will only matter for searches and analytics, not for reading the text, since you can simply see the page as it was scanned, regardless of whether the OCR got the text right.

In our correspondence about this, the creator of the site informed me that in other OCR projects like Google Books…

Older books have more OCR errors, therefore the frequency of all words will be skewed, showing for every word that its frequency increases over time. [But w]hat is really increasing is the quality of the text. Forgotten Books’ word data does not have this fundamental flaw because non-dictionary words were excluded from the calculations in order to correct for this error.

Their ongoing project to manually improve the text will improve statistical results even further.

To test the site out I decided to experiment.

I went into “administrative records > minutes and reports” and found 6344 titles there. There is an “about” page explaining what kind of works are in that category, and I can list titles alphabetically or by popularity or relevance–but not yet by date of publication. Although that feature is definitely on the way. Its absence is a significant limitation to a historian, but they assure me it won’t be absent for long. Word searches, however, can be limited to specific dates or date ranges, but many of the texts have not been date-coded properly, and so right now you will get dirty search results by date–for example, sometimes a book from 1977 will end up in your “before 1830” filtered search list. I’m assured they are fixing this as well, with manual verification and correction of every title and date (it’s a top priority), so searches will get cleaner over time.

Two search features worthy of additional note are that the default “book” search (i.e. without using the drop-down menu to search a specific data field like “title”) combines the results of title, author and most 100 common words in the book (excluding stopwords and matching plurals). A convenient feature. Meanwhile, the drop-down menu allows other options such as the “page search,” which allows you to search every page in every book in the library at once. For example, here are the results for the phrase “black magic” (I apologize if that doesn’t load for non-subscribers).

In the “about” page for the category I chose to experiment with we learn interesting things like this:

The Woman’s Committee, United States Council of National Defence: An Interpretative Report by Emily Newell Blair is particularly interesting. This is a report on the First World War (1914-1918) and is particularly notable because the author, Emily Newell Blair, was a renowned feminist and suffragist, providing a feminine and very different perspective from all the male dominated governance of the time. She was also a fantastic writer in her own right and this free ebook of her report should be enjoyed by anyone with an interest in the history of this war a century ago.

And that’s just one gem among thousands to find here.

But when I went exploring in the six thousand entries in this category I found some limitations in the navigation. If you know what you are looking for, it’s easy to find (using the general search functions rather than the category browsing). But if you want to peruse or narrow entries within a category, it’s a bit awkward. They are aware of these issues, and improvements are on the way. But current users should be aware of this. Searches can be run by words in the title, but that doesn’t help when you want to make sure you are searching every book in a specific subject. Right now there is only category browsing, and no category search filter. But that should be an easy feature to add, and they’ve told me they are working on it.

I picked at random from the first few entries of the six thousand or so Minutes 1851 by Reformed Church In America Particular Synod Of New York and noted that the description section is very dirty OCR (but there is a warning note saying to expect that). For example, it begins “Five copies of the Minutee of the last sesfiion of the Partaculftr Synod ef Jklbany were received and laid on the table. Article VL Corrrspondbnck. Nothing oceurred. Article VII. Classical Rkports.” So you can see the character recognition wasn’t doing too well there. But when I went to read the minutes, of course, it becomes all image-over-text so I had little difficulty. But you can see how searching the text will be of limited reliability if the underlying hypertext is as dirty as the description paragraph.

In this case that dirtiness is to be expected, since, being the published minutes of such a long time ago, the inking of the type is a bit poor for modern scanners to handle, although the human eye does fine. You will get cleaner OCR from commercially printed books, even from the same period.

But back to my test. The minutes I browsed contain tables of data, as well as plain news items and interesting remarks, so you can imagine the wealth of historical and cultural information you could dig up here or glean in general. For example, these minutes contained a complete breakdown of all the churches and pastors and schools under that synod (many dozens), a census of members and students for each, cash receipts and outlays, and more. Plus church trials and rulings and appeals regarding the dismissal of pastors and accusations against them. And more.

I then went into the “afterlife and immortality” section and found 180 titles there. I picked one early on, The Astral Plane: Its Scenery, Inhabitants, and Phenomena by C. W. Leadbeater, published in 1900. I found it full of confident descriptions of the astral plane and its contents (by a theosophist, of course). Tons of cultural assumptions to explore and marvel at in there. Likewise interesting alternative religious ideas. And so on. Did you know there are really seven astral planes, each higher than the next? Hence we’re told we must not call the astral plane the fourth dimension. That would be an error. Indeed.

I then went into “philosophy -> metaphysics” and found Philosophy: What Is It? by F. B. Jevons, published in 1914, which is interestingly described as (sic):

One of the branches of the Workers Educational Association expressed a desire to know what Philosophy is; thereby assuming that Philosophy is a concern of the average man and of practical life, and should not be the monopoly of the professed student. Of the truth of this j.view there can be no doubt, and this book consists of the five lectures which, were given by way of an attempt, not so much to answer their question as to bring out the meaning of the question. Hence the interrogative form of the title PhUosophy: what is it? The attempt was necessarily made, in the discussion of the question, to avoid technical terms as far as possible. Without technical terms it is impossible, it may be said, to go very far in the discussion.

Notice the cleaner OCR here (as compared with the minutes before).

Conclusions

Overall Forgotten Books is superior to Google Books (which faces similar defects anyway). The ways you can employ it, the size of the collection, the variety of ways you can read books downloaded or viewed from it, the ability to bookmark and annotate, and the commitment to improve even what flaws it has over time (which I expect will go more rapidly the more subscribers they get), all combine with its relative affordability to make this a site at least worth taking a look at.

11 Comments

steele on October 21, 2013 at 1:50 pm

Richard,

Thanks for the review, I like Google Books but I will definitely check this out. I am glad to see there is some competition to the Google behemoth, it is always a good thing.

You and I may disagree on things but being able to read books online is something I think many people take for granted but I find it great to be able to access these resources online which I think we can both agree on.

Thanks

Erik
Reply
F [is for failure to emerge] on October 21, 2013 at 5:11 pm

Definitely cool. especially considering what is happening to important books and editions thereof left to the care of malnourished systems of libraries.
Reply
lpetrich on October 22, 2013 at 8:02 am

Looks like a great resource. Ideally, one would want *every* surviving publication digitized, and I don’t know offhand how close we are to that goal for publications before the last few centuries.
Reply
- Richard Carrier on October 22, 2013 at 9:05 am
  
  Corporations killed that dream with the Mickey Mouse law.
  Reply
Candice Ginson on December 11, 2013 at 11:14 pm

Forgotten Books often offers life membership at regular intervals which is more worthwhile than their regular and add-on membership tariff.

Have a look at what they offered very recently:

Life Membership Options

Casual Reader
10 eBooks / Month
$49.95
Available

Fast Reader
25 eBooks / Month
$69.95
Available

Book Lover
100 eBooks / Month
$89.95
Available

Collector
500 eBooks / Month
$99.95
Available

Librarian
1000 eBooks/Month
$109.95
Available

So, it is worthwhile to wait for it to return, while you get a free book a day by simply registering on their website, and choose the smallest package if you want really something urgently.

Not everyone is given a free life membership with [no?] download limit, like Richard Carrier 🙁

I wonder why Richard Carrier failed to mention whether Forgotten Books obviates need or supplants for the truly sought after perennial titles within ocean of deservedly forgotten and outdated books.

http://www.scholarsbooklist.com
http://www.mosleyfacsimiles.com
http://www.willowsreprints.com
http://www.abebooks.com/books/RareBooks/collection-expensive-reprint-publisher/facsimile-editions.shtml

and other scholarly important antiquarian book alternatives not available in softcopy

I would appreciate open feedback here by Richard Carrier and Forgotten Books here, if not the full emails exchanged with Richard Carrier for knowing and pursuing deeper the scope and depth of utility perspectives they discussed. I prefer responses here than to myself in private.

May this thread be supported by all, with the hope of paid life memberships during New Year 2014 eve from forgottenbooks.org !
Reply
- Richard Carrier on December 13, 2013 at 10:43 am
  
  I don’t understand your question.
  Reply
Chris Moss on December 19, 2013 at 12:51 pm

I’m not sure how hard you’ve worked for your subscription to Forgotten Books. You conclude unequivocally that “Fogotten Books is superior to Google Books”, but on very flimsy grounds.

Consider:
Number of books: FB has 1,000,000 GB has (on a recent count) 30,000,000

Quality of OCR: you’ve accepted FB’s metric without question. Whether it’s true or not I don’t know.

Cost: You need to subscribe to read most FB books ($3/m upwards). GB are free if there are no copyright issues, but limited outside the US otherwise.

Convenience: FB does offer Kindle but only as images (afaics); GB puts much stuff on archive.org where multiple formats are available. (archive.org claims 5m books).

Now I’m not knocking another entry in the ebook market – I’m all for it. I guess that FB is a profitable enterprise without advertising. But I don’t think you’ve proved anything.
Reply
Satyamev Jayate on January 5, 2014 at 6:30 am

I agree with Chris and Candice completely, but wish to add about this quite bit of a fraud website due to the fact they are copyrighting out-of-copyright many trash, some treasure and few rare titles by simply adding a few self-captioned pages in front or rear end of the scanned book from public libraries.

They were fair and proper up until recently when they were offering their service at USD 49 for life; which is reasonable price for the offering but lately they have been profiteering with solicited endorsements traded for free life memberships to “big” names as influencers, with a pricing model that is not in public interest for third world countries with disadvantaged exchange rate vis a vis USD.

Their stock is entirely dependent on public librarians and public endowments; currently used for private enrichment of their founder David Forsythe, which can’t hold a candle to Google Books at all, by any stretch of imagination.

It is a worthwhile service only at the price and type that they have now stopped offering – life membership model.

Preserving human knowledge lawfully, comprising mostly of out of print and out of copyright held in trust for humanity till posterity, is a project of Google Books is grand and noble with free access. But doing it in a quite a surreptitious and not at all strategic venture withholding books from humanity till posterity, in an odious re-copyrighting project with usurious rate of service disproportionate to the true source of those books, is despicable.

Appended is the free book a day email from forgottenbooks.org on 1st Jan 2012, which is self-explanatory.

Legal eagles must launch proper class action case against forgottenbooks for usurping public interest for private gain.

What do the other commentators think, apart from the original blogger, who has missed these vital points completely ?

In support of free knowledge or affordable access to yesteryear cumulative knowledge of humanity,

Satyamev Jayate,
INDIA

———- Forwarded message ———-
From: Forgotten Books
Date: Sun, Jan 1, 2012 at 6:04 AM
Subject: Free Book of the Day – Sunday 1st
To: bhatpm@gmail.com

Dear Subscriber,

Today is Sunday 1st of January. Your free book for today is:
Beowulf

To download your free high-quality copy, please visit the following link:
http://www.forgottenbooks.org/freebie.php?book=60358884GA22682

You have only 24 hours to download the book before the link expires. If you miss it, you can still download the free copy, or purchase membership. Membership to Forgotten Books costs only $49 for life, you only pay once.

Thank you for your continued support,
Forgotten Books Team

To unsubscribe:
http://www.forgottenbooks.org/unsubscribe.php
Reply
LEN on June 24, 2014 at 9:32 am

“You pay the site not only for the tools and services they provide, but also visual builds. Although the text of these books is public domain, the editions, reconstruction, formatting, and presentation are proprietary . . .”

This is actually not true. Forgotten Books takes things that are not public domain and sells them in violation of copyright law. There are things that they are selling on this site that are available elsewhere for free (though copyrighted) on the www.

There is no email or contact info on the Forgotten Books web site and there should be–that way they could be informed that they are in violation of the law and the spirit of the law. Perhaps if you are in touch with them you might ask them about this fact.
Reply
- Richard Carrier on June 26, 2014 at 11:11 am
  
  None of that is true.
  
  I don’t think you actually understand how copyright law works.
  
  And they are easily contacted.
  Reply
Wilfredo Mendez on November 3, 2014 at 5:14 am

The forgottenbooks.com and forgottenbooks.org are down.

Any idea as to:

– What happened to them? Why did they shut down the site?

– When or if they will re-launch the site?
Reply