Extracting Hacker News Book Recommendations with the ChatGPT API
Created:2023-10-12: I have written a followup with some details about the process used to generate this list: Extracting HN Book Recommendations with the ChatGPT API, Vol. II
I love books and I enjoy reading through the Hacker News (HN) book recommendation threads. On HN, there’s almost 200 stories so far this year that have the separate word “book” in the title, and aren’t linked to another page. I wondered what the most commonly recommended or mention books are. Mainly wondering if SICP or PCL would be the top recommendation.
After reading of the man who categorised his favourite podcast into dewey decimal using GPT, I was aware that the GPT API could be used to categorise data and output the information in json format. So using the HN data fetched from the hackernews API, I used the subset of stories that seem to be book recommendation threads and extracted book titles, authors and urls from the text using calls to the Chat Completions API.
Here’s the top 50 book recommendations:
Edit: Some corrections,
- Dune was by Frank not Brian Herbet.
- Meditations listed is by Marcus Aurelius, not Descartes.
- Calculus is by Michael Spivak, although there are 14 mentions of Calculus by Apostol included in that count too. (Whoops.) Thanks to the Hacker News commenters who pointed out these errors. My sql query should have returned the most common author for each title, instead of min(author).
Update, 2023-10-12: The top 50 list has been revised, Javascript and Calculus were there in error. Those books listed were actually multiple different titles with the same name. For example Javascript: The Good Parts was renamed to ‘javascript’ by my data processing code, but there are other books with the same title, and there a number of books called “Calculus”. I have de-duped those, and they dropped off the top 50 as a result.
Some things I discovered while doing this project:
- When the API doesn’t return valid JSON, usually this is when chatGPT is saying things like “I apologize for the confusion…” or “You’re welcome! If you have any more questions, feel free to ask.”, in response to a HN comment that just says “thanks” or asks a question.
- Designed the prompt so that I can discard responses with empty titles. This is because I was unable to get chatgpt to stop including mentions of Authors without a title of a particular book.
- Processing 57k comments cost about $40 using gpt 3.5 turbo API.
- Even with a temperature of 0, GPT’s results vary from call to call. Others have noticed this effect (HN discussion), it’s not just GPT-4 that is non-deterministic - GPT 3.5 turbo exhibits greater variability compared to earlier GPT-3 models.
- It can identify links from the text, but I had to remove the html tag and just leave the url otherwise GPT would pick up the truncated link text instead of the URL.
Here’s an example of the json output by chatgpt, for this comment, it got everything wrong except the link, but it shows the format of the data:
[
  {
    "match": "Hitchhiker's Guide Vms Unsupported Undocumented Can Go Away At Any Time Feature",
    "title": "The Hitchhiker's Guide to the Galaxy",
    "author": "Douglas Adams",
    "link": "http://www.amazon.com/Hitchhikers-Guide-Vms-Unsupported-Undocumented-Can-Go-Away-At-Any-Time-Feature/dp/1878956000"
  }
]
Edit: someone has asked for the prompt, here it is:
prompt = [
            {"role": "system", "content": "Assistant that identifies book titles and authors in the following document and shows the words you match to a book title from. Some titles may be abbreviated, please expand the abbreviated title. If the document talks about an author but doesn't mention a book, leave \"title\" blank. If you know who the author is, provide the author. Don't include the book's subtitle. If the text is asking for a recommendation, without mentioning a book, then return an empty array. Provide your answer in a json array."}
            {"role": "user", "content": 'Wren\'s Explosion https://www.amazon.com/gp/395, and any Plath."'},
            {"role": "assistant", "content": '''[{"match":"'Wren's Explosion","title":"Explosion","author":"P.C. Wren","link":"https://www.amazon.com/gp/395"}, {"match":"any Plath","title":"", "author":"Sylvia Plath"}]'''},
            {"role":"user", "content":"3-days free trial isn't freemium."},
            {"role":"assistant", "content":"[]"},
            {"role":"user", "content": "Miranda Hamilton"},
            {"role":"assistant", "content": '''[{"match":"Miranda Hamilton","title":"Hamilton","author":"Lin-Manuel Miranda, Jeremy McCarter","link":""}]'''},
        ]
The Data
Because I enjoy working with data and think you might find it interesting to analyse the results, here’s the raw data produced by GPT , sorted by title. Note there’s a match column in there which includes an excerpt of the comment where the book was identified. I also normalised the book titles, lowercasing and removing ‘the’ if present at the start, and removed any subtitles. This enabled me to query the top books without missing too many items due to inconsistency in the names that gpt came up with.
Here is the input data in zipped csv format, it expands out to a 24 MB file.
Note
I have added an amazon affiliate links to amazon urls in the tables above, mainly as a learning exercise.