Data sources and webscraping

This functionnality is built in a python library: https://gitlab.com/vindarel/bookshops

It provides also a shell command line tool.

Where do we get the data of books and CDs from ?

We get the detailed information about books on the internet, where we can find it. The data is incomplete for a professional bookstore, who must subscribe to a Dilicom account. It is meant for tests and for individuals.

How to import an ods LibreOffice sheet

It’s on the command line only and is still a work in progress.

The ods (or csv) file can be of these forms:

  • it has a row with an “isbn” and “quantity” columns (this is the easiest and most precise way)
  • it has a row containing the name of the columns. In that case, it must have a “title” column or a “isbn” one.
  • it contains only data, it has no row to declare the column names. In that case, we use a settings.py file to declare them.

In short:

make odsimport odsfile=myfile.ods

This functionnality relies on 2 scripts:

  • search/datasources/odslookup/odslookup.py is responsible for extracting the data from your ods and fetching the data for each row. It returns a big list of dictionnaries with, supposedly, all the information we need to register a Card to the database. When it fetches results it must check if they are accurate. Beware the false positives !
  • scripts/odsimport.py calls the script above and adds everything in the database. It adds the cards with their quantity, and creates places, editors and distributors if needed.

There’s more info in them if you want to develop (and want to cache http requests or store and retrieve a set of results).

The ods file needs at least the following information with the corresponding english or french label (case is not important):

  • the card’s title (“title”, “titre”),
  • the publisher (“éditeur”),
  • the distributor (will be the publisher by default),
  • its discount (“remise”),
  • the public price (first column with “price” or “prix” in it) ,
  • the quantity (“stock”, “quantité”).

There’s a little test suite:

   cd search/datasources/odslookup
   make test

Upcoming infos: the category and historical information.

Note

Known limitations:

  • the script will include a few false positive results. It can not make the difference between “a title t.1” and “a title t.2”.

Table Of Contents

Previous topic

Javascript use in Abelujo

Next topic

Django tutorial