This functionnality is built in a python library: https://gitlab.com/vindarel/bookshops
It provides also a shell command line tool.
We get the detailed information about books and CDs from the following websites:
Appart Discogs who provides a public api, we extract the data on the other sites with some webscraping.
Web sites are written using HTML, which means that each web page is a structured document. Sometimes it would be great to obtain some data from them and preserve the structure while we’re at it. Web sites don’t always provide their data in comfortable formats such as csv or json.
This is where web scraping comes in. Web scraping is the practice of using a computer program to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data. (excerpt of xxx)
To scrap websites, we have to fire an HTTP request, get the response, parse it and extract the interesting fields.
We have to construct an url that the website understands as a search request. Observe the url:
this one is pretty long but very interesting. In particular, we can notice the “quicksearch” field, where our search terms are separated by a + sign, as it must be with url parameters.
Note
sometimes, the url is obfuscated. In that case, if the study of the POST parameters doesn’t help, we’ll need to use mechanize.
The http connection is done with the python-request library. It is as simple to use as:
import requests
response = requests.get(url)
then we can explore the response properties, like:
response.status_code
response.text
Parsing is done with beautifulSoup4.
Note
Often, the page is partly rendered by javascript calls to the server. As a result, the html we get with requests from our python script isn’t the same as the one displayed in the browser (when we see its source).
A big tool to get the html after javascript execution is selenium. However, we don’t necessarity need it (and didn’t so far). Indeed, we can study what calls make the page to what api endpoints and call them ourselves. Or simply get the html of a book’s details page.
Often, we don’t get all the data we want about a single book from the list of results (we always want its price and isbn). We can get it with a second pass, and scrape its details page.
No easy answer ! But it looks like it is.
Take example from deDE/buchlentner/buchlentnerScraper.py.
See also the BaseScraper.
But first, contact us !
The website should:
Unit tests and end-to-end tests.
They can be long when we run them, because we are awaiting HTTP requests. We do not use a cache to run end to end tests.
To run end to end tests (“live tests”), go to the datasources directory and run:
make testscrapers
These tests are defined for each scraper. They use a base class in utils/baseScraper.py. The expected results are defined in their test_scraper.yaml. This yaml defines a list of books we are expecting to find in the scraping results. The base tests fires a search, filters the results (with the title, the ean and the price which are expecting to be the same) and then it tests more fields (publishers, authors, etc). It also tests that the postSearch method returns what is expected.
TODO: run tests periodically.
We use requests_cache to automatically cache the http requests.
TODO: give an option to bypass it.
See the list on gitlab.
Integrate pages that need javascript with Selenium. It’s easy, it just needs more processing, so let’s try to avoid it first. (ask us, we’re doing it for Foyles.com.uk)
For sites of which the url is not guessable, use mechanize.
Study how xpath can help shorten the code and scrapers creation.
Test with continuous integration on GitlabCI.
It’s on the command line only and is still a work in progress.
The ods (or csv) file can be of these forms:
In short:
make odsimport odsfile=myfile.ods
This functionnality relies on 2 scripts:
There’s more info in them if you want to develop (and want to cache http requests or store and retrieve a set of results).
The ods file needs at least the following information with the corresponding english or french label (case is not important):
There’s a little test suite:
cd search/datasources/odslookup
make test
Upcoming infos: the category and historical information.
Note
Known limitations: