paul lombard
paul lombard
paul lombard
paul lombard
paul lombard
paul lombard
paul lombard
paul lombard
paul lombard
paul lombard

Paul Lombard

Abstract: .info will just be an experiement domain dealing with scraping unique sources of data into a mysql database and staggering output. I really want to avoid using wordpress for this and just coding everything from scratch. So, various sources of data exist out there which engines havn't indexed. PDFs for example. Using utilities and a Python scraper, I could scrape large amounts of text into rows of data and output those at random intevals. I think a good goal is about 10,000 'posts'. Not sure if I should stick to a theme, or just mix the subject matter. Also, either with naturally occuring kwds, or with randomly inserted ones, I'd like to create an IA to feed other sites. Possibly just insert my name in random places and link back to my main property.

Update: 25 February 2015
Right... .info, right? Ok, well, did a simple inurl: search to find books on consulting:

PDF inurl: search

Easy enough... when you open up a random one and do an exact match search from the top of the PDF it's indexed, but lower down not... which is an encouraging sign. Perhaps engines bother with only the first x% of the text in PDFs and then stop indexing? Dunno... would have to look into it further.

This is from the first couple of pages and seems to be indexed properly: (check the first result, ignore the error message)

This snippet from lower down isn't indexed as a contiguous string...

So, now... I'd go looking for instances of 'consultant'... every x occurance of the text, I'd insert "SEO Consultant" and link back to my main property. Who knows what that'll do, if anything but it's fun and games.

Next Time: I'll run some bash utilities at the pdf, converting and store in a flat file for later cycling... probably do the echoing through a cronjob... avoiding CMSs for now.

Update: 25 February 2015
Ok... Turns out this is pretty simple. I first had to sudo apt-get install xxx package to get the right utility. From there, it's a simple one-liner to convert the formats. So, then you end up with a .txt file. Then, there's a lovely unix tool called split <<< this is why I love unix/linux. You simply aim it at your file and tell it what level of granulairty you want. From there, you create differnt files. mkdir to make things neat and put it all in there.

The next step is to set up a small shell script to for each file in folder and echo it's content into a file of the same name. The problem is, the split util creates pretty random file names. I suppose what I can do is grab the first x words and then rename the file that way. Also, the loop will create a XML sitemap and we can look at pinging google at a later stage once each post goes up.

Oh, I forgot to mention that sed helped me with...


Update: 4 March 2015
I can't get my f*cking cronjobs to run. It's got something to do with setting the PATH of the bash shell on the Debian server. I can't really carry on with this untill I can run scripts. :(

Update: 17 March 2015
Ok, screw it! Here you go Googlebot... here is the link with all the links so crawl! Hope this doesn't mess with the domain.

Update: 16 June 2015
Did a few loops and chopped the old doc into parts.
Update: 29 October 2015
The National Archives " provide for a National Archives and Records Service; the proper management and care of the records of governmental bodies; and the preservation and use of a national archival heritage; and to provide for matters connected therewith." The archive contains a range of documents, but what seems really appealing is sound archives and Oral history. You could potentially do some interesting things with that, provided that you can somehow transverse their records digitally.

Also... seems like there is a way of searching the archives online:

I love the fact that we have the abbreviations NASA and another one "NAAIRS" ... what a bunch of ..., um...