Extracting Data from Web Site [Archive] - Glock Talk

PDA

View Full Version : Extracting Data from Web Site


N2DFire
01-22-2005, 16:20
Alright, I know there should be an easy way to do this but for the life of me I can't find it.

I (or actually my g/f) has a text book that has a web site with additional study aids. One of these aids is a section of Flash Cards. She also has a nifty little flash card program for her PDA that will take a .txt file (in proper format) and display them as flash cards.

What we are wanting to do is somehow extract the data from the web site's flash cards and put them into a text file.

I'm comfortable enough with VB.NET text file manipulation (streamreader) that this shouldn't be a problem however I can't get to the dumb HTML files to open them (because streamwriter won't accept a URL) and I can't seem to find a good program to copy the website to my computer HD.

The web site is set up such that there is a flashcard page that contains a lot of javascript that makes the system work. Under that there are sub folders for each chapter

/Chapter1
.
.
.
/ChapterXX

In each chapter folder there are card files
/card1.html
/card2.html
.
.
.
/cardXX.html

Every card has the following format

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
******
Each flashcard file contains term and definition for one card
as JavaScript variables. Page is written dynamically via
JavaScript function. All logic and style resides in shared files.
State variables in parent frameset determine how/when a single card
appears.
-->
<html>
<head>
<title>Card</title>
******** language="JavaScript">
//data for this card
var term = "adenohypophysis "
var def = "The anterior lobe of the pituitary gland."
var audio = "none"
</script>
******** language="JavaScript" src="../card.js"></script>
<LINK REL="Stylesheet" TYPE="text/css" HREF="../card.css">
</head>
****** *********"#ffffff" ************"../card.gif">
******** language="JavaScript">
//write the card
writeCard()
</script>
********>
********>

This page is http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/chapter18/card1.html

I can call each card page up on it's own but because the system was written as a frameset the code required to make it display properly is not present.

What I need in a nutshell is a way to extract the value of var term & var def from each card file so that I can then write them out into a formatted .txt file for the PDA flashcard program.

Any help with accessing the on-line HTML files via VB.NET or an good program to cache them to my HD so I can do it the "old" way I know would be greatly appreciated.

Thanks in Advance

Edited to fix URL

Deathwind
01-22-2005, 19:02
Here's some quick-n-dirty Perl code that will do what you want:

#!/usr/bin/perl -w

use LWP::Simple;

$url = 'http://place.with.the.stuff.com/card.html';

$content = get($url);

if($content =~ m/.*var.term.=.\"(.*)\".*/) {
print("$1\n");
}
if($content =~ m/.*var.def.=.\"(.*)\".*/) {
print("$1\n");
}


Should be pretty easy to modify it to loop over the card numbers as well.

Or, if you have an aversion to Perl (although it's great for short little junk with text like this), HTTrack (http://www.httrack.com/) or wget (http://www.interlog.com/~tcharron/wgetwin.html) are my preferred website downloaders.

N2DFire
01-22-2005, 22:11
Deathwind - Thanks for the reply. I dunno PEARL but it's high time I started learning I guess so I'll give that a look-see.

Also I tried the windows version of HTTrack and gave it "http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/flashcards.html" as the starting page and (I thought) told it to get everything below it, however it only retrieves the 7 or so files in that folder and will not recurse into the ../chapterXX sub folders.

Any pointers on how to make HTTrack work and I'll have the problem solved. I can write the VB.NET to read the files localy, I just haven't yet figured out how to read them off the web.

David_G17
01-22-2005, 22:22
just skimmed over the post, but would "wget" solve your problem?

(downloading the site locally)

N2DFire
01-22-2005, 23:27
David_G17
I am having the same issues with wget as HTTrack - I'm too dumb to force it to get the contents of the ../ChapterXX folders

I tried "wget http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/flashcards.html -m" it downloaded & built the directory structure to that point and all the files in the flashcards folder but nothing beneath that.

N2DFire
01-23-2005, 16:18
YEA PERL !!!!

I did it (well sort of). It still has an error that I need to trap/fix and it's not the best means of looping but all-in-all, not bad for a first timer I don't think.


#!/usr/bin/perl -w

use LWP::Simple;
open(FD, ">A&P_Chapter18.txt");
$X = 1;
$url = 'http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/chapter18/card' . $X . '.html';
$content = get($url);

while($content ne '') {
# print $X;

if($content =~ m/.*var.term.=.\"(.*)\".*/) {
# print ("Q: $1\n");
print FD ("Q: $1\n");
}

if($content =~ m/.*var.def.=.\"(.*)\".*/) {
# print ("A: $1\n");
print FD ("A: $1\n");
}
print FD (" \n");
$X += 1;
$url = 'http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/chapter18/card' . $X . '.html';
$content = get($url);
}
close(FD);


It would have taken me forever to get the text matching part though. Many many thanks to Deathwind for the starting point.