Page 2 of 3
Re: Tyteen's spider bot progress
Posted: Tue Jul 31, 2012 10:44 pm
by tyteen4a03
VirtLands wrote:congrats. you're such a brain with that code, (which I don't even know what is displayed there),
it's some kind of Python i guess?
I can see bits and pieces of webpages being slowly
digested by that spider thing you created. 
(The only programming I know is with Blitz. %@?_)
I suppose in the future, you can ultimately make some (offline) search
program for us, where when you input an author or level name, that it
returns all kinds of stuff, I suppose.
I'm optimistic. I think you'll finish this in about a week.
Keep up the good work.
tyteen4a03 wrote:The spider is working...
The big bit displayed was the topics, with the format "topicid": (topicname, topicStarterUID). It will be passed on to other parts to crawl the posts inside each topic.
The offline search part is actually going to be in the new website - I guess you can call that offline search

(but the most accurate term is off-site search)
Re: Tyteen's spider bot progress
Posted: Tue Jul 31, 2012 11:02 pm
by VirtLands
Maybe you can make an offline
version too, -one that you can give us to download...
Re: Tyteen's spider bot progress
Posted: Tue Jul 31, 2012 11:17 pm
by tyteen4a03
VirtLands wrote:tyteen4a03 wrote:The offline search part is actually going to be in the new website - I guess you can call that offline search

(but the most accurate term is off-site search)
Maybe you can make an offline
version too, -one that you can give us to
download; its info will be finite, (dated up to a point).
All offline programs work faster than online.
Now that you've got the bot code off to a good start, I was wondering...
How will you prevent duplication of search efforts,
..what I mean is, how will you keep from searching the same links over and over?
For example, Link A links to several (perhaps 10 others), and those links will
eventually link back to Link A, in some strange round loop.
(Well, I'm going to lala land, will be back in a while...
)
It won't, because the database backend for this bot has a unique key feature - it prevents the same post/topic/profile from being archived over and over again. The bot also does not follow links - it simply grabs information it needs then move on.
An offline version will depend on if I can spare time to work on a GUI application (which has been a bit of a failure to me). Development of this will also depend if MS wants to use this website to handle future level submissions - if that's the case, another application wouldn't be necessary. If the community really do want this application, it will start after the website's development finishes.
Posted: Fri Aug 03, 2012 6:54 pm
by tyteen4a03
Another status update...
I am off to grabbing posts from the forum, and it required me to do some special configuration... anybody who can guess what I did gets a (virtual) cookie.

Posted: Fri Aug 03, 2012 7:08 pm
by StinkerSquad01
Well, you changed the date to the post number(?).
Posted: Fri Aug 03, 2012 7:25 pm
by tyteen4a03
StinkerSquad01 wrote:Well, you changed the date to the post number(?).
It's not the post number...
(hint: Compare my picture to what you see on the main page)
Unix Time Stamps
Posted: Fri Aug 03, 2012 7:33 pm
by VirtLands
Posted: Fri Aug 03, 2012 7:51 pm
by StinkerSquad01
I was thinking that as well..
Posted: Sat Aug 04, 2012 8:23 am
by tyteen4a03
VirtLands wrote:
My wildest guess is that you've conveniently converted the date-time into
a special numeric format for easier sorting.
Does anyone see a pattern here? (I can't).
Fri Aug 03, 2012 11:33 am -- 1343947903 -- VirtLands
Fri Jul 13, 2012 8:17 am -- 1342196228
Sat Jul 14, 2012 5:15 pm -- 1342314949
Fri Aug 03, 2012 6:51 am -- 1344005460
Thu Aug 02, 2012 2:30 pm -- 1343946627
Thu Jul 19, 2012 5:25 pm -- 1342747557 -- hex:5008B3A5
I'll get back to you on this. 
They are Unix Timestamps. They represent seconds since the Unix Epoch (1st January 1970 00:00:00 GMT) and is a very convenient time format since you can turn them into any time format displayable.
(and yes, they can also be used for sorting)
Unix Time Format
Posted: Sun Aug 05, 2012 4:46 am
by VirtLands
Re: Unix Time Format
Posted: Sun Aug 05, 2012 4:57 am
by tyteen4a03
VirtLands wrote:Aha. I was so close. 
( I was hunting for date-time formats, and it never occurred to me that it's Unix. )
Out of curiousity, I may compare the Unix to other formats to see 'complexity' vs 'convenience'.
Let us know of your eventual progress.
Here's some coffee to keep you going. 
Thanks! I am having insomnias lately I can't focus on anything I do.
Here's some code snippets to show you the current progress. It might not be the tidiest, but it at least works.
This code snippet requires Python OOP knowledge, and Scrapy and Beautiful Soup library knowledge.
Code: Select all
def parseTopics(self, response):
soup = BeautifulSoup(response.body)
# Find topic information
topics = []
for link, profileBit in zip(soup.find_all("a", attrs={"class": "topictitle"}),
soup.find_all("span", attrs={"class": "name"})):
if link["href"].split("=")[1] in self.topics_to_ignore: # Make sure Announcements and Stickies are not scanned twice while making sure they do get scanned at least once
continue
aTopic = Topic()
aTopic["forumID"] = response.meta["forumid"]
aTopic["topicID"] = link["href"].split("=")[1]
aTopic["topicName"] = link.string
aTopic["posterID"] = profileBit.a["href"].split("=")[2]
topics.append(aTopic)
if link.previous_sibling().string in ["Announcement:", "Sticky:"]: # We've scanned this topic before, let's skip it in the future
self.topics_to_ignore.append(link["href"].split("=")[1])
# Figure out if there's tomorrow
hasMultiplePages = soup.find_all("td", align="right", valign="bottom", nowrap="nowrap")
if hasMultiplePages:
hasNextPage = hasMultiplePages[0].find_all("a", text="Next")
if hasNextPage:
yield Request((self.root_domain + "/" + hasNextPage[0]["href"]), callback=self.parseTopics, meta={"forumid": response.meta["forumid"]})
for t in topics:
if t["topicID"] not in ignoreList:
yield Request((self.root_domain + "/" + "viewtopic.php?t=" + t["topicID"]), callback=self.parsePosts, meta={"topicID": t["topicID"]})
Code is complex
Posted: Sun Aug 05, 2012 5:35 am
by VirtLands
Re: Code is complex
Posted: Sat Aug 11, 2012 6:30 pm
by tyteen4a03
VirtLands wrote:Hmmm, this post hasn't been updated in a while. Could be it's turning into a cobweb site.
Yes, haven't got time to work on it for a while.
Here's a very important bit of the spider - post scraping. This 70-lined code scraps post information and attachments, while initiating scrap of post content (will explain later why it's a separate process), user profile, and (of course) the scraping of next page.
It's 2:30 AM here now, I need to go to sleep.
Code: Select all
def parsePosts(self, response):
"""
Parse post content.
"""
soup = BeautifulSoup(response.body)
def aName(tag):
return tag.name == "a" and isinstance(tag["name"], int)
def aHref(tag):
return tag.name == "a" and tag["href"].startswith("profile.php?mode=viewprofile&u=")
def spanClass(tag):
return tag.name == "span" and tag["class"] == "postdetails" and tag.string.startswith("Posted: ")
def spanClassPostBody(tag):
return tag.name == "span" and \
tag["class"] == "postbody" and not \
tag.string.startswith("<br />_________________<br />")
def attachURL(tag):
return tag.name == "a" and tag["href"].startswith("download.php?id=")
def determineContentFetchMode(postid):
if soup.find("a", href="posting.php?mode=editpost&p=" + postid):
return "edit"
elif soup.find("a", href="posting.php?mode=quote&p=" + postid):
return "quote"
else:
return "raw"
# Find posts information
posts = []
attachments = []
for (pid, userid, username, posttime, content) in zip(
soup.find_all(aName), # Post ID
soup.find_all(aHref), # User ID
soup.find_all("span", attrs={"class": "name"}).b.string,
soup.find_all(spanClass),
soup.find_all(spanClassPostBody), # Post body
):
aPost = Post()
aPost["postID"] = pid
aPost["topicID"] = response.meta["topicID"]
aPost["posterID"] = userid.strip("profile.php?mode=viewprofile&u=")
aPost["postTime"] = posttime.strip("Posted: ")[0:9] # Timestamps are always 10 digits
attachTable = content.find_next_sibling("table", attrs={"class": "attachtable"})
# Attachment?
if attachTable:
anAttachment = Attachment()
anAttachment["postID"] = pid
anAttachment["originalFilename"] = attachTable.find("span", attrs={"class": "gen"}) # The original name
anAttachment["displayFilename"] = attachTable.find(attachURL)
# Initiate Post content scraping
mode = determineContentFetchMode(pid)
if mode != "raw":
yield Request((self.root_domain + "/" + "posting.php?mode=" +
("editpost" if p["content"][0] == "edit" else "quote") +
"&p=" + p["postID"]),
callback=self.parsePostContent,
meta={"postID": pid, "content": None, "mode": mode})
# Initiate User scraping
if username not in self.users_scanned:
yield Request((self.root_domain + "/" + userid),
callback=self.parseUser)
posts.append(aPost)
yield posts
# Figure out if there's tomorrow
hasMultiplePages = soup.find("td", align="left", valign="bottom", colspan=2)
if hasMultiplePages:
hasNextPage = hasMultiplePages.find("a", text="Next")
if hasNextPage:
yield Request((self.root_domain + "/" + hasNextPage["href"]),
callback=self.parseTopics,
meta={"topicID": response.meta["topicID"]})
crawlers and stuff
Posted: Sat Aug 11, 2012 9:30 pm
by VirtLands
Teleport Pro
Posted: Sat Aug 11, 2012 11:26 pm
by VirtLands
Posted: Sun Aug 12, 2012 3:55 am
by tyteen4a03
I don't want to work with regex (they are a pain in the butt), and all those customizations just hurt my brain.
And the code I'm writing is open-source (well, will be soon, haven't got time to upload it to GitHub yet)
For now, the best help would be to grab me coffee.

Zoolz
Posted: Mon Aug 13, 2012 2:55 am
by VirtLands
I had an idea that if you ...
finish this project and wish to share your hard earned data with us
then you can upload it to the following.
Create an account on Zoolz: http://goo.gl/6D4uT
or
SkyDrive: https://skydrive.live.com/
Zoolz link for downloaded http://www.pcpuzzle.com/forum/
Posted: Mon Aug 13, 2012 4:51 am
by VirtLands

Uploaded zipped form 240 MB (incomplete, but substantial.)
Zoolz link: http://zlz.me/yq5i9
Re: Zoolz link for downloaded http://www.pcpuzzle.com/forum/
Posted: Mon Aug 13, 2012 12:24 pm
by jdl
VirtLands wrote:Surprisingly it only amounts to 240 MB, yet it states "complete".
{ 8878 files in 1058 folders }, maybe it's just a very, very compact format.

I think I found out why this is. I just tested out your download, and for example, for the old off-topic it says "Goto page 1, 2, 3 ... 41, 42, 43". Pages 4-40 have not been archived. It appears that all the "in-between-pages" for all the subforums are not downloaded at all.
Posted: Mon Aug 13, 2012 1:11 pm
by tyteen4a03
search.php does not work because it's a dynamic page.
I also have my own storage space - I don't use cloud file services.
I want to also clarify why making my own spider is better than archiving all pages then mine data out of it - it saves disk space. Because pages are scanned and mined on-the-fly, almost no disk space is needed to store unnecessary HTML files (which is a lot of overhead)
offline and so fine
Posted: Mon Aug 13, 2012 7:29 pm
by VirtLands
I forgot to mention about (c),
(a) It gets stuck when it wanders onto http://www.midnightsynergy.com/
(b) The search function ( http://www.pcpuzzle.com/forum/search.php ) doesn't work.
(c) Contains no attachments, and therefore no levels & customMedia
______________________________________________________
Thanks to JDL for his studying of the download.
I thought there was something fishy about it only being 240 MB.
Posted: Mon Aug 13, 2012 11:36 pm
by tyteen4a03
Yes, I blacklisted the login page.
The spider mines topic list, post list, user profile and attachments of specific forums.
Offline Explorer Enterprise
Posted: Thu Aug 16, 2012 2:20 am
by VirtLands
I see.

-----------------------------------------------------------------------
Re: Offline Explorer Enterprise
Posted: Thu Aug 16, 2012 3:02 am
by tyteen4a03
VirtLands wrote:I see.
Well, I tried the demo of Offline Explorer Enterprise, and tried so
many ways to set up "URL omissions" so that it won't log me out
of
http://www.pcpuzzle.com/forum/
But, I could never get it to download attachments,
can only get it to download the regular html stuff, (+images).
I basically gave up on the attachments option.
Looks like we'll have to get someone to download all
the attachments for us. Any volunteers? 
-----------------------------------------------------------------------
So, how much progress have you made with your data mining ?
You need to login in order to download attachments.
Today's the last day that I'm not free. Work will start tomorrow.
Web Boomerang
Posted: Fri Aug 17, 2012 5:53 pm
by VirtLands

[ Time to start worrying about the sharks, they're coming after you.. ]
I changed my mind about Web Boomerang. It's awful.
Posted: Fri Aug 17, 2012 6:25 pm
by tyteen4a03
That site still exists? o.o
Anyways, work has resumed and I expect another update very soon. Hopefully I will be able to put the spider to work for the first time.
SID
Posted: Fri Aug 17, 2012 7:40 pm
by VirtLands
I'm temporarily back on the 'rack. (HTTrack)
I just learned what a SID is:
Whenever one logs onto a forum (such as this), we are provided
with a SID.
For example, my SID is :
sid=786b9ecdf00278##################
AHa!. Did you really think I'd tell you my SID? (I covered it with #'s).
So, folks, never give out your SID.
Definition and examples of SID:
http://kb.iu.edu/data/aotl.html
A SID contains:
User and group security descriptors
48-bit ID authority
Revision level
Variable sub-authority values
____________________________________________
Posted: Sat Aug 18, 2012 4:45 am
by tyteen4a03
SID means PHP Session ID. Everytime you visit another page it refreshes in the database. There is absolutely no harm giving out your PHP Session ID out, as hackers can't really do anything about it.
(But if an exploit has been found in a software this will not be the case - won't explain here)
Re: HTTrack project 20GB
Posted: Sat Aug 18, 2012 5:06 am
by LexieTheFox