Threading NodeBox execution?

Posted by Cedric on Aug 02, 2008

Hi.

I am trying to make a Graph. But each node is making a HTTP query to a database to retrieve some of its information. When I run my code, NodeBox freezes, until completion. I tried to thread my code, but I am rather a newbie in this topic, using the threading module. Here is a summary of what I am doing:

import threading
class MyNode(threading.Thread):
    def __init__(self, reference):
        threading.Thread.__init__(self):
        self.reference = reference
    def query(self):
       self.label, self.category = #[make the HTTP request by using another module's code using self.reference is input]
    def run(self):
        # self.run() is the method actually called when calling MyNode.start()
        self.query()
 
#--- 
g = graph.create() 
for item in my_list_of_references:     
    mn = MyNode(item)   
    mn.start() # starts thread.
    mn.join() # wait for thread completion
    g.add_node(ROOT, mn.reference, category=mn.category, label=mn.label)
    
#[more code here... but the point is already made]

Any idea/help would be greatly appreciated.
Thanks.

Posted by Tom De Smedt on Aug 04, 2008

Hi Cedric,

There are several points to touch here.

First you may want to have a look at the animation tutorial. This introduces the draw() command. Code you put in the draw() command is called several times each second - this allows you to create animations, or do different things at specific times and events (e.g. user clicks mouse, web content is done downloading).

If you don't use an animation, NodeBox will first execute all of the code (i.e. "freeze") and then draw it, as you mentioned.

Next, you may want to use the Web library. It offers support for asynchronous downloads (in the background) so you don't have to bother about threading yourself.

Here's a short example. It continues to download the current URL until it is ready, at which point it adds the URL to the graph and starts downloading the next one.

size(500, 500)
 
web = ximport("web")
urls = web.yahoo.search("nodebox")
i = 0 # the current URL to retrieve
download = web.url.retrieve(urls[i], asynchronous=True)
 
# Clear cache for live downloads each tim the script runs.
#web.clear_cache()
 
graph = ximport("graph")
g = graph.create()
g.add_node("root")
 
speed(30)
def draw():
    
    global download, i
    # Once we are done downloading a URL,
    # add it to the graph and start downloading the next.
    # The retrieved content is stored in download.data.
    if download != None and download.done:
        html = download.data # parse the stuff you need from "html" string.
        g.add_edge(urls[i], "root")
        g.layout.refresh()
        if i < len(urls)-1:
            i += 1
            download = web.url.retrieve(urls[i], asynchronous=True)
        else:
            # There is nothing left to download.
            download = None    
          
    g.draw()

Posted by Cedric on Aug 05, 2008

Dear Tom. Thanks very much for your help. The web module looks really interesting. I just tweaked my code in order to follow your suggestions. But I stumbled upon the following problems:

The webpages I would like to retrieve are written in XML. This is simpler for me to use this instead of parsing a non-standard HTML webpage (which could nonetheless be done of course, by writing a new abstraction class). No dedicated XML methods seem to be provided in this otherwise-excellent module.

However, I discovered that I simply cannot use the web module behind a proxy. Therefore, I cannot even test if the web module could be tweaked for XML...

I found elsewhere how to Thread my webpage queries for the moment. But I will keep following the story if you reply.

Cedric

Posted by Tom De Smedt on Aug 05, 2008

You can retrieve all sorts of stuff with web.url.retrieve(). It will simply yield the file contents as a string. If there's XML inside, you can use Python's xml.dom.minidom module to parse the string, for example.

Proxies should not be able to influence the Web library. Are you receiving an error message?

Posted by Cedric on Aug 06, 2008

I tried web.url.retrieve() but enver managed to get something. So I tried with just a piece of code, and used web.is_url(), which answered False (all other is_xxx() methods were also answering False). I looked inside the module itself, and the little is_url() definition states that it simply checks if it can connects. So I concluded that it was simply failing to connect. The proxy of my university is a recurrent issue, so I thought that was the source of the problem.

Maybe I have overlooked a bit the problem, since I had already a WebPage class using urllib2 that was able to connect.

I was also wondering if web.url.retrieve() was able to retrieve non-static page. I guess so of course, but just wondering. The URLl I am trying to retrieve looks like:

http://www.server.org/abs/bibcode&format=short_xml

I'll give another try today. Thanks for your help! I'll definitely continue to explore NodeBox capabilities, and it is already helping me in my research.

Cedric

Cedric

Posted by Cedric on Aug 06, 2008

Back to work, and therefore behing a restrictive proxy, I made some tests. Here is a simple piece of code, with an example of URL I am trying to retrieve:

web = ximport('web')
html = web.url.retrieve('http://adsabs.harvard.edu/abs/1976PASP...88..917C&data_type=SHORT_XML')
print html, html.data

When run the first time, I get the following message:

/Users/cedric/Library/Application Support/NodeBox/web/url.py:414: Warning: in web.url.URLAccumulator for http://adsabs.harvard.edu/abs/1976PASP...88..917C&data_type=SHORT_XML

For subsequent execution, I get only:

So it returns an object, but its data attribute is empty.

Posted by Cedric on Aug 06, 2008

Hm... The comment textbox in this website is very sensitive to any presence of lesser and greater signs used in HTML tags. The object I got is this, removing the problematic signs:

web.url.URLAccumulator instance at 0x1749a288

Posted by Tom De Smedt on Aug 09, 2008

Well, I was able to get data from the URL in your example, so no problem there... Can you try the following:

from urllib2 import urlopen
print urlopen("http://adsabs.harvard.edu/abs/1976PASP...88..917C&data_type=SHORT_XML").read()

and tell me if that yields an error?

Posted by Cedric on Aug 09, 2008

With no proxy, it works, of course. Behind the proxy, it freezes until timeout is raised:

Traceback (most recent call last):
  File "nodebox/gui/mac/__init__.pyo", line 358, in _execScript
  File "", line 2, in 
  File "urllib2.pyo", line 121, in urlopen
  File "urllib2.pyo", line 374, in open
  File "urllib2.pyo", line 392, in _open
  File "urllib2.pyo", line 353, in _call_chain
  File "urllib2.pyo", line 1100, in http_open
  File "urllib2.pyo", line 1075, in do_open
URLError: urlopen error (60, 'Operation timed out')

I send you an (old) piece of code I am using to connect through a proxy. It may help.

import urllib
from xml.dom import minidom
 
class ADSPage(urllib.FancyURLopener):
	def __init__(self, url=None, proxy=False):
		self.url = url
		self.proxy = proxy
		self.__config()
 
	def __config(self):
		if self.proxy:
			proxy_map = readConnectionConfig()
			urllib.FancyURLopener.__init__(self, proxy_map)
		else:
			urllib.FancyURLopener.__init__(self)
			
		self.addheader('User-Agent', 'Mozilla/5.0')
	
	def get(self):
		self._query()
		self.xml_dom = minidom.parseString(self.content)
 
	def _query(self):
		f = self.open('http://%s' % (self.url))		
		self.content = f.read()
		f.close()

Then you can write a standard parsing method that will reads self.xml_dom. I have an external function readConnectionConfig that reads a config file, and provide the necessary proxy map:

{'http': 'http://www.myproxyserver.fr:3128'}

Posted by Tom De Smedt on Aug 11, 2008

Hi Cedric,

I've added a set_proxy() command to the latest release of the web library. Could you test if that works for you?

web = ximport('web')
web.set_proxy('http://www.myproxyserver.fr:3128', type='http')
html = web.url.retrieve('http://adsabs.harvard.edu/abs/1976PASP...88..917C&data_type=SHORT_XML')
print html, html.data

Posted by Cedric on Aug 27, 2008

Dear Tom.

Sorry for the late reply, I come back from vacations. Yes it works for me, behind my university proxy, with this simple command (adjusting proxy address of course). Great job. Thanks.