Kuarepoti-Dju

kuarepoti-dju - the josef multilingual blogosphere

23.1.2006

Advanced Python Hacking

Categories: — josef @ 16:06

While preparing for an exam and thinking about which kind of course to offer for students at one of the upcoming events at University, a hopelessly non-performing recursive Fibonacci calculation made me think about caching function values in a generic way for constant functions and, as an extension, eventually constant functions. It resulted in a cache module for Python.

Consider the following code snipped:


#!/usr/bin/env python
import cache
def fib(n):
if n < 3:
tmp = 1
else:
tmp = fib(n - 1) + fib(n - 2)
return tmp

Now, calling fib(35) takes many minutes to complete. Already fib(20) takes about half a minute and more than 150049 function calls!
(The absolute time is worsened a lot by the benchmark redirections - but still the number of calls should indicate that the recursive calculation is not a good idea.)

Now, let’s enable the cache:


cache.cache(fib, globals())

That’s all! fib(35) finishes in less than a second, needing only 47 invocations, of which 22 yield cache hits and 25 yield cache misses.
Using a backing file means that cache values can be reused from previous runs - which is a nice feature for long running programs so crashes won’t harm.


cache.backingfile("__cache.file")

Now a single iteration is sufficient for calculating fib(35)!

Another advanced usage is that of an eventually constant function. One of this kind is a download function, which lets you use the cache module as a WWW proxy:


#!/usr/bin/env python
import cache
import random
# Simulate an error-prone file download
def download(file):
x = random.randint(0, 1)
return (None, “xyz")[x]
cache.cacheretry(download, globals(), None)
cache.verbose(1)
print “– Download:”
print download("x")
print download("x")
print download("x")
cache.stats()

With the cacheretry statement, the eventual constness is made known to the cache module.
Of course, several different Python methods can be cached at once.

Download the file cache.py to use it right away. It comes with quite a bit of documentation (at least compared to my usual habit of not documenting source code at all).

16.1.2006

Frischfleisch

Categories: — josef @ 16:38

Wer ein aufmerksamer freshmeat.net-Leser ist (geht das heutzutage überhaupt noch?), der hat sicherlich die neuesten Tarbälle gesehen, die ich in den letzten Tagen unter’s Volk gestreut habe:

Wichtige Hacks kurz vor der Prüfungsperiode. Enjoy…

5.1.2006

RR&S: DNS, /etc/hosts and you

Categories: — josef @ 16:07

Resolving host names, that is, finding a connection-level address from their human-readable representation, is a standard task in client applications which need to connect to a server. This is quite a complex task: Information about such mappings might be spread, and new additions like IPv6 or internationalized domain names (IDN) are appearing once in a while, although admittedly not often.

These properties make it desirable to use a ready-made library to perform hostname lookups. But of course, there’s plenty of choice, so which possibilities exist out there? Let’s look at the most common ones: System library functions (libc, resolv, anl), adns, firedns and Qt.

resolv:
A fairly simple interface to DNS servers is provided by the resolver function res_query(), which has its root in the BIND sources and shipped with BSD 4.3 initially. The usage takes a name, constants for the class (always ns_c_in for Internet resources) and type (always ns_t_a for hostname lookups), as well as a string buffer provided by the application. For IPv6 however, ns_t_aaaa would have to be used additionally.
Before using the call, res_init() should be called, although this is not a requirement. The call to res_query() might block the application, which is a big drawback. Another issue for the real-world usage is that this library really only supports DNS queries, therefore not taking the contents of /etc/hosts into account, which severely limits its usability for network-less development machines, for instance.
Note that the glibc version is modified a bit so that no global state is kept within the library, thus enabling resolv to be used with threads. Note also that the returned IP address is really a string, which might or might not have to be converted to a network address structure using inet_pton(), which again has the drawback of not being IPv6-transparent, albeit it can support this address family.

libc:
Probably the most well-known “classic” function is gethostbyname(). It appeared in BSD 4.3 and has also been a POSIX standard function. The call is synchronous, and supports looking up IPv6 addresses. It queries the /etc/hosts file, DNS and other name servers like NIS, depending on the global configuration in /etc/host.conf. This function is reasonably easy to use (except for a few OS incompatibilities which might occur), but is not acceptable for use with interactive programs since it will block the execution thread. To overcome this limitation, a second thread could be created manually, but gethostbyname() is not threadsafe since it uses a library-internal static buffer, so gethostbyname_r had to be introduced to let the application programmer specify a buffer.

Note that IPv6 support was not present initially, for which the substitute function getipnodebyname() has been introduced, according to RFC 2553. This function was obsoleted soon, both by adding address family support to a variant called gethostbyname2(), which is glibc-only, and other functions.

Finally, gethostbyname() has been superseeded by getaddrinfo(), or GAI for short. This function does a little more than resolving hostnames, for instance it can resolve service names as well. Otherwise, it is similar to the gethostbyname2_r() call with IPv6 support and application-provided threadsafe buffers, but also including the undesirable blocking behaviour. This function has also been named in RFC 2553.

anl:
This is a not-yet-standardised extension to GAI, and provides a function called getaddrinfo_a() and several other support functions. As the name suggests, it supports asynchronous hostname lookups, and is present so far in recent versions of the glibc library, where it relies on the AIO layer. The features are similar to getaddrinfo(). Of note is that the notification might happen in two flavours, either by thread creation or by being signalled. Both are not really optimal for all kinds of applications, although it’s not terribly hard to handle the notifications. These are the drawback costs one pays for letting the system library do the polling for the results.

adns:
One of the oldest resolver libraries which natively support asynchronous operation, in addition to synchronous mode. The adns library is initialized with a global state variable, while for each query or status check a structure is initialized by allocating memory. The main functions to use are adns_init(), adns_submit(), adns_processany() and adns_check(). A few things are noteworthy. First, it embeds well with applications since the internally used file descriptor is made available, which means notification can happen within the regular execution thread, for instance using select(). Second, a lot of information is provided, in many cases in form of a struct or union so that the relevant parts can be accessed easily, like the resolved hostname as a string and as a network address structure, depending on the further processing needs.
One drawback is shared with the resolv library, however: The file /etc/hosts is not used, an accessible DNS server is required for the library to work. Another one is that IPv6 is not supported yet.

firedns:
Looking at the big picture, firedns is similar to the adns library, although there are a few differences. For example, IPv6 is supported, although no transparent support is given - similar to the resolv library, both IPv4 and IPv6 queries have to be done separately. Integration happens similarly using a file descriptor which must be polled. A drawback shared with adns is that only real DNS queries are performed.

Qt:
A special contender in this comparison is the Qt library, which is a library for creating C++ applications, providing both GUI and non-GUI classes and a fairly complete platform-independent programming platform. In Qt3, the QDns class was available, which took the hostname and optionally the wanted resource type, and then processed the query asynchronously until a result was available. Then, a Qt signal was sent, which basically means calling back a C++ method (or Slot in qt speak), in which a QValueList contained entries of type QHostAddress, which in turn could be passed further along to Qt networking functions or reveal the IP address in string format. IPv6 addresses and the /etc/hosts file are supported by Qt. In Qt4, QDns was renamed to QHostInfo, which provides a simplyfied interface, abstracting away from DNS terminology such as resource types. Also, support for IDN (via Punycode) was introduced.

Conclusion:
It is difficult to really find the “best of breed” when all methods of resolving have their drawbacks. For Qt-based applications, using the Qt resolver is the most compelling and most sensible way of getting the work done. For all others, especially graphical clients to network services, it gets really difficult.

Let’s do a simple score table:

libc/resolv: hosts -1, ipv6 +/-0, idn -1, nonblocking -1, score = -3
libc/gethostbyname: hosts +1, ipv6 +1, idn -1, nonblocking -1, score = 0
libc/getaddrinfo: hosts +1, ipv6 +1, idn +1, nonblocking -1, score = 2
libc/anl: hosts +1, ipv6 +1, idn +1, nonblocking +1, score = 4
adns: hosts -1, ipv6 -1, idn -1, nonblocking +1, score = -2
firedns: hosts -1, ipv6 +/-0, idn -1, nonblocking +1, score = -1
Qt: hosts +1, ipv6 +1, idn +1, nonblocking +1, score = 4

There is a clear distinction between the favorites and the others, but depending on the usage, this might not tell much - for instance, C programmers will not be able to use Qt!

For this reason, the GGZ library for networking and other programming aids, libggz, has taken a twofold approach. It provides a function named ggz_resolvename() and a function to set a result callback. If libggz has been configured with support for the anl library (providing gethostyname_a()), this callback is called whenever the results are available, or otherwise directly after a blocking call, transparently to the programmer.
Since libggz has a convenience method for creating sockets, called ggz_make_socket(), the developer is even relieved from resolving the socket first by setting up a callback for ggz_make_socket(), which if activated already holds the file descriptor, ready to be used on either IPv4 or IPv6 connections. The only remaining drawback would be missing IDN support, although even this is covered for all getaddrinfo() calls with some IDN flags when using recent glibc versions (>= 2.3.4) which are compiled with IDN support.

Otherwise this is really the recommended way to go for application developers: No blocking calls (when configured correctly), inclusion of /etc/hosts, IPv6 support and convenient API.

2.1.2006

RR&S: Temporary files and transactional file handling

Categories: — josef @ 19:44

Temporary file handling on POSIX-ish systems is a mine field - there are a lot
of standard (yet harmful) functions available, and close to none standard (and
harmless) ones, with mkstemp() and tmpfile() probably being the exceptions.

But does harmlessness also bring usefulness? Let’s look at what a current
modus operandi of creating temporary files is. A file shall be created,
no matter what its name is, but the resulting name should be known for
continued processing - maybe, because it is registered somewhere.

The tmpfile() function only exists for anonymous temporary files, thus it
is out of question here.
The current method for solving the would thus be:
- using mkstemp(), providing a template
- calling fdopen() on the file descriptor
- registering the modified template as file name
Alternatively, using mktemp() and open() with O_EXCL would work, too.
Note that the recommendation against mktemp() due to security issues is not
valid when using O_EXCL (unless on NFS), but it might still involve an
inconvenient race.

To refine the only remaining alternative, the following library function
is proposed.


FILE *f = tmpatomic(const char *directory, char *template, int templatesize, int backingfile);

The parameters are as follows:

- directory: Directory where the final file shall be created.
- template: Modifyable string buffer which will contain the filename. If templatesize
is set to zero, template must contain a certain C format string.
- templatesize: Template buffer size in bytes, or zero for format-string.
One byte needs to be reserved for the string terminator in the former case.
- backingfile: If set to 1, the file is really created as a temporary file first,
and only renamed to its final filename after calling fclose().

This sounds a bit over-engineered, but it will actually help with the most common
use cases.
Design considerations are:
- With the help of the backingfile switch, atomicity for file creation is achieved,
without requiring the programmer to handle the rename() call.
- The flexible template handling guarantees that tempfile creation will not fail
until the maximum possible file name length has been reached, and in the case
of providing a format string (like “tmp-%s-%i") more control can be executed
over which name and random characters will be used.
Additionally, one can exclude the randomness entirely in case only transactional
file creation is wanted.
- The implementation can hide the O_EXCL problem and NFS special-casing as follows:
Both the backing file and the temp file are “safely” opened, that is, using O_EXCL
in the normal case and using symbolic link refcounting on NFS.
Since ownership of both is then guaranteed exclusively, the renaming can be
performed without interception. Of special convenience would be that renaming
could take place across partition boundaries, as to hide file system-specific
details about mount points.
Note that without any advanced file system support, the final file can only
be opened upon the fclose() call, which might then fail with EEXIST.
In this case, a facility to make the name of the backing file available would
be needed, or to tell the truth, the whole backing file concept would not
work and two distinct functions for temp file creation and renaming would
be needed again. This does not invalidate the other advantages though.
Alternatively, if the application can live with a zero-byte final file
in the case of a crash, this issue could be considered solved, too.

In summary, here are two useful applications:

tmpatomic("/tmp", “XXX", 3, 0);

- A classic temporary file

tmpatomic("/var/lib/myapp", “game-%i", 0, 1);

- A file which is created temporarily and then moved atomically to its destination

tmpatomic("/home/user", “some.conf", 0, 1);

- Deterministic transactional file creation

It is not easy to define what the root of the evil is which makes such complex
designs necessary. The inability of some file systems to “reserve” names until
the files are either created or a cleanup run is done (after a crash) is certainly
part of the issue.
Likewise, one should think out of the box to let the file naming be part of flose()
instead of fopen(): No matter where the file is created first, it should be stored
safely afterwards. This resembles the backing file principle, and from current
practice it is known, that in the case of name conflicts a routine to select
an alternative name would have to be called anyway - which could simply be another
random name in this case, but requires to give a (modifyable) buffer to a function
like tmpatomicclose().

Rants, Ramblings & Solutions

Categories: — josef @ 19:42

This is a new article series devoted to the brain-dead-ness of POSIX interfaces, Unix legacy constructs and the like. I simply felt like writing about those issues would help the cause more (i.e. cause people to think about it) than hurt it (i.e. discredit me). That being said, everything is mostly written during the night, when I’m really sleepy at times and without internet access for further validation, and does not inhibit paper publication quality.

This is a permanent entry; below the individual articles will be linked in the future. Note that I’m going to abbreviate the series title with RR&S.

Powered by WordPress