This blog post is an introduction to lock-free programming. I'm writing this because this is the pre-requisite to understand my next post. This was also the content of my presentation for Qt Developer Days 2011.

Lock-free programming is the design of algorithms and data structures that do not acquire locks or mutexes.

When different threads in your program need to access the same data, we must ensure that the data is always in a coherent state when used. One way to achieve that is to do locking. A thread will acquire a mutex to write the data. That thread may touch the data structure and have it in an inconsistent state as it holds the mutex, but this is not a problem because other threads cannot access the data at this time as they will block waiting for the mutex to be released. While a thread is waiting, the OS will schedule another thread or process or let the CPU core rest.

What is wrong with Mutexes?

Mutexes are perfectly fine. But you have a problem if there is lock contention. If you want your algorithm to be fast, you want to use the available cores as much as possible instead of letting them sleep. A thread can hold a mutex and be de-scheduled by the CPU (because of a cache miss or its time slice is over). Then all the threads that want to acquire this mutex will be blocked. And if you have a lot of blocking, the OS also needs to do more context switches which are expensive because they clear the caches.

Other problems may arise if you do real time programming (priority inversion, convoying, ...). Mutexes also cannot be used in signal handlers.

Another example: Let us suppose you want to split your program into different processes so the whole application does not crash if one process crashes. (This is what modern browsers are doing by having the rendering of the page in a different process.) But if the process crashes while holding a shared lock you are in big trouble as this will most likely cause a dead lock in the main application.

So how can we do it without locking?

Modern CPUs have something called atomic operations. There are libraries that have APIs that let you use those atomic operations. Qt has two classes: QAtomicInt and QAtomicPointer. Other libraries or languages might have different primitives, but the principles are the same.

QAtomic API

I won't go into the detail of the API here as you can read the documentation of QAtomicInt and QAtomicPointer

But here are the highlights: Both classes have a similar API. They wrap an int or a pointer and allow to make atomic operations on it. There are 3 main operations: fetchAndAdd, fetchAndStore, and testAndSet. They are available in 4 variants, one for each ordering.

The one used here is testAndSet. It is also called Compare and Swap in the literature.
Here is a non-atomic implementation

bool QAtomicInt::testAndSet(int expectedValue, int newValue) {
    if (m_value != expectedValue)
        return false;
    m_value = newValue;
    return true;
}

What it does: it changes the wrapped value only if the value is the expected value, else it does not touch it and returns false.

It is atomic, meaning that if two threads operate at the value on the same time it stays consistent.

Of course, it is not implemented like this as it would not be atomic. It is implemented using assembly instructions. Qt's atomic classes are one of the very few places inside Qt implemented with assembly on each platform.

Memory Ordering

Today's CPUs have what is called out of order execution. What it means is that at each clock cycle, the CPU might read several instructions (say 6) from the memory and decode them and store them in a pool of instructions. The wires in the CPU will compute the dependencies between the instructions and feed the processing units with the instructions in the best possible order making the most efficient use of the CPU. So the instructions, and especially the reads and the store, are executed in an order that might be different from the one in the original program. The CPU is allowed to do that as long as it gives the same result on that thread.

However, we want to make sure that the ordering is preserved when we play with atomic operations. Indeed, if the memory is written in a different order, the data structure may be in an inconsistent state when other threads read the memory.

Here is an example:

  QAtomicPointer<int> p;
  int x;
  //...
  x = 4;
  p.fetchAndStoreRelease(&x);

It is important that when p is set to &x, the value of x is already 4. Else another thread could see the value of p that is still pointing to something else.

This is done by adding the proper memory fence in the program. Memory fences are special instructions that tell the CPU to not reorder. We have 4 kind of fences:

Acquire: No reads or writes that happen after will be moved before the atomic operations.
Release: The opposite of Acquire: No reads or writes are going to be moved after.
Ordered: It is a mix of the two previous orderings: Nothing can be moved after or before. This is the safest and the one to use if you don't know which one to use.
Relaxed: No memory fences are added.

The fence hints are added to the functions because on some architectures, there is one assembly instruction to do the atomic operation and the fence.

The fences are only there for keeping the CPU away from reordering. It has nothing to do with the fact that the compiler might also re-order everything. We make sure the compiler does not re-order by having 'volatile' access.

Lock-free Stack

We will here design a stack that works without locking:

class Stack {
    QAtomicPointer<Node> head;
public:
    Stack() : head(0) {}
    void push(Node *n) {
        do {
            n->next = head;
        } while(!head.testAndSetOrdered(n->next, n));
    }
    Node *pop() {
        Node *n;
        do {
            n = head;
            if (!n)
                return 0;
        } while(!head.testAndSetOrdered(n, n->next));
        return n;
    }
};

I'll use drawings to show how the code works:

It is basically implemented as a linked list: each node has a pointer to the next node, and we have a pointer to the first node called head.

Push

In this example, two threads want to push a node to the stack. Both threads have already executed the line n->next = head and will soon execute the atomic operation that will change head from the former head (B) to n (C or D)

In this image we see that the Thread 2 was first. And D is now on the stack.

The testAndSet in Thread 1 will fail. The head is not B anymore. head is not changed, meaning that the node D is still on the stack.

The Thread 1 will be notified that the testAndSet has failed and will then retry with the new head which is now D

Benchmark

So you can try yourself with this little example: lockfreestack.cpp
(Download the file in a new directory and do qmake -project && qmake && make)

This program first pushes 2 million nodes to the list using 4 threads, measuring the time it takes. Once all the threads have finished pushing, it will pop all those nodes using 4 threads and measure how long that takes

The program contains a version of the stack that uses QMutex (in the #if 0 block)

Results: (on my 4 core machine)

	Push (ms)	Pop (ms)	Total (real / user / sys) (ms)
With QMutex	3592	3570	7287 / 4180 / 11649
Lock-free	185	237	420 / 547 / 297

Not bad: the lock-free stack is more than 100 times faster. As you can see, there is much less contention (the real is smaller than the user) in the lock-free case, while with the mutex, a lot of time is spent blocking.

The ABA problem

OK, there is actually a big bug in our implementation. It works well in the benchmark because we push all the node then pop them all. There is no thread that pushes an pops nodes at the same time. In a real application, we might want to have a stack that works even if threads are pushing and popping nodes at the same time.

But what is the problem exactly? Again, I'll use some images to show:

In this example, the Thread 1 wants to pop a node. It take the address of A and will do a testAndSet to change head atomically from A to B. But it is de-scheduled by the OS just before the atomic operation while another thread is being executed

While Thread 1 is sleeping, Thread 2 also pops a node, so A is not on the stack anymore.

If Thread 1 wakes up now, the atomic operation in Thread 1 will fail because head is not anymore equal to A. But Thread 1 does not wake up and Thread 2 continues...

Thread 2 has pushed a node (C). Again, if thread 1 would wake up now, there would not be any problem, but it still does not wake up.

And Thread 2 pushes A back in the stack

But now, Thread 1 wakes up and execute the testAndSet, which succeeds as head is A again. This is a problem because now C is leaking.

It could have been even worse if Thread 2 had popped the node B.

Solutions to the ABA Problem

Every problem has solutions. It is outside the scope of this article to show the solutions in details. I would just give some hint that will orient your research on the web:

Adding a serial number to the pointer, incremented each time a node is popped. It can be stored inside the least significant bits of the pointer (considering it points to an aligned address). But that might not be enough bits. Instead of using pointers, one can use indexes in array, leaving more bits for the serial number.
Hazard pointer: Each thread puts the pointer it reads in a list readable by all threads. Those lists are then checked before reusing a node.
Double or multiple word compare and swap. (which is possible using single word compare and swap)

Conclusions

As you might see, developing lock-free algorithm requires much more thinking than writing blocking algorithms. So keep mutexes, unless you have a lot of lock contention or are looking for a challenge.

A question I often got is whether it is not better to lock in order to let the other threads working instead of entering what might appear to be a spin lock. But the reality is that it is not a spin lock. The atomic operations succeed much more often than they fail. And they only fail if there was progress (that is, another thread made progress).

If you need help in your Qt applications regarding threads and locking, maybe Woboq can help you.

You may want to read this blog if you want to understand the internals of QMutex or if you are interested in lock-free algorithms and want to discover one. You do not need to read or understand this article if you just want to use QMutex in your application. I found implementing QMutex interesting and therefore I want to share the knowledge.

I am going to show and explain a simplified version of the real code. If you want to see the real code, you can just browse the Qt source code. This article will help you to understand the concepts.

In order to better understand this article, you need to know some basics of lock-free programming. If you understand what the ABA problem is and know how to make a lock-free stack, you are probably fine. Otherwise I recommend reading my previous post as an introduction.

The contents of this blog entry are taken from the last part of my presentation at the Qt Developer days 2011.

QMutex

Before understanding how something works, one needs to understand what it does first. Here is the basic interface we are going to analyze.

class QMutex {
public:
    void lock();
    void unlock();
    bool tryLock();
private:
    //...
};

What it does is unsurprising. If you don't guess what those functions do, you can read it in the QMutex documentation.

Motivation

We want QMutex to be as efficient as possible. QMutex in Qt 4.8 is already quite fast because it is already using atomic operations. So don't expect differences in speed between Qt 4.8 and Qt 5. But what we improved is more from a memory perspective. We want QMutex to use as little memory as possible, so you can put more QMutex in your objects to have more lock granularity.

sizeof(QMutex) == sizeof(void*).

That condition was already true in Qt 4. But Qt 4 has lots of hidden costs because it is using pimpl, as almost every class in Qt does. Constructing a QMutex in Qt 4 allocates and initializes a QMutexPrivate which itself initializes some platform specific primitives. So we have a memory overhead of over 120 bytes per mutex and also the cost of initialization/destruction. In Qt 5, there are no more hidden costs per mutex.

Another good point is that it is now a POD. This is good because QMutex is often used to lock global resources. Then a global mutex seems the obvious thing to use. But, in Qt, we want to avoid the use of global objects. Indeed, global objects should be avoided especially in library code, because the order of initialization is unspecified and they slow down the start-up even if that part of the library is not used. But as PODs do not need to be initialized, they can be used as global object.

QMutex singletonMutex; // static global object !!!  (OK only for POD)
MySingleton *MySingleton::self() {
    QMutexLocker locker(&singletonMutex)
    if(!m_self)
        m_self = new MySingleton;
    return m_self;
}

That code is dangerous in Qt 4 because maybe another global object in another compilation unit wants to make use of the singleton and the mutex could be used before it is created.

Summary of the QMutex changes from Qt 4.8 to Qt 5

QMutex uses much less memory.
The construction and destruction is faster.
It is now suitable as a global object.
The cost of locking or unlocking should not have changed.

Overview

class QMutex {
public:
    // API ...
private:
    QAtomicPointer<QMutexPrivate> d_ptr;
};

d_ptr is the only data member. The trick is that it is not always a pointer to an actual QMutexPrivate It may have a magic value:

Value of `d_ptr`	Meaning
`0x0`	The mutex is unlocked
`0x1`	The mutex is locked and no other threads are waiting
Other address	An actual pointer to a QMutexPrivate

You should be familiar with a pointer having 0 (or 0x0) as a value (also called NULL). 0x1 is a special value that represents an uncontended locked mutex.

So the principle of a lock is to try to atomically change the value of d_ptr from 0 to 1 If this succeeds, we have the lock, else, we need to wait.

bool QMutex::tryLock() {
    return d_ptr.testAndSetAcquire(0x0, 0x1);
}

void QMutex::lock() {
    while(!d_ptr.testAndSetAcquire(0x0, 0x1)) {
        // change the value of d_ptr to a QMutexPrivate and call wait on it.
        // ... see bellow
    }
}

The function bool QAtomicPointer::testAndSetAcquire(expected, newValue) will change the value of the pointer to newValue only if the previous value is expected and returns true if it succeeds. It does it atomically: If other threads change the value behind its back, the operation would fail and returns false.

Unlock

Let us start with the code:

void QMutex::unlock() {
    Q_ASSERT(d_ptr); // cannot be 0x0 because the mutex is locked.

    QMutexPrivate *d = d_ptr.fetchAndStoreRelease(0x0);
    if (d == 0x1)
        return; // no threads are waiting for the lock

    d->wake(); // wake up all threads
    d->deref();
}

The function fetchAndStoreRelease will atomically exchange the value of d_ptr to 0x0 and return the previous value. This exchange is atomic: it is guaranteed that we have the old value and that no threads had put a different value in between.

The returned value cannot be 0, because our mutex was locked (we have the lock, since we are unlocking it). If it was 0x1, no threads were waiting: we are finished. Otherwise, we have a pointer to QMutexPrivate and threads are waiting, so we need to wake all those threads up. They will then shortly try to lock the mutex again. Waking all the threads while only one can acquire the mutex is a waste of CPU. The real code inside Qt only wakes up one thread.

You will also notice a deref(). This will de-reference the QMutexPrivate and release it if needed.

Memory Management

In lock free programming, memory management is always a difficult topic. We use the technique of reference counting in order to make sure the QMutexPrivate is not released while one thread is trying to use it.

It is important to see that the QMutexPrivate can never be deleted. Consider this code:

d_ptr->ref().

This is not an atomic operation. The ref() itself is atomic, but there is a step before. The code first reads the value of d_ptr and puts the memory address in a register. But another thread may unlock the mutex and release the QMutexPrivate before the call to ref(). We would end up accessing free'ed memory.

QMutexPrivate are never deleted. Instead, they are put in a pool to be reused (a lock-free stack). We expect the total number of QMutexPrivate to be rather small. A QMutexPrivate is only needed if there is a thread waiting, and there cannot be more threads waiting than the number of threads on the application.

Here is how the QMutexPrivate looks

class QMutexPrivate {
    QAtomicInt refCount;
public:
    bool ref() {
        do {
            int c = refCount;
            if (c == 0)       // do not reference a QMutexPrivate that
                return false; //   has been already released.
        } while (!refCount.testAndSetAcquire(c, c + 1))
        return true;
    }
    void deref() {
        if (!refCount.deref())
            release();
    }

    /* Release this QMutexPrivate by pushing it on the
       global lock-free stack */
    void release();

    /* Pop a mutex private from the global lock-free stack
       (or allocate a new one if the stack was empty)
       And make sure it is properly initialized
       The refCount is initially 1
      */
    static QMutexPrivate *allocate();

    /* Block until wake is called.
       If wake was already called, returns immediately */
    void wait();

    /* Wake all the other waiting threads. All further calls to wait()
       will return immediately */
    void wake();

    // ... platform specifics
};

Lock

Now, we have got enough information to look at the implementation of the locking.

void QMutex::lock() {
    // Try to atomically change d_ptr from 0 to 1
    while (!d_ptr.testAndSetAcquire(0x0, 0x1)) {
        // If it did not succeed, we did not acquire the lock
        // so we need to wait.

        // We make a local copy of d_ptr to operate on a
        // pointer that does not change behind our back.
        QMutexPrivate *copy = d_ptr;

        // It is possible that d_ptr was changed to 0 before we did the copy.
        if (!copy)     // In that case, the mutex is unlocked,
            continue;  // so we can try to lock it again

        if (copy == 0x1) {
            // There is no QMutexPrivate yet, we need to allocate one
            copy = QMutexPrivate::allocate();
            if (!d_ptr.testAndSetOrdered(0x1, copy)) {
                // d_ptr was not 0x1 anymore, either the mutex was unlocked
                // or another thread has put a QMutexPrivate.
                // either way, release the now useless allocated QMutexPrivate,
                // and try again.
                copy->deref();
                continue;
            }
        }

        // Try to reference the QMutexPrivate
        if (!copy->ref())  // but if it fails because it was already released,
            continue;      // the mutex has been unlocked, so try again

        // Now, it is possible that the QMutexPrivate had been released,
        // but re-used again in another mutex. Hence, we need to check that
        // we are really holding a reference to the right mutex.
        if (d_ptr != copy) {
            copy->deref();
            continue;
        }

        // From this point we know that we have the right QMutexPrivate and that
        // it won't be released or change behind our back.
        // So we can wait.
        copy->wait();

        // The mutex has been unlocked, we can release the QMutexPrivate
        copy->deref();
        // and retry to lock.
    }
}

I hope that the comments were self explanatory.

To The Real Code.

The simplified algorithms code shown here has some limitations that have been solved in the real implementation inside Qt:

QMutex has a recursive mode with the same API (the Qt 5 implementation supports them but they are more expensive than non-recursive mutex)
This simplification wakes all the threads waiting on a single mutex instead of just one. That is not optimal since only one is going to acquire the lock.
It is not fair. This means that a thread waiting for a long time might never acquire the mutex while two other threads always exchange the mutex between each other.
There is the QMutex::tryLock(int timeout) that only blocks for a limited amount of time if the mutex cannot be acquired.

This code is also not used on Linux. On Linux, we use the futexes, using the d_ptr as the futex controller.

If you are interested, you can look at the actual code: the file qmutex.cpp, qmutex.h and qmutex_p.h available on Gitorious

While I was working on Qt at Nokia I spent a lot of time reviewing patches from colleagues.

Peer reviewing is an important step, and there is a reason why we do it in Qt.
Code in libraries (such as Qt) requires much greater care than application code.
The main reason is that you need to maintain compatibility for your users. You want to avoid regressions in your library, so code that works continues to work, even if the user did some hack or trick. For the user of the library, it has great value to be able to upgrade it without too much effort. A subtle behaviour difference in the library might introduce a bug in the application that might be really hard to track in the millions of old lines of code of the application.

Reviewing changes starts by looking for obvious mistakes and incompatibilities with the project guidelines:

Is the code style correct?
Is the code (auto)tested?
Are the new APIs properly documented?
Does the code follow the project policy? (Thread safety, no static global objects, no exported symbols without prefix, No use of features not supported by a supported compiler, ...)

But all of this is the least important part of reviewing. Because the submitter is already supposed to know those policies that should be documented for any software project.

The hard part of the review is seeing the things that the author of the patch could not possibly know.
It takes someone already familiar with the code to do the review. Someone who knows how the code interacts with the other parts, someone who knows the rationale behind some of the part of the code.
Even if you are familiar with the code, you almost always need to open the IDE and browse the existing code and the code it interacts too. Making a review by just looking at the patch may be possible for trivial compilation fix, but in almost all case of non trivial code, you will need to open the relevant files on your IDE.

There is so many things to check and to be aware of: code calling virtual functions, code used for other use cases than the one of the bug report, ...
You also need to think hard about what the patch might break and what the behaviour changes are.

As an example, let us take someone who does a change in a QComboBox to fix a usability glitch on Windows. Someone not familiar with QComboBox might not know that it behaves very differently depending on the style. The reviewer will likely to be able to spot that a change will "break" that style.

Another example is someone who identifies that a bug is caused by one wrong condition in an if clause. Changing seems to fix the problem. But why was that condition there? If the code is not commented and the history of the code shows it is very old, no relevant information can be deduced. The reviewer should be the one who knows that this condition is there to test a very specific use case that might or might not be required anymore.

Inside Nokia we used to have an internal Pastebin website for putting up patches. Nowadays the Qt Project is using a much more advanced process for reviewing.

Qt5 alpha has been released. One of the features which I have been working on is a new syntax for signals and slot. This blog entry will present it.

Previous syntax

Here is how you would connect a signal to a slot:

connect(sender, SIGNAL(valueChanged(QString,QString)),
        receiver, SLOT(updateValue(QString)) );

What really happens behind the scenes is that the SIGNAL and SLOT macros will convert their argument to a string. Then QObject::connect() will compare those strings with the introspection data collected by the moc tool.

What's the problem with this syntax?

While working fine in general, we can identify some issues:

No compile time check: All the checks are done at run-time by parsing the strings. That means if you do a typo in the name of the signal or the slot, it will compile but the connection will not be made, and you will only notice a warning in the standard output.
Since it operates on the strings, the type names of the slot must match exactly the ones of the signal. And they also need to be the same in the header and in the connect statement. This means it won't work nicely if you want to use typedef or namespaces

New syntax: using function pointers

In the upcoming Qt5, an alternative syntax exist. The former syntax will still work. But you can now also use this new way of connecting your signals to your slots:

connect(sender, &Sender::valueChanged,
        receiver, &Receiver::updateValue );

Which one is the more beautiful is a matter of taste. One can quickly get used to the new syntax.

So apart from the aesthetic point of view, let us go over some of the things that it brings us:

Compile-time checking

You will get a compiler error if you misspelled the signal or slot name, or if the arguments of your slot do not match those from the signal.
This might save you some time while you are doing some re-factoring and change the name or arguments of signals or slots.

An effort has been made, using static_assert to get nice compile errors if the arguments do not match or of you miss a Q_OBJECT

Arguments automatic type conversion

Not only you can now use typedef or namespaces properly, but you can also connect signals to slots that take arguments of different types if an implicit conversion is possible

In the following example, we connect a signal that has a QString as a parameter to a slot that takes a QVariant. It works because QVariant has an implicit constructor that takes a QString

class Test : public QObject 
{ Q_OBJECT
public:
    Test() {
        connect(this, &Test::someSignal, this, &Test::someSlot);
    }
signals:
    void someSignal(const QString &);
public:
    void someSlot(const QVariant &);
};

Connecting to any function

As you might have seen in the previous example, the slot was just declared as public and not as slot. Qt will indeed call directly the function pointer of the slot, and will not need moc introspection anymore. (It still needs it for the signal)

But what we can also do is connecting to any function or functor:

static void someFunction() {
    qDebug() << "pressed";
}
// ... somewhere else
    QObject::connect(button, &QPushButton::clicked, someFunction);

This can become very powerful when you associate that with boost or tr1::bind.

C++11 lambda expressions

Everything documented here works with the plain old C++98. But if you use compiler that supports C++11, I really recommend you to use some of the language's new features. Lambda expressions are supported by at least MSVC 2010, GCC 4.5, clang 3.1. For the last two, you need to pass -std=c++0x as a flag.

You can then write code like:

void MyWindow::saveDocumentAs() {
    QFileDialog *dlg = new QFileDialog();
    dlg->open();
    QObject::connect(dlg, &QDialog::finished, [=](int result) {
        if (result) {
            QFile file(dlg->selectedFiles().first());
            // ... save document here ...
        }
        dlg->deleteLater();
    });
}

This allows you to write asynchronous code very easily.

So what now?

It is time to try it out. Check out the alpha and start playing. Don't hesistate to report bugs.

QStringLiteral is a new macro introduced in Qt 5 to create QString from string literals. (String literals are strings inside "" included in the source code). In this blog post, I explain its inner working and implementation.

Summary

Let me start by giving a guideline on when to use it: If you want to initialize a QString from a string literal in Qt5, you should use:

Most of the cases: QStringLiteral("foo") if it will actually be converted to QString
QLatin1String("foo") if it is use with a function that has an overload for QLatin1String. (such as operator==, operator+, startWith, replace, ...)

I have put this summary at the beginning for the ones that don't want to read the technical details that follow.

Read on to understand how QStringLiteral works

Reminder on how QString works

QString, as many classes in Qt, is an implicitly shared class. Its only member is a pointer to the 'private' data. The QStringData is allocated with malloc, and enough room is allocated after it to put the actual string data in the same memory block.

// Simplified for the purpose of this blog
struct QStringData {
  QtPrivate::RefCount ref; // wrapper around a QAtomicInt
  int size; // size of the string
  uint alloc : 31; // amount of memory reserved after this string data
  uint capacityReserved : 1; // internal detail used for reserve()

  qptrdiff offset; // offset to the data (usually sizeof(QSringData))

  inline ushort *data()
  { return reinterpret_cast<ushort *>(reinterpret_cast<char *>(this) + offset); }
};

// ...

class QString {
  QStringData *d;
public:
  // ... public API ...
};

The offset is a pointer to the data relative to the QStringData. In Qt4, it used to be an actual pointer. We'll see why it has been changed.

The actual data in the string is stored in UTF-16, which uses 2 bytes per character.

Literals and Conversion

Strings literals are the strings that appears directly in the source code, between quotes.
Here are some examples. (suppose action, string, and filename are QString

    o->setObjectName("MyObject");
    if (action == "rename")
        string.replace("%FileName%", filename);

In the first line, we call the function QObject::setObjectName(const QString&). There is an implicit conversion from const char* to QString, via its constructor. A new QStringData is allocated with enough room to hold "MyObject", and then the string is copied and converted from UTF-8 to UTF-16.

The same happens in the last line where the function QString::replace(const QString &, const QString &) is called. A new QStringData is allocated for "%FileName%".

Is there a way to prevent the allocation of QStringData and copy of the string?

Yes, one solution to avoid the costly creation of a temporary QString object is to have overload for common function that takes const char* parameter.
So we have those overloads for operator==

bool operator==(const QString &, const QString &);
bool operator==(const QString &, const char *);
bool operator==(const char *, const QString &)

The overloads do not need to create a new QString object for our literal and can operate directly on the raw char*.

Encoding and `QLatin1String`

In Qt5, we changed the default decoding for the char* strings to UTF-8. But many algorithms are much slower with UTF-8 than with plain ASCII or latin1

Hence you can use QLatin1String, which is just a thin wrapper around char * that specify the encoding. There are overloads taking QLatin1String for functions that can opperate or the raw latin1 data directly without conversion.

So our first example now looks like:

    o->setObjectName(QLatin1String("MyObject"));
    if (action == QLatin1String("rename"))
        string.replace(QLatin1String("%FileName%"), filename);

The good news is that QString::replace and operator== have overloads for QLatin1String. So that is much faster now.

In the call to setObjectName, we avoided the conversion from UTF-8, but we still have an (implicit) conversion from QLatin1String to QString which has to allocate the QStringData on the heap.

Introducing `QStringLiteral`

Is it possible to avoid the allocation and copy of the string literal even for the cases like setObjectName? Yes, that is what QStringLiteral is doing.

This macro will try to generate the QStringData at compile time with all the field initialized. It will even be located in the .rodata section, so it can be shared between processes.

We need two languages feature to do that:

The possibility to generate UTF-16 at compile time:
On Windows we can use the wide char L"String". On Unix we are using the new C++11 Unicode literal: u"String". (Supported by GCC 4.4 and clang.)
The ability to create static data from expressions.
We want to be able to put QStringLiteral everywhere in the code. One way to do that is to put a static QStringData inside a C++11 lambda expression. (Supported by MSVC 2010 and GCC 4.5) (And we also make use of the GCC __extension__ ({ }) )

Implementation

We will need need a POD structure that contains both the QStringData and the actual string. Its structure will depend on the way we use to generate UTF-16.


/* We define QT_UNICODE_LITERAL_II and declare the qunicodechar
   depending on the compiler */
#if defined(Q_COMPILER_UNICODE_STRINGS)
   // C++11 unicode string
   #define QT_UNICODE_LITERAL_II(str) u"" str
   typedef char16_t qunicodechar;
#elif __SIZEOF_WCHAR_T__ == 2
   // wchar_t is 2 bytes  (condition a bit simplified)
   #define QT_UNICODE_LITERAL_II(str) L##str
   typedef wchar_t qunicodechar;
#else
   typedef ushort qunicodechar; // fallback
#endif

// The structure that will contain the string.
// N is the string size
template <int N>
struct QStaticStringData
{
    QStringData str;
    qunicodechar data[N + 1];
};

// Helper class wrapping a pointer that we can pass to the QString constructor
struct QStringDataPtr
{ QStringData *ptr; };


#if defined(QT_UNICODE_LITERAL_II)
// QT_UNICODE_LITERAL needed because of macro expension rules
# define QT_UNICODE_LITERAL(str) QT_UNICODE_LITERAL_II(str)
# if defined(Q_COMPILER_LAMBDA)

#  define QStringLiteral(str) \
    ([]() -> QString { \
        enum { Size = sizeof(QT_UNICODE_LITERAL(str))/2 - 1 }; \
        static const QStaticStringData<Size> qstring_literal = { \
            Q_STATIC_STRING_DATA_HEADER_INITIALIZER(Size), \
            QT_UNICODE_LITERAL(str) }; \
        QStringDataPtr holder = { &qstring_literal.str }; \
        const QString s(holder); \
        return s; \
    }()) \

# elif defined(Q_CC_GNU)
// Use GCC To  __extension__ ({ }) trick instead of lambda
// ... <skiped> ...
# endif
#endif

#ifndef QStringLiteral
// no lambdas, not GCC, or GCC in C++98 mode with 4-byte wchar_t
// fallback, return a temporary QString
// source code is assumed to be encoded in UTF-8
# define QStringLiteral(str) QString::fromUtf8(str, sizeof(str) - 1)
#endif

Let us simplify a bit this macro and look how the macro would expand

o->setObjectName(QStringLiteral("MyObject"));
// would expand to: 
o->setObjectName(([]() {
        // We are in a lambda expression that returns a QStaticString

        // Compute the size using sizeof, (minus the null terminator)
        enum { Size = sizeof(u"MyObject")/2 - 1 };

        // Initialize. (This is static data initialized at compile time.)
        static const QStaticStringData<Size> qstring_literal =
        { { /* ref = */ -1, 
            /* size = */ Size, 
            /* alloc = */ 0, 
            /* capacityReserved = */ 0, 
            /* offset = */ sizeof(QStringData) },
          u"MyObject" };

         QStringDataPtr holder = { &qstring_literal.str };
         QString s(holder); // call the QString(QStringDataPtr&) constructor
         return s;
    }()) // Call the lambda
  );

The reference count is initialized to -1. A negative value is never incremented or decremented because we are in read only data.

One can see why it is so important to have an offset (qptrdiff) rather than a pointer to the string (ushort*) as it was in Qt4. It is indeed impossible to put pointer in the read only section because pointers might need to be relocated at load time. That means that each time an application or library, the OS needs to re-write all the pointers addresses using the relocation table.

Results

For fun, we can look at the assembly generated for a very simple call to QStringLiteral. We can see that there is almost no code, and how the data is laid out in the .rodata section

We notice the overhead in the binary. The string takes twice as much memory since it is encoded in UTF-16, and there is also a header of sizeof(QStringData) = 24. This memory overhead is the reason why it still makes sense to still use QLatin1String when the function you are calling has an overload for it.

QString returnAString() {
    return QStringLiteral("Hello");
}

Compiled with g++ -O2 -S -std=c++0x (GCC 4.7) on x86_64

    .text
    .globl  _Z13returnAStringv
    .type   _Z13returnAStringv, @function
_Z13returnAStringv:
    ; load the address of the QStringData into %rdx
    leaq    _ZZZ13returnAStringvENKUlvE_clEvE15qstring_literal(%rip), %rdx
    movq    %rdi, %rax
    ; copy the QStringData from %rdx to the QString return object
    ; allocated by the caller.  (the QString constructor has been inlined)
    movq    %rdx, (%rdi)
    ret
    .size   _Z13returnAStringv, .-_Z13returnAStringv
    .section    .rodata
    .align 32
    .type   _ZZZ13returnAStringvENKUlvE_clEvE15qstring_literal, @object
    .size   _ZZZ13returnAStringvENKUlvE_clEvE15qstring_literal, 40
_ZZZ13returnAStringvENKUlvE_clEvE15qstring_literal:
    .long   -1   ; ref
    .long   5    ; size
    .long   0    ; alloc + capacityReserved 
    .zero   4    ; padding
    .quad   24   ; offset
    .string "H"  ; the data. Each .string add a terminal '\0'
    .string "e"
    .string "l"
    .string "l"
    .string "o"
    .string ""
    .string ""
    .zero   4

Conclusion

I hope that now that you have read this you will have a better understanding on where to use and not to use QStringLiteral.
There is another macro QByteArrayLiteral, which work exactly on the same principle but creates a QByteArray.

C++11 is the name of the current version of the C++ standard, which brings many new features to the language.

Qt 4.8 was the first version of Qt that started to make use of some of the new C++11 features in its API. I wrote a blog post about C++11 in Qt 4.8 before the 4.8 release which I won't repeat here.

In Qt5, we make use of even more features. I will go into detail about some of them in this post.

Lambda expressions for slots

Lambda expressions are a new syntax in C++11 which allow to declare anonymous functions. Anonymous functions can be used in order to use small functions as parameter without requiring to explicitly declare them.
The previous way to write functors using operator() in a struct was requiring a lot of boilerplate code.

It was already possible in Qt 4.8 to use lambda expressions in some QtConcurrent functions. Now it is even possible to use them as slots using the new connect syntax.

Remember the time when you had to write a one-line function for your slot. Now you can write it right in place. It is much more readable to have the function directly where you reference it:

 connect(sender, &Sender::valueChanged, [=](const QString &newValue) {
        receiver->updateValue("senderValue", newValue);
    });

Lambda functions are supported since MSVC 2010, GCC 4.5, and clang 3.1.

Unicode literal

In C++11, you can generate UTF-16 by writing u"MyString". This is used by Qt to implement QStringLiteral which is a macro that initializes the QString at compile time without run-time overhead.

    QString someString = QStringLiteral("Hello");

See my previous blog about QStringLiteral.

Constant expressions: `constexpr`

This new constexpr C++11 keyword can be added to annotate some inline functions to specify that they could be computed at compile time. In Qt 5, we introduced Q_DECL_CONSTEXPR which is defined to constexpr when the compiler supports it, or nothing otherwise.

We also annotated some of the Qt functions that made sense (for example QFlags), which allows them to be used in constant expressions.

enum SomeEnum { Value1, Value2, Value3 };
Q_DECLARE_OPERATORS_FOR_FLAGS(QFlags<SomeEnum>)
// The previous line declares
// Q_DECL_CONSTEXPR QFlags<SomeValue> operator|(SomeValue,SomeValue) {...}

int someFunction(QFlags<SomeEnum> value) {
    switch (value) {
        case SomeEnum::Value1:
            return 1;
        case SomeEnum::Value2:
            return 2;
        case SomeEnum::Value1 | SomeEnum::Value3:
        // Only possible with C++11 and because QFlags operators are constexpr
        // Previously this line would call
        //        QFlags<SomeValue> operator|(SomeValue,SomeValue)
        // that would have thrown an error because only compiler constants
        // are allowed as case satement

            return 3;
        default:
            return 0;
    }
}

(Notice also here that I used SomeEnum:: in front of the values, which is allowed in C++11, but was not allowed before)

static_assert

C++11 helps producing better error messages when something wrong can be detected at compile time using static_assert. In Qt5 we introduced the macros Q_STATIC_ASSERT, and Q_STATIC_ASSERT_X That will use static_assert if available, or some other template trick if not.

Qt uses that macro already quite a lot in order to provide better compiler errors when the API is not used properly.

Override and final

Have you ever had code that did not work because you had a typo in the name of a virtual function you re-implemented? (Or forgot that damn const at the end)

You can now annotate the functions that are meant to override virtual functions with Q_DECL_OVERRIDE.

That macro will be substituted to the new "override" attribute if the compilers supports it, or to nothing otherwise. If you compile your code with C++11 enabled compiler you will get an error if you have a typo, or if you changed a virtual method while re-factoring.

class MyModel : public QStringListModel {
    //...
protected:
     Qt::ItemFlags flags (const QModelIndex & index)  Q_DECL_OVERRIDE;
};

And because we forgot the const that will produce errors such as:

mymodel.h:15: error: `Qt::ItemFlags MyModel::flags(const QModelIndex&)`
 marked override, but does not override

There is also Q_DECL_FINAL that is substituted to the final attribute which specifies a virtual function cannot be overridden.

Deleted member

The new macro Q_DECL_DELETE expands to = delete for compilers that support deleted functions. This us useful to give better compiler errors to avoid common mistakes.

Deleted functions are used to explicitly delete the function that would be otherwise ceated automatically by the compiler (such as a default constructor, or a copy assignement operator). Deleted functions may not be called and a compiler error is thrown if they are used.

We use it for example in the Q_DISABLE_COPY macro. Before, the trick was to keep those members private, but the error message was not as good.

Rvalue References and Move Constructors

In the Qt 4.8 article I already explained what rvalue references are about.

Due to a change of the internals of the reference counting of our shared class in Qt5, it was possible to add a move constructor for many of them.

Conclusion

MSVC does not require any special flags and enables the C++11 features by default, but GCC or Clang require -std=c++0x.

By default, Qt5 itself will be compiled with the C++11 flags on compilers that need it.

If you use qmake, you can add that line to your .pro file (Qt5):

CONFIG += c++11

(In Qt4, it should be something like: gcc:CXXFLAGS += -std=c++0x)

And now you can enjoy all the nice features of C++11. (It is already worth doing it only for being able to use auto)

SIMD: "Single instruction, multiple data" is a class of instructions present in many CPUs today. For example, on the Intel CPU they are known under the SSE acronym. Those instructions enable more parallelism by operating simultaneously on multiple data.

In this blog post I will present a method for converting UTF-8 text to UTF-16 using SSE4 compiler intrinsics. My goal is also to introduce you to the SIMD intrinsics, if you are not familiar with them yet.

About Intrinsics

CPUs can parallelize computations by using specials instructions that operate simultaneously on multiples values. There are large registers that contain vectors of values. For example, the SSE registers are 128 bits vectors that can contain either 2 double, 4 float, 4 int, 8 short or up to 16 char values. Then, one single instruction can add, substract, multiply, ... all those values and put the result in another vector register.
This is called SIMD for Single Instruction Multiple Data.

Compilers can sometimes automatically vectorize the contents of a simple loop, but they are not really good at it. If the loop is too complex or if it depends on the previous iteration, the compiler will fail to vectorize. (Even if it would be obvious for an human). The vectorization therefore has to be done manually.

This is where compiler intrinsics help us. They are special functions handled directly in the compiler who will emit the instruction we want.

The problem with intrinsics is that they are not really portable. They target a specific architecture. This means you need to write your algorithms for the different CPU architectures you target. Fortunately, the main compilers use the same intrinsics so they are portable across compilers and operating systems.

You are supposed to detect the support of the instruction set at run-time. What is usually done is to write several variants of the function for the different target architecture and have a function pointer that is set to the best variant for the running architecture.

The best documentation I could find is the reference on MSDN. You will have to browse it a bit to see what the available intrinsics are.

The Task: Converting UTF-8 to UTF-16

I got interested in SIMD when Benjamin was using them to speed up QString::fromLatin1, which converts a char* encoded in Latin-1 to UTF-16 (QString's internal encoding). After he explained to me how it works, I was able to speed up some of the drawing code inside Qt.
However in Qt5, we changed the default string encoding from Latin-1 to UTF-8.

UTF-8 can represent all of Unicode while Latin-1 can only represent the alphabet of a few latin based languages. UTF-8 parsing is much more difficult to vectorize because a single code point can be encoded by a variable number of bytes.

This table explains the correspondence between the bits in UTF-8 and UTF-16:

	UTF-8	UTF-16
ASCII (1 byte)	`0aaaaaaa`	`00000000 0aaaaaaa`
Basic Multilingual Plane (2 or 3 bytes)	`110bbbbb 10aaaaaa`	`00000bbb bbaaaaaa`
Basic Multilingual Plane (2 or 3 bytes)	`1110cccc 10bbbbbb 10aaaaaa`	`ccccbbbb bbaaaaaa`
Supplementary Planes (4 bytes)	`11110ddd 10ddcccc 10bbbbbb 10aaaaaa` `uuuu = ddddd - 1`	`110110uu uuccccbb 110111bb bbaaaaaa`

I was wondering if we could improve QString::fromUtf8 using SIMD.
Most of the strings are just ASCII which is easy to vectorize, but I wanted to also try to go ahead and vectorize the more complicated cases as well.
However, the 4 bytes sequences (Supplementary Planes) are very rarely used (most used languages and useful symbols are already in the Basic Multilingual Plane) and also more complicated so I did not vectorize that case for the scope of this blog post. This is left as an exercise for the reader ;-)

Previous work

UTF-8 processing is a very common task. I was expecting that the problem would be already solved. Currently however most libraries and applications still have a scalar implementation. Few use vectorization, but only for ASCII.

There is a library called u8u16. It uses a clever technique by separating the bits in individual vectors. It is processing 128 char at the time in 8 different vectors.
I wanted to try a different approach and work directly on one vector. Working with 16 char at the time allows speeding up the processing of smaller strings.

The Easy Case: ASCII

I'm going to start by explaining the simple case when we only have ASCII. This is very easy to vectorize because for ASCII one byte in UTF-8 is always one byte in UTF-16 with \0 bytes in between.
ASCII is also a very common encoding so it make sense to have a special case for it.

I will start by showing the code and then explaining it

Our function takes a pointer src to the UTF-8 buffer of length len, and a pointer dst to a buffer of size at least len*2 where we will store the UTF-16 output. The function returns the size of the UTF-16 output.

int fromUtf8(const char *src, int len, unsigned short *dst) {
  /* We will process the input 16 bytes at a time,
     so the length must be at least 16. */
    while(len >= 16) {

        /* Load 128 bit into a vector register. We use the 'loadu' intrinsic
           where 'u' stands for unaligned. Loading aligned data is much faster,
           but here we don't know if the source is aligned */
        __m128i chunk = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src));

        /* Detect if it is ASCII by looking if one byte has the high bit set. */
        if (!_mm_movemask_epi8(chunk)) {

            // unpack the first 8 bytes, padding with zeros
            __m128i firstHalf = _mm_unpacklo_epi8(chunk, _mm_set1_epi8(0));
            // and store to the destination
            _mm_storeu_si128(reinterpret_cast<__m128i*>(dst), firstHalf); 

            // do the same with the last 8 bytes
            __m128i secondHalf = _mm_unpackhi_epi8 (chunk, _mm_set1_epi8(0));
            _mm_storeu_si128(reinterpret_cast<__m128i*>(dst+8), secondHalf);

            // Advance
            dst += 16;
            src += 16;
            len -= 16;
            continue;
        }

        // handle the more complicated case when the chunk contains multi-bytes
        // ...
    }
    // handle the few remaining bytes using a classical serial algorith
    // ...
}

The type __m128i represents a vector of 128 bits. It is most likely going to be stored in a SSE register.
Since we don't know the alignment, we load the data with _mm_loadu_si128. There is also the _mm_load_si128 that assumes the memory is aligned on 16 bytes. Because it is faster to operate on aligned data, some algorithms sometimes start with a prologue to handle the first bytes until the rest is aligned. I did not do that here but we could optimize the ASCII processing further by doing so.

Once we have loaded 16 bytes of data into chunk, we need to check if it only contains ASCII. Each ASCII byte should have its most significant bit set to 0. _mm_movemask_epi8 creates a 16bit mask out of most significant bits of our 16 bytes in the vector. If that mask is 0 we only have ASCII in that vector.

The next operation is to expand with zeroes. We will do what is called unpacking. The instructions _mm_unpackhi_epi8 and _mm_unpacklo_epi8 are going to interleave two vectors. Because the result of an unpack is on 32 bytes, there is two operations: one to get the first 16 high bytes, and the other to get the low bytes.
_mm_set1_epi8(0) creates a vector filled with zeroes.

Note that vectors are in little endian, so I show the least significant bytes first

Now, we can just store the result with _mm_storeu_si128. We use the unaligned operation again.

General Algorithm

Now that I have introduced you to the ASCII decoding process, we can move on to the general case, in which we might have multiple bytes for one character.

The algorithm proceeds with the following steps:

First, we will classify each bytes so we know where the values in the sequence are and how long they are. This information will be stored in vectors that we can use to mask so operations we do later operate only on a certain category of bytes.
Then, we will operate on the data to to find the actual Unicode value behind each character. They will be stored at the location of the last byte of a sequence, into two different vectors: one for the low bits and one for the high bits.
Then we need to shuffle the vectors to remove the possible "gaps" left by multi-bytes sequences.
Finally, we can store the result and advance the pointers.

I will explain every step in more detail.

In our example, I will decode the string x≤(α+β)²γ², which is represented in UTF-8 as x\e2\89\a4(\ce\b1+\ce\b2)\c2\b2\ce\b3\c2\b2. The "≤" is represented by 3 bytes, the Greek letters by two bytes, and the "²" also by two bytes. The total length is 17 bytes, so the last byte will be truncated when loaded into the vector

Classification

At the end of this step, we will have the following vectors:

mask tells which bit needs to be set to 0 in order to only keep the bits that are going to end up in the Unicode value. Bits of mask that are set will be unset from the data.
count is zero if the corresponding byte is ASCII or a continuation byte. If the corresponding byte is starting a sequence, then it is the number of bytes in that sequence.
counts represents (for each byte) how many bytes are remaining in this sequence. It is 0 for ASCII, 1 for the last byte in a sequence, and so on.

We start by finding count and mask. I do that by computing a vector that I called state, in which the 5 higher bits are the mask and the 3 lower bits represent the count.

We will use the _mm_cmplt_epi8 instruction. It compares two vectors of 16 signed char, and returns a mask vector where each byte is equal to 0xff or 0x00 depending of the result of the comparison between the corresponding chars.
We can then use this mask as input to the _mm_blendv_epi8 SSE4 instruction. It takes 3 input arguments: two vectors and a mask. It returns a vector that has the bytes of the second input where the most significant bit of the corresponding byte of the mask is one and the bytes of the first input otherwise. In other words, for each 16 bytes of the vectors, we have:
output[i] = mask[i] & 0x80 ? input2[i] : input1[i]

The problem of _mm_cmplt_epi8 is that it only works on signed bytes. That is why we add 0x80 to everything simulate unsigned comparison.

    // The state for ascii or contiuation bit: mask the most significant bit.
    __m128i state = _mm_set1_epi8(0x0 | 0x80);

    // Add an offset of 0x80 to work around the fact that we don't have
    // unsigned comparison
    __m128i chunk_signed = _mm_add_epi8(chunk, _mm_set1_epi8(0x80));

    // If the bytes are greater or equal than 0xc0, we have the start of a
    // multi-bytes sequence of at least 2.
    // We use 0xc2 for error detection, see later.
    __m128i cond2 = _mm_cmplt_epi8( _mm_set1_epi8(0xc2-1 -0x80), chunk_signed);

    state = _mm_blendv_epi8(state , _mm_set1_epi8(0x2 | 0xc0),  cond2);

    __m128i cond3 = _mm_cmplt_epi8( _mm_set1_epi8(0xe0-1 -0x80), chunk_signed);

    // We could optimize the case when there is no sequence logner than 2 bytes.
    // But i did not do it in this version.
    //if (!_mm_movemask_epi8(cond3)) { /* process max 2 bytes sequences */ }

    state = _mm_blendv_epi8(state , _mm_set1_epi8(0x3 | 0xe0),  cond3);
    __m128i mask3 = _mm_slli_si128(cond3, 1);

    // In this version I do not handle sequences of 4 bytes. So if there is one
    // we break and do a classic processing byte per byte.
    __m128i cond4 = _mm_cmplt_epi8(_mm_set1_epi8(0xf0-1 -0x80), chunk_signed);
    if (_mm_movemask_epi8(cond4)) { break; }

    // separate the count and mask from the state vector
    __m128i count =  _mm_and_si128(state, _mm_set1_epi8(0x7));
    __m128i mask = _mm_and_si128(state, _mm_set1_epi8(0xf8));

From count I will be able to compute counts. I will use _mm_subs_epu8 which substracts two vector of unsigned bytes and saturate (so if you underflow, it sets 0). Then I shift that result, and add it to the count. Then we do the same, but we substract 2 and shift the result two bytes.

_mm_slli_si128 shifts a vector on the left by the given number of bytes. Notice again that we work on little endian. So that is why I shifted right on the picture, but left in the code.

    //substract 1, shift 1 byte and add
    __m128i count_subs1 = _mm_subs_epu8(count, _mm_set1_epi8(0x1));
    __m128i counts = _mm_add_epi8(count, _mm_slli_si128(count_subs1, 1));

    //substract 2, shift 2 bytes and add
    counts = _mm_add_epi8(counts, _mm_slli_si128(
                    _mm_subs_epu8(counts, _mm_set1_epi8(0x2)), 2));

Processing

The goal of this step is to end up with the final unicode bytes values. They will be split into two vectors (one for the high bytes and one for the low bytes). We keep them in the location of the last byte of a sequence, which means that there will be "gaps" in the vector.

There is no instructions that shift the whole vector by an arbitrary number of bits. It is only possible to shift by an integer number of bytes with _mm_slli_si128 or _mm_srli_si128. The instructions that shift by an abitrary number of bits only work on vector with elements of 16 or 32 bits. No 8 bits elements. We will use what we have and will first shift by one byte (in chunk_right) and then shift and mask to get what we want.

    // Mask away our control bits with ~mask (and not)
    chunk = _mm_andnot_si128(mask , chunk);
    // from now on, we only have usefull bits

    // shift by one byte on the left
    __m128i chunk_right = _mm_slli_si128(chunk, 1);

    // If counts == 1,  compute the low byte using 2 bits from chunk_right
    __m128i chunk_low = _mm_blendv_epi8(chunk,
                _mm_or_si128(chunk, _mm_and_si128(
                    _mm_slli_epi16(chunk_right, 6), _mm_set1_epi8(0xc0))),
                _mm_cmpeq_epi8(counts, _mm_set1_epi8(1)) );

    // in chunk_high, only keep the bits if counts == 2
    __m128i chunk_high = _mm_and_si128(chunk ,
                                       _mm_cmpeq_epi8(counts, _mm_set1_epi8(2)));
    // and shift that by 2 bits on the right.
    // (no unwanted bits are comming because of the previous mask)
    chunk_high = _mm_srli_epi32(chunk_high, 2);

    // Add the bits from the bytes for which counts == 3
    __m128i mask3 = _mm_slli_si128(cond3, 1); //re-use cond3 (shifted)
    chunk_high = _mm_or_si128(chunk_high, _mm_and_si128(
        _mm_and_si128(_mm_slli_epi32(chunk_right, 4), _mm_set1_epi8(0xf0)),
                              mask3));

Shuffle

Now, we need to get rid of the gaps in our strings. We will start by computing, for each byte, the number of bytes we need to shift. We notice that count_subs1is the number of bytes we need to skip for a given sequence. We will then accumulate all those number in order to get the number of bytes we need to shift each bytes. Then we reset to zero all the bytes that are supposed to go away.

    __m128i shifts = count_subs1;
    shifts = _mm_add_epi8(shifts, _mm_slli_si128(shifts, 2));
    shifts = _mm_add_epi8(shifts, _mm_slli_si128(shifts, 4));
    shifts = _mm_add_epi8(shifts, _mm_slli_si128(shifts, 8));

    // keep only if the corresponding byte should stay
    // that is, if counts is 1 or 0  (so < 2)
    shifts  = _mm_and_si128 (shifts , _mm_cmplt_epi8(counts, _mm_set1_epi8(2)));

We will shift the shift vector so the meaning of it becomes the number of bytes that are separating a byte in its final location from the byte in its original location.

The maximum we can shift is 10 (if everything is a 3 bytes sequence). We can then compute in 4 steps.
We will shift the interesting bit of the vector shift, so it can be used to control the blend operation.


        shifts = _mm_blendv_epi8(shifts, _mm_srli_si128(shifts, 1),
                                 _mm_srli_si128(_mm_slli_epi16(shifts, 7) , 1));

        shifts = _mm_blendv_epi8(shifts, _mm_srli_si128(shifts, 2),
                                 _mm_srli_si128(_mm_slli_epi16(shifts, 6) , 2));

        shifts = _mm_blendv_epi8(shifts, _mm_srli_si128(shifts, 4),
                                _mm_srli_si128(_mm_slli_epi16(shifts, 5) , 4));

        shifts = _mm_blendv_epi8(shifts, _mm_srli_si128(shifts, 8),
                                 _mm_srli_si128(_mm_slli_epi16(shifts, 4) , 8));

Then we can then know, for each byte in the final vector, its location in the original vector. With that information, we can use _mm_shuffle_epi8 (SSSE3) to shuffle the vector and remove the "gaps". Then we can unpack the result and store it, and we are almost done.

    __m128i shuf = _mm_add_epi8(shifts,
            _mm_set_epi8(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0));

    // Remove the gaps by shuffling
    shuffled_low = _mm_shuffle_epi8(chunk_low, shuf);
    shuffled_high = _mm_shuffle_epi8(chunk_high, shuf);

    // Now we can unpack and store
    __m128i utf16_low = _mm_unpacklo_epi8(shuffled_low, shuffled_high);
    __m128i utf16_high = _mm_unpackhi_epi8(shuffled_low, shuffled_high);
    _mm_storeu_si128(reinterpret_cast<__m128i*>(dst), utf16_low);
    _mm_storeu_si128(reinterpret_cast<__m128i*>(dst+8) , utf16_high);

Error Detection

So far we have only considered valid UTF-8. But a conform UTF-8 decoder should also handle broken UTF-8. It should detect the following error cases:

Not enough, misplaced or too many continuation characters.
Overlong forms: If a sequence decodes to a value that could have been encoded in a shorter sequence. For example the quote must be encoded in its ASCII form (\22) and not in a multi-byte sequence (\c0\a2 or \e0\80\a2). Not forbidding those forms could lead to security problems, e.g. if the the quote where escaped in the utf-8 input string, but appears again later when decoded.
Invalid or reserved Unicode code point (for example, the one reserved for UTF-16 surrogates)

Since error is not the common case, what I do in that case is to break, and let the rest be handled by the sequential algorithm (Just like when we have 4 bytes sequences).


    // ASCII characters (and only them) should have the
    // corresponding byte of counts equal 0.
    if (asciiMask ^ _mm_movemask_epi8(_mm_cmpgt_epi8(counts, _mm_set1_epi8(0))))
            break;

    // The difference between a byte in counts and the next one should be negative,
    // zero, or one. Any other value means there is not enough continuation bytes.
    if (_mm_movemask_epi8(_mm_cmpgt_epi8(_mm_sub_epi8(_mm_slli_si128(counts, 1),
                counts), _mm_set1_epi8(1))))
            break;

    // For the 3 bytes sequences we check the high byte to prevent
    // the over long sequence (0x00-0x07) or the UTF-16 surrogate (0xd8-0xdf)
    __m128i high_bits = _mm_and_si128(chunk_high, _mm_set1_epi8(0xf8));
    if (!_mm_testz_si128(mask3,
                _mm_or_si128(_mm_cmpeq_epi8(high_bits,_mm_set1_epi8(0x00)) ,
                            _mm_cmpeq_epi8(high_bits,_mm_set1_epi8(0xd8))) ))
        break;

    // Check for a few more invalid unicode using range comparison and _mm_cmpestrc
    const int check_mode = _SIDD_UWORD_OPS | _SIDD_CMP_RANGES;
    if (_mm_cmpestrc( _mm_cvtsi64_si128(0xfdeffdd0fffffffe), 4,
                     utf16_high, 8, check_mode) |
        _mm_cmpestrc( _mm_cvtsi64_si128(0xfdeffdd0fffffffe), 4,
                      utf16_low, 8, check_mode))
            break;

Advance Pointers

Now we are almost done. We just need to advance the source and destination pointers. The source pointer usually advance by 16, unless it ends in the middle of a sequence. In that case, we need to roll back 1 or 2 chars. For that, we look at the end of the counts vector.
Then we can see how much we need to advance the destination by looking at the end of the shift vector.

    int s = _mm_extract_epi32(shifts, 3);
    int dst_advance = source_advance - (0xff & (s >> 8*(3 - 16 + source_advance)));

    int c = _mm_extract_epi16(counts, 7);
    int source_advance = !(c & 0x0200) ? 16 : !(c & 0x02) ? 15 : 14;

    dst += dst_advance;
    src += source_advance;

Putting it all together

Now we are done, you can see the result there

I mixed all the operations to limit the dependencies between the operations that are next to each other to achieve better pipeline performance. I am surprised that the compiler did not do that for me.

Benchmarks Results

	Long Chinese text (8KiB)	36 bytes of chinese text	Long ASCII (24KiB)	20 ASCII bytes
Iconv	62000ns	380ns	165000ns	230ns
Qt (scalar)	43000ns	180ns	55000ns	67ns
u8u16	20000ns	300ns	8000ns	37ns
This solution (SSE4.1)	22000ns	110ns	15000ns	40ns

We beat u8u16 on small strings but it is better on large strings because it processes more bytes at the same time. We are still better than any scalar processing.

I added some debug output in the implementation of QString::fromUtf8 in Qt 4.8 and ran few KDE applications translated in French. The average size of strings that goes through this function is 9 characters. More than 50% of the string are less than 2 character long, 85% are less than 16, and 13% are between 16 and 128 characters long. 98% of the strings where only composed of ASCII. The rest was mainly having 2 bytes sequences and very few 3 bytes sequences. There is no sequence of 4 bytes.
Also consider that the processing of UTF-8 is probably far from being the bottleneck of any application that does something useful.

So was my exercise useless? Maybe in practice, but I think there is still something to learn from this that can be applied to another algorithm that might be a bottleneck.

As a developer, I write code, but I also read a lot of code. Often the code of other people. It is easy to get lost among all the functions, objects or files. What calls what? What does this function exactly do? Where is this variable modified? It is important that as much information as possible can be presented or easily accessible.

Good IDEs are good at displaying that information and ease the navigation in the source code. But often, I want to browse source code which I don't necessarily have on my hard drive. While I look at code on the web, I am very disturbed by the poor browsing experience. This is why I developed an online code browser.

Introducing the Woboq Codebrowser

Inspired by the UI of KDevelop (my favorite IDE) and using clang as a parser, I was able to imitate the KDevelop UI. But there is also a Qt Creator skin, for those who like it sober.

The idea is a bit based on LXR or Mozilla's DXR. But the goal is more to give better visualization and browsing experience rather than a search tool.

Semantic highlighting

Once you have tasted the semantic highlighting, you cannot go back to non properly highlighted code.

In this screen shot one can see the KDevelop color scheme. Green for types, dark yellow for members of the current class. Light color for local variables, with different color depending on the variable, allows you to quickly recognize what is what. Virtual methods are in italic.

When you hover over a symbol, all its use gets highlighted.

Navigation

Click on a symbol brings you to the definition of this symbol. You can browse code like you browse web. Open links in new tabs or go back in your browser history.

Tooltips

That's where it is getting interesting: Most of the symbols have a tooltip reminding you of the type of the symbol. The name of the arguments of the function, and even its default arguments. It will show what it thinks might be the documentation for a given symbol directly in the tooltip.

You can also can see the locations where a given function or variable is used. And if you click on it you get redirected to the right location.

Browse code now!

Woboq is hosting the code of some stuff which we find useful: Qt5 (contains V8, JavascriptCore, ...), Qt4, Qt creator, KDE (most of the modules, plus some extra like Quassel), LLVM/Clang, the linux kernel, glibc.

So if you ever wanted to look deeper in the implementation of some of your program, visit http://code.woboq.org.

Technical Details

The HTML pages are generated using a tool that use the clang library from the LLVM project. It parses the code as if it was compiling it, and visit the AST to annotate every token, and build an index of the uses and definitions.

It currently handles C++ and C.
I did not spend much time on template, or on the preprocessor (macros). So it is still not as good as KDevelop in this respect.

Update: The source code and licence plan are now available.

Qt is well known for its signals and slots mechanism. But how does it work? In this blog post, we will explore the internals of QObject and QMetaObject and discover how signals and slot work under the hood.

In this blog article, I show portions of Qt5 code, sometimes edited for formatting and brevity.

Signals and Slots

First, let us recall how signals and slots look like by showing the official example.

If you read this article from the RSS, you may want to open it in its original URL to have property formatted code.

Hover over the code to see fancy tool tips powered by the Woboq Code Browser!

The header looks like this:

class Counter : public QObject
{
    Q_OBJECT
    int m_value;
public:
    int value() const { return m_value; }
public slots:
    void setValue(int value);
signals:
    void valueChanged(int newValue);
};

Somewhere in the .cpp file, we implement setValue()

void Counter::setValue(int value)
{
    if (value != m_value) {
        m_value = value;
        emit valueChanged(value);
    }
}

Then one can use this Counter object like this:

  Counter a, b;
  QObject::connect(&a, SIGNAL(valueChanged(int)),
                   &b, SLOT(setValue(int)));

  a.setValue(12);  // a.value() == 12, b.value() == 12

This is the original syntax that has almost not changed since the beginning of Qt in 1992.

But even if the basic API has not changed since the beginning, its implementation has been changed several times. New features have been added and a lot happened under the hood. There is no magic involved and this blog post will show you how it works.

MOC, the Meta Object Compiler

The Qt signals/slots and property system are based on the ability to introspect the objects at runtime. Introspection means being able to list the methods and properties of an object and have all kinds of information about them such as the type of their arguments.
QtScript and QML would have hardly been possible without that ability.

C++ does not offer introspection support natively, so Qt comes with a tool to provide it. That tool is MOC. It is a code generator (and NOT a preprocessor like some people call it).

It parses the header files and generates an additional C++ file that is compiled with the rest of the program. That generated C++ file contains all the information required for the introspection.

Qt has sometimes been criticized by language purists because of this extra code generator. I will let the Qt documentation respond to this criticism. There is nothing wrong with code generators and the MOC is of a great help.

Magic Macros

Can you spot the keywords that are not pure C++ keywords? signals, slots, Q_OBJECT, emit, SIGNAL, SLOT. Those are known as the Qt extension to C++. They are in fact simple macros, defined in qobjectdefs.h

#define signals public
#define slots /* nothing */

That is right, signals and slots are simple functions: the compiler will handle them them like any other functions. The macros still serve a purpose though: the MOC will see them.

Signals were protected in Qt4 and before. They are becoming public in Qt5 in order to enable the new syntax.

#define Q_OBJECT \
public: \
    static const QMetaObject staticMetaObject; \
    virtual const QMetaObject *metaObject() const; \
    virtual void *qt_metacast(const char *); \
    virtual int qt_metacall(QMetaObject::Call, int, void **); \
    QT_TR_FUNCTIONS /* translations helper */ \
private: \
    Q_DECL_HIDDEN static void qt_static_metacall(QObject *, QMetaObject::Call, int, void **);

Q_OBJECT defines a bunch of functions and a static QMetaObject Those functions are implemented in the file generated by MOC.

#define emit /* nothing */

emit is an empty macro. It is not even parsed by MOC. In other words, emit is just optional and means nothing (except being a hint to the developer).

Q_CORE_EXPORT const char *qFlagLocation(const char *method);
#ifndef QT_NO_DEBUG
# define QLOCATION "\0" __FILE__ ":" QTOSTRING(__LINE__)
# define SLOT(a)     qFlagLocation("1"#a QLOCATION)
# define SIGNAL(a)   qFlagLocation("2"#a QLOCATION)
#else
# define SLOT(a)     "1"#a
# define SIGNAL(a)   "2"#a
#endif

Those macros just use the preprocessor to convert the parameter into a string, and add a code in front.

In debug mode we also annotate the string with the file location for a warning message if the signal connection did not work. This was added in Qt 4.5 in a compatible way. In order to know which strings have the line information, we use qFlagLocation which will register the string address in a table with two entries.

MOC Generated Code

We will now go over portion of the code generated by moc in Qt5.

The QMetaObject

const QMetaObject Counter::staticMetaObject = {
    { &QObject::staticMetaObject, qt_meta_stringdata_Counter.data,
      qt_meta_data_Counter,  qt_static_metacall, 0, 0}
};


const QMetaObject *Counter::metaObject() const
{
    return QObject::d_ptr->metaObject ? QObject::d_ptr->dynamicMetaObject() : &staticMetaObject;
}

We see here the implementation of Counter::metaObject() and Counter::staticMetaObject. They are declared in the Q_OBJECT macro. QObject::d_ptr->metaObject is only used for dynamic meta objects (QML Objects), so in general, the virtual function metaObject() just returns the staticMetaObject of the class.

The staticMetaObject is constructed in the read-only data. QMetaObject as defined in qobjectdefs.h looks like this:

struct QMetaObject
{
    /* ... Skiped all the public functions ... */

    enum Call { InvokeMetaMethod, ReadProperty, WriteProperty, /*...*/ };

    struct { // private data
        const QMetaObject *superdata;
        const QByteArrayData *stringdata;
        const uint *data;
        typedef void (*StaticMetacallFunction)(QObject *, QMetaObject::Call, int, void **);
        StaticMetacallFunction static_metacall;
        const QMetaObject **relatedMetaObjects;
        void *extradata; //reserved for future use
    } d;
};

The d indirection is there to symbolize that all the member should be private. They are not private in order to keep it a POD and allow static initialization.

The QMetaObject is initialized with the meta object of the parent object (QObject::staticMetaObject in this case) as superdata. stringdata and data are initialized with some data explained later in this article. static_metacall is a function pointer initialized to Counter::qt_static_metacall.

Introspection Tables

First, let us analyze the integer data of QMetaObject.

static const uint qt_meta_data_Counter[] = {

 // content:
       7,       // revision
       0,       // classname
       0,    0, // classinfo
       2,   14, // methods
       0,    0, // properties
       0,    0, // enums/sets
       0,    0, // constructors
       0,       // flags
       1,       // signalCount

 // signals: name, argc, parameters, tag, flags
       1,    1,   24,    2, 0x05,

 // slots: name, argc, parameters, tag, flags
       4,    1,   27,    2, 0x0a,

 // signals: parameters
    QMetaType::Void, QMetaType::Int,    3,

 // slots: parameters
    QMetaType::Void, QMetaType::Int,    5,

       0        // eod
};

The first 13 int consists of the header. When there are two columns, the first column is the count and the second column is the index in this array where the description starts.
In this case we have 2 methods, and the methods description starts at index 14.

The method descriptions are composed of 5 int. The first one is the name, it is an index in the string table (we will look into the details later). The second integer is the number of parameters, followed by the index at which one can find the parameter description. We will ignore the tag and flags for now. For each function, moc also saves the return type of each parameter, their type and index to the name.

String Table

struct qt_meta_stringdata_Counter_t {
    QByteArrayData data[6];
    char stringdata[47];
};
#define QT_MOC_LITERAL(idx, ofs, len) \
    Q_STATIC_BYTE_ARRAY_DATA_HEADER_INITIALIZER_WITH_OFFSET(len, \
    offsetof(qt_meta_stringdata_Counter_t, stringdata) + ofs \
        - idx * sizeof(QByteArrayData) \
    )
static const qt_meta_stringdata_Counter_t qt_meta_stringdata_Counter = {
    {
QT_MOC_LITERAL(0, 0, 7),
QT_MOC_LITERAL(1, 8, 12),
QT_MOC_LITERAL(2, 21, 0),
QT_MOC_LITERAL(3, 22, 8),
QT_MOC_LITERAL(4, 31, 8),
QT_MOC_LITERAL(5, 40, 5)
    },
    "Counter\0valueChanged\0\0newValue\0setValue\0"
    "value\0"
};
#undef QT_MOC_LITERAL

This is basically a static array of QByteArray. the QT_MOC_LITERAL macro creates a static QByteArray that references a particular index in the string below.

Signals

The MOC also implements the signals. They are simple functions that just create an array of pointers to the arguments and pass that to QMetaObject::activate. The first element of the array is the return value. In our example it is 0 because the return value is void.
The 3rd parameter passed to activate is the signal index (0 in that case).

// SIGNAL 0
void Counter::valueChanged(int _t1)
{
    void *_a[] = { 0, const_cast<void*>(reinterpret_cast<const void*>(&_t1)) };
    QMetaObject::activate(this, &staticMetaObject, 0, _a);
}

Calling a Slot

It is also possible to call a slot by its index in the qt_static_metacall function:

void Counter::qt_static_metacall(QObject *_o, QMetaObject::Call _c, int _id, void **_a)
{
    if (_c == QMetaObject::InvokeMetaMethod) {
        Counter *_t = static_cast<Counter *>(_o);
        switch (_id) {
        case 0: _t->valueChanged((*reinterpret_cast< int(*)>(_a[1]))); break;
        case 1: _t->setValue((*reinterpret_cast< int(*)>(_a[1]))); break;
        default: ;
        }

The array pointers to the argument is the same format as the one used for the signal. _a[0] is not touched because everything here returns void.

A Note About Indexes.

In each QMetaObject, the slots, signals and other invokable methods of that object are given an index, starting from 0. They are ordered so that the signals come first, then the slots and then the other methods. This index is called internally the relative index. They do not include the indexes of the parents.

But in general, we do not want to know a more global index that is not relative to a particular class, but include all the other methods in the inheritance chain. To that, we just add an offset to that relative index and get the absolute index. It is the index used in the public API, returned by functions like QMetaObject::indexOf{Signal,Slot,Method}

The connection mechanism uses a vector indexed by signals. But all the slots waste space in the vector and there are usually more slots than signals in an object. So from Qt 4.6, a new internal signal index which only includes the signal index is used.

While developing with Qt, you only need to know about the absolute method index. But while browsing the Qt's QObject source code, you must be aware of the difference between those three.

How Connecting Works.

The first thing Qt does when doing a connection is to find out the index of the signal and the slot. Qt will look up in the string tables of the meta object to find the corresponding indexes.

Then a QObjectPrivate::Connection object is created and added in the internal linked lists.

What information needs to be stored for each connection? We need a way to quickly access the connections for a given signal index. Since there can be several slots connected to the same signal, we need for each signal to have a list of the connected slots. Each connection must contain the receiver object, and the index of the slot. We also want the connections to be automatically destroyed when the receiver is destroyed, so each receiver object needs to know who is connected to him so he can clear the connection.

Here is the QObjectPrivate::Connection as defined in qobject_p.h :

struct QObjectPrivate::Connection
{
    QObject *sender;
    QObject *receiver;
    union {
        StaticMetaCallFunction callFunction;
        QtPrivate::QSlotObjectBase *slotObj;
    };
    // The next pointer for the singly-linked ConnectionList
    Connection *nextConnectionList;
    //senders linked list
    Connection *next;
    Connection **prev;
    QAtomicPointer<const int> argumentTypes;
    QAtomicInt ref_;
    ushort method_offset;
    ushort method_relative;
    uint signal_index : 27; // In signal range (see QObjectPrivate::signalIndex())
    ushort connectionType : 3; // 0 == auto, 1 == direct, 2 == queued, 4 == blocking
    ushort isSlotObject : 1;
    ushort ownArgumentTypes : 1;
    Connection() : nextConnectionList(0), ref_(2), ownArgumentTypes(true) {
        //ref_ is 2 for the use in the internal lists, and for the use in QMetaObject::Connection
    }
    ~Connection();
    int method() const { return method_offset + method_relative; }
    void ref() { ref_.ref(); }
    void deref() {
        if (!ref_.deref()) {
            Q_ASSERT(!receiver);
            delete this;
        }
    }
};

Each object has then a connection vector: It is a vector which associates for each of the signals a linked lists of QObjectPrivate::Connection.
Each object also has a reversed lists of connections the object is connected to for automatic deletion. It is a doubly linked list.

Linked lists are used because they allow to quickly add and remove objects. They are implemented by having the pointers to the next/previous nodes within QObjectPrivate::Connection
Note that the prev pointer of the senderList is a pointer to a pointer. That is because we don't really point to the previous node, but rather to the pointer to the next in the previous node. This pointer is only used when the connection is destroyed, and not to iterate backwards. It allows not to have a special case for the first item.

Signal Emission

When we call a signal, we have seen that it calls the MOC generated code which calls QMetaObject::activate.

Here is an annotated version of its implementation from qobject.cpp

void QMetaObject::activate(QObject *sender, const QMetaObject *m, int local_signal_index,
                           void **argv)
{
    activate(sender, QMetaObjectPrivate::signalOffset(m), local_signal_index, argv);
    /* We just forward to the next function here. We pass the signal offset of
     * the meta object rather than the QMetaObject itself
     * It is split into two functions because QML internals will call the later. */
}

void QMetaObject::activate(QObject *sender, int signalOffset, int local_signal_index, void **argv)
{
    int signal_index = signalOffset + local_signal_index;

    /* The first thing we do is quickly check a bit-mask of 64 bits. If it is 0,
     * we are sure there is nothing connected to this signal, and we can return
     * quickly, which means emitting a signal connected to no slot is extremely
     * fast. */
    if (!sender->d_func()->isSignalConnected(signal_index))
        return; // nothing connected to these signals, and no spy

    /* ... Skipped some debugging and QML hooks, and some sanity check ... */

    /* We lock a mutex because all operations in the connectionLists are thread safe */
    QMutexLocker locker(signalSlotLock(sender));

    /* Get the ConnectionList for this signal.  I simplified a bit here. The real code
     * also refcount the list and do sanity checks */
    QObjectConnectionListVector *connectionLists = sender->d_func()->connectionLists;
    const QObjectPrivate::ConnectionList *list =
        &connectionLists->at(signal_index);

    QObjectPrivate::Connection *c = list->first;
    if (!c) continue;
    // We need to check against last here to ensure that signals added
    // during the signal emission are not emitted in this emission.
    QObjectPrivate::Connection *last = list->last;

    /* Now iterates, for each slot */
    do {
        if (!c->receiver)
            continue;

        QObject * const receiver = c->receiver;
        const bool receiverInSameThread = QThread::currentThreadId() == receiver->d_func()->threadData->threadId;

        // determine if this connection should be sent immediately or
        // put into the event queue
        if ((c->connectionType == Qt::AutoConnection && !receiverInSameThread)
            || (c->connectionType == Qt::QueuedConnection)) {
            /* Will basically copy the argument and post an event */
            queued_activate(sender, signal_index, c, argv);
            continue;
        } else if (c->connectionType == Qt::BlockingQueuedConnection) {
            /* ... Skipped ... */
            continue;
        }

        /* Helper struct that sets the sender() (and reset it backs when it
         * goes out of scope */
        QConnectionSenderSwitcher sw;
        if (receiverInSameThread)
            sw.switchSender(receiver, sender, signal_index);

        const QObjectPrivate::StaticMetaCallFunction callFunction = c->callFunction;
        const int method_relative = c->method_relative;
        if (c->isSlotObject) {
            /* ... Skipped....  Qt5-style connection to function pointer */
        } else if (callFunction && c->method_offset <= receiver->metaObject()->methodOffset()) {
            /* If we have a callFunction (a pointer to the qt_static_metacall
             * generated by moc) we will call it. We also need to check the
             * saved metodOffset is still valid (we could be called from the
             * destructor) */
            locker.unlock(); // We must not keep the lock while calling use code
            callFunction(receiver, QMetaObject::InvokeMetaMethod, method_relative, argv);
            locker.relock();
        } else {
            /* Fallback for dynamic objects */
            const int method = method_relative + c->method_offset;
            locker.unlock();
            metacall(receiver, QMetaObject::InvokeMetaMethod, method, argv);
            locker.relock();
        }

        // Check if the object was not deleted by the slot
        if (connectionLists->orphaned) break;
    } while (c != last && (c = c->nextConnectionList) != 0);
}

Conclusion

We saw how connections are made and how signals slots are emitted. What we have not seen is the implementation of the new Qt5 syntax, but that will be for another post.

Update: The part 2 is available.

This is the sequel of my previous article explaining the implementation details of the signals and slots. In the Part 1, we have seen the general principle and how it works with the old syntax. In this blog post, we will see the implementation details behind the new function pointer based syntax in Qt5.

New Syntax in Qt5

The new syntax looks like this:

  QObject::connect(&a, &Counter::valueChanged,
                   &b, &Counter::setValue);

Why the new syntax?

I already explained the advantages of the new syntax in a dedicated blog entry. To summarize, the new syntax allows compile-time checking of the signals and slots. It also allows automatic conversion of the arguments if they do not have the same types. As a bonus, it enables the support for lambda expressions.

New overloads

There was only a few changes required to make that possible.
The main idea is to have new overloads to QObject::connect which take the pointers to functions as arguments instead of char*

There are three new static overloads of QObject::connect: (not actual code)

QObject::connect(const QObject *sender, PointerToMemberFunction signal,
                 const QObject *receiver, PointerToMemberFunction slot,
                 Qt::ConnectionType type)

QObject::connect(const QObject *sender, PointerToMemberFunction signal,
                 PointerToFunction method)

QObject::connect(const QObject *sender, PointerToMemberFunction signal,
                 Functor method)

The first one is the one that is much closer to the old syntax: you connect a signal from the sender to a slot in a receiver object. The two other overloads are connecting a signal to a static function or a functor object without a receiver.

They are very similar and we will only analyze the first one in this article.

Pointer to Member Functions

Before continuing my explanation, I would like to open a parenthesis to talk a bit about pointers to member functions.

Here is a simple sample code that declares a pointer to member function and calls it.

  void (QPoint::*myFunctionPtr)(int); // Declares myFunctionPtr as a pointer to
                                      // a member function returning void and
                                      // taking (int) as parameter
  myFunctionPtr = &QPoint::setX;
  QPoint p;
  QPoint *pp = &p;
  (p.*myFunctionPtr)(5); // calls p.setX(5);
  (pp->*myFunctionPtr)(5); // calls pp->setX(5);

Pointers to member and pointers to member functions are usually part of the subset of C++ that is not much used and thus lesser known.
The good news is that you still do not really need to know much about them to use Qt and its new syntax. All you need to remember is to put the & before the name of the signal in your connect call. But you will not need to cope with the ::*, .* or ->* cryptic operators.

These cryptic operators allow you to declare a pointer to a member or access it. The type of such pointers includes the return type, the class which owns the member, the types of each argument and the const-ness of the function.

You cannot really convert pointer to member functions to anything and in particular not to void* because they have a different sizeof.
If the function varies slightly in signature, you cannot convert from one to the other. For example, even converting from void (MyClass::*)(int) const to void (MyClass::*)(int) is not allowed. (You could do it with reinterpret_cast; but that would be an undefined behaviour if you call them, according to the standard)

Pointer to member functions are not just like normal function pointers. A normal function pointer is just a normal pointer the address where the code of that function lies. But pointer to member function need to store more information: member functions can be virtual and there is also an offset to apply to the hidden this in case of multiple inheritance.
sizeof of a pointer to a member function can even vary depending of the class. This is why we need to take special care when manipulating them.

Type Traits: `QtPrivate::FunctionPointer`

Let me introduce you to the QtPrivate::FunctionPointer type trait.
A trait is basically a helper class that gives meta data about a given type. Another example of trait in Qt is QTypeInfo.

What we will need to know in order to implement the new syntax is information about a function pointer.

The template<typename T> struct FunctionPointer will give us information about T via its member.

ArgumentCount: An integer representing the number of arguments of the function.
Object: Exists only for pointer to member function. It is a typedef to the class of which the function is a member.
Arguments: Represents the list of argument. It is a typedef to a meta-programming list.
call(T &function, QObject *receiver, void **args): A static function that will call the function, applying the given parameters.

Qt still supports C++98 compiler which means we unfortunately cannot require support for variadic templates. Therefore we had to specialize our trait function for each number of arguments. We have four kinds of specializationd: normal function pointer, pointer to member function, pointer to const member function and functors. For each kind, we need to specialize for each number of arguments. We support up to six arguments. We also made a specialization using variadic template so we support arbitrary number of arguments if the compiler supports variadic templates.

The implementation of FunctionPointer lies in qobjectdefs_impl.h.

`QObject::connect`

The implementation relies on a lot of template code. I am not going to explain all of it.

Here is the code of the first new overload from qobject.h:

template <typename Func1, typename Func2>
static inline QMetaObject::Connection connect(
    const typename QtPrivate::FunctionPointer<Func1>::Object *sender, Func1 signal,
    const typename QtPrivate::FunctionPointer<Func2>::Object *receiver, Func2 slot,
    Qt::ConnectionType type = Qt::AutoConnection)
{
  typedef QtPrivate::FunctionPointer<Func1> SignalType;
  typedef QtPrivate::FunctionPointer<Func2> SlotType;

  //compilation error if the arguments does not match.
  Q_STATIC_ASSERT_X(int(SignalType::ArgumentCount) >= int(SlotType::ArgumentCount),
                    "The slot requires more arguments than the signal provides.");
  Q_STATIC_ASSERT_X((QtPrivate::CheckCompatibleArguments<typename SignalType::Arguments,
                                                         typename SlotType::Arguments>::value),
                    "Signal and slot arguments are not compatible.");
  Q_STATIC_ASSERT_X((QtPrivate::AreArgumentsCompatible<typename SlotType::ReturnType,
                                                       typename SignalType::ReturnType>::value),
                    "Return type of the slot is not compatible with the return type of the signal.");

  const int *types;
  /* ... Skipped initialization of types, used for QueuedConnection ...*/

  QtPrivate::QSlotObjectBase *slotObj = new QtPrivate::QSlotObject<Func2,
        typename QtPrivate::List_Left<typename SignalType::Arguments, SlotType::ArgumentCount>::Value,
        typename SignalType::ReturnType>(slot);


  return connectImpl(sender, reinterpret_cast<void **>(&signal),
                     receiver, reinterpret_cast<void **>(&slot), slotObj,
                     type, types, &SignalType::Object::staticMetaObject);
}

You notice in the function signature that sender and receiver are not just QObject* as the documentation points out. They are pointers to typename FunctionPointer::Object instead. This uses SFINAE to make this overload only enabled for pointers to member functions because the Object only exists in FunctionPointer if the type is a pointer to member function.

We then start with a bunch of Q_STATIC_ASSERT. They should generate sensible compilation error messages when the user made a mistake. If the user did something wrong, it is important that he/she sees an error here and not in the soup of template code in the _impl.h files. We want to hide the underlying implementation from the user who should not need to care about it.
That means that if you ever you see a confusing error in the implementation details, it should be considered as a bug that should be reported.

We then allocate a QSlotObject that is going to be passed to connectImpl(). The QSlotObject is a wrapper around the slot that will help calling it. It also knows the type of the signal arguments so it can do the proper type conversion.
We use List_Left to only pass the same number as argument as the slot, which allows connecting a signal with many arguments to a slot with less arguments.

QObject::connectImpl is the private internal function that will perform the connection. It is similar to the original syntax, the difference is that instead of storing a method index in the QObjectPrivate::Connection structure, we store a pointer to the QSlotObjectBase.

The reason why we pass &slot as a void** is only to be able to compare it if the type is Qt::UniqueConnection.

We also pass the &signal as a void**. It is a pointer to the member function pointer. (Yes, a pointer to the pointer)

Signal Index

We need to make a relationship between the signal pointer and the signal index.
We use MOC for that. Yes, that means this new syntax is still using the MOC and that there are no plans to get rid of it :-).

MOC will generate code in qt_static_metacall that compares the parameter and returns the right index. connectImpl will call the qt_static_metacall function with the pointer to the function pointer.

void Counter::qt_static_metacall(QObject *_o, QMetaObject::Call _c, int _id, void **_a)
{
    if (_c == QMetaObject::InvokeMetaMethod) {
        /* .... skipped ....*/
    } else if (_c == QMetaObject::IndexOfMethod) {
        int *result = reinterpret_cast<int *>(_a[0]);
        void **func = reinterpret_cast<void **>(_a[1]);
        {
            typedef void (Counter::*_t)(int );
            if (*reinterpret_cast<_t *>(func) == static_cast<_t>(&Counter::valueChanged)) {
                *result = 0;
            }
        }
        {
            typedef QString (Counter::*_t)(const QString & );
            if (*reinterpret_cast<_t *>(func) == static_cast<_t>(&Counter::someOtherSignal)) {
                *result = 1;
            }
        }
        {
            typedef void (Counter::*_t)();
            if (*reinterpret_cast<_t *>(func) == static_cast<_t>(&Counter::anotherSignal)) {
                *result = 2;
            }
        }
    }
}

Once we have the signal index, we can proceed like in the other syntax.

The QSlotObjectBase

QSlotObjectBase is the object passed to connectImpl that represents the slot.

Before showing the real code, this is what QObject::QSlotObjectBase was in Qt5 alpha:

struct QSlotObjectBase {
    QAtomicInt ref;
    QSlotObjectBase() : ref(1) {}
    virtual ~QSlotObjectBase();
    virtual void call(QObject *receiver, void **a) = 0;
    virtual bool compare(void **) { return false; }
};

It is basically an interface that is meant to be re-implemented by template classes implementing the call and comparison of the function pointers.

It is re-implemented by one of the QSlotObject, QStaticSlotObject or QFunctorSlotObject template class.

Fake Virtual Table

The problem with that is that each instantiation of those object would need to create a virtual table which contains not only pointer to virtual functions but also lot of information we do not need such as RTTI. That would result in lot of superfluous data and relocation in the binaries.

In order to avoid that, QSlotObjectBase was changed not to be a C++ polymorphic class. Virtual functions are emulated by hand.

class QSlotObjectBase {
  QAtomicInt m_ref;
  typedef void (*ImplFn)(int which, QSlotObjectBase* this_,
                         QObject *receiver, void **args, bool *ret);
  const ImplFn m_impl;
protected:
  enum Operation { Destroy, Call, Compare };
public:
  explicit QSlotObjectBase(ImplFn fn) : m_ref(1), m_impl(fn) {}
  inline int ref() Q_DECL_NOTHROW { return m_ref.ref(); }
  inline void destroyIfLastRef() Q_DECL_NOTHROW {
    if (!m_ref.deref()) m_impl(Destroy, this, 0, 0, 0);
  }

  inline bool compare(void **a) { bool ret; m_impl(Compare, this, 0, a, &ret); return ret; }
  inline void call(QObject *r, void **a) {  m_impl(Call,    this, r, a, 0); }
};

The m_impl is a (normal) function pointer which performs the three operations that were previously virtual functions. The "re-implementations" set it to their own implementation in the constructor.

Please do not go in your code and replace all your virtual functions by such a hack because you read here it was good. This is only done in this case because almost every call to connect would generate a new different type (since the QSlotObject has template parameters wich depend on signature of the signal and the slot).

Protected, Public, or Private Signals.

Signals were protected in Qt4 and before. It was a design choice as signals should be emitted by the object when its change its state. They should not be emitted from outside the object and calling a signal on another object is almost always a bad idea.

However, with the new syntax, you need to be able take the address of the signal from the point you make the connection. The compiler would only let you do that if you have access to that signal. Writing &Counter::valueChanged would generate a compiler error if the signal was not public.

In Qt 5 we had to change signals from protected to public. This is unfortunate since this mean anyone can emit the signals. We found no way around it. We tried a trick with the emit keyword. We tried returning a special value. But nothing worked. I believe that the advantages of the new syntax overcome the problem that signals are now public.

Sometimes it is even desirable to have the signal private. This is the case for example in QAbstractItemModel, where otherwise, developers tend to emit signal from the derived class which is not what the API wants. There used to be a pre-processor trick that made signals private but it broke the new connection syntax.
A new hack has been introduced. QPrivateSignal is a dummy (empty) struct declared private in the Q_OBJECT macro. It can be used as the last parameter of the signal. Because it is private, only the object has the right to construct it for calling the signal. MOC will ignore the QPrivateSignal last argument while generating signature information. See qabstractitemmodel.h for an example.

More Template Code

The rest of the code is in qobjectdefs_impl.h and qobject_impl.h. It is mostly standard dull template code.

I will not go into much more details in this article, but I will just go over few items that are worth mentioning.

Meta-Programming List

As pointed out earlier, FunctionPointer::Arguments is a list of the arguments. The code needs to operate on that list: iterate over each element, take only a part of it or select a given item.

That is why there is QtPrivate::List that can represent a list of types. Some helpers to operate on it are QtPrivate::List_Select and QtPrivate::List_Left, which give the N-th element in the list and a sub-list containing the N first elements.

The implementation of List is different for compilers that support variadic templates and compilers that do not.

With variadic templates, it is a template<typename... T> struct List;. The list of arguments is just encapsulated in the template parameters.
For example: the type of a list containing the arguments (int, QString, QObject*) would simply be:

List<int, QString, QObject *>

Without variadic template, it is a LISP-style list: template<typename Head, typename Tail > struct List; where Tail can be either another List or void for the end of the list.
The same example as before would be:

List<int, List<QString, List<QObject *, void> > >

ApplyReturnValue Trick

In the function FunctionPointer::call, the args[0] is meant to receive the return value of the slot. If the signal returns a value, it is a pointer to an object of the return type of the signal, else, it is 0. If the slot returns a value, we need to copy it in arg[0]. If it returns void, we do nothing.

The problem is that it is not syntaxically correct to use the return value of a function that returns void. Should I have duplicated the already huge amount of code duplication: once for the void return type and the other for the non-void? No, thanks to the comma operator.

In C++ you can do something like that:

functionThatReturnsVoid(), somethingElse();

You could have replaced the comma by a semicolon and everything would have been fine.

Where it becomes interesting is when you call it with something that is not void:

functionThatReturnsInt(), somethingElse();

There, the comma will actually call an operator that you even can overload. It is what we do in qobjectdefs_impl.h

template <typename T>
struct ApplyReturnValue {
    void *data;
    ApplyReturnValue(void *data_) : data(data_) {}
};

template<typename T, typename U>
void operator,(const T &value, const ApplyReturnValue<U> &container) {
    if (container.data)
        *reinterpret_cast<U*>(container.data) = value;
}
template<typename T>
void operator,(T, const ApplyReturnValue<void> &) {}

ApplyReturnValue is just a wrapper around a void*. Then it can be used in each helper. This is for example the case of a functor without arguments:

  static void call(Function &f, void *, void **arg) {
      f(), ApplyReturnValue<SignalReturnType>(arg[0]);
  }

This code is inlined, so it will not cost anything at run-time.

Conclusion

This is it for this blog post. There is still a lot to talk about (I have not even mentioned QueuedConnection or thread safety yet), but I hope you found this interresting and that you learned here something that might help you as a programmer.

This post is about the use of QThread. It is an answer to a three years old blog post by Brad, my colleague at the time:
You're doing it wrong

In his blog post, Brad explains that he saw many users misusing QThread by sub-classing it, adding some slots to that subclass and doing something like this in the constructor:

 moveToThread(this);

They move a thread to itself. As Brad mentions, it is wrong: the QThread is supposed to be the interface to manage the thread. So it is supposed to be used from the creating thread.

Slots in the QThread object are then not run in that thread and having slots in a subclass of QThread is a bad practice.

But then Brad continues and discourages any sub-classing of QThread at all. He claims it is against proper object-oriented design. This is where I disagree. Putting code in run() is a valid object-oriented way to extend a QThread: A QThread represents a thread that just starts an event loop, a subclass represents a thread that is extended to do what's in run().

After Brad's post, some members of the community went on a crusade against sub-classing QThread. The problem is that there are many perfectly valid reasons to subclass QThread.

With Qt 5.0 and Qt 4.8.4, the documentation of QThread was changed so the sample code does not involve sub-classing. Look at the first code sample of the Qt 4.8 QThread documentation. It has many lines of boiler plate just to run some code in a thread. And the there is even a leak: the QThread is never going to quit and be destroyed.

I was asked on IRC a question from an user who followed that example in order to run some simple code in a thread. He had a hard time to figure out how to properly destroy the thread. That is what motivated me to write this blog entry.

If you allow to subclass QThread, this is what you got:

class WorkerThread : public QThread {
    void run() {
        // ...
    }
};

void MyObject::startWorkInAThread()
{
    WorkerThread *workerThread = new WorkerThread;
    connect(workerThread, SIGNAL(finished()),
            workerThread, SLOT(deleteLater()));
    workerThread->start();
}

This code does no longer leak and is much simpler and has less overhead as it does not create useless object.

The Qt threading example threadedfortuneserver is an example that uses this pattern to run blocking operations and is much simpler than the equivalent using a worker object.

I have submitted a patch to the documentation to not discourage sub-classing QThread anymore.

Rules of thumbs

When to subclass and when not to?

If you do not really need an event loop in the thread, you should subclass.
If you need an event loop and handle signals and slots within the thread, you may not need to subclass.

What about using QtConcurrent instead?

QThread is a quite low level and you should better use a higher level API such as QtConcurrent.

Now, QtConcurrent has its own set of problems: It is tied to a single thread pool so it is not a good solution if you want to run blocking operations. It has also some problems in its implementation that gives some performance overhead. All of this is fixable. Perhaps even Qt 5.1 will see some improvements.

A good alternative is also the C++11 standard library with std::thread and std::async which are now the standard way to run code in a thread. And the good news is that it still works fine with Qt: All other Qt threading primitives can be used with native threads. (Qt will create automatically create a QThread if required).

QtQuick and QML form a really nice language to develop user interfaces. The QML Bindings are very productive and convenient. The declarative syntax is really a pleasure to work with.
Would it be possible to do the same in C++? In this blog post, I will show a working implementation of property bindings in pure C++.

Disclaimer: This was done for the fun of it and is not made for production.

If you read this article from the RSS, you may want to open it in its original URL to see property formatted code.

Bindings

The goal of bindings is to have one property which depends on other properties. When its dependencies are changed, the property is automatically updated.

Here is an example inspired from the QML documentation.

int calculateArea(int width, int height) {
  return (width * height) * 0.5;
}

struct rectangle {
  property<rectangle*> parent = nullptr;
  property<int> width = 150;
  property<int> height = 75;
  property<int> area = [&]{ return calculateArea(width, height); };

  property<std::string> color = [&]{
    if (parent() && area > parent()->area)
      return std::string("blue");
    else
      return std::string("red");
  };
};

If you are not familiar with the [&]{ ... } syntax, this is a lambda function. I'm also using the fact that in C++11, you can initialize the members directly in the declaration.

Now, we'll see how this property class works. At the end I will show a cool demo of what you can do.

The code is using lots of C++11 constructs. It has been tested with GCC 4.7 and Clang 3.2.

Property

I have used my knowledge from QML and the QObject system to build something similar with C++ bindings.
The goal is to make a proof of concept. It is not optimized. I just wanted to have comprehensible code for this demo.

The idea behind the property class is the same as in QML. Each property keeps a list of its dependencies. When a binding is evaluated, all access to the property will be recorded as dependencies.

property<T> is a template class. The common part is put in a base class: property_base.

class property_base
{
  /* Set of properties which are subscribed to this one.
     When this property is changed, subscriptions are refreshed */
  std::unordered_set<property_base *> subscribers;

  /* Set of properties this property is depending on. */
  std::unordered_set<property_base *> dependencies;

public:
  virtual ~property_base()
  { clearSubscribers(); clearDependencies(); }

  // re-evaluate this property
  virtual void evaluate() = 0;
   
  // [...]
protected:
  /* This function is called by the derived class when the property has changed
     The default implementation re-evaluates all the property subscribed to this one. */
  virtual void notify() {
    auto copy = subscribers;
    for (property_base *p : copy) {
      p->evaluate();
    }
  }

  /* Derived class call this function whenever this property is accessed.
     It register the dependencies. */
  void accessed() {
    if (current && current != this) {
      subscribers.insert(current);
      current->dependencies.insert(this);
    }
  }

  void clearSubscribers() {
      for (property_base *p : subscribers)
          p->dependencies.erase(this);
      subscribers.clear();
  }
  void clearDependencies() {
      for (property_base *p : dependencies)
          p->subscribers.erase(this);
      dependencies.clear();
  }

  /* Helper class that is used on the stack to set the current property being evaluated */
  struct evaluation_scope {
    evaluation_scope(property_base *prop) : previous(current) {
      current = prop;
    }
    ~evaluation_scope() { current = previous; }
    property_base *previous;
  };
private:
  friend struct evaluation_scope;
  /* thread_local */ static property_base *current;
};

Then we have the implementation of the class property.

template <typename T>
struct property : property_base {
  typedef std::function<T()> binding_t;

  property() = default;
  property(const T &t) : value(t) {}
  property(const binding_t &b) : binding(b) { evaluate(); }

  void operator=(const T &t) {
      value = t;
      clearDependencies();
      notify();
  }
  void operator=(const binding_t &b) {
      binding = b;
      evaluate();
  }

  const T &get() const {
    const_cast<property*>(this)->accessed();
    return value;
  }

  //automatic conversions
  const T &operator()() const { return get();  }
  operator const T&() const { return get(); }

  void evaluate() override {
    if (binding) {
      clearDependencies();
      evaluation_scope scope(this);
      value = binding();
    }
    notify();
  }

protected:
  T value;
  binding_t binding;
};

`property_hook`

It is also desirable to be notified when a property is changed, so we can for example call update(). The property_hook class lets you specify a function which will be called when the property changes.

Qt bindings

Now that we have the property class, we can build everything on top of that. We could build for example a set of widgets and use those. I'm going to use Qt Widgets for that. If the QtQuick elements had a C++ API, I could have used those instead.

The `property_qobject`

I introduce a property_qobject which is basically wrapping a property in a QObject. You initialize it by passing a pointer to the QObject and the string of the property you want to track, and voilà.

The implementation is not efficient and it could be optimized by sharing the QObject rather than having one for each property. With Qt5 I could also connect to lambda instead of doing this hack, but I used Qt 4.8 here.

Wrappers

Then I create a wrapper around each class I'm going to use that expose the properties in a property_qobject

A Demo

Now let's see what we are capable of doing:

This small demo just has a line edit which lets you specify a color and few sliders to change the rotation and the opacity of a graphics item.

Let the code speak for itself.

We need a Rectangle object with the proper bindings:

struct GraphicsRectObject : QGraphicsWidget {
  // bind the QObject properties.
  property_qobject<QRectF> geometry { this, "geometry" };
  property_qobject<qreal> opacity { this, "opacity" };
  property_qobject<qreal> rotation { this, "rotation" };

  // add a color property, with a hook to update when it changes
  property_hook<QColor> color { [this]{ this->update(); } };
private:
  void paint(QPainter* painter, const QStyleOptionGraphicsItem* option, QWidget*) override {
    painter->setBrush(color());
    painter->drawRect(boundingRect());
  }
};

Then we can proceed and declare a window object with all the subwidgets:

struct MyWindow : Widget {
  LineEdit colorEdit {this};

  Slider rotationSlider {Qt::Horizontal, this};
  Slider opacitySlider {Qt::Horizontal, this};

  QGraphicsScene scene;
  GraphicsView view {&scene, this};
  GraphicsRectObject rectangle;

  ::property<int> margin {10};

  MyWindow() {
    // Layout the items.  Not really as good as real layouts, but it demonstrates bindings
    colorEdit.geometry = [&]{ return QRect(margin, margin,
                                             geometry().width() - 2*margin,
                                             colorEdit.sizeHint().height()); };
    rotationSlider.geometry = [&]{ return QRect(margin,
                                                  colorEdit.geometry().bottom() + margin,
                                                  geometry().width() - 2*margin,
                                                  rotationSlider.sizeHint().height()); };
    opacitySlider.geometry = [&]{ return QRect(margin,
                                                 rotationSlider.geometry().bottom() + margin,
                                                 geometry().width() - 2*margin,
                                                 opacitySlider.sizeHint().height()); };
    view.geometry = [&]{
        int x = opacitySlider.geometry().bottom() + margin;
        return QRect(margin, x, width() - 2*margin, geometry().height() - x - margin); 
    };

    // Some proper default value
    colorEdit.text = QString("blue");
    rotationSlider.minimum = -180;
    rotationSlider.maximum = 180;
    opacitySlider.minimum = 0;
    opacitySlider.maximum = 100;
    opacitySlider.value = 100;

    scene.addItem(&rectangle);

    // now the 'cool' bindings
    rectangle.color = [&]{ return QColor(colorEdit.text);  };
    rectangle.opacity = [&]{ return qreal(opacitySlider.value/100.); };
    rectangle.rotation = [&]{ return rotationSlider.value(); };
  }
};

int main(int argc, char **argv)
{
    QApplication app(argc,argv);
    MyWindow window;
    window.show();
    return app.exec();
}

Conclusion

You can clone the code repository and try it for yourself.

Perhaps one day, a library will provide such property bindings.

In this blog post, I am going to review the different kind of data and how they are initialized in a program.

What I am going to explain here is valid for Linux and GCC.

Code Example

I'll just start by showing a small piece of code. What is going to interest us is where the data will end up in memory and how it is initialized.

const char string_data[] = "hello world"; // .rodata
const int even_numbers[] = { 0*2 , 1*2,  2*2,  3*2, 4*2}; //.rodata

int all_numbers[] = { 0, 1, 2, 3, 4 };  //.data

static inline int odd(int n) { return n*2 + 1; }
const int odd_numbers[] = { odd(0), odd(1), odd(2), odd(3), odd(4) }; //initialized

QString qstring_data("hello QString"); //object with constructor and destructor

I'll analyze the assembly. It has been generated with the following command, then re-formatted for better presentation in this blog post.

g++ -O2 -S data.cpp

(I also had to add a function that uses the data in order to avoid that the compiler removes some arrays that were not used.)

The sections

On Linux, the binaries (program or libraries) are stored as file in the ELF format. Those files are composed of many sections. I'll just go over a few of them:

The code: `.text`

This section is the actual code of your library or program it contains all the instructions for each function. That part of the code is mapped into memory, and shared between the instances of the processes that uses it (provided the library is compiled as position independent, which is usually the case).

I am not interested in the code in this blog post, let us move to the data sections.

The read-only data: `.rodata`

This section will be loaded the same way as the .text section is loaded. It will also be shared between processes.

It contains the arrays that are marked as const such as string_data and even_numbers.

.section    .rodata
_ZL11string_data:
    .string "hello world"
_ZL12even_numbers:
    .long   0
    .long   2
    .long   4
    .long   6
    .long   8

You can see that even if the even_numbers array was initialized with multiplications, the compiler was able to optimize and generate the array at compile time.

The _ZL11 that is part of the name is the mangling because it is const.

Writable data: `.data`

The data section will contain the pre-initialized data that are not read-only.
This section is not shared between processes but copied for each instance of processes that uses it. (Actually, with the copy-on-write optimization in the kernel, it might need to be copied only if the data changes.)

There goes our all_number array that has not been declared as const.

.data
all_numbers:
    .long   0
    .long   1
    .long   2
    .long   3
    .long   4

Initialized at run-time: `.bss` + `.ctors`

The compiler was not able to optimize the calls to odd(), it has to be computed at run-time. Where will our odd_numbers array be stored?

What will happen is that it will not be stored in the binary, but some space will be reserved in the .bss section. That section is just some memory which is allocated to each process, it is initialized to 0.

The binary also contains a section with code that is going to be executed before main() is being called.

.section    .text.startup
_GLOBAL__sub_I_odd_numbers:
    movl    $1, _ZL11odd_numbers(%rip)
    movl    $3, _ZL11odd_numbers+4(%rip)
    movl    $5, _ZL11odd_numbers+8(%rip)
    movl    $7, _ZL11odd_numbers+12(%rip)
    movl    $9, _ZL11odd_numbers+16(%rip)
    ret

.section    .ctors,"aw",@progbits
    .quad   _GLOBAL__sub_I_odd_numbers

.local  _ZL11odd_numbers  ; reserve 20 bytes in the .bss section
    .comm   _ZL11odd_numbers,20,16

The .ctor section contains a table of pointers to functions that are going to be called by the loader before it calls main(). In our case, there is only one, the code that initializes the odd_numbers array.

Global Object

How about our QString? It is a global C++ object with a constructor and destructor. It is simply initialized by running the constructor at start-up.

.section    .rodata.str1.1,"aMS",@progbits,1
.LC0:
    .string "hello QString"

.section    .text.startup,"ax",@progbits
_GLOBAL__sub_I_qstring_data:
       ; QString constructor (inlined)
    movl    $-1, %esi
    movl    $.LC0, %edi
    call    _ZN7QString16fromAscii_helperEPKci
    movq    %rax, _ZL12qstring_data(%rip)
       ; register the destructor
    movl    $__dso_handle, %edx
    movl    $_ZL12qstring_data, %esi
    movl    $_ZN7QStringD1Ev, %edi
    jmp __cxa_atexit   ; (tail call)

Here is the code of the constructor, which have been inlined.

We can also see that the code calls the function __cxa_atexit with the parameters $_ZL12qstring_data and $_ZN7QStringD1Ev Which are respectively the address of the QString object, and a function pointer of the QString destructor. In other words, this code registers the destructor of QString to be run on exit.
The third parameter $__dso_handle is a handle to this dynamic shared object (used to run the destructor when a plugin is unloaded for example).

What is the problem with global objects with constructor?

The order in which the constructors are called are not specified by the C++ standard. If you have dependencies between your global objects, you will run into trouble.
All the constructors of all the global in all the libraries need to be run before main() and slow down the startup of the application. (Even for objects that will never be used).

This is why it is not recommended to have global objects in libraries. Instead, one can use function static objects, which are initialized on the first use. (Qt provides a macro for that: Q_GLOBAL_STATIC which is made public in Qt 5.1.)

Here comes C++11

C++11 comes with a new feature: constexpr

That keyword can be used in two ways: If you specify that a function is a constexpr it means that the function can be run at compile-time.
If you specify that a variable is a constexpr, then it means it can be computed at compile time.

Let us slightly modify the example above and see what it does:

static inline constexpr int odd(int n) { return n*2 + 1; }
constexpr int odd_numbers[] = { odd(0), odd(1), odd(2), odd(3), odd(4) };

Two constexpr were added.

.section    .rodata
_ZL11odd_numbers:
    .long   1
    .long   3
    .long   5
    .long   7
    .long   9

Now they are generated at compile time.

If a class has a constructor that is declared as constexpr and has no destructor, you can have this as global object and it will be initialized at compile time.

Since Qt 4.8, there is a macro Q_DECL_CONSTEXPR which expands to constexpr if the compiler supports it, or to nothing otherwise.

I have been trying to re-write Qt's moc using libclang from the LLVM project.

The result is moc-ng. It is really two different things:

A plugin for clang to be used when compiling your code with clang;
and an executable that can be used as a drop in replacement for moc.

What is `moc` again?

moc is a developer tool which is part of the Qt library. It's role is to handle the Qt's extension within the C++ code to offer introspection and enable the Qt signals and slots.

What are clang and libclang?

clang is the C and C++ frontend to the LLVM compiler. It is not only a compiler though, it also contains a library (libclang) which helps to write a C++ parser.

Motivation

moc is implemented using a custom naive C++ parser which does just enough to extract the right information from your source files. The limitation is that it can sometimes choke on more complex C++ code and it is not compatible with some of the features provided by the new versions of the C++ standard (such as C++11 trailing return functions or advanced templated argument types)

Using clang as a frontend just gives it a perfect parser than can handle all the most complicated constructs allowed by C++.

Having it as a plugin for clang would also allow to pass meta-data directly to LLVM without going trough the generated code. Allowing to do things that would not be possible with generated code such as having Q_OBJECT in a function-locale class. (That's not yet implemented)

Expressive Diagnostics

Clang has also a very good diagnostics framework, which allows better error analysis.
Compare: The error from moc:

With moc-ng

See how I used clang's look-up system to check the existence of the identifiers and suggest typo correction, while moc ignores such error and you get a weird error in the generated code.

Meet moc-ng

moc-ng is my proof of concept attempt of re-implementing the moc using clang as a frontend. It is not officially supported by the Qt-project.

It is currently in alpha state, but is already working very well. I was able to replace moc and compile many modules of qt5, including qtbase, qtdeclarative and qt-creator.

All the Qt tests that I ran passed or had an expected failure (for example tst_moc is parsing moc's error output, which has now changed)

Compatibility with the official `moc`

I have tried as much as possible to stay compatible with the real moc. But there are some differences to be aware of.

`Q_MOC_RUN`

There is a Q_MOC_RUN macro that is defined when the original moc is run. It is typically used to hide to moc some complicated C++ constructs it would otherwise choke on. Because we need to see the full C++ like a normal compiler, we don't define this. This may be a problem when signals or slots or other Qt meta things are defined in a Q_MOC_RUN block.

Missing or not Self-Contained Headers

The official moc ignores any headers that are not found. So if include paths are not passed to moc, it won't complain. Also, the moc parser does not care if the type have not been declared, and it won't report any of those errors.

moc-ng has a stricter C++ parser that requires a self-contained header. Fortunately, clang falls back gracefully when there are errors, and I managed to turn all the errors into warnings. So when parsing a non self contained headers or if the include flags were wrong, one gets lots of warning from moc.

Implementation details and Challenges

I am now going to go over some implementation details and challenges I encountered.

I used the C++ clang tooling API directly, instead of using the libclang's C wrapper, even tough the C++ API does not maintain source compatibility. The reasons are that the C++ API is much more complete, and that I want to use C++. I did not want to write a C++ wrapper around a C wrapper around the C++ clang.
In my experience with the code browser (which is also using the C++ API directly), there is not so much API changes and keeping the compatibility is not that hard.

Annotations

The clang libraries parse the C++ and give the AST. From that AST, one can list all the classes and theirs method in a translation unit. It has all the information you can find from the code, with the location of each declarations.

But the pre-processor removed all the special macro like signals or slots. I needed a way to know which method are tagged with special Qt keywords.
At first, I tought I would use pre-processor hook to remember the location where those special macro are expended. That could have worked. But there is a better way. I got the idea from the qt-creator wip/clang branch which tries to use clang as a code model. They use attribute extension to annotate the methods. Annotations are meant exactly for this use case: annotate the source code with non standard extensions so a plugin can act upon. And the good news is that they can be placed exactly where the signals or slot keyword can be placed.

#define Q_SIGNAL  __attribute__((annotate("qt_signal")))
#define Q_SLOT    __attribute__((annotate("qt_slot")))
#define Q_INVOKABLE  __attribute__((annotate("qt_invokable")))

#define signals    public Q_SIGNAL
#define slots      Q_SLOT

We do the same for all the other macro that annotate method. But we still need to find something for macro that annotate classes: Q_OBJECT, Q_PROPERTY, Q_ENUMS
Those where a bit more tricky. And the solution I found is to use a static_assert, with a given pattern. However, static_assert is C++11 only and I want it to work without C++11 enabled. Fortunately clang accept the C11's _Static_assert as an extension on all the modes. Using this trick, I can walk the AST to find the specific static_assert that matches the pattern and get the content within a string literal.

#define QT_ANNOTATE_CLASS(type, anotation)  \
    __extension__ _Static_assert(sizeof (#anotation), #type);

#define Q_ENUMS(x) QT_ANNOTATE_CLASS(qt_enums, x)
#define Q_FLAGS(x) QT_ANNOTATE_CLASS(qt_flags, x)

#define Q_OBJECT(x)   QT_ANNOTATE_CLASS(qt_qobject, "") \ 
        /*... other Q_OBJECT declarations ... */

We just have to replace the Qt macros by our macros. I do that by injecting code right when we exit qobjectdefs.h which defines all the Qt macro.

Suppressing The Errors

As stated, the Qt moc ignores most of the errors. I already tell clang not to parse the bodies of the functions. But you may still get errors if types used in declarations are not found. When moc-ng is run as a binary, it is desirable to not abort on those errors, for compatibility with moc. I did not find easy way to change errors into warnings. You can promote some warnings into errors or change fatal errors to normal errors, but you cannot easily suppress errors or change them into warnings.

What I did is create my own diagnostic consumer , which proxies the error to the default one, but turns some of them into warnings. The problem is that clang would still count them as error. So the hack I did was to reset the error count. I wish there was a better way.

When used as a plugin, there is only one kind of error that one should ignore, it is if there is an include "foo.moc" That file will not exist because the moc is not run. Fortunately, clang has a callback when an include file has not been found. If it looks like a file that should have been generated by moc (starting by moc_ or ending by .moc) then that include can be ignored.

Qt's Binary JSON

Since Qt5, there is a macro Q_PLUGIN_METADATA which you can use to load a JSON file, and moc would embed this JSON in some binary proprietary format which is used internally in QJsonDocument.

I did not want to depend on Qt (to avoid the bootstrap issue). Fortunately, LLVM already has a good YAML parser (which is a super-set of JSON), so parsing was not a problem at all. The problem was to generate Qt's binary format. I spend too much time trying to figure out why Qt would not accept my binary before noticing that QJsonDocument enforces some alignment constraint on some items. Bummer.

Error Reporting within String Literal

When parsing the contents of things like Q_PROPERTY, I wish to report an error at the location it is in the source code. Using the macro described earlier, the content of Q_PROPERTY is turned in a string literal. Clang supports reporting errors within string literals in macros. As you can see on the screen shot, this works pretty well.

But there is still two level of indirection I would like to hide. It would be nice to hide some builtins macro from the diagnostic (I've hidden one level in the screenshot).
Also, I want to be able to report the location int the Q_PROPERTY line and not in the scratch space. But when using the # in macro, clang does not track the exact spelling location anymore.

Consider compiling this snippet with clang: It should warn you about the escape sequence \o, \p and \q not being valid. And look where the caret is for each warning

#define M(A, B)  A "\p" #B;
char foo[] = M("\o",   \q );

For \o and \p, clang puts the caret at the right place when the macro is expanded. But for \q, the caret is not put at its spelling location.

The way clang use to track the real origin of a source location is a very clever and efficient way. Each source location is represented by a clang::SourceLocation with is basically a 32 bit integer. The source location space is divided in consecutive entry that represents files or macro expansion. Each time a macro is expanded, there is a new macro expansion entry added, containing the source location of the expansion, and the location of the #define. In principle, there could be a new entry for each expended tokens, but consecutive entries are merged.
One could not do the same for strignified tokens because the string literal is only one token, but is coming from possibly many tokens. There are also some escaping rules to take in account that make it harder.

The way to do it is probably to leave the source location as they are, but having a special case for the scratch space while trying to find out the location of the caret.

Built-in includes

Some headers required by the standard library are not located in a standard location, but are shipped with clang and looked up in ../lib/clang/3.2/include relative to the binary.
I don't want to requires external files. I would like to just to have a simple single static binary without dependencies.

The solution would be to bundle those headers within the binary. I have nothing like qrc resources, but I can do the same in few lines of cmake

file(GLOB BUILTINS_HEADERS "${LLVM_BIN_DIR}/../lib/clang/${LLVM_VERSION}/include/*.h")
foreach(BUILTIN_HEADER ${BUILTINS_HEADERS})
    file(READ ${BUILTIN_HEADER} BINARY_DATA HEX)
    string(REGEX REPLACE "(..)" "\\\\x\\1" BINARY_DATA "${BINARY_DATA}")
    string(REPLACE "${LLVM_BIN_DIR}/../lib/clang/${LLVM_VERSION}/include/" 
                   "/builtins/" FN "${BUILTIN_HEADER}")
    set(EMBEDDED_DATA "${EMBEDDED_DATA} { \"${FN}\" , \"${BINARY_DATA}\" } , ")
endforeach()
configure_file(embedded_includes.h.in embedded_includes.h)

This will just go over all *.h files in the builtin include directory, read them in a hex string. and the regexp transforms that in something suitable in a C++ string literal. Then configure_file will replace @EMBEDDED_DATA@ by its value.
Here is how embedded_includes.h.in looks like:

static struct { char *filename; char *data; } EmbeddedFiles[] = {
    @EMBEDDED_DATA@
    {0, 0}
};

Conclusion

moc-ng was a fun project to do. Just like developing our C/C++ code browser. The clang/llvm frameworks are really powerfull and nice to work with.

Please have a look at the moc-ng project on GitHub or browse the source online.

The Qt toolkit has often been criticized for extending C++ and requiring a non-standard code generator (moc) to provide introspection.
Now, the C++ standardization committee is looking at how to extend C++ with introspection and reflection. As the current maintainer of Qt's moc I thought I could write a bit about the need of Qt, and even experiment a bit.

In this blog post, I will comment on the current proposal draft, and try to analyze what one would need to be able to get rid of moc.

If you read this article from the RSS or a planet, you may want to open it in its original URL to see property formatted code.

Current draft proposal

Here is the draft proposal: N3951: C++ type reflection via variadic template expansion. It is a clever way to add compile time introspection to C++. It gives new meaning to typedef and typename such that it would work like this:

/* Given a simple class */
class SomeClass { 
public:
  int foo();
  void bar(int x);
};

#if 0
/* The new typename<>... and typedef<>... 'operators' : */
  vector<string> names = { typename<SomeClass>... } ;
  auto members = std::make_tuple(typedef<SomeClass>...) ;
#else
/* Would be expanded to something equivalent to: */
  vector<string> names =  { "SomeClass",  "foo", "bar" };
  auto members = std::make_tuple(static_cast<SomeClass*>(nullptr), 
                               &SomeClass::foo, &SomeClass::bar);
#endif

We can use that to go over the member of a class at compile time and do stuff like generating a QMetaObject with a normal compiler.

With the help of some more traits that is a very good start to be able to implement moc features in pure C++.

The experiment

I have been managing to re-implement most of the moc features such as signals and slots and properties using the proposals, without the need of moc. Of course since the compiler obviously doesn't have support for that proposal yet, I have been manually expanding the typedef... and typename... in the prototype.

The code does a lot of template tricks to handle strings and array at compile time and generates a QMetaObject that is even binary compatible to the one generated by moc

The code is available here.

About Qt and `moc`

Qt is a cross platform C++ toolkit specialized for developing applications with user interfaces. Qt code is purely standard C++ code, however it needs a code generator to provide introspection data: the Meta Object Compiler (moc). That little utility parses the C++ headers and generates additional C++ code that is compiled alongside the program. The generated code contains the implementations of the Qt signals, and builds the QMetaObject (which embeds string tables with the names of all methods and properties).

Historically, the first mission of the moc was to enable signals and slots using a nice syntax. It is also used for the property system. The first use of the properties was for the property editor in Qt designer, then it became used for integration with a scripting language (QtScript), and is now widely used to access C++ objects from QML.

(For an explanation of the inner working of the signals and slots, read one of my previous articles: How Qt signals and slots work.)

Generating the QMetaObject at compile time

We could ask the programmer to add a macro in the .cpp such as Q_OBJECT_IMPL(MyObject) which would be expanded to that code:

const QMetaObject MyObject::staticMetaObject = createMetaObject<MyObject>();
const QMetaObject *MyObject::metaObject() const { return &staticMetaObject; }

int MyObject::qt_metacall(QMetaObject::Call _c, int _id, void** _a) {
    return qt_metacall_impl<MyObject>(this, _c, _id, _a);
}
void MyObject::qt_static_metacall(QObject *_o, QMetaObject::Call _c, int _id, void** _a) {
    qt_static_metacall_impl<MyObject>(_o, _c, _id, _a);
}

The implementation of createMetaObject uses the reflection capabilities to find out all the slots, signals and properties in order to build the metaobject at compile time. The function qt_metacall_impl and qt_static_metacall_impl are generic implementations that use the same data to call the right function. Click on the function name if you are interested in the implementation.

Annotating signals and slots

We could perhaps use C++11 attributes for that. In that case, it would be convenient if attributes could be placed next to the access specifiers. (There is already a proposal to add group access specifiers, but it does not cover the attributes.)

class MyObject : public QObject {
    Q_OBJECT
public [[qt::slot]]:
    void fooBar();
    void otherSlot(int);
public [[qt::signal]]:
    void mySignal(int param);
public:
   enum [[qt::enum]] Foobar { Value1, Value2  };
};

Then we would need compile time traits such as has_attribute<&MyObject::myFunction>("qt::signal")

Function traits

I just mentioned has_attribute. Another trait will be needed to determine if the function is public, protected or private.
The proposal also mentioned we could use typename<&MyObject::myFunction>... to get the parameter names. We indeed need them as they are used when you connect to a signal in QML to access the parameters.
And currently we are able to call a function without specifying all the parameters if there are default parameters. So we need to know the default parameters at compile time to create them at run time.

However, there is a problem with functions traits in that form: non-type template parameters of function type need to be function literals. (See this stackoverflow question.) Best explained with this code:

struct Obj { void func(); };
template<void (Obj::*)()> struct Trait {};
int main() {
    Trait<&Obj::func> t1;  //Ok.  The function is directly written

    constexpr auto var = &Obj::func;
    Trait<var> t2;  //Error:  var is not a function directly written.
}

But as we are introspecting, we get, at best, the functions in constexpr form. So this restriction would need to be removed.

The properties

We have not yet solved the Q_PROPERTY feature.

I'm afraid we will have to introduce a new macro because it is most likely not possible to keep the source compatibility with Q_PROPERTY. A way to do it would be to add static constexpr members of a recognizable type. For example, this is my prototype implementation:

template <typename Type, typename... T> struct QProperty : std::tuple<T...> {
    using std::tuple<T...>::tuple;
    using PropertyType = Type;
};
template <typename Type, typename... T> constexpr auto qt_makeProperty(T&& ...t)
{ return QProperty<Type, typename std::decay<T>::type...>{ std::forward<T>(t)... }; }

#define Q_PROPERTY2(TYPE, NAME, ...) static constexpr auto qt_property_##NAME = \
                                            qt_makeProperty<TYPE>(__VA_ARGS__);

To be used like this

    Q_PROPERTY2(int, foo, &MyObject::getFoo, &MyObject::setFoo)

We can find the properties by looking for the QProperty<...> members and removing the "qt_property_" part of the name. Then all the information about the getter, setter and others are available.

And if we want to keep the old `Q_PROPERTY`?

I was wondering if it is possible to even keep the source compatibility using the same macro: I almost managed:

template<typename... Fs> struct QPropertyHolder { template<Fs... Types> struct Property {}; };
template<typename... Fs> QPropertyHolder<Fs...> qPropertyGenerator(Fs...);

#define WRITE , &ThisType::
#define READ , &ThisType::
#define NOTIFY , &ThisType::
#define MEMBER , &ThisType::

#define Q_PROPERTY(A) Q_PROPERTY_IMPL(A) /* expands the WRITE and READ macro */

#define Q_PROPERTY_IMPL(Prop, ...) static void qt_property_ ## __COUNTER__(\
    Prop, decltype(qPropertyGenerator(__VA_ARGS__))::Property<__VA_ARGS__>) = delete;

class MyPropObject : public QObject {
    Q_OBJECT
    typedef MyPropObject ThisType; // FIXME: how do do that automatically
                                   //        from within the Q_OBJECT macro?

signals: // would expand to public [[qt::signal]]:
    void fooChanged();
public:
    QString foo() const;
    void setFoo(const QString&);

    Q_PROPERTY(QString foo READ foo WRITE setFoo NOTIFY fooChanged)

};

This basically creates a function with two arguments. The name of the first argument is the name of the property, which we can get via reflection. Its type is the type of the property. The second argument is of the type QPropertyHolder<...>::Property<...>, which contains pointers to the member functions for the different attributes of the property. Introspection would allow us to dig into this type.
But the problem here is that it needs to do a typedef ThisType. It would be nice if there was something like deltype(*this) that would be working in the class scope without any members, then we would put this typedef within the Q_OBJECT macro.

Re-implementing the signals

This is going to be the big problem as I have no idea how to possibly do that. We need, for each signal, to generate its code. Something that could look like this made-up syntax:

int signalId=0;
/* Somehow loop over all the signals to implement them  (made up syntax) */
for(auto signal : {typedef<MyObject requires has_attribute("qt::signal")>... }) {
    signalId++;
    signal(auto... arguments) = { 
        SignalImplementation<decltype(signal), signalId>::impl(this, arguments...); 
    }
}

The implementation of SignalImplementation::impl is then easy.

Summary: What would we need

In summary, this is what would be needed in the standard to implement Qt like features without the need of moc:

The N3951 proposal: C++ type reflection via variadic template expansion would be a really good start.
Allow attributes within the access specifier (public [[qt::slot]]:)
Traits to get the attributes (constexpr std::has_attribute<&MyClass::mySignal>("qt::signal");
Traits to get the access of a function (public, private, protected) (for QMetaMethod::access)
A way to declare functions.
Getting default value of arguments.
Accessing function traits via constexpr expression.
Listing the constructors. (for Q_INVOKABLE constructors.)

What would then still be missing

Q_PLUGIN_METADATA which allows to load a JSON file, and put the information in the binary:
I'm afraid we will still need a tool for that. (Because I hardly see the c++ compiler opening a file and parsing JSON.) This does not really belong in moc anyway and is only there because moc was already existing.
Whatever else I missed or forgot. :-)

Conclusion: will finally moc disappear?

Until Qt6, we have to maintain source and binary compatibility. Therefore moc is not going to disappear, but may very well be optional for new classes. We could have a Q_OBJECT2 which does not need moc, but would use only standard C++.

In general, while it would be nice to avoid the moc, there is also no hurry to get rid of it. It is generally working fine and serving its purpose quite well. A pure template C++ implementation is not necessarily easier to maintain. Template meta-programming should not be abused too much.

For a related experiment, have a look at my attempt to reimplement moc using libclang

This is the story how I have (not) solved a race condition that impacts QWaitCondition and is also present on every other condition variable implementations (pthread, boost, std::condition_variable).

bool QWaitCondition::wait(int timeout) is supposed to return true if the condition variable was met and false if it timed out. The race is that it may return false (for timeout) even if it was actually woken up.

The problem was already reported in 2012. But I only came to look at it when David Faure was trying to fix another bug in QThreadPool that was caused by this race.

The problem in QThreadPool

When starting a task, QThreadPool did something along the lines of:

QMutexLocker locker(&mutex);

taskQueue.append(task); // Place the task on the task queue
if (waitingThreads > 0) {
   // there are already running idle thread. They are waiting on the 'runnableReady' 
   // QWaitCondition. Wake one up them up.
   waitingThreads--;
   runnableReady.wakeOne();
} else if (runningThreadCount < maxThreadCount) {
   startNewThread(task);
}

And the the thread's main loop looks like this:

void QThreadPoolThread::run()
{
  QMutexLocker locker(&manager->mutex);
  while (true) {
    /* ... */
    if (manager->taskQueue.isEmpty()) {
      // no pending task, wait for one.
      bool expired = !manager->runnableReady.wait(locker.mutex(), 
                                                  manager->expiryTimeout);
      if (expired) {
        manager->runningThreadCount--;
        return;
      } else {
        continue;
      }
    }
    QRunnable *r = manager->taskQueue.takeFirst();
    // run the task
    locker.unlock();
    r->run();
    locker.relock();
  }
}

The idea is that the thread will wait for a given amount of second for a task, but if no task was added in a given amount of time, the thread expires and is terminated. The problem here is that we rely on the return value of runnableReady. If there is a task that is scheduled at exactly the same time as the thread expires, then the thread will see false and will expire. But the main thread will not restart any other thread. That might let the application hang as the task will never be run.

The Race

Many of the implementations of a condition variable have the same issue.
It is even documented in the POSIX documentation:

[W]hen pthread_cond_timedwait() returns with the timeout error, the associated predicate may be true due to an unavoidable race between the expiration of the timeout and the predicate state change.

pthread documentation describes it as an unavoidable race. But is it so? The wait condition is associated with a mutex, which is locked by the user when calling wake() and that is also passed locked to wait(). The implementation is supposed to unlock and wait atomically.

The C++11 standard library's condition_variable even has an enum (cv_status) for the return code. The C++ standard does not document the race, but all the implementations I have tried suffer from the race. (No implementations are therefore conform.)

Let me try to explain the race better: this code show a typical use of QWaitCondition

Thread 1	Thread 2
mutex.lock(); if(!ready) { ready = true; condition.wakeOne(); } mutex.unlock();	mutex.lock(); ready = false; bool success = condition.wait(&mutex, timeout); assert(success == ready); mutex.unlock();

Thread 1

Thread 2

  mutex.lock();
  if(!ready) {
      ready = true;
      condition.wakeOne();
  }
  mutex.unlock();

  mutex.lock();
  ready = false;
  bool success = condition.wait(&mutex, timeout);
  assert(success == ready);
  mutex.unlock();

The race is that the wait condition in Thread2 timeout and returns false, but at the same time, Thread1 wakes the condition. One could expect that since everything is protected by a mutex, this should not happen. Internally, the wait condition unlocks the internal mutex, but does not check that it has not been woken up once the user mutex is locked again.

QWaitCondition has internal state that counts the number of waiting QWaitCondition and the number of QWaitCondition that are waiting to be woken up.
Let's review the actual code of QWaitCondition (edited for readability)

bool QWaitCondition::wait(QMutex *mutex, unsigned long time)
{
    // [...]
    pthread_mutex_lock(&d->mutex);
    ++d->waiters;
    mutex->unlock();
 
    // (simplified for briefty)
    int code = 0;
    do {
      code = d->wait_relative(time); // calls pthread_cond_timedwait
    } while (code == 0 && d->wakeups == 0);
    --d->waiters;
    if (code == 0)
      --d->wakeups; // [!!]
    pthread_mutex_unlock(&d->mutex);
    mutex->lock();
    return code == 0;
}

void QWaitCondition::wakeOne()
{
    pthread_mutex_lock(&d->mutex);
    d->wakeups = qMin(d->wakeups + 1, d->waiters);
    pthread_cond_signal(&d->cond);
    pthread_mutex_unlock(&d->mutex);
}

Notice that d->mutex is a native pthread mutex, while the local variable mutex is the user mutex. In the line marked with [!!] we effectively take the right to wake up. But we do that before locking the user's mutex. What if we checked again for waiters under the user's lock?

Attempt 1: check again under the user's lock

bool QWaitCondition::wait(QMutex *mutex, unsigned long time)
{
// Same as before:
    pthread_mutex_lock(&d->mutex);
    ++d->waiters;
    mutex->unlock();
    int code = 0;
    do {
      code = d->wait_relative(time); // calls pthread_cond_timedwait
    } while (code == 0 && d->wakeups == 0);
//    --d->waiters; // Moved bellow
    if (code == 0)
      --d->wakeups;
    pthread_mutex_unlock(&d->mutex);
    mutex->lock();

//  Now check the wakeups again:
    pthread_mutex_lock(&d->mutex);
    --d->waiters;
    if (code != 0 && d->wakeups) {
      // The race is detected, and corrected
      --d->wakeups;
      code = 0;
    }
    pthread_mutex_unlock(&d->mutex);

    return code == 0;
}

And there we have fixed the race! We just had to lock the internal mutex again because d->waiters and d->wakeups need to be protected by it. We needed to unlock it because locking the user's mutex with the internal mutex locked would potentially cause deadlock as lock order would not be respected.

However, we now have introduced another problem: If there are three threads, a thread may be woken up before

//    Thread 1              // Thread 2             // Thread 3
mutex->lock()
cond->wait(mutex);
                            mutex->lock()
                            cond->wake();
                            mutex->unlock()
                                                    mutex->lock()
                                                    cond->wait(mutex, 0);

We don't want that the Thread 3 steal the signal from the Thread 1. But that can happen if the Thread 1 is sleeping a bit too long and do not manage to lock the internal mutex in time before Thread 3 expires.

The only way to solve this problem would be if we could order the thread by the time they started to wait.
Inspired by the bitcoin's blockchain, I created a linked list of nodes on the thread's stack that represent the order. When a thread is starting to wait, it adds itself at the end of the double linked list. When a thread is waking other thread, it marks the last node of the linked list. (by incrementing a woken counter inside the node). When a thread is timing out, it checks if it was marked, or any other thread after him in the linked list. We only solve the race in that case, otherwise we consider it is a timeout.

You can see the patch on the code review tool.

Performance

This patch adds quite a bit of code to add and remove nodes in the linked list, and also to go over the list to check if we were indeed woken up. The linked list is bound by the number of waiting thread. I was expecting that this linked list handling would be negligible compared to the other cost of QWaitCondition

However, the results of the QWaitCondition benchmark show that, with 10 threads and high contention, we have a ~10% penalty. With 5 threads there is ~5% penalty.

Is it worth it to pay this penalty to solve the race? So far, we decided not to merge the patch and keep the race.

Conclusion

Fixing the race is possible, bug has a small performance impact. None of the implementations attempt to fix the race. I wonder why there is even a returned status at all if you cannot rely on it.

If you are using Qt, you might have some qDebug or qWarning statements in your code. But did you know that you can greatly improve the output of those with the QT_MESSAGE_PATTERN environment variable? This blog post will give you some hints and examples of what you can do.

The default message pattern just prints the message (and the category if one was specified), but qDebug has the possibility to output more information. You can display cool things like the line of code, the function name or more by using some placeholders in the pattern.

QT_MESSAGE_PATTERN="%{message}"

Some example of placeholder:

%{file} and %{line} are the location of the qDebug statement (file and line number)
%{function} just shows the function name. Contrary to the Q_FUNC_INFO, which is really the raw function name, this shows a short prettier version of the function name without the arguments or not so useful decorators
%{time [format]} shows the time, at which the debug statement is emitted. Using the format you can show the time since the process startup, or an absolute time, with or without the date. Having the milliseconds in the debug output is helpful to get timing information about your code
%{threadid} , %{pid} , %{appname} are useful if the logs are mixed between severals application, or to find out from which thread something is run.
And you can find even more placeholders in the documentation.

Colorize it!

In order to make the output much prettier and easier to read, you can add some color by the mean of terminal escape sequences

Putting the escape sequence in an environment variable might be a bit tricky. The trick I use which work with bash or zsh, is to use echo -e in single back quotes.

export QT_MESSAGE_PATTERN="`echo -e "\033[34m%{function}\033[0m: %{message}"`"

That example will print the function in blue, and then the message in the normal color

Conditions

KDE's kDebug has colored debug output support since KDE 4.0 (it is enabled by setting the KDE_COLOR_DEBUG environment variable). It printed the function name in blue for normal debug messages, and in red for warnings or critical messages. I wanted the same in Qt, so some placeholders were added to have an output that depends on the type of message.

The content of what is between %{if-debug} and %{endif} will only be used for qDebug statements but not for qWarning. Similarly, we have %{if-warning} and %{if-critical}. There is also %{if-category} that will only be displayed if there is a category associated with this message.

Backtrace (linux-only)

On Linux, it is possible to show a short backtrace for every debug output.

Use the %{backtrace} placeholder, which can be configured to show more or less call frames.

In order for Qt to be able to determine the backtrace, it needs to find the symbol names from the symbol table. By default, this is only going to display exported functions within a library. But you can tell the linker to include this information for every function. So if you wish to use this feature, you need to link your code with the -rdynamic option.

Add this in your .pro file if you are using qmake:

QMAKE_LFLAGS += -rdynamic

Remember while reading this backtrace that symbols might be optimized away by the compiler. That is the case for inline functions, or functions with the tail-call optimization
See man backtrace .

Examples of patterns

And now, here are a few ready to use patterns that you can put in your /etc/profile, ~/.bashrc, ~/.zshrc or wherever you store your shell configuration.

KDE4 style:
export QT_MESSAGE_PATTERN="`echo -e "%{appname}(%{pid})/(%{category}) \033\[31m%{if-debug}\033\[34m%{endif}%{function}\033\[0m: %{message}"`"

Time in green; blue Function name for debug; red 3-frames backtrace for warnings. Category in yellow in present:
export QT_MESSAGE_PATTERN="`echo -e "\033[32m%{time h:mm:ss.zzz}%{if-category}\033[32m %{category}:%{endif} %{if-debug}\033[34m%{function}%{endif}%{if-warning}\033[31m%{backtrace depth=3}%{endif}%{if-critical}\033[31m%{backtrace depth=3}%{endif}%{if-fatal}\033[31m%{backtrace depth=3}%{endif}\033[0m %{message}"`"

Note that since Qt 5.4, the information about the function name of the file location is only available if your code is compiled in debug mode or if you define QT_MESSAGELOGCONTEXT in your compiler flags. For this reason %{backtrace depth=1} might be more accurate than %{function}

Don't hesitate to post your own favorite pattern in the comments.

Final words

The logging system has become quite powerful in Qt5. You can have categories and hooks. I invite you to read the documentation for more information about the debugging option that are at your disposal while using Qt.

We made some improvements to QDockWidget for Qt 5.6. You can now re-order your QDockWidget's tabs with the mouse. There is also a new mode you can set on your QMainWindow so that you can drag and drop full groups of tabbed QDockWidgets. Furthermore there is a new API which allows you to programatically resize the QDockWidgets.

Images (or in this case, animations) are worth a 1000 words:

Re-order the tabbed QDockWidgets:

This change applies to all the application using QDockWidget without any modification of their code.

New QMainwindow mode to drag tab by group

This is not by default because it changes the behaviour, so application developer may want to enable QMainWindow::GroupedDragging in their code:

MyMainWindow::MyMainWindow(/*...*/) : QMainWindow(/*...*/)
{
    /*...*/
    setDockOptions(dockOptions() | QMainWindow::GroupedDragging);
}

Without this flag, the user can only drag the QDockWidget's one by one. In this new mode, the user is able to drag the whole group of tabbed QDockWidget together by dragging the title. Individual QDockWidget can still be dragged by dragging the tab out of its tab bar.

This animation shows the example in qtbase/examples/widgets/mainwindows/mainwindow:

This changes the behaviour slightly as the QDockWidget may be reparented to an internal floating tab window. So if your application assumed that QDockWidget's parent was always the main window, it needs to be changed.

Programatically resize your dock widgets

If you want to give a default layout that looks nice for your application using many QDockWidget, you can use QMainWindow::resizeDocks to achieve that goal.

Conclusion

You can try these changes in the Qt 5.6 beta

We made those changes because it was requested by one of our customers.

QtWidgets still have their place on desktop and many applications are still using QDockWidgets.

This blog is part of a series of blogs explaining the internals of signals and slots.

In this article, we will explore the mechanisms powering the Qt queued connections.

Summary from Part 1

In the first part, we saw that signals are just simple functions, whose body is generated by moc. They are just calling QMetaObject::activate, with an array of pointers to arguments on the stack. Here is the code of a signal, as generated by moc: (from part 1)

// SIGNAL 0
void Counter::valueChanged(int _t1)
{
    void *_a[] = { Q_NULLPTR, const_cast<void*>(reinterpret_cast<const void*>(&_t1)) };
    QMetaObject::activate(this, &staticMetaObject, 0, _a);
}

QMetaObject::activate will then look in internal data structures to find out what are the slots connected to that signal. As seen in part 1, for each slot, the following code will be executed:

// Determine if this connection should be sent immediately or
// put into the event queue
if ((c->connectionType == Qt::AutoConnection && !receiverInSameThread)
        || (c->connectionType == Qt::QueuedConnection)) {
    queued_activate(sender, signal_index, c, argv, locker);
    continue;
} else if (c->connectionType == Qt::BlockingQueuedConnection) {
    /* ... Skipped ... */
    continue;
}
/* ... DirectConnection: call the slot as seen in Part 1 */

So in this blog post we will see what exactly happens in queued_activate and other parts that were skipped for the BlockingQueuedConnection

Qt Event Loop

A QueuedConnection will post an event to the event loop to eventually be handled.

When posting an event (in QCoreApplication::postEvent), the event will be pushed in a per-thread queue (QThreadData::postEventList). The event queued is protected by a mutex, so there is no race conditions when threads push events to another thread's event queue.

Once the event has been added to the queue, and if the receiver is living in another thread, we notify the event dispatcher of that thread by calling QAbstractEventDispatcher::wakeUp. This will wake up the dispatcher if it was sleeping while waiting for more events. If the receiver is in the same thread, the event will be processed later, as the event loop iterates.

The event will be deleted right after being processed in the thread that processes it.

An event posted using a QueuedConnection is a QMetaCallEvent. When processed, that event will call the slot the same way we call them for direct connections. All the information (slot to call, parameter values, ...) are stored inside the event.

Copying the parameters

The argv coming from the signal is an array of pointers to the arguments. The problem is that these pointers point to the stack of the signal where the arguments are. Once the signal returns, they will not be valid anymore. So we'll have to copy the parameter values of the function on the heap. In order to do that, we just ask QMetaType. We have seen in the QMetaType article that QMetaType::create has the ability to copy any type knowing it's QMetaType ID and a pointer to the type.

To know the QMetaType ID of a particular parameter, we will look in the QMetaObject, which contains the name of all the types. We will then be able to look up the particular type in the QMetaType database.

queued_activate

We can now put it all together and read through the code of queued_activate, which is called by QMetaObject::activate to prepare a Qt::QueuedConnection slot call. The code showed here has been slightly simplified and commented:

static void queued_activate(QObject *sender, int signal,
                            QObjectPrivate::Connection *c, void **argv,
                            QMutexLocker &locker)
{
  const int *argumentTypes = c->argumentTypes;
  // c->argumentTypes is an array of int containing the argument types.
  // It might have been initialized in the connection statement when using the
  // new syntax, but usually it is `nullptr` until the first queued activation
  // of that connection.

  // DIRECT_CONNECTION_ONLY is a dummy int which means that there was an error
  // fetching the type ID of the arguments.

  if (!argumentTypes) {
    // Ask the QMetaObject for the parameter names, and use the QMetaType
    // system to look up type IDs
    QMetaMethod m = QMetaObjectPrivate::signal(sender->metaObject(), signal);
    argumentTypes = queuedConnectionTypes(m.parameterTypes());
    if (!argumentTypes) // Cannot queue arguments
      argumentTypes = &DIRECT_CONNECTION_ONLY;
    c->argumentTypes = argumentTypes; /* ... skipped: atomic update ... */
  }
  if (argumentTypes == &DIRECT_CONNECTION_ONLY) // Cannot activate
      return;
  int nargs = 1; // Include the return type
  while (argumentTypes[nargs-1])
      ++nargs;
  // Copy the argumentTypes array since the event is going to take ownership
  int *types = (int *) malloc(nargs*sizeof(int));
  void **args = (void **) malloc(nargs*sizeof(void *));

  // Ignore the return value as it makes no sense in a queued connection
  types[0] = 0; // Return type
  args[0] = 0; // Return value

  if (nargs > 1) {
    for (int n = 1; n < nargs; ++n)
      types[n] = argumentTypes[n-1];

    // We must unlock the object's signal mutex while calling the copy
    // constructors of the arguments as they might re-enter and cause a deadlock
    locker.unlock();
    for (int n = 1; n < nargs; ++n)
      args[n] = QMetaType::create(types[n], argv[n]);
    locker.relock();

    if (!c->receiver) {
      // We have been disconnected while the mutex was unlocked
      /* ... skipped cleanup ... */
      return;
    }
  }

  // Post an event
  QMetaCallEvent *ev = c->isSlotObject ?
    new QMetaCallEvent(c->slotObj, sender, signal, nargs, types, args) :
    new QMetaCallEvent(c->method_offset, c->method_relative, c->callFunction,
                       sender, signal, nargs, types, args);
  QCoreApplication::postEvent(c->receiver, ev);
}

Upon reception of this event, QObject::event will set the sender and call QMetaCallEvent::placeMetaCall. That later function will dispatch just the same way as QMetaObject::activate would do it for direct connections, as seen in Part 1

  case QEvent::MetaCall:
  {
    QMetaCallEvent *mce = static_cast<QMetaCallEvent*>(e);

    QConnectionSenderSwitcher sw(this, const_cast<QObject*>(mce->sender()),
                                 mce->signalId());

    mce->placeMetaCall(this);
    break;
  }

BlockingQueuedConnection

BlockingQueuedConnection is a mix between DirectConnection and QueuedConnection. Like with a DirectConnection, the arguments can stay on the stack since the stack is on the thread that is blocked. No need to copy the arguments. Like with a QueuedConnection, an event is posted to the other thread's event loop. The event also contains a pointer to a QSemaphore. The thread that delivers the event will release the semaphore right after the slot has been called. Meanwhile, the thread that called the signal will acquire the semaphore in order to wait until the event is processed.

} else if (c->connectionType == Qt::BlockingQueuedConnection) {
  locker.unlock(); // unlock the QObject's signal mutex.
  if (receiverInSameThread) {
    qWarning("Qt: Dead lock detected while activating a BlockingQueuedConnection: "
             "Sender is %s(%p), receiver is %s(%p)",
             sender->metaObject()->className(), sender,
             receiver->metaObject()->className(), receiver);
  }
  QSemaphore semaphore;
  QMetaCallEvent *ev = c->isSlotObject ?
    new QMetaCallEvent(c->slotObj, sender, signal_index, 0, 0, argv, &semaphore) :
    new QMetaCallEvent(c->method_offset, c->method_relative, c->callFunction,
                       sender, signal_index, 0, 0, argv , &semaphore);
  QCoreApplication::postEvent(receiver, ev);
  semaphore.acquire();
  locker.relock();
  continue;
}

It is the destructor of QMetaCallEvent which will release the semaphore. This is good because the event will be deleted right after it is delivered (i.e. the slot has been called) but also when the event is not delivered (e.g. because the receiving object was deleted).

A BlockingQueuedConnection can be useful to do thread communication when you want to invoke a function in another thread and wait for the answer before it is finished. However, it must be done with care.

The dangers of BlockingQueuedConnection

You must be careful in order to avoid deadlocks.

Obviously, if you connect two objects using BlockingQueuedConnection living on the same thread, you will deadlock immediately. You are sending an event to the sender's own thread and then are locking the thread waiting for the event to be processed. Since the thread is blocked, the event will never be processed and the thread will be blocked forever. Qt detects this at run time and prints a warning, but does not attempt to fix the problem for you. It has been suggested that Qt could then just do a normal DirectConnection if both objects are in the same thread. But we choose not to because BlockingQueuedConnection is something that can only be used if you know what you are doing: You must know from which thread to what other thread the event will be sent.

The real danger is that you must keep your design such that if in your application, you do a BlockingQueuedConnection from thread A to thread B, thread B must never wait for thread A, or you will have a deadlock again.

When emitting the signal or calling QMetaObject::invokeMethod(), you must not have any mutex locked that thread B might also try locking.

A problem will typically appear when you need to terminate a thread using a BlockingQueuedConnection, for example in this pseudo code:

void MyOperation::stop()
{
    m_thread->quit();
    m_thread->wait(); // Waits for the callee thread, might deadlock
    cleanup();
}

// Connected via a BlockingQueuedConnection
Stuff MyOperation::slotGetSomeData(const Key &k)
{
    return m_data->get(k);
}

You cannot just call wait here because the child thread might have already emitted, or is about to emit the signal that will wait for the parent thread, which won't go back to its event loop. All the thread cleanup information transfer must only happen with events posted between threads, without using wait(). A better way to do it would be:

void MyOperation::stop()
{
    connect(m_thread, &QThread::finished, this, &MyOperation::cleanup);
    m_thread->quit();
    /* (note that we connected before calling quit to avoid a race) */
}

The downside is that MyOperation::cleanup() is now called asynchronously, which may complicate the design.

Conclusion

This article should conclude the series. I hope these articles have demystified signals and slots, and that knowing a bit how this works under the hood will help you make better use of them in your applications.

I have often read, on various places, criticisms about Qt because of its use of moc. As the maintainer of moc I thought it would be good to write an article debunking some of the myths.

Introduction

moc is a developer tool and is part of the Qt library. Its role is to handle Qt's extension within the C++ code to offer introspection and enable reflection for Qt Signals and Slots, and for QML. For a more detailed explanation, read my previous article How Qt Signals and Slots work.

The use of moc is often one of the criticisms given to Qt. It even led to forks of Qt, such as CopperSpice. However, most of the so-called drawbacks are not completely founded.

Myths

Moc rewrites your code before passing it to the compiler

This is often misunderstood, moc does not modify or rewrite your code. It simply parses part of your code to generate additional C++ files which are then usually compiled independently.
This does not make a big difference overall, but is still a technical misconception.

The moc is just generating some boilerplate code that would be fastidious to write otherwise. If you were masochist, you would write all the introspection tables and implementation of signals by hand. It is so much more convenient to have this auto generated.

When writing Qt code, you are not writing real C++ ¹

I have read this many times, but this is simply false. The macros understood by moc to annotate the code are simply standard C++ macros defined in a header. They should be understood by any tool that can understand C++. When you write Q_OBJECT, it is a standard C++ macro that expands to some function declarations. When you write signals: it is just a macro that expands to public:. Many other Qt macros expand to nothing. The moc will then locate these macros and generate the code of the signal emitter functions, together with some additional introspection tables.

The fact that your code is also read by another tool than the compiler does not make it less C++. I've never heard that you are not using vanilla C++ if you use tools like gettext or doxygen, which will also parse your code to extract some information.

Moc makes the build process much more complicated ²

If you are using any mainstream build system, such as CMake or qmake, they will have a native integration of Qt. Even with a custom build system, we are just talking about invoking one additional command onto your header files. All the build systems allow this because many projects have some sort of generated code as part of the build. (Things like yacc/bison, gperf, llvm has TableGen)

It makes the debugging experience harder

Since the moc generated code is pure C++, debuggers or other tools have no problems with it. We try to keep the generated code without warnings or any issues that would trigger static or dynamic code analyzers.
You will sometimes see backtraces containing frames within the moc generated code. In some rare case you can have errors within the moc generated code, but it is usually straightforward to find their cause. The moc generated code is human readable for most part. It will also probably be easier to debug than the infamous compiler error messages you can get while using advanced template code.

Removing moc improves run time performance ³

This is a quote from the CopperSpice home page, and is probably their biggest lie. The moc generated code is carefully crafted to avoid dynamic allocation and reduce relocations.
All the moc generated tables go in const arrays that are stored in the shareable read-only data segment. CopperSpice, on the other hand, registers the QMetaObject data (information about signals, slots and properties) at run time.

Milian Wolff did some measurements to compare Qt and CopperSpice for his CppCon2015 talk. Here is a screenshot from one of his slides (smaller is better):

It is also worth mentioning that Qt code with moc compiles faster than CopperSpice code.

Outdated Myths

Some criticisms used to be true, but are long outdated.

A macro cannot be used to declare a signal, a slot, the base class of an object, or ... ⁴

Before Qt5, moc did not expand macros. But since Qt 5.0 moc fully expands macros, and this is no longer an issue at all.

Enums and typedefs must be fully qualified for signal and slot parameters

This is only an issue if you want to use the string-based connection syntax (as this is implemented with a string comparison). With the Qt5 function pointer syntax, this is not an issue.

Q_PROPERTY does not allow commas in its type ⁵

Q_PROPERTY is a macro with one argument that expands to nothing and is only understood by moc. But since it is a macro, the comma in QMap<Foo, Bar> separating macro arguments is causing a compilation error. When I saw that CopperSpice used this as an argument against Qt, I spent five minutes to fix it using C++11 variadic macros.

Other criticisms

Template, nested, or multiple inherited classes cannot be QObjects

While true, those are just missing features of QObject, which could be implemented in moc if we wanted them. The Qt project does not think these features are important.

For example, I implemented support for templated QObjects in moc, but this was not merged because it did not raise enough interest within the Qt project.

As a side note, moc-ng supports template and nested classes.

Multiple inheritance is also something that is in itself controversial. Often considered bad design, it has been left out of many languages. You can have multiple inheritance with Qt as long as QObject comes first as base class. This small restriction allows us to make useful optimization. Ever wondered why qobject_cast is so much faster than dynamic_cast?

Conclusion

I believe moc is not a problem. The API and usability of the Qt meta object macro helps. Compare them to CopperSpice's to see the excessive boilerplate and user unfriendly macros (not even talking about the loss in performance). The Qt signals and slots syntax which exists since the 90s is among the things that made Qt so successful.

You might also be interested to learn about some research projects around moc, like moc-ng: a re-implementation of moc using the clang libraries; or this blog research if moc can be replaced by C++ reflection.

Footnotes

¹ StackOverflow question
² StackOverflow answer
³ CopperSpice home page
⁴ CopperSpice documentation
⁵ CopperSpice documentation

What is wrong with Mutexes?

So how can we do it without locking?

QAtomic API

Memory Ordering

Lock-free Stack

Push

Benchmark

The ABA problem

Solutions to the ABA Problem

Conclusions

QMutex

Motivation

Summary of the QMutex changes from Qt 4.8 to Qt 5

Overview

Unlock

Memory Management

Lock

To The Real Code.

Previous syntax

What's the problem with this syntax?

New syntax: using function pointers

Compile-time checking

Arguments automatic type conversion

Connecting to any function

C++11 lambda expressions

So what now?

Summary

Reminder on how QString works

Literals and Conversion

Encoding and QLatin1String

Introducing QStringLiteral

Implementation

Results

Conclusion

Lambda expressions for slots

Unicode literal

Constant expressions: constexpr

static_assert

Override and final

Deleted member

Rvalue References and Move Constructors

Conclusion

About Intrinsics

The Task: Converting UTF-8 to UTF-16

Previous work

The Easy Case: ASCII

General Algorithm

Classification

Processing

Shuffle

Error Detection

Advance Pointers

Putting it all together

Benchmarks Results

Introducing the Woboq Codebrowser

Semantic highlighting

Navigation

Tooltips

Browse code now!

Technical Details

Signals and Slots

MOC, the Meta Object Compiler

Magic Macros

MOC Generated Code

The QMetaObject

Introspection Tables

String Table

Signals

Calling a Slot

A Note About Indexes.

How Connecting Works.

Signal Emission

Conclusion

New Syntax in Qt5

Why the new syntax?

New overloads

Pointer to Member Functions

Type Traits: QtPrivate::FunctionPointer

QObject::connect

Signal Index

Encoding and `QLatin1String`

Introducing `QStringLiteral`

Constant expressions: `constexpr`

Type Traits: `QtPrivate::FunctionPointer`

`QObject::connect`

`property_hook`

The `property_qobject`

The code: `.text`

The read-only data: `.rodata`

Writable data: `.data`

Initialized at run-time: `.bss` + `.ctors`

What is `moc` again?

Compatibility with the official `moc`

`Q_MOC_RUN`

About Qt and `moc`

And if we want to keep the old `Q_PROPERTY`?