Discussion:
[cython-users] create a unicode dictionary using C++'s hashmap
Ecolss Logan
9 years ago
Permalink
Hi Cythoners, reeeeally need your help

I'm writing a text processing tool using Cython, and I encountered a
problem.
Briefly what I want to do is, given a string (might contain CJK
characters), and a dictionary (containing all the vocabulary, also might be
CJK characters), then segment the string into words which exist in the
dictionary.

I intend to store all the string and vocabulary as *Unicode*, since they
might be *CJK*.
In respect of the dictionary, I could use hash table container to store all
the vocabulary, like Python's* dict/set*, but what about C++'s*
unordered_map/unordered_set*, I tried something like below:
from libcpp.unordered_set cimport unordered_set
cdef unordered_set[unicode] vocab


but it wouldn't work at all, error saying:
Python object type 'unicode object' cannot be used as a template argument




so how to do it if I insist using C++'s hash table container to store all
the CJK vocabulary?
Thanks
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Stefan Behnel
9 years ago
Permalink
...
Why would you insist? Just use Python dicts. They are perfect for that job.

Stefan
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ecolss Logan
9 years ago
Permalink
圚 2016幎3月30日星期䞉 UTC+8䞋午6:15:45Stefan Behnel写道
...
Well, I'm just biased a little bit against Python's containers, thinking
that C++'s ones might be more efficient.
Also wonder how could it be done just using C++ containers.
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ecolss Logan
9 years ago
Permalink
圚 2016幎3月31日星期四 UTC+8䞊午8:14:40Ecolss Logan写道
...
One more reason is that, I want to put all my code in *nogil* context (to
make my code more faster), but Python object cannot be used inside this
nogil.
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Daniele Nicolodi
9 years ago
Permalink
Post by Ecolss Logan
One more reason is that, I want to put all my code in *nogil* context
(to make my code more faster), but Python object cannot be used inside
this nogil.
`unicode` is also a Python object that you cannot manipulate without
holding the GIL. I don't see the advantage of storing a Python object
into a non-Python container.

Cheers,
Daniele
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ecolss Logan
9 years ago
Permalink
圚 2016幎3月31日星期四 UTC+8䞊午8:48:01Daniele Nicolodi写道
Post by Daniele Nicolodi
Post by Ecolss Logan
One more reason is that, I want to put all my code in *nogil* context
(to make my code more faster), but Python object cannot be used inside
this nogil.
`unicode` is also a Python object that you cannot manipulate without
holding the GIL. I don't see the advantage of storing a Python object
into a non-Python container.
Cheers,
Daniele
You're right, storing a Python object into a C++ container is inappropriate.
And frankly speaking, I just want to squeeze more performance out of the
code, which makes me think that I should use C++ library as much as
possible.
Am I wrong with this thought radically?
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Yury V. Zaytsev
9 years ago
Permalink
Post by Ecolss Logan
You're right, storing a Python object into a C++ container is
inappropriate.
Regarding your original question, the answer depends on what kind of
processing you want to perform on the strings, on top of storing them.

For instance, you can simply store them as byte sequences, e.g.
std::vector<char>, as C++ couldn't care less what's inside.
Post by Ecolss Logan
And frankly speaking, I just want to squeeze more performance out of the
code, which makes me think that I should use C++ library as much as
possible. Am I wrong with this thought radically?
Yes, one improves the performance by profiling the code, identifying
bottlenecks (including algorithmic problems) and removing them using the
appropriate tools, rather then using lowest level tools in all situations.
--
Sincerely yours,
Yury V. Zaytsev
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ecolss Logan
9 years ago
Permalink
圚 2016幎3月31日星期四 UTC+8䞋午2:18:32Yury V. Zaytsev写道
Post by Yury V. Zaytsev
Post by Ecolss Logan
You're right, storing a Python object into a C++ container is inappropriate.
Regarding your original question, the answer depends on what kind of
processing you want to perform on the strings, on top of storing them.
For instance, you can simply store them as byte sequences, e.g.
std::vector<char>, as C++ couldn't care less what's inside.
You mean I could read a file from the disk and encode all the text in, say
UTF8, and then store them into std::vector as simple chars, right?
Yes, I could do that, but what if I want to do unicode string search and
match? Encoded text won't do?
...
such as `nogil`, and cuz no Python objects allowed in nogil context, so I
went for C++ counterparts.
Post by Yury V. Zaytsev
--
Sincerely yours,
Yury V. Zaytsev
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Yury V. Zaytsev
9 years ago
Permalink
Post by Ecolss Logan
You mean I could read a file from the disk and encode all the text in,
say UTF8, and then store them into std::vector as simple chars, right?
Yes, I could do that, but what if I want to do unicode string search and
match? Encoded text won't do?
Oh, I missed the original message where you were explaining what you want
to use the dictionary for. My point was that if you don't need to deal
with code points at the C++ level, you can store them as opaque byte
sequences in std::vector<char> in any encoding you like, as long as you
are consistent about encoding / decoding on the C++ <-> Python boundary.

In your specific case, if you go for UTF-8, of course, this isn't going to
work unless you implement your algorithm in a variable length encoding
aware fashion, but you can use UCS-4 as your internal encoding and
std::vector<char32_t> / std::u32string instead, which would work just
fine, except that you'd incur additional encoding/decoding overheads.
Post by Ecolss Logan
Yes, one improves the performance by profiling the code, identifying
bottlenecks (including algorithmic problems) and removing them using the
appropriate tools, rather then using lowest level tools in all situations.
Yes, sometimes I just want to make use of all the fancy stuff in Cython,
such as `nogil`, and cuz no Python objects allowed in nogil context, so
I went for C++ counterparts.
This is, in general, a very questionable justification.
--
Sincerely yours,
Yury V. Zaytsev
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Daπid
9 years ago
Permalink
Post by Yury V. Zaytsev
Yes, sometimes I just want to make use of all the fancy stuff in Cython,
such as `nogil`, and cuz no Python objects allowed in nogil context, so I
went for C++ counterparts.
Python's dicts are very well implemented, so they may be faster than
you think. You should do a quick experiment comparing Python's dict vs
C++ hashmap without unicode to get an idea of the performance hit, and
measure how much of your time of your program is being spent on this
operation. Then you can decide if it is worth the effort of making the
change.


/David.
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ecolss Logan
9 years ago
Permalink
Thanks you guys all,
highly appreciated
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...