-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sentencepiece preset #1384
Add sentencepiece preset #1384
Conversation
If you don't need those functions, we could just skip them for now. |
69fc311
to
8bdb65d
Compare
I've skipped all functions triggering the error for now, to get the build and an encoding example working. The build works now except for Windows, which keeps causing symlink and path issues so I'm inclined to also disable it for now. The one important function triggering the issue is I was able to get encoding working via the example, well half of the time on my mac. javacpp-presets/sentencepiece/samples/SentencepieceExample.java Lines 5 to 28 in aaa9448
The other half it crashes even if I only instantiate the
|
Please try to set the "org.bytedeco.javacpp.nopointergc" system property to "true". |
It still shows the same behaviour with nopointergc enabled. |
You could try it on Linux, and if it works there, we'll know the problem is something related to Mac. |
Thanks! Looks like some mac weirdness indeed as it's working fine on my Linux machine. Perhaps it's even something weird on my local machine so think I should try the mac CI builds to see if it makes a difference. I think I've asked you this before, but right now it's not possible to download the artifacts from the CI build if they are not yet published to sonatype right? |
Right, but the builds need to pass any to get anything published.
GNU make doesn't Visual Studio, we typically use ninja, like this: |
Link against sentencepiece_train
I figured that I can use cmake itself for building by looking at the CI of upstream sentencepiece (they're building for windows too). The windows build also surfaced a missing linker config. So with decoding skipped, it builds on all enabled platforms now. |
I've removed Docker and configured cross-compiling for the arm64 build. |
I can see how it works for linux-arm64, but cross-compilation is missing for macosx-arm64. |
I've now tried to enable cross-building for macosx-arm64 as well. I'm not really sure if it works as intended though, because before it seems like it linked the arm64 compiled jni lib against the x86_64 sentencepiece lib without errors. Could be due to Apple's compat layer perhaps? I guess I could try to build it locally (I have an intel mac) with this new config and try to see if the object files are different. |
Ok I built both variants locally on my intel mac with maven, and I think this is what we want: ❯ mvn package --projects sentencepiece -Djavacpp.platform=macosx-x86_64
❯ lipo -info sentencepiece/cppbuild/macosx-x86_64/lib/*.a
Non-fat file: sentencepiece/cppbuild/macosx-x86_64/lib/libsentencepiece.a is architecture: x86_64
Non-fat file: sentencepiece/cppbuild/macosx-x86_64/lib/libsentencepiece_train.a is architecture: x86_64
❯ lipo -info sentencepiece/target/native/org/bytedeco/sentencepiece/macosx-x86_64/libjnisentencepiece.dylib
Non-fat file: sentencepiece/target/native/org/bytedeco/sentencepiece/macosx-x86_64/libjnisentencepiece.dylib is architecture: x86_64
❯ mvn package --projects sentencepiece -Djavacpp.platform=macosx-arm64
❯ lipo -info sentencepiece/cppbuild/macosx-arm64/lib/*.a
Non-fat file: sentencepiece/cppbuild/macosx-arm64/lib/libsentencepiece.a is architecture: arm64
Non-fat file: sentencepiece/cppbuild/macosx-arm64/lib/libsentencepiece_train.a is architecture: arm64
❯ lipo -info sentencepiece/target/native/org/bytedeco/sentencepiece/macosx-arm64/libjnisentencepiece.dylib
Non-fat file: sentencepiece/target/native/org/bytedeco/sentencepiece/macosx-arm64/libjnisentencepiece.dylib is architecture: arm64 |
I'm currently mapping strings like this to make the infoMap.put(new Info("std::string").annotations("@StdString").valueTypes("String").pointerTypes("@Cast({\"char*\", \"std::string*\"}) BytePointer")) Now for decoding, in C++ I'd create a std::vector<std::string> pieces = { "▁This", "▁is", "▁a", "▁", "te", "st", "." }; // sequence of pieces
std::string text;
processor.Decode(pieces, &text);
std::cout << text << std::endl; I'm not sure how to create The linux-arm64 build fails again but I think it is unrelated to the changes I made since I didn't touch that in the latest commit. It's already failing while installing the system packages so perhaps some update of ubuntu/ubuntu-ports packages could have caused it. |
Don't map that to BytePointer, it most likely won't work. Map it to a something like a StringVector with Info.define(): |
If you're talking only about the std::string, that can be mapped to a BytePointer like this yes. What issue are you having with String constructor? |
Ah, I hadn't realized the role of
Thanks! Decoding is working now. I've updated the example and added the missing README. |
@saudet if you don't have any additional remarks, this PR is ready now. |
Thanks for your help and patience @saudet! |
I'm trying to add a preset for the sentencepiece library, an
frequently used in current large language model architectures.
C++ API examples
Current state
The native library builds, and the parser runs sucessfully, but the generated JNI code still fails to compile. Grateful for any feedback and apologies if I'm missing something obvious here, my C++ knowledge is still rather limited.
I think it could be related to the use of
string_view
, which I've just tried to map to a std:string, but perhaps needs to treated diffently.Here's one example where it fails: